原文：

TMDB 5000 Movie Dataset

Metadata on ~5,000 movies from TMDb

What can we say about the success of a movie before it is released? Are there certain companies (Pixar?) that have found a consistent formula? Given that major films costing over $100 million to produce can still flop, this question is more important than ever to the industry. Film aficionados might have different interests. Can we predict which films will be highly rated, whether or not they are a commercial success?

This is a great place to start digging in to those questions, with data on the plot, cast, crew, budget, and revenues of several thousand films.

We have removed the original version of this dataset per a DMCA takedown request from IMDB. In order to minimize the impact, we're replacing it with a similar set of films and data fields from The Movie Database (TMDb) in accordance with their terms of use. The bad news is that kernels built on the old dataset will most likely no longer work.

The good news is that:

You can port your existing kernels over with a bit of editing. This kernel offers functions and examples for doing so. You can also find a general introduction to the new format here.
The new dataset contains full credits for both the cast and the crew, rather than just the first three actors.
Actor and actresses are now listed in the order they appear in the credits. It's unclear what ordering the original dataset used; for the movies I spot checked it didn't line up with either the credits order or IMDB's stars order.
The revenues appear to be more current. For example, IMDB's figures for Avatar seem to be from 2010 and understate the film's global revenues by over $2 billion.
Some of the movies that we weren't able to port over (a couple of hundred) were just bad entries. For example, this IMDB entry has basically no accurate information at all. It lists Star Wars Episode VII as a documentary.

Data Source Transfer Details

Several of the new columns contain json. You can save a bit of time by porting the load data functions [from this kernel]().
Even in simple fields like runtime may not be consistent across versions. For example, previous dataset shows the duration for Avatar's extended cut while TMDB shows the time for the original version.
There's now a separate file containing the full credits for both the cast and crew.
All fields are filled out by users so don't expect them to agree on keywords, genres, ratings, or the like.
Your existing kernels will continue to render normally until they are re-run.
If you are curious about how this dataset was prepared, the code to access TMDb's API is posted here.

New columns:

homepage
id
original_title
overview
popularity
production_companies
production_countries
release_date
spoken_languages
status
tagline
vote_average

Lost columns:

actor1facebook_likes
actor2facebook_likes
actor3facebook_likes
aspect_ratio
casttotalfacebook_likes
color
content_rating
directorfacebooklikes
facenumberinposter
moviefacebooklikes
movieimdblink
numcriticfor_reviews
numuserfor_reviews

译：

TMDB 5000电影数据集

来自TMDb的约5000部电影的元数据

在一部电影上映之前，我们能对它的成功说些什么呢？是否有某些公司（皮克斯？）找到了一致的公式？鉴于制作成本超过1亿美元的大型电影仍可能失败，这个问题对电影业来说比以往任何时候都更重要。电影迷可能有不同的兴趣。我们能否预测哪些电影会获得高评价，无论它们是否在商业上取得成功？

这是一个开始深入研究这些问题的好地方，有几千部电影的情节、演员阵容、工作人员、预算和收入的数据。

已根据IMDB的DMCA删除请求删除了该数据集的原始版本。为了将影响降至最低，我们根据电影数据库（TMDb）的使用条款，将其替换为一组类似的电影和数据字段。坏消息是，基于旧数据集构建的内核很可能不再工作。

好消息是：

● 您可以通过一些编辑来移植现有内核。这个内核提供了相关函数和示例。你也可以在这里找到新格式的一般介绍。

● 新的数据集包含演员和剧组的全部学分，而不仅仅是前三名演员。

● 男演员和女演员现在按他们在演员名单中出现的顺序排列。目前尚不清楚原始数据集使用了什么顺序；对于我抽查的电影，它既不符合信用卡订单，也不符合IMDB的明星订单。

● 收入似乎更具流动性。例如，IMDB关于《阿凡达》的数据似乎是从2010年开始的，并且低估了这部电影的全球收入超过20亿美元。

● 有些我们没能搬过去的电影（几百部）只是糟糕的作品。例如，这个IMDB条目基本上没有准确的信息。它将《星球大战》第七集列为纪录片。

数据源传输详细信息

● 几个新列包含json。通过[从这个内核]（）移植load data函数，可以节省一些时间。

● 即使在运行时这样的简单字段中，各版本之间也可能不一致。例如，之前的数据集显示了《阿凡达》延长剪辑的持续时间，而TMDB显示了原始版本的时间。

● 现在有一个单独的文件，包含演员和工作人员的全部学分。

● 所有字段都由用户填写，所以不要期望他们在关键词、类型、评分等方面达成一致。

● 现有内核将继续正常渲染，直到重新运行。

● 如果您对这个数据集是如何准备的感到好奇，可以在这里发布访问TMDb API的代码。

新增字段：

homepage
id
original_title
overview
popularity
production_companies
production_countries
release_date
spoken_languages
status
tagline
vote_average

Lost columns:

actor1facebook_likes
actor2facebook_likes
actor3facebook_likes
aspect_ratio
casttotalfacebook_likes
color
content_rating
directorfacebooklikes
facenumberinposter
moviefacebooklikes
movieimdblink
numcriticfor_reviews
numuserfor_reviews

来自TMDB的5000部电影数据集相关推荐

python代码电影人物关系_以腾讯5000部电影为例，告诉你Python数据分析该怎么做...
上一篇文章(Python爬虫帮助解决挑选电影费时费力的烦恼),我们对腾讯视频中的电影按照"豆瓣好评"的方式进行了数据爬虫,获取了大约5000部电影的详情数据,解决了选择电影时比较浪 ...
分析了5000部电影票房，发现赚钱的电影都有这些特征~
作者:启方来源:数据分析不是个事儿一般电影公司制作一部新电影推向市场时,要想获得成功,通常要了解电影市场趋势,观众喜好的电影类型,电影的发行情况,改编电影和原创电影的收益情况,以及观众喜欢什么样的 ...
如何查看python代码中的数据集按住data鼠标右键_Python小练习——电影数据集TMDB预处理...
加载TMDB数据集,进行数据预处理 TMDb电影数据库,数据集中包含来自1960-2016年上映的近11000部电影的基本信息,主要包括了电影类型.预算.票房.演职人员.时长.评分等信息.用于练习数据 ...
ML之RS之CF：基于用户的CF算法—利用大量用户的电影及其评分数据集对一个新用户Jason进行推荐电影+(已知Jason曾观看几十部电影及其评分)
ML之RS之CF:基于用户的CF算法-利用大量用户的电影及其评分数据集对一个新用户Jason进行推荐电影+(已知Jason曾观看几十部电影及其评分) 目录输出结果实现代码输出结果先看推荐结果显 ...
The Movies Dataset（电影数据集）
原文: The Movies Dataset Metadata on over 45,000 movies. 26 million ratings from over 270,000 users. T ...
【机器学习】从电影数据集到推荐系统
作者 | Amine Zaamoun 编译 | VK 来源 | Towards Data Science 最初是一个数据集,现在是一个由Amine Zaamoun开发的电影推荐系统: 为什么是推荐系统 ...
电影推荐系统 python简书_分析9000部电影|一个简单的电影推荐系统
不知道大家平时喜不喜欢看电影来消遣时光,我是比较喜欢看电影的.对我而言,当我看完一部电影,觉得很好看的时候,我就会寻找类似这部电影的其他电影.刚好有这么一个数据集,包含了很多部的电影,于是打算对其进行 ...
人工智能观看100部电影学习如何识别接吻 | 广东省智能创新协会
Netflix的一位资深数据科学家训练人工智能来检测电影中的接吻场景. Patrick Swayze和Demi Moore在1990年的电影"幽灵"中亲吻,这是一部数据科学家用来训 ...
优酷上线4K修复版经典剧漫高清修复计划5年焕新5000部经典
3月3日消息,日前,优酷上线一批经过4K修复的经典国产内容,既有<潜伏>.<少年包青天>等经典国产剧集.也有<宝莲灯>.<哪吒闹海>等经典国漫.也包括& ...
Python 爬取 3000 部电影，最具人气烂片排行榜出炉！
作者 | 徐麟责编 | 刘静前言随着电影行业的蓬勃发展,越来越多的电影出现在了观众的视野中,丰富了大家的生活,好的电影也能让大家在放松自我的同时收获一些对人生的思考. 然而,也有那么一些&qu ...

来自TMDB的5000部电影数据集

TMDB 5000 Movie Dataset

TMDB 5000电影数据集

来自TMDB的5000部电影数据集相关推荐

最新文章

热门文章