爬取豆瓣top250电影,提取评论关键词,然后将同一国家的关键词做成一个词云,轮廓是每个国家的地图轮廓

爬取数据

需要爬取电影名称、导演、年份、地区和前10个评论除了地区,其他的都没什么问题,我们来研究下地区的信息怎么获取

import requests
from bs4 import BeautifulSoup
import time
import pymysql
import pandas as pd
db = pymysql.connect('ip','QINYUYOU','QINyuyo!','homework')
cursor = db.cursor()
headers = {'cookie':'bid=xiXasJy_T2s; ll="118304"; __utmz=30149280.1576307574.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmz=223695111.1576307574.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __yadk_uid=ucYJzWLxVGkxVUZzkLuOr2WKGYDQUChd; _vwo_uuid_v2=DDF040CDC39D506E32CB70680F68474E1|09b885503496bad5cd4ffc77a93035b1; _pk_ses.100001.4cf6=*; __utma=30149280.1798292817.1576307574.1576307574.1576411260.2; __utmb=30149280.0.10.1576411260; __utmc=30149280; __utma=223695111.844953453.1576307574.1576307574.1576411260.2; __utmb=223695111.0.10.1576411260; __utmc=223695111; ap_v=0,6.0; trc_cookie_storage=taboola%2520global%253Auser-id%3Da50462e2-0a35-4fe0-8d41-70f031512552-tuct4efa694; _pk_id.100001.4cf6=774b2f69656869fe.1576307574.2.1576411507.1576309794.','referer':'https://movie.douban.com/top250?start=0&filter=','user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'
}
list_name = []
list_dir = []
list_year = []
list_area = []
list_com = []
for i in range(0,10):time.sleep(5)url = 'https://movie.douban.com/top250?start={0}&filter='.format(i*25)url_r = requests.get(url,headers=headers)#get请求网页url_b = BeautifulSoup(url_r.text,'lxml')#解析movie_list = url_b.find('ol',attrs={'class':'grid_view'})#电影列表#print(url_list)for movie_li in movie_list.find_all('li'):movie_url = movie_li.find('a').attrs['href']  # 获取电影链接time.sleep(4)movie_r = BeautifulSoup(requests.get(movie_url,headers=headers).text,'lxml')movie_name = movie_r.h1.span.string#获取h1中span的内容(电影标题)movie_directed = movie_r.find('a',rel='v:directedBy').string#电影导演time_ = movie_r.h1.find('span',class_='year').string#找到class为year的标签span的内容(年份)info_div = movie_r.find('div', attrs={'id': 'info'})for child in info_div.children:if child.string and child.string.startswith('制片国家/地区'):area = child.next_sibling.string.strip()#制片国家print(area)comment_url = movie_r.find('div',id='comments-section').find('div',class_='mod-hd').find('span',class_='pl').a.attrs['href']#评论地址time.sleep(5)#print(url_)comment_req = BeautifulSoup(requests.get(comment_url, headers=headers).text, 'lxml')comment_item = comment_req.find_all('div', class_='comment-item')for j in range(10):comment = comment_item[j].find('div', class_='comment').find('span', class_='short').stringprint(i, comment)list_name.append(movie_name)list_dir.append(movie_directed)list_year.append(time_)list_area.append(area)list_com.append(comment)sql = 'INSERT INTO tp250(movie_name,movie_dir,movie_year,movie_area,movie_comment) VALUES(%s,%s,%s,%s,%s)'cursor.execute(sql, (movie_name, movie_directed,time_, area,comment))print('插入成功')db.commit()
dict_ = pd.DataFrame({'name': list_name, 'dir': list_dir, 'time': list_year,'area':list_area,'comment':list_com})
dict_.to_csv('top250.csv')
db.close()

分析

import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from PIL import Image
import numpy as np
data = pd.read_csv('国家关键词.csv')
print(data)
def area(string):return string.strip().split('/')[0].strip()
data['area'] = data['area'].apply(area)
count = data.groupby(['area'],as_index=False)['area'].agg({'cnt':'count'})
United_States = data[data['area']=='美国']#美国评论
China = data.loc[(data['area']=='中国台湾') | (data['area']=='中国大陆') | (data['area']=='中国香港')]#中国评论
#print(United_States)
#print(China)
Denmark = data[data['area']=='丹麦']#丹麦评论
Iran = data[data['area']=='伊朗']
India = data[data['area']=='印度']
#获取每个国家的关键词
def area_comment(area_name):return data[data['area']==area_name]
def key_word_count(area_name):data = area_comment(area_name)string_key = ''for word in data['keyword']:string_key = string_key + wordlist_key = string_key.split(' ')series_key = pd.DataFrame({'key_word':list_key})count = series_key.groupby(['key_word'], as_index=False)['key_word'].agg({'cnt': 'count'})key_word = count['key_word']key_count = count['cnt']image = np.array(Image.open("美国.jpg"))wordcloud = WordCloud(mask=image,font_path="/usr/local/lib/python3.7/site-packages/matplotlib/mpl-data/fonts/ttf/SimHei.ttf").generate(string_key)plt.imshow(wordcloud,interpolation="bilinear")plt.show()
if __name__ == '__main__':key_word_count('美国')

数据处理

import pandas as pd
import thulac
import math
import string
import jieba.analyse
from collections import Counterlac = thulac.thulac(seg_only=True)
data = pd.read_csv('./top250_result.csv')
list_area = data['area'].values
def word_cut(string):return lac.cut(string,text=True)
df_comment = data['comment']
df_comment = df_comment.apply(word_cut)
#tf-idf
'''
def tf(word, count):return count[word] / sum(count.values())
def n_containing(word, count_list):return sum(1 for count in count_list if word in count)
def idf(word, count_list):return math.log(len(count_list)) / (1 + n_containing(word, count_list))
def tfidf(word, count, count_list):return tf(word, count) * idf(word, count_list)
def main():#print(df_comment)countlist = []for word_list in df_comment:for i in range(len(word_list)):count = Counter(word_list[i])countlist.append(count)#print(countlist)for i, count in enumerate(countlist):print("Top words in document {}".format(i + 1))scores = {word: tfidf(word, count, countlist) for word in count}sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)for word, score in sorted_words[:]:print("\tWord: {}, TF-IDF: {}".format(word, round(score, 10)))
'''
if __name__ == "__main__":keyword_list = []for sentence in df_comment:string = ''print(sentence)keywords = jieba.analyse.extract_tags(sentence, topK=10, withWeight=True)for item in keywords:string = string + item[0]+' 'keyword_list.append(string)dict__ = pd.DataFrame({'area':list_area,'keyword':keyword_list})dict__.to_csv('国家关键词.csv')

处理2

import pandas as pd
tfidf = analyse.extract_tags
data = pd.read_csv('./top250.csv')
df_comment = data['comment']
list1 = []#评论
string = ''
for i in range(len(df_comment)):print(i)if (i+1) % 10 == 0:string = string + df_comment[i]list1.append(string)string = ''else :string = string + df_comment[i]
data = data[['name','area','time','dir']]
data = data.drop_duplicates()#去重
print(data)
print(list1)
df_dir = data['dir'].values#导演
df_time = data['time'].values#年份
df_area = data['area'].values#国家
df_name = data['name'].values#片名
print(len(df_name),len(df_area),len(df_time),len(df_dir),len(list1))
dict_ = pd.DataFrame({'name':df_name,'dir':df_dir,'time':df_time,'area':df_area,'comment':list1})
dict_.to_csv('top250_result.csv')

爬取豆瓣top250电影并分析相关推荐

  1. [python爬虫] BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息

    这篇文章主要对比BeautifulSoup和Selenium爬取豆瓣Top250电影信息,两种方法从本质上都是一样的,都是通过分析网页的DOM树结构进行元素定位,再定向爬取具体的电影信息,通过代码的对 ...

  2. scrapy爬取豆瓣top250电影数据

    scrapy爬取豆瓣top250电影数据 scrapy框架 Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. sc ...

  3. 【实战1】用BeatuifulSoup解析器爬取豆瓣Top250电影的名称

    [实战1]用BeatuifulSoup解析器爬取豆瓣Top250电影的名称 一. 爬虫的普遍步骤 二. 项目目标分析 三.完整爬取代码 参考链接: https://www.bilibili.com/v ...

  4. 【实战2】爬取豆瓣Top250电影的海报

    [实战2]爬取豆瓣Top250电影的海报 一. 项目目标分析 二. 完整代码 参考链接: https://www.bilibili.com/video/BV1ZJ411N7Fq?p=4 一. 项目目标 ...

  5. Python爬取豆瓣Top250电影中2000年后上映的影片信息

    Python爬取豆瓣Top250电影中2000年后上映的影片信息 前言 双十一前加在京东购物车的一个东西,价格330,Plus会员用券后差不多310.双十一当天打开看了下399,还得去抢满300减10 ...

  6. Python爬取豆瓣Top250电影可见资料并保存为excel形式

    Python爬取豆瓣Top250电影可见资料并保存为excel形式 利用requests第三方库实现网页的元素爬取,再使用openpyxl库进行信息的录入. 具体思路 1.分析网页的headers. ...

  7. python采用requests+bs4爬取豆瓣top250电影信息

    爬取豆瓣top250电影说明 (链接:https://movie.douban.com/top250,可爬取一页或者多页(输出电影的正标题(肖申克的救赎),副标题( The Shawshank Red ...

  8. 爬取豆瓣TOP250电影的评分、评价人数、短评等信息,并在其保存在sql数据库中。

    爬取目标 爬取豆瓣TOP250电影的评分.评价人数.短评等信息,并在其保存在sql数据库中. 最终实现效果如图: 确定爬取的URL 爬取的网页地址为:https://movie.douban.com/ ...

  9. Python爬虫菜鸟入门,爬取豆瓣top250电影 (自己学习,如有侵权,请联系我删除)

    Python爬虫菜鸟入门,爬取豆瓣top250电影 (自己学习,如有侵权,请联系我删除) import requests from bs4 import BeautifulSoup import ti ...

最新文章

  1. 用eclipse阅读编辑android和kernel,uboot的源代码
  2. 1_HelloWorld
  3. Cannot load 64-bit SWT libraries on 32-bit JVM
  4. 2020-10-25(个人int误区)
  5. Hibernate5环境搭建
  6. navision系统和sap区别_erp系统与sap的区别是什么?
  7. 组织需要什么样的我_为什么开放组织对我说话
  8. 通信 —— 串口与并口
  9. 崇尚个人当前状态的社会
  10. 送给程序员:关于性格内向者的10个误解(转)
  11. 【回文串7】LeetCode 234. Palindrome Linked List
  12. C++ map的基本操作和使用
  13. hihoCoder 1388(fft)
  14. poj 3660 Cow Contest floyd 传递闭包!!基础
  15. 硬盘格式化恢复数据,硬盘格式化如何恢复数据
  16. Java项目:外卖订餐管理系统(java+SSM+JSP+jQuery+Ajax+mysql)
  17. 动作捕捉软件系统有那么重要吗?
  18. 【Magick++】创建图像
  19. 模块学习3:PTC052A-200串口摄像头拍照等功能编写
  20. ui设计需要做android和苹果版本,安卓和IOS系统对于UI设计来说一样吗

热门文章

  1. 如何选择企业电脑加密软件,知道这几点一定不后悔!
  2. Python - 装机系列22 华擎A520+AMD 4650G + Ubuntu装新机过程
  3. 2020-11-08 焊单片机技巧
  4. 【IoT】产品管理:产品部管理管理规章与制度
  5. 0083-Zipkin耗时分析
  6. Mac U盘安装High Sierra
  7. html 页面自动滚动,打开网页后屏幕自动滚动代码
  8. Android编译自定义sdk,向Android SDK中添加自定义的库 (Addon)
  9. 【编程题 动态规划】最长公共子序列(详细注释 易懂)
  10. 循环(环形)缓冲区之Boost::circular_buffer