scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250

说真的，不知道为啥！只要一问那些做过爬虫的筒靴，不管是自己平时兴趣爱好亦或是刚接触入门，都喜欢拿豆瓣网作为爬虫练手对象，以至于到现在都变成了没爬过豆瓣的都不好意思说自己搞过爬虫了。好了，切入正题......

一、系统环境

Python版本：2.7.12(64位)

Scrapy版本：1.4.0

Mysql版本：5.6.35(64位)

系统版本：Win10(64位)

MySQLdb版本: MySQL-python-1.2.3.win-amd64-py2.7(64位)

开发IDE：PyCharm-2106.3.3(64位)

二、安装MySQL数据库

2.1、安装MySQLdb

ok，到这里，说明上面的MySQL已经安装成功了，接下来你需要安装MySQLdb了。

2.2、什么是MySQLdb？

MySQLdb 是用于Python链接Mysql数据库的接口，它实现了 Python 数据库 API 规范 V2.0，基于 MySQL C API 上建立的；简单来说，就是类似于Java中的JDBC。

2.3、如何安装MySQLdb？

目前你有两个选择：

1、安装已编译好的版本(强烈推荐)

2、从官网下载，自己编译安装(这个真要取决于个人的RP人品了，如果喜欢折腾的话不妨可以试他一试，在此不做介绍，请自行度娘即可)

ok，我们选择第一种方式，官网下载地址：http://www.codegood.com/downloads，大家根据自己的系统自行下载即可，下载完毕直接双击进行安装，可以修改下安装路径，然后一路next即可。

image.png

2.4、验证MySQLdb是否安装成功

cmd——》输入python——》输入import MySQLdb，查看是否报错，没有报错则说明MySQLdb安装成功！

image.png

2.5、如何使用MySQLdb

2.6、熟悉XPath

抓取网页时，你做的最常见的任务是从HTML源码中提取数据。现有的一些库可以达到这个目的。

BeautifulSoup：是在程序员间非常流行的网页分析库，它基于HTML代码的结构来构造一个Python对象，对不良标记的处理也非常合理，但它有一个缺点：慢。

lxml：是一个基于 ElementTree (不是Python标准库的一部分)的python化的XML解析库(也可以解析HTML)。

XPath：即为XML路径语言，它是一种用来确定XML(标准通用标记语言的子集)文档中某部分位置的语言。XPath基于XML的树状结构，有不同类型的节点，包括元素节点，属性节点和文本节点，提供在数据结构树中找寻节点的能力。

Scrapy提取数据有自己的一套机制。它们被称作选择器(seletors)，因为他们通过特定的 XPath 或者 CSS 表达式来“选择” HTML文件中的某个部分。

ok，有了上面这些基本的准备工作之后，我们可以开始正式编写爬虫程序了。这里以豆瓣电影TOP250为例：https://movie.douban.com/top250

三、编写爬虫

首先我们使用Chrome或者Firefox浏览器打开这个地址，然后一起分析下这个页面的html元素结构，按住F12键即可查看网页源代码。分析页面我们可以看到，最终需要提取的信息都已经被包裹在class属性为grid_view的这个ol里面了，所以我们就可以基本确定解析范围了，以这个ol元素为整个大的边框，然后再在里面进行查找定位即可。

image.png

然后具体细节在此就不罗嗦了，直接撸代码吧：

完整的代码已经上传至github上git@github.com:hu1991die/douan_movie_spider.git，欢迎fork，欢迎clone！

1、DoubanMovieTop250Spider.py

# encoding: utf-8

'''

@author: feizi

@file: DoubanMovieTop250Spider.py

@Software: PyCharm

@desc:

'''

import re

from scrapy import Request

from scrapy.spiders import Spider

from douan_movie_spider.items import DouanMovieItem

class DoubanMovieTop250Spider(Spider):

name = 'douban_movie_top250'

def start_requests(self):

url = 'https://movie.douban.com/top250'

yield Request(url)

def parse(self, response):

item = DouanMovieItem()

movieList = response.xpath('//ol[@class="grid_view"]/li')

for movie in movieList:

# 排名

rank = movie.xpath('.//div[@class="pic"]/em/text()').extract_first()

# 封面

cover = movie.xpath('.//div[@class="pic"]/a/img/@src').extract_first()

# 标题

title = movie.xpath('.//div[@class="hd"]/a/span[1]/text()').extract_first()

# 评分

score = movie.xpath('.//div[@class="star"]/span[@class="rating_num"]/text()').extract_first()

# 评价人数

comment_num = movie.xpath('.//div[@class="star"]/span[4]/text()').re(ur'(\d+)')[0]

# 经典语录

quote = movie.xpath('.//p[@class="quote"]/span[@class="inq"]/text()').extract_first()

# 上映年份,上映地区，电影分类

briefList = movie.xpath('.//div[@class="bd"]/p/text()').extract()

if briefList:

# 以'/'进行分割

briefs = re.split(r'/', briefList[1])

# 电影分类

types = re.compile(u'([\u4e00-\u9fa5].*)').findall(briefs[len(briefs) - 1])[0]

# 上映地区

region = re.compile(u'([\u4e00-\u9fa5]+)').findall(briefs[len(briefs) - 2])[0]

if len(briefs) <= 3:

# 上映年份

years = re.compile(ur'(\d+)').findall(briefs[len(briefs) - 3])[0]

else:

# 上映年份

years = ''

for brief in briefs:

if hasNumber(brief):

years = years + re.compile(ur'(\d+)').findall(brief)[0] + ","

print years

if types:

# 替换空格为“,”

types = types.replace(" ", ",")

print(rank, cover, title, score, comment_num, quote, years, region, types)

item['rank'] = rank

item['cover'] = cover

item['title'] = title

item['score'] = score

item['comment_num'] = comment_num

item['quote'] = quote

item['years'] = years

item['region'] = region

item['types'] = types

yield item

# 获取下一页url

next_url = response.xpath('//span[@class="next"]/a/@href').extract_first()

if next_url:

next_url = 'https://movie.douban.com/top250' + next_url

yield Request(next_url)

def hasNumber(str):

return bool(re.search('\d+', str))

2、items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

# 电影实体类

class DouanMovieItem(scrapy.Item):

# 排名

rank = scrapy.Field()

# 封面

cover = scrapy.Field()

# 标题

title = scrapy.Field()

# 评分

score = scrapy.Field()

# 评价人数

comment_num = scrapy.Field()

# 经典语录

quote = scrapy.Field()

# 上映年份

years = scrapy.Field()

# 上映地区

region = scrapy.Field()

# 电影类型

types = scrapy.Field()

3、pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here

# Don't forget to add your pipeline to the ITEM_PIPELINES setting

# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import MySQLdb

from scrapy.exceptions import DropItem

from douan_movie_spider.items import DouanMovieItem

# 获取数据库连接

def getDbConn():

conn = MySQLdb.Connect(

host='127.0.0.1',

port=3306,

user='root',

passwd='123456',

db='testdb',

charset='utf8'

)

return conn

# 关闭数据库资源

def closeConn(cursor, conn):

# 关闭游标

if cursor:

cursor.close()

# 关闭数据库连接

if conn:

conn.close()

class DouanMovieSpiderPipeline(object):

def __init__(self):

self.ids_seen = set()

def process_item(self, item, spider):

if item['title'] in self.ids_seen:

raise DropItem("Duplicate item found: %s" % item)

else:

self.ids_seen.add(item['title'])

if item.__class__ == DouanMovieItem:

self.insert(item)

return

return item

def insert(self, item):

try:

# 获取数据库连接

conn = getDbConn()

# 获取游标

cursor = conn.cursor()

# 插入数据库

sql = "INSERT INTO db_movie(rank, cover, title, score, comment_num, quote, years, region, types)VALUES(%s, %s, %s, %s, %s, %s, %s, %s, %s)"

params = (item['rank'], item['cover'], item['title'], item['score'], item['comment_num'], item['quote'], item['years'], item['region'], item['types'])

cursor.execute(sql, params)

#事务提交

conn.commit()

except Exception, e:

# 事务回滚

conn.rollback()

print 'except:', e.message

finally:

# 关闭游标和数据库连接

closeConn(cursor, conn)

4、main.py

# encoding: utf-8

'''

@author: feizi

@file: main.py

@Software: PyCharm

@desc:

'''

from scrapy import cmdline

name = "douban_movie_top250"

# cmd = "scrapy crawl {0} -o douban.csv".format(name)

cmd = "scrapy crawl {0}".format(name)

cmdline.execute(cmd.split())

5、settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for douan_movie_spider project

# For simplicity, this file contains only settings considered important or

# commonly used. You can find more settings consulting the documentation:

# http://doc.scrapy.org/en/latest/topics/settings.html

# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'douan_movie_spider'

SPIDER_MODULES = ['douan_movie_spider.spiders']

NEWSPIDER_MODULE = 'douan_movie_spider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3013.3 Safari/537.36'

# Obey robots.txt rules

ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)

#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)

# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay

# See also autothrottle settings and docs

#DOWNLOAD_DELAY = 3

# The download delay setting will honor only one of:

#CONCURRENT_REQUESTS_PER_DOMAIN = 16

#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)

#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)

#TELNETCONSOLE_ENABLED = False

# Override the default request headers:

#DEFAULT_REQUEST_HEADERS = {

# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

# 'Accept-Language': 'en',

# Enable or disable spider middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

#SPIDER_MIDDLEWARES = {

# 'douan_movie_spider.middlewares.DouanMovieSpiderSpiderMiddleware': 543,

# Enable or disable downloader middlewares

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html

#DOWNLOADER_MIDDLEWARES = {

# 'douan_movie_spider.middlewares.MyCustomDownloaderMiddleware': 543,

# Enable or disable extensions

# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html

#EXTENSIONS = {

# 'scrapy.extensions.telnet.TelnetConsole': None,

# Configure item pipelines

# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {

'douan_movie_spider.pipelines.DouanMovieSpiderPipeline': 300,

}

# Enable and configure the AutoThrottle extension (disabled by default)

# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

#AUTOTHROTTLE_ENABLED = True

# The initial download delay

#AUTOTHROTTLE_START_DELAY = 5

# The maximum download delay to be set in case of high latencies

#AUTOTHROTTLE_MAX_DELAY = 60

# The average number of requests Scrapy should be sending in parallel to

# each remote server

#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0

# Enable showing throttling stats for every response received:

#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)

# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings

#HTTPCACHE_ENABLED = True

#HTTPCACHE_EXPIRATION_SECS = 0

#HTTPCACHE_DIR = 'httpcache'

#HTTPCACHE_IGNORE_HTTP_CODES = []

#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

需要注意一点，为了防止爬虫被ban，我们可以设置一下USER-AGENT.

还是F12键，查看一下Request Headers请求头，找到User-Agent信息然后设置到settings文件中即可。当然，这只是一种简单的方式，其他更复杂的策略如IP池，User-Agent池请自行google吧，这里不做赘述。

image.png

四、运行爬虫

image.png

五、保存结果

image.png

六、简单数据可视化分析

最后，给大家看下简单的数据可视化分析效果。

6.1、评分top10

image.png

6.2、标题云

image.png

6.3、语录云

image.png

6.4、评论TOP10

image.png

6.5、每一年电影上映数统计

image.png

6.6、上映地区统计

image.png

6.7、电影类型汇总

image.png

scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250相关推荐

python爬取豆瓣电影top250_【Python3爬虫教程】Scrapy爬取豆瓣电影TOP250
今天要实现的就是使用是scrapy爬取豆瓣电影TOP250榜单上的电影信息. 步骤如下: 一.爬取单页信息首先是建立一个scrapy项目,在文件夹中按住shift然后点击鼠标右键,选择在此处打开命令 ...
03_使用scrapy框架爬取豆瓣电影TOP250
前言: 本次项目是使用scrapy框架,爬取豆瓣电影TOP250的相关信息.其中涉及到代理IP,随机UA代理,最后将得到的数据保存到mongoDB中.本次爬取的内容实则不难.主要是熟悉scrapy相关 ...
scrapy1.3爬取豆瓣电影top250
学习<爬虫框架scrapy,爬取豆瓣电影top250>,用scrapy1.3实践,记录学习过程 1 . 新建项目进入打算存储代码的目录,命令行运行如下语句 scrapy startpro ...
python爬取豆瓣电影top250_用Python爬虫实现爬取豆瓣电影Top250
用Python爬虫实现爬取豆瓣电影Top250 #爬取豆瓣电影Top250 #250个电影 ,分为10个页显示,1页有25个电影 import urllib.request from bs4 imp ...
利用python爬取豆瓣电影top250
利用python爬取豆瓣电影top250: 注:本内容只是作为个人学习记录 1.业务分析进入网页https://movie.douban.com/top250 可以看见每部电影都呈现在眼前,点击电影 ...
Python爬虫爬取豆瓣电影TOP250
Python爬虫爬取豆瓣电影TOP250 最近在b站上学习了一下python的爬虫,实践爬取豆瓣的电影top250,现在对这两天的学习进行一下总结主要分为三步: 爬取豆瓣top250的网页,并通过 ...
jsoup爬取豆瓣电影top250
文章目录 0.准备工作 1. 分析 2. 构思 3. 编程 3.1 定义一个bean,用于保存电影的数据 3.2 按照之前的构思进行编程 4.效果图 5.获取资源 5.1GitHub 5.2百度云 0 ...
爬虫练习-爬取豆瓣电影TOP250的数据
前言: 爬取豆瓣电影TOP250的数据,并将爬取的数据存储于Mysql数据库中本文为整理代码,梳理思路,验证代码有效性--2020.1.4 环境: Python3(Anaconda3) PyChar ...
python爬取豆瓣电影top250编码_Python学习日记1| 用python爬取豆瓣电影top250
今天是3.17号. 离毕业论文开题只剩下不到15天,自己这边还不知道要写什么好,问了导师,导师给的范围超级广泛,实在是想吐槽.想了几天,决定了要尽快给老师说自己的想法和方向,做什么还是靠自己比较靠谱. ...
python BeautifulSoup爬取豆瓣电影top250信息并写入Excel表格
豆瓣是一个社区网站,创立于2005年3月6日.该网站以书影音起家,提供关于书籍,电影,音乐等作品信息,其描述和评论都是由用户提供的,是Web2.0网站中具有特色的一个网站. 豆瓣电影top250网址: ...

scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250

scrapy mysql 报错_Scrapy+MySQL爬取豆瓣电影TOP250相关推荐

最新文章

热门文章