scrapy_redis在继承scrapy去重组件的基础上覆盖了某些方法,原scrapy去重是基于单机情况下的内部去重,但是分布式是多机条件下的多爬虫协同去重,因此需要让不同及其上的同一个爬虫能够在同一个地方进行去重,这就是Redis的集合。

先看看scrapy_redis 去重组件dupefilter的源码:

import loggingimport time

from scrapy.dupefilters import BaseDupeFilterfrom scrapy.utils.request import request_fingerprint

from .connection import get_redis_from_settings

DEFAULT_DUPEFILTER_KEY = "dupefilter:%(timestamp)s"

logger = logging.getLogger(__name__)

# TODO: Rename class to RedisDupeFilter.class RFPDupeFilter(BaseDupeFilter):"""Redis-based request duplicates filter.This class can also be used with default Scrapy's scheduler."""

logger = logger

def __init__(self, server, key, debug=False):"""Initialize the duplicates filter.Parameters----------server : redis.StrictRedisThe redis server instance.key : strRedis key Where to store fingerprints.debug : bool, optionalWhether to log filtered requests."""self.server = serverself.key = keyself.debug = debugself.logdupes = True

@classmethoddef from_settings(cls, settings):"""Returns an instance from given settings.This uses by default the key ``dupefilter:<timestamp>``. When using the``scrapy_redis.scheduler.Scheduler`` class, this method is not used asit needs to pass the spider name in the key.Parameters----------settings : scrapy.settings.SettingsReturns-------RFPDupeFilterA RFPDupeFilter instance."""server = get_redis_from_settings(settings)# XXX: This creates one-time key. needed to support to use this# class as standalone dupefilter with scrapy's default scheduler# if scrapy passes spider on open() method this wouldn't be needed# TODO: Use SCRAPY_JOB env as default and fallback to timestamp.key = DEFAULT_DUPEFILTER_KEY % {'timestamp': int(time.time())}debug = settings.getbool('DUPEFILTER_DEBUG')return cls(server, key=key, debug=debug)

@classmethoddef from_crawler(cls, crawler):"""Returns instance from crawler.Parameters----------crawler : scrapy.crawler.CrawlerReturns-------RFPDupeFilterInstance of RFPDupeFilter."""return cls.from_settings(crawler.settings)

def request_seen(self, request):"""Returns True if request was already seen.Parameters----------request : scrapy.http.RequestReturns-------bool"""fp = self.request_fingerprint(request)# This returns the number of values added, zero if already exists.added = self.server.sadd(self.key, fp)return added == 0

def request_fingerprint(self, request):"""Returns a fingerprint for a given request.Parameters----------request : scrapy.http.RequestReturns-------str"""return request_fingerprint(request)

def close(self, reason=''):"""Delete data on close. Called by Scrapy's scheduler.Parameters----------reason : str, optional"""self.clear()

def clear(self):"""Clears fingerprints data."""self.server.delete(self.key)

def log(self, request, spider):"""Logs given request.Parameters----------request : scrapy.http.Requestspider : scrapy.spiders.Spider"""if self.debug:msg = "Filtered duplicate request: %(request)s"self.logger.debug(msg, {'request': request}, extra={'spider': spider})elif self.logdupes:msg = ("Filtered duplicate request %(request)s"" - no more duplicates will be shown"" (see DUPEFILTER_DEBUG to show all duplicates)")msg = "Filtered duplicate request: %(request)s"self.logger.debug(msg, {'request': request}, extra={'spider': spider})self.logdupes = False

from_settings、from_crawler方法不用解释,就是读取配置文件连接Redis设置key,关键在request_seen、request_fingerprint这两个方法;request_seen调用self.request_fingerprint进而调用from scrapy.utils.request import request_fingerprint生成request的指纹。

再看scrapy.utils.request 中的request_fingerprint源码:

"""
This module provides some useful functions for working with
scrapy.http.Request objects
"""

from __future__ import print_functionimport hashlibimport weakreffrom six.moves.urllib.parse import urlunparse

from w3lib.http import basic_auth_headerfrom scrapy.utils.python import to_bytes, to_native_str

from w3lib.url import canonicalize_urlfrom scrapy.utils.httpobj import urlparse_cached

_fingerprint_cache = weakref.WeakKeyDictionary()def request_fingerprint(request, include_headers=None):"""Return the request fingerprint.The request fingerprint is a hash that uniquely identifies the resource therequest points to. For example, take the following two urls:http://www.example.com/query?id=111&cat=222http://www.example.com/query?cat=222&id=111Even though those are two different URLs both point to the same resourceand are equivalent (ie. they should return the same response).Another example are cookies used to store session ids. Suppose thefollowing page is only accesible to authenticated users:http://www.example.com/members/offers.htmlLot of sites use a cookie to store the session id, which adds a randomcomponent to the HTTP Request and thus should be ignored when calculatingthe fingerprint.For this reason, request headers are ignored by default when calculatingthe fingeprint. If you want to include specific headers use theinclude_headers argument, which is a list of Request headers to include."""if include_headers:include_headers = tuple(to_bytes(h.lower())for h in sorted(include_headers))cache = _fingerprint_cache.setdefault(request, {})if include_headers not in cache:fp = hashlib.sha1()fp.update(to_bytes(request.method))fp.update(to_bytes(canonicalize_url(request.url)))fp.update(request.body or b'')if include_headers:for hdr in include_headers:if hdr in request.headers:fp.update(hdr)for v in request.headers.getlist(hdr):fp.update(v)cache[include_headers] = fp.hexdigest()return cache[include_headers]

def request_authenticate(request, username, password):"""Autenticate the given request (in place) using the HTTP basic accessauthentication mechanism (RFC 2617) and the given username and password"""request.headers['Authorization'] = basic_auth_header(username, password)

def request_httprepr(request):"""Return the raw HTTP representation (as bytes) of the given request.This is provided only for reference since it's not the actual stream ofbytes that will be send when performing the request (that's controlledby Twisted)."""parsed = urlparse_cached(request)path = urlunparse(('', '', parsed.path or '/', parsed.params, parsed.query, ''))s = to_bytes(request.method) + b" " + to_bytes(path) + b" HTTP/1.1\r\n"s += b"Host: " + to_bytes(parsed.hostname or b'') + b"\r\n"if request.headers:s += request.headers.to_string() + b"\r\n"s += b"\r\n"s += request.bodyreturn s

def referer_str(request):""" Return Referer HTTP header suitable for logging. """referrer = request.headers.get('Referer')if referrer is None:return referrerreturn to_native_str(referrer, errors='replace')

request_fingerprint用于返回唯一指纹,并且对于携带参数顺序不同的URL返回相同的指纹,通过哈希计算实现,参与哈希计算的有request.method、request.url、request.body,同时提供了一个备选列表参数,该列表存放需要加入计算的request.headers中的字段信息,也就是cookie也可以参与指纹计算,针对同一个URL但是header中的字段不同那么也可以视为不同的请求,从而不被去重。

当然获得一个请求的指纹之后,便可与Redis中的去重列表比对,如果存在,那么抛弃,如果不存在那么将指纹加入去重集合,同时通知调度器该request可以进入队列,这便是去重的实现过程。

scrapy分布式去重组件源码及其实现过程相关推荐

  1. Redis分布式锁解析源码分析

    Redis分布式锁解析&源码分析 概述 实战 简单的分布式锁 Redisson实现分布式锁 Redission源码分析 构造方法 获取锁lock 解锁 锁失效 红锁 案例分析 原始的写法 进化 ...

  2. 【商业版】C# ASP.NET 通用权限管理系统组件源码中的数据库访问组件可以全面支持Access单机数据库了...

    可能在5年前还用过Access单机数据库但是后来很少用了,可能平时接触的都是大型管理类系统的开发工作大部分是Oracle.SQLServer数据库上做开发的,很少做一些小网站或者单机版本的东西,所以跟 ...

  3. ASP.NET 生成唯一不重复的订单号 支持多用户并发、持多数据库的实现参考(C#.NET通用权限管理系统组件源码组成部分)...

    我们在日常开发项目过程中往往需要各种订单单号的产生方法,而且是支持多用户并发.支持多种数据库的,我们并不想为每个项目都写一些独立的代码去实现这些功能,往往需要有个通用的函数比较爽一些. 下面我们以C# ...

  4. Android一个漂亮的日历组件源码

    简介: Android 一个漂亮的日历组件源码主要特性 日历样式完全自定义,拓展性强 左右滑动切换上下周月,上下滑动切换周月模式 抽屉式周月切换效果 标记指定日期(marker) 跳转到指定日期 下载 ...

  5. elementui组件_elementui 中 loading 组件源码解析(续)

    上一篇我们说了elementui如何将loading组件添加到 Vue 实例上,具体内容见上期 elementui 中 loading 组件源码解析. 这一篇我们开始讲讲自定义指令 自定义指令 关于自 ...

  6. Ui学习笔记---EasyUI的EasyLoader组件源码分析

    Ui学习笔记---EasyUI的EasyLoader组件源码分析 技术qq交流群:JavaDream:251572072   1.问题1:为什么只使用了dialog却加载了那么多的js   http: ...

  7. elementui table某一列是否显示_elementui 中 loading 组件源码解析(续)

    上一篇我们说了elementui如何将loading组件添加到 Vue 实例上,具体内容见上期 elementui 中 loading 组件源码解析. 这一篇我们开始讲讲自定义指令 自定义指令 关于自 ...

  8. NET开发邮件发送功能的全面教程(含邮件组件源码)(

    天,给大家分享的是如何在.NET平台中开发"邮件发送"功能.在网上搜的到的各种资料一般都介绍的比较简单,那今天我想比较细的整理介绍下. AD:2013云计算架构师峰会精彩课程曝光 ...

  9. .NET开发邮件发送功能的全面教程(含邮件组件源码)

    今天,给大家分享的是如何在.NET平台中开发"邮件发送"功能.在网上搜的到的各种资料一般都介绍的比较简单,那今天我想比较细的整理介绍下: 1)         邮件基础理论知识 2 ...

最新文章

  1. checkbox wpf 改变框的大小_使用Photoshop智能对象调整图像大小而不会丢失质量
  2. A wizard’s guide to Adversarial Autoencoders: Part 1, Autoencoder?
  3. jquery中的ajax如何接收json串形式的接口
  4. 【LeetCode笔记】20.有效的括号(Java、栈) 21. 合并两个有序链表(Java)
  5. Pycharm 提示:this license * has been cancelled - Python零基础入门教程
  6. 动手学深度学习(PyTorch实现)(四)--梯度消失与梯度爆炸
  7. 【Elasticsearch】解除索引只读限制 read-only
  8. HTML:常用特殊字符编码表(自用)
  9. fmea第五版pfmea表格_第五版PFMEA模板(含附属评分准则编写指南全套EXCEL表)
  10. Logistic回归分析
  11. Java SE 007 流程控制语句 续
  12. 百度地图 根据经纬度 定位
  13. python 顺序读取文件夹下面的文件(自定义排序方式)
  14. 不3了也不写了....
  15. 虚拟号码认证如何开通?
  16. bugzilla 删除bug
  17. termux怎么修改php版本,玩转Termux
  18. 《鸟哥的Linux私房菜》第四版辅助文档
  19. 要达到什么水平才能找到一份软件自动化测试的工作?
  20. 淘客app上架五步走

热门文章

  1. 15个产业级算法推出、35个高精度预训练模型上线!最强国产开源AI框架再进化,密集提升视觉产业实战能力...
  2. 我长了一条日本制的尾巴:智能配合身体运动,增强平衡感,把我变成“改造人”| SIGGRAPH...
  3. 关于数字化转型,IDC发布了他们的新观点
  4. 深入分析 Redis Lua 脚本运行原理
  5. Python的22个编程技巧,Pick一下?你又知道多少呢……
  6. 从音乐分享平台到泛音乐视频社交平台,菠萝 BOLO完成过亿元 B 轮融资
  7. Java基础/利用fastjson序列化对象为JSON
  8. linux下crontab实现定时服务详解
  9. 在sql server中建存储过程,如果需要参数是一个可变集合怎么处理?
  10. 关于如何在MyEclipse下修改项目名包名,以及类