scrapy解析网页时,针对一些特别格式的数据的处理

1. 环境

  • python 3.6.1
  • 系统:win7
  • IDE:pycharm
  • scrapy框架

2. 页面源代码中含有json数据

2.1. 案例

  • 参考页面:

https://www.amazon.com/Best-Sellers-Sports-Outdoors-Hunting-Shooting-Safety-Glasses/zgbs/sporting-goods/3413661/ref=zg_bs_nav_sg_5_3304074011#3

  • 提取产品的唯一标识asin码
import jsonproductLst = response.xpath("//div[@class='zg_itemImmersion']")
for elem in productLst:asinInfo = elem.xpath(".//div[@class='zg_itemWrapper']/div[@class='a-section a-spacing-none p13n-asin']/@data-p13n-asin-metadata").extract()if len(asinInfo) > 0:# 解析asinasinInfoJson = json.loads(asinInfo[0])print(f"asinInfoJson = {asinInfoJson}")asin = asinInfoJson['asin']# 打印结果
asinInfoJson = {"ref":"zg_bs_3413661_41","asin":"B01MUET4LC"}
asinInfoJson = {"ref":"zg_bs_3413661_42","asin":"B007JZ3UG0"}
asinInfoJson = {"ref":"zg_bs_3413661_43","asin":"B01KX6184M"}
asinInfoJson = {"ref":"zg_bs_3413661_44","asin":"B007ZNTN7G"}
asinInfoJson = {"ref":"zg_bs_3413661_45","asin":"B004OFRLQI"}
asinInfoJson = {"ref":"zg_bs_3413661_46","asin":"B077JTTYJQ"}
asinInfoJson = {"ref":"zg_bs_3413661_47","asin":"B01ACIB5YY"}
asinInfoJson = {"ref":"zg_bs_3413661_48","asin":"B00JXLTO5E"}
asinInfoJson = {"ref":"zg_bs_3413661_49","asin":"B01MYBOPKJ"}
asinInfoJson = {"ref":"zg_bs_3413661_50","asin":"B01MZXHWQ9"}
asinInfoJson = {"ref":"zg_bs_3413661_51","asin":"B01GZ2ZI3U"}
asinInfoJson = {"ref":"zg_bs_3413661_52","asin":"B0055BCCUK"}
asinInfoJson = {"ref":"zg_bs_3413661_53","asin":"B005NH4Q64"}
asinInfoJson = {"ref":"zg_bs_3413661_54","asin":"B007HKPAA6"}
asinInfoJson = {"ref":"zg_bs_3413661_55","asin":"B007VKKP5W"}
asinInfoJson = {"ref":"zg_bs_3413661_56","asin":"B01GSUX3QS"}
asinInfoJson = {"ref":"zg_bs_3413661_57","asin":"B007JZ3UP6"}
asinInfoJson = {"ref":"zg_bs_3413661_58","asin":"B06Y12NL7N"}
asinInfoJson = {"ref":"zg_bs_3413661_59","asin":"B01FT2CFYC"}
asinInfoJson = {"ref":"zg_bs_3413661_60","asin":"B00ESH12E4"}

2.2. 关于json数据的拓展

  • 官方文档:http://docs.python.org/library/json.html
  • Json在线解析网站:http://www.json.cn/

json简单说就是javascript中的对象和数组,所以这两种结构就是对象和数组两种结构,通过这两种结构可以表示各种复杂的结构。
对象:对象在js中表示为{ }括起来的内容,数据结构为 { key:value, key:value, … }的键值对的结构,在面向对象的语言中,key为对象的属性,value为对应的属性值,所以很容易理解,取值方法为 对象.key 获取属性值,这个属性值的类型可以是数字、字符串、数组、对象这几种。
数组:数组在js中是中括号[ ]括起来的内容,数据结构为 [“Python”, “javascript”, “C++”, …],取值方式和所有语言中一样,使用索引获取,字段值的类型可以是 数字、字符串、数组、对象几种。

import json
json模块提供了四个功能:dumps、dump、loads、load,用于字符串 和 python数据类型间进行转换。

  • json.loads( ):把Json格式字符串解码转换成Python对象。
    从json到python的类型转化对照如下:
# json_loads.pyimport json
strList = '[1, 2, 3, 4]'
strDict = '{"city": "北京", "name": "大猫"}'json.loads(strList)
# [1, 2, 3, 4]
json.loads(strDict) # json数据自动按Unicode存储
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u732b'}
  • json.dumps( ):实现python类型转化为json字符串,返回一个str对象,把一个Python对象编码转换成Json字符串。
    从python原始类型向json类型的转化对照如下:
# json_dumps.pyimport json
import chardetlistStr = [1, 2, 3, 4]
tupleStr = (1, 2, 3, 4)
dictStr = {"city": "北京", "name": "大猫"}json.dumps(listStr)
# '[1, 2, 3, 4]'
json.dumps(tupleStr)
# '[1, 2, 3, 4]'# 注意:json.dumps() 序列化时默认使用的ascii编码
# 添加参数 ensure_ascii=False 禁用ascii编码,按utf-8编码
# chardet.detect()返回字典, 其中confidence是检测精确度json.dumps(dictStr)
# '{"city": "\\u5317\\u4eac", "name": "\\u5927\\u5218"}'chardet.detect(json.dumps(dictStr))
# {'confidence': 1.0, 'encoding': 'ascii'}print json.dumps(dictStr, ensure_ascii=False)
# {"city": "北京", "name": "大刘"}chardet.detect(json.dumps(dictStr, ensure_ascii=False))
# {'confidence': 0.99, 'encoding': 'utf-8'}

chardet是一个非常优秀的编码识别模块,可通过pip安装

  • json.dump( ):将Python内置类型序列化为json对象后写入文件。
# json_dump.pyimport jsonlistStr = [{"city": "北京"}, {"name": "大刘"}]
json.dump(listStr, open("listStr.json","w"), ensure_ascii=False)dictStr = {"city": "北京", "name": "大刘"}
json.dump(dictStr, open("dictStr.json","w"), ensure_ascii=False)
  • json.load( ):读取文件中json形式的字符串元素 转化成python类型
# json_load.pyimport jsonstrList = json.load(open("listStr.json"))
print strList# [{u'city': u'\u5317\u4eac'}, {u'name': u'\u5927\u5218'}]strDict = json.load(open("dictStr.json"))
print strDict
# {u'city': u'\u5317\u4eac', u'name': u'\u5927\u5218'}
  • 注意事项
json.loads() 是把 Json格式字符串解码转换成Python对象,如果在json.loads的时候出错,要注意被解码的Json字符的编码。如果传入的字符串的编码不是UTF-8的话,需要指定字符编码的参数 encodingdataDict = json.loads(jsonStrGBK);
dataJsonStr是JSON字符串,假设其编码本身是非UTF-8的话而是GBK 的,那么上述代码会导致出错,改为对应的:dataDict = json.loads(jsonStrGBK, encoding="GBK");
如果 dataJsonStr通过encoding指定了合适的编码,但是其中又包含了其他编码的字符,则需要先去将dataJsonStr转换为Unicode,然后再指定编码格式调用json.loads()dataJsonStrUni = dataJsonStr.decode("GB2312"); dataDict = json.loads(dataJsonStrUni, encoding="GB2312");##字符串编码转换
这是中国程序员最苦逼的地方,什么乱码之类的几乎都是由汉字引起的。
其实编码问题很好搞定,只要记住一点:
####任何平台的任何编码 都能和 Unicode 互相转换
UTF-8 与 GBK 互相转换,那就先把UTF-8转换成Unicode,再从Unicode转换成GBK,反之同理。# 这是一个 UTF-8 编码的字符串
utf8Str = "你好地球"
# 1. 将 UTF-8 编码的字符串 转换成 Unicode 编码unicodeStr = utf8Str.decode("UTF-8")# 2. 再将 Unicode 编码格式字符串 转换成 GBK 编码gbkData = unicodeStr.encode("GBK")# 1. 再将 GBK 编码格式字符串 转化成 UnicodeunicodeStr = gbkData.decode("gbk")# 2. 再将 Unicode 编码格式字符串转换成 UTF-8utf8Str = unicodeStr.encode("UTF-8")decode的作用是将其他编码的字符串转换成 Unicode 编码
encode的作用是将 Unicode 编码转换成其他编码的字符串
一句话:UTF-8是对Unicode字符集进行编码的一种编码方式

JsonPath简介:
JsonPath 是一种信息抽取类库,是从JSON文档中抽取指定信息的工具,提供多种语言实现版本,包括:Javascript, Python, PHP 和 Java。
JsonPath 对于 JSON 来说,相当于 XPATH 对于 XML。
下载地址:https://pypi.python.org/pypi/jsonpath
安装方法:点击Download URL链接下载jsonpath,解压之后执行python setup.py install
官方文档:http://goessner.net/articles/JsonPath

3. ajax请求返回一段网页源代码

  • 参考页面:

https://www.amazon.com/WizGear-Universal-Swift-Snap-Technology-Smartphones/product-reviews/B00PGJWYJ0/ref=cm_cr_arp_d_paging_btm_3?ie=UTF8&reviewerType=all_reviews&pageNumber=3

  • 返回的数据:是网页源码,但是全部用 \ 处理了 “

  • 参考返回数据的一部分,如下:
["script","if(window.ue) { ues('id','reviewsAjax0','NPYJW8456B30FX29R1NR');ues('t0','reviewsAjax0',new Date());ues('ctb','reviewsAjax0','1');uet('bb','reviewsAjax0'); }"]
&&&["update","#cm_cr-review_list",""]
&&&["loaded"]
&&&["append","#cm_cr-review_list","<div id=\"RTT58UH8PH6QT\" data-hook=\"review\" class=\"a-section review\"><div id=\"customer_review-RTT58UH8PH6QT\" class=\"a-section celwidget\"><div class=\"a-row\"><a class=\"a-link-normal\" title=\"4.0 out of 5 stars\" href=\"/gp/customer-reviews/RTT58UH8PH6QT/ref=cm_cr_getr_d_rvw_ttl?ie=UTF8&ASIN=B00PGJWYJ0\"><i data-hook=\"review-star-rating\" class=\"a-icon a-icon-star a-star-4 review-rating\"><span class=\"a-icon-alt\">4.0 out of 5 stars</span></i></a><span class=\"a-letter-space\"></span><a data-hook=\"review-title\" class=\"a-size-base a-link-normal review-title a-color-base a-text-bold\" href=\"/gp/customer-reviews/RTT58UH8PH6QT/ref=cm_cr_getr_d_rvw_ttl?ie=UTF8&ASIN=B00PGJWYJ0\">A good product at a good price and works well.</a></div><div class=\"a-row\"><span data-hook=\"review-author\" class=\"a-size-base a-color-secondary review-byline\"><span class=\"a-color-secondary\">By</span><span class=\"a-letter-space\"></span><a data-hook=\"review-author\" class=\"a-size-base a-link-normal author\" href=\"/gp/profile/amzn1.account.AEQY4EUKALIRYJ4IDW3MMTHU3U2Q/ref=cm_cr_getr_d_pdp?ie=UTF8\">Carroll K.</a></span><span class=\"a-declarative\" data-action=\"cr-popup\" data-cr-popup=\"{&quot;width&quot;:&quot;340&quot;,&quot;title&quot;:&quot;Help&quot;,&quot;url&quot;:&quot;/gp/help/customer/display.html/ref=cm_cr_dp_bdg_help?ie=UTF8&amp;nodeId=14279681&amp;pop-up=1#tr&quot;,&quot;height&quot;:&quot;340&quot;}\"></span><span class=\"a-letter-space\"></span><span data-hook=\"review-date\" class=\"a-size-base a-color-secondary review-date\">on July 16, 2016</span></div><div class=\"a-row a-spacing-mini review-data review-format-strip\"><span class=\"a-declarative\" data-action=\"reviews:filter-action:push-state\" data-reviews:filter-action:push-state=\"{}\"><a data-reftag=\"cm_cr_getr_d_rvw_rvwer\" data-reviews-state-param=\"{&quot;pageNumber&quot;:&quot;1&quot;,&quot;reviewerType&quot;:&quot;avp_only_reviews&quot;}\" class=\"a-link-normal\" href=\"/WizGear-Universal-Swift-Snap-Technology-Smartphones/product-reviews/B00PGJWYJ0/ref=cm_cr_getr_d_rvw_rvwer?ie=UTF8&reviewerType=avp_only_reviews&pageSize=10\"><span data-hook=\"avp-badge\" class=\"a-size-mini a-color-state a-text-bold\">Verified Purchase</span></a></span></div><div class=\"a-row review-data\"><span data-hook=\"review-body\" class=\"a-size-base review-text\">This product works very well for holding a cellphone on your vehicle's A/C vents but it does block some of the airflow. The blockage is not enough to cause discomfort and can be overcome if you position the round magnet on the cellphone on one end or the other and not in the middle like you might normally think. If you position the magnets in the appropriate position on the cellphone and on the vent the placement will not have much if any airflow blockage. (So take the time to go to your vehicle and see where the best placement of each magnet will provide the best solution for your particular vehicle and phone combination) I did not realize this when I first attached the magnet to the cellphone and attached it in the middle which made it more difficult to keep the phone out of the airflow. Learn from my mistake and you will be happy with your purchase and you will be getting a good value as well since it is one of the lower priced options for this type of product.</span></div><div class=\"a-row a-spacing-top-small review-comments comments-for-RTT58UH8PH6QT\"><div data-reftag=\"cm_cr_getr_d_cmt_opn\" aria-live=\"polite\" data-a-expander-name=\"review_comment_expander\" class=\"a-row a-expander-container a-expander-inline-container\"><a href=\"javascript:void(0)\" data-action=\"a-expander-toggle\" class=\"a-expander-header a-declarative a-expander-inline-header a-link-expander\" data-a-expander-toggle=\"{&quot;allowLinkDefault&quot;:true, &quot;expand_prompt&quot;:&quot;&quot;, &quot;collapse_prompt&quot;:&quot;&quot;}\"><i class=\"a-icon a-icon-expand\"></i><span class=\"a-expander-prompt\"><span class=\"review-comment-total aok-hidden\">0</span><span class=\"a-size-base\">Comment</span></span></a><span data-hook=\"review-voting-widget\" class=\"a-size-base cr-vote\"><span class=\"cr-vote-buttons cr-vote-component\"><span class=\"a-letter-space\"></span><i class=\"a-icon a-icon-text-separator\" aria-label=\"|\"><span class=\"a-icon-alt\">|</span></i><span class=\"a-letter-space\"></span><span class=\"a-color-secondary\">Was this review helpful to you?</span><span class=\"a-letter-space\"></span><div class=\"cr-vote-button-margin\">\n      <span class=\"a-button a-button-beside-text a-button-small cr-vote-yes cr-vote-component\"><span class=\"a-button-inner\"><a href=\"https://www.amazon.com/ap/signin?openid.return_to=https%3A%2F%2Fwww.amazon.com%2FWizGear-Universal-Swift-Snap-Technology-Smartphones%2Fproduct-reviews%2FB00PGJWYJ0%2Fref%3Dcm_cr_getr_d_vote_lft%3Fie%3DUTF8%26voteInstanceId%3DRTT58UH8PH6QT%26voteValue%3D1%26reviewerType%3Dall_reviews%26pageSize%3D10%26csrfT%3DgGActz1LPHZe770snBfqcAr6UB3QCtO3YMiLFY8AAAAJAAAAAFootixyYXcAAAAA&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0\" data-hook=\"vote-yes-button\" class=\"a-button-text\" role=\"button\"><div class=\"cr-vote-button\">\n          Yes</div>\n      </a></span></span></div>\n  <span class=\"a-letter-space\"></span><div class=\"cr-vote-button-margin\">\n      <span class=\"a-button a-button-beside-text a-button-small cr-vote-no cr-vote-component\"><span class=\"a-button-inner\"><a href=\"https://www.amazon.com/ap/signin?openid.return_to=https%3A%2F%2Fwww.amazon.com%2FWizGear-Universal-Swift-Snap-Technology-Smartphones%2Fproduct-reviews%2FB00PGJWYJ0%2Fref%3Dcm_cr_getr_d_vote_rgt%3Fie%3DUTF8%26voteInstanceId%3DRTT58UH8PH6QT%26voteValue%3D-1%26reviewerType%3Dall_reviews%26pageSize%3D10%26csrfT%3DgGActz1LPHZe770snBfqcAr6UB3QCtO3YMiLFY8AAAAJAAAAAFootixyYXcAAAAA&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0\" data-hook=\"vote-no-button\" class=\"a-button-text\" role=\"button\"><div class=\"cr-vote-button\">\n          No</div>\n      </a></span></span></div>\n  </span></span><span><span class=\"a-letter-space\"></span><span class=\"a-declarative\" data-action=\"cr-popup\" data-cr-popup=\"{&quot;width&quot;:&quot;580&quot;,&quot;title&quot;:&quot;ReportAbuse&quot;,&quot;url&quot;:&quot;/gp/voting/cast/Reviews/2115/RTT58UH8PH6QT/Inappropriate/1/ref=cm_cr_getr_d_rvw_hlp?ie=UTF8&amp;target=L3NzL2N1c3RvbWVyLXJldmlld3MvYWpheC9yZXZpZXdzL2dldC9yZWY9Y21fY3JfYXJwX2RfcGFnaW5nX2J0bV8z&amp;token=A27F9D5E07C938C6950AB49E02E7650B02FB3F07&amp;voteAnchorName=RTT58UH8PH6QT.2115.Inappropriate.Reviews&amp;voteSessionID=136-4489048-3064812&amp;type=popup&quot;,&quot;height&quot;:&quot;380&quot;}\"><a class=\"a-size-base a-link-normal a-color-tertiary report-abuse-link a-text-normal\" href=\"/gp/voting/cast/Reviews/2115/RTT58UH8PH6QT/Inappropriate/1/ref=cm_cr_getr_d_rvw_hlp?ie=UTF8&target=L3NzL2N1c3RvbWVyLXJldmlld3MvYWpheC9yZXZpZXdzL2dldC9yZWY9Y21fY3JfYXJwX2RfcGFnaW5nX2J0bV8z&token=A27F9D5E07C938C6950AB49E02E7650B02FB3F07&voteAnchorName=RTT58UH8PH6QT.2115.Inappropriate.Reviews&voteSessionID=136-4489048-3064812&type=popup\">Report abuse</a></span></span><div aria-expanded=\"false\" class=\"a-expander-content a-spacing-top-base  a-spacing-large a-expander-inline-content a-expander-inner\" style=\"display:none\"><div class=\"a-column a-span12 a-text-right a-spacing-base\"><span class=\"a-declarative\" data-action=\"reviews:open-comment-submission\" data-reviews:open-comment-submission=\"{}\"><span class=\"a-button open-comment-submission-button aok-hidden\"><span class=\"a-button-inner\"><input data-reftag=\"cm_cr_getr_d_cmt_submn\" class=\"a-button-input\" type=\"submit\"><span class=\"a-button-text\" aria-hidden=\"true\">Comment</span></span></span></span></div><div class=\"a-section a-spacing-base comment-submission\"><div class=\"a-row a-spacing-top-small a-grid-vertical-align a-grid-bottom\"><div class=\"a-column a-span8\"></div><div class=\"a-column a-span4 a-text-right a-span-last\"><span class=\"a-declarative\" data-action=\"a-modal\" data-a-modal=\"{&quot;name&quot;:&quot;insertProductLink-RTT58UH8PH6QT&quot;,&quot;activate&quot;:&quot;onclick&quot;,&quot;width&quot;:&quot;450&quot;,&quot;header&quot;:&quot;Insert product link&quot;,&quot;position&quot;:&quot;triggerBottom&quot;}\"><span class=\"a-button a-button-normal a-spacing-mini a-button-small product-link-trigger\"><span class=\"a-button-inner\"><input class=\"a-button-input\" type=\"submit\" value=\"submit\"><span class=\"a-button-text a-text-center\" aria-hidden=\"true\">Insert product link</span></span></span></span><div class=\"a-popover-preload\" id=\"a-popover-insertProductLink-RTT58UH8PH6QT\"><div id=\"insertProductLink-RTT58UH8PH6QT\" class=\"a-section a-spacing-none product-link-popover\"><div class=\"a-section a-spacing-none search-product-link-section\"><div class=\"a-row\"><div class=\"a-section a-spacing-small\"><span class=\"a-size-base a-color-tertiary product-link-instruction\">Paste the product's web address below:</span></div></div><div class=\"a-row\"><span class=\"a-declarative\" data-action=\"reviews:input-product-link\" data-reviews:input-product-link=\"{}\"><input type=\"text\" placeholder=\"http://...\" class=\"a-input-text a-width-extra-large a-spacing-small product-link-input\"></span><span class=\"a-spinner a-spinner-small product-link-spinner aok-hidden\"></span><div class=\"a-section a-spacing-none a-spacing-top-small\"><div class=\"a-section a-spacing-none product-link-not-found-section aok-hidden\"><span class=\"a-size-base a-color-error product-link-not-found a-text-italic\">Product not found. Only products offered on Amazon can be linked.</span></div></div></div></div><div class=\"a-section a-spacing-base a-spacing-top-base product-link-section aok-hidden\"><hr class=\"a-divider-normal\"><div class=\"a-section a-spacing-base product-link-found-section\"><div class=\"a-fixed-left-grid product-links-item-template\"><div class=\"a-fixed-left-grid-inner\" style=\"padding-left:100px\"><div class=\"a-fixed-left-grid-col a-col-left\" style=\"width:100px;margin-left:-100px;_margin-left:-50px;float:left;\"><img alt=\"\" src=\"https://images-na.ssl-images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V192234675_.gif\" class=\"product-link-image\" height=\"80\" width=\"80\"></div><div class=\"a-fixed-left-grid-col a-col-right\" style=\"padding-left:2.5%;*width:97.1%;float:left;\"><h4><span class=\"a-size-base a-color-tertiary product-link-title productLinkTextOverflow a-text-bold\"></span></h4><span class=\"product-link-creator productLinkTextOverflow\"></span><br><span class=\"a-declarative\" data-action=\"reviews:select-product-link\" data-reviews:select-product-link=\"{}\"><span class=\"a-button a-spacing-top-base a-button-beside-text a-button-small a-button-width-normal select-product-link-button\"><span class=\"a-button-inner\"><input data-reftag=\"cm_cr_getr_d_cmt_pl\" data-reviews-state-param=\"{&quot;parentId&quot;:&quot;RTT58UH8PH6QT&quot;, &quot;asin&quot;:&quot;&quot;}\" class=\"a-button-input\" type=\"submit\"><span class=\"a-button-text\" aria-hidden=\"true\">Select</span></span></span></span></div></div></div></div></div></div></div><span class=\"a-letter-space\"></span><span class=\"a-declarative\" data-action=\"a-modal\" data-a-modal=\"{&quot;name&quot;:&quot;productLinkWhatsThis&quot;,&quot;activate&quot;:&quot;onclick&quot;,&quot;width&quot;:&quot;600&quot;,&quot;header&quot;:&quot;What's this?&quot;}\"><a href=\"javascript:void(0)\" class=\"a-popover-trigger a-declarative\">What's this?<i class=\"a-icon a-icon-popover\"></i></a></span><div class=\"a-popover-preload\" id=\"a-popover-productLinkWhatsThis\"><h1 class=\"a-size-base a-spacing-medium a-color-state a-text-bold\">What are product links?</h1><p><div class=\"a-row a-spacing-medium\">In the text of your review, you can link directly to any product offered on Amazon.com. To insert a product link, follow these steps:</div><div class=\"a-row a-spacing-medium\">1. Find the product you want to reference on Amazon.com</div><div class=\"a-row a-spacing-medium\">2. Copy the web address of the product</div><div class=\"a-row a-spacing-medium\">3. Click <span class=\"a-text-bold\">Insert product link</span></div><div class=\"a-row a-spacing-medium\">4. Paste the web address in the box</div><div class=\"a-row a-spacing-medium\">5. Click <span class=\"a-text-bold\">Select</span></div><div class=\"a-row a-spacing-medium\">6. Selecting the item displayed will insert text that looks like this: <span>\n        [[ASIN:014312854X</span><span class=\"a-letter-space\"></span><span>Hamlet (The Pelican Shakespeare)]]\n      </span></div><div class=\"a-row a-spacing-medium\"><span>7. When your review is displayed on Amazon.com, this text will be transformed into a hyperlink, like so:</span><span class=\"a-letter-space\"></span><a class=\"a-link-normal\" href=\"/gp/product/014312854X\">Hamlet (The Pelican Shakespeare)</a></div></p><p class=\"a-spacing-top-extra-large\">You are limited to 10 product links in your review, and your link text may not be longer than 256 characters.</p></div></div></div><span class=\"a-declarative\" data-action=\"reviews:input-comment\" data-reviews:input-comment=\"{}\"><input type=\"hidden\" name=\"csrfT\" value=\"gGActz1LPHZe770snBfqcAr6UB3QCtO3YMiLFY8AAAAJAAAAAFootixyYXcAAAAA\"><label for=\"comment-text-area-gGActz1LPHZe770snBfqcAr6UB3QCtO3YMiLFY8AAAAJAAAAAFootixyYXcAAAAA\" class=\"a-form-label aok-offscreen\">Talk about why you like this review, or ask a question.</label><div data-reftag=\"cm_cr_getr_d_cmt_txt\" class=\"a-input-text-wrapper a-spacing-top-mini comment-text-area\"><textarea placeholder=\"Talk about why you like this review, or ask a question.\" id=\"comment-text-area-gGActz1LPHZe770snBfqcAr6UB3QCtO3YMiLFY8AAAAJAAAAAFootixyYXcAAAAA\" rows=\"2\" disabled=\"disabled\" class=\"a-form-disabled\"></textarea></div></span><div class=\"a-row a-spacing-top-small a-grid-vertical-align a-grid-top\"><div class=\"a-column a-span8\"><div class=\"a-box a-alert-inline a-alert-inline-error comment-no-text-error aok-hidden\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\">Please write at least one word</div></div></div><div class=\"a-box a-alert-inline a-alert-inline-error comment-banned-error aok-hidden\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\">You must be in good standing in the Amazon community to post.</div></div></div><div class=\"a-box a-alert-inline a-alert-inline-error comment-potty-mouth-error aok-hidden\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\">Your message will not be posted. Please see our guidelines regarding objectionable content.</div></div></div><div class=\"a-box a-alert-inline a-alert-inline-error comment-prs-error aok-hidden\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\">You must purchase at least one item from Amazon to post a comment</div></div></div><div class=\"a-box a-alert-inline a-alert-inline-error comment-submit-error aok-hidden\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\">A problem occurred while submitting your comment. Please try again later.</div></div></div></div><div class=\"a-column a-span4 a-text-right a-span-last\"><a href=\"https://www.amazon.com/gp/forum/content/db-guidelines.html?ie=UTF8\" target=\"_blank\">\n        Guidelines</a>\n      <span class=\"a-button submit-comment-button sign-in\"><span class=\"a-button-inner\"><a href=\"https://www.amazon.com/ap/signin?openid.return_to=https%3A%2F%2Fwww.amazon.com%2FWizGear-Universal-Swift-Snap-Technology-Smartphones%2Fproduct-reviews%2FB00PGJWYJ0%2Fref%3Dcm_cr_getr_d_cmt_btn%3Fie%3DUTF8%26pageNumber%3D3%26expandedComment%3DRTT58UH8PH6QT%26openCommentSubmission%3Dtrue%26reviewerType%3Dall_reviews%26pageSize%3D10&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0\" class=\"a-button-text\" role=\"button\">Sign in to comment</a></span></span></div></div></div><div class=\"a-row a-spacing-mini review-comments-header aok-hidden\"><ul class=\"a-viewoptions-list a-viewoptions-section a-span12\">\n    <div class=\"a-row a-spacing-none a-grid-vertical-align a-grid-center\"><div class=\"a-column a-span6\"><span class=\"a-size-base a-viewoptions-list-label\">Showing <span class='review-comment-count'>0</span> comments</span></div><div class=\"a-column a-span6 a-text-right a-span-last\"><span class=\"a-size-base a-viewoptions-list-label\">Sort by:<span class=\"a-letter-space\"></span></span><span class=\"a-declarative\" data-action=\"reviews:sort-comments\" data-reviews:sort-comments=\"{}\"><li class=\"sort-newest a-viewoptions-list-item a-selected a-color-state\">\n            <a data-reftag=\"cm_cr_getr_d_cmt_lft\" class=\"a-link-normal a-selected a-color-state\" href=\"#\">Newest</a></li>\n          <li class=\"sort-oldest a-viewoptions-list-item\">\n            <a data-reftag=\"cm_cr_getr_d_cmt_rgt\" class=\"a-link-normal\" href=\"#\">Oldest</a></li>\n        </span></div></div></ul>\n</div><div class=\"a-section a-spacing-extra-large a-spacing-top-medium a-text-center comment-load-error aok-hidden\"><div class=\"a-box a-alert a-alert-error cr-error\" aria-live=\"assertive\" role=\"alert\"><div class=\"a-box-inner a-alert-container\"><h4 class=\"a-alert-heading\">There was a problem loading comments right now. Please try again later.</h4><i class=\"a-icon a-icon-alert\"></i><div class=\"a-alert-content\"></div></div></div></div><div id=\"RTT58UH8PH6QT\" class=\"a-section a-spacing-none review-comments\"></div><div class=\"a-spinner-wrapper comment-loading aok-hidden a-spacing-top-medium a-spacing-extra-large\"><span class=\"a-spinner a-spinner-medium\"></span></div><hr class=\"a-spacing-none a-spacing-top-large a-divider-normal\"></div></div></div></div></div>"]
&&&
# 第一步:先将\去掉
# 第二步:像处理普通网页源码一样,用xpath直接解析import lxml.html
import lxml.etree# 分析 ajax 返回的 review 数据
def parseReviewsAjaxData(self, response):print(f"parseReviewsAjaxData url={response.url}, response.meta = {response.meta}")print(f"ajaxData = {response.text}")# 原有的数据中,全部用了\这种反斜杠,来处理带"的数据,这个地方先将response的数据全部处理掉,# 然后用python lxml 原生的xpath来处理,方式和scrapy有些不同,区别在于有些不用extractcontent = response.text.replace("\\", "")# print(f"content=\n{content}")resultTree = lxml.etree.HTML(content)reviewLst = resultTree.xpath("//div[@class='a-section review']")for elem in reviewLst:# 评论ID  ['R3QS2YXG0S7MA8']reviewID = elem.xpath("@id")[0]# 用户评分  ['5.0 out of 5 stars']star = elem.xpath(".//i[@data-hook='review-star-rating']/span/text()")[0]# 评论人   ['janette McKinnon']author = elem.xpath(".//a[@data-hook='review-author']/text()")[0]# 评论时间  ['on September 10, 2017']reviewDate = elem.xpath(".//span[@data-hook='review-date']/text()")[0]

scrapy解析网页时,针对一些特别格式的数据的处理相关推荐

  1. 使用BeautifulSoup解析网页时漏掉了元素

    使用 soup=BeautifulSoup(res.text,"html.parser") 解析网页时漏掉了元素 改为 soup=BeautifulSoup(res.text, & ...

  2. 做网页时如何使格式不随浏览器大小改变而是出现滚动条

    Q:做网页时如何使格式不随浏览器大小改变而是出现滚动条? A:网页中的代码指定宽度长度的地方都改成像素,而不是百分比,举例如下: <table name="xxx" widt ...

  3. Python爬虫之解析网页

    常用的类库为lxml, BeautifulSoup, re(正则) 以获取豆瓣电影正在热映的电影名为例,url='https://movie.douban.com/cinema/nowplaying/ ...

  4. selenium 解析网页_用Selenium进行网页搜刮

    selenium 解析网页 网页抓取系列 (WEB SCRAPING SERIES) 总览 (Overview) Selenium is a portable framework for testin ...

  5. jsoup html转义处理,jsoup解析网页出现转义符问题

    https://www.oschina.net/question/996055_136438 *************************************** 我要解析这个网页  htt ...

  6. html如何在网页上看错误,HTML错误时,Spring MVC的,但不能查看网页时,静态

    我已经下载的引导模板,我想利用发球和Thymeleaf Spring MVC的页面.当我在电脑上静态打开实际页面时,它显示为在线显示,但是当我启动Spring Boot应用程序时,在解析HTML文件时 ...

  7. 拱拱Lite开发(3):三翼页及湘大文库下载实现(解析网页获取信息及模拟登陆)

    因为没有三翼新闻及湘大文库的API,简单的方法行不通就只能绕远啦,我们这次来解析网页,嗯,是个体力活其实.因为网页HTML也是有一定格式的,所以只要网页结构不进行大的改动,我们就可以一直这样解析网页获 ...

  8. php 解析网页慢,网页访问变慢的原因分析及优化

    我的个人wordpress博客开通也有二个星期了,除了写了几篇文章之外,对云服务器.       wordpress的使用也是非常的感兴趣,从一开始的配置,到各种插件的探索,玩的不亦乐乎.自我感觉个人 ...

  9. Android 使用Jsoup解析网页批量获取图片

    Android 网络图片查看器HappyLook开发 一.前言 二.框架介绍 1.Jsoup简介 2.EventBus简介 3.RecyclerView及Glide 三.具体实现 1.需求确认 2.引 ...

  10. 在获取网页时半角全角字符混合的问题

    在获取网页数据时,遇到一个问题.获取的数据在解析成中文时由于字符中包含半角和全角的字符,总是不能全部显示正确.一开始总想有什么解析方法可以用来处理这种数据.网页本身是utf-8格式,按理来说不应该有这 ...

最新文章

  1. R可视化包ggplot2改变图例(Legend)的位置实战
  2. 基于自监督网络的手部静脉无损三维测量
  3. 上证所Level-2在信息内容和传送方式方面的比较优势[逐笔数据与分笔数据的根本区别]...
  4. Id都是“とつくとき”这样的怎么爬,在线等,急
  5. zabbix通过JMX监控Tomcat及一些报错
  6. window.cookie
  7. SQL Server-服务器迁移之后login登录问题
  8. 增加VirtualBox虚拟机的磁盘空间大小(Host:Win7 VirtualBox5.0.16 VM:Win10)
  9. PHP初学者头疼问题总结
  10. 血雨腥风43载,苹果帝国背后的5个男人
  11. Vhost and VIOMMU
  12. 设计模式 C++装饰模式
  13. 如何短期通过2022年3月PMP考试?
  14. 如何把照片裁剪成证件照指定尺寸比例?
  15. Steam忘记账号如何在文件夹中找回
  16. 机器学习中的数学——距离定义(二十四):F-散度(F-Divergence)
  17. word2010 目录 摘要 正文 奇偶页页码不同的解决方法
  18. 上下文无关文法及其分析树
  19. 使用 Acrobat 将 PDF 转换为 Word
  20. arm编程语言基础c,ARM基础:ARM 伪指令详解

热门文章

  1. pycharm提示 进程已结束,退出代码 -1073740791 (0xC0000409)
  2. 怎样设置CCProxy
  3. python 正则表达式 r_python 正则表达式
  4. uni-app 前后端实战课 - 《悦读》学习笔记:【创建项目、后端环境介绍】小程序开发实例教程1/
  5. if...elseif....else 语句 2020年周易起名系统开发,生辰八字,周易
  6. Css3中align-content,css align-content属性怎么用
  7. deadline集群渲染_Maya笔记
  8. 如何确立人生目标?100个人生目标清单总汇
  9. 高质量的博客评论外链有用么?
  10. Rational License Key Error的解决办法