python实现书籍比价工具

目录

  • python实现书籍比价工具
    • 一 功能说明
    • 二 效果截图
    • 三 程序代码
      • 3.1 当当网
      • 3.2京东网
      • 3.3 一号店
      • 3.4 淘宝网
    • 四 参考

一 功能说明

用户输入书籍的ISBN,则依次爬取当当网、京东、一号店、淘宝的第一页搜索结果,按价格由高到低排序显示。

二 效果截图


三 程序代码

3.1 当当网

import requests
from lxml import htmldef spider(sn,books=[]):url = 'http://search.dangdang.com/?key={sn}&act=input'.format(sn=sn)# 获取html内容html_data = requests.get(url).text  # 注意:不要命名为html,不然会和import html发生覆盖# xpath对象selector = html.fromstring(html_data)# 找到书本列表lis = selector.xpath('//div[@id="search_nature_rg"]/ul/li')for li in lis:print('---------------------------')# 标题title = li.xpath('./a/@title')[0]print(title)# 购买链接link = li.xpath('./a/@href')[0]print(link)# 价格origin_price = li.xpath('./p[contains(@class,"price")]/span[@class="search_now_price"]/text()')if origin_price:passelse:origin_price = li.xpath('./div[contains(@class,"ebook_buy")]/p[contains(@class,"price")]/span/text()')price=origin_price[0].replace('¥', '')print(price)# 商家store = li.xpath('./p[@class="search_shangjia"]/a[@name="itemlist-shop-name"]/text()')if store:passelse:store = ['当当自营']print(store[0])books.append({'title':title,'link':link,'price':price,'store':store[0]})if __name__ == '__main__':sn = 9787208061644spider(sn)

3.2京东网

import requests
from lxml import htmldef spider(sn,books=[]):headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',}url = 'https://search.jd.com/Search?keyword={sn}'.format(sn=sn)html_data = requests.get(url, headers=headers).content.decode('utf-8')selector = html.fromstring(html_data)lis = selector.xpath('//div[@id="J_goodsList"]/ul/li')for li in lis:print('---------------------------')# 标题title = li.xpath('./div/div[@class="p-name"]/a/em/text()')[0]print(title)# 购买链接link = li.xpath('./div/div[@class="p-img"]/a/@href')[0]print('https:' + link)# 价格price = li.xpath('./div/div[@class="p-price"]//i/text()')[0].replace('¥', '')print(price)# 商家store = li.xpath('./div/div[@class="p-icons"]/i[1]/text()')  #注意:下标从1开始if store == ['自营']:passelse:store = li.xpath('./div/div[@class="p-shopnum"]/a/@title')print(store[0])books.append({'title':title,'link':'https:' + link,'price':price,'store':store[0]})if __name__ == '__main__':sn = 9787115428028spider(sn)

3.3 一号店

import requests
from lxml import htmldef spider(sn,books=[]):url='https://search.yhd.com/c0-0/k{0}'.format(sn)headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36',}html_data = requests.get(url, headers=headers).content.decode('utf-8')selector = html.fromstring(html_data)divs = selector.xpath('//div[@id="itemSearchList"]/div')for div in divs:print('---------------------------')# 标题title = div.xpath('./div/p[contains(@class,"proName")]/a/@title')[0]print(title)# 购买链接link = div.xpath('./div/p[contains(@class,"proName")]/a/@href')[0]print('https:' + link)# 价格price = div.xpath('./div/p[@class="proPrice"][1]/em/@yhdprice')[0]print(price)# 商家store = div.xpath('./div/p[contains(@class,"searh_shop_storeName")]/span/text()')if store==['自营']:passelse:store = div.xpath('./div/p[contains(@class,"searh_shop_storeName")]/a/@title')print(store[0])books.append({'title':title,'link':'https:' + link,'price':price,'store':store[0]})if __name__ == '__main__':sn=9787115428028spider(sn)

3.4 淘宝网

记得添加cookie信息

import requests
import json
import re
import randomdef spider(sn, books=[]):DATA = []url = 'https://s.taobao.com/search?q={0}&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20191027&ie=utf8'.format(sn)user_agents = ["Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_2 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5","Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_2 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8H7 Safari/6533.18.5","MQQBrowser/25 (Linux; U; 2.3.3; zh-cn; HTC Desire S Build/GRI40;480*800)","Mozilla/5.0 (Linux; U; Android 2.3.3; zh-cn; HTC_DesireS_S510e Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1","Mozilla/5.0 (SymbianOS/9.3; U; Series60/3.2 NokiaE75-1 /110.48.125 Profile/MIDP-2.1 Configuration/CLDC-1.1 ) AppleWebKit/413 (KHTML, like Gecko) Safari/413","Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Mobile/8J2","Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_2) AppleWebKit/534.51.22 (KHTML, like Gecko) Version/5.1.1 Safari/534.51.22","Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3","Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3","Mozilla/5.0 (iPhone; CPU iPhone OS 5_0 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A5313e Safari/7534.48.3","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.202 Safari/535.1","Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; SAMSUNG; OMNIA7)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; XBLWP7; ZuneWP7)","Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30","Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.2; Trident/4.0; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C)","Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C)","Mozilla/4.0 (compatible; MSIE 60; Windows NT 5.1; SV1; .NET CLR 2.0.50727)","Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)","Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)","Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.648; .NET CLR 3.5.21022; .NET4.0E; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET4.0C)","Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/533.21.1 (KHTML, like Gecko) Version/5.0.5 Safari/533.21.1","Mozilla/5.0 (Windows; U; Windows NT 5.1; ) AppleWebKit/534.12 (KHTML, like Gecko) Maxthon/3.0 Safari/534.12","Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 2.0.50727; TheWorld)"]headers = {'User-Agent': random.choice(user_agents),'referer':'https://s.taobao.com/search?q=9787115428028&imgfile=&js=1&stats_click=search_radio_all%3A1&initiative_id=staobaoz_20191027&ie=utf8','cookie':''# 放cookie信息}proxy = {'HTTP': '139.199.19.174:8114'}html_data = requests.get(url, headers=headers,proxies=proxy).textprint(html_data)# f = open('./static/test.html', 'r', encoding='utf-8')# html_data = f.read()content = re.findall(r'g_page_config = (.*?)g_srp_loadCss', html_data, re.S)[0]# 格式化,将json格式的字符串切片。去掉最后一个字符;content = content.strip()[:-1]# 将json转为dictcontent = json.loads(content)# 借助json在线解析分析,取dict里的具体datadata_list = content['mods']['itemlist']['data']['auctions']# 提取数据for item in data_list:temp = {'title': re.findall('(.*?)<span', item['title'])[0],'view_price': item['view_price'],'view_sales': item['view_sales'],'view_fee': '否' if float(item['view_fee']) else '是','isTmall': '是' if item['shopcard']['isTmall'] else '否','area': item['item_loc'],'name': item['nick'],'detail_url': item['detail_url'],}print('------------------------')# 标题title = temp['title']print(title)# 购买链接link = 'https:' + temp['detail_url']print(link)# 价格price = temp['view_price']print(price)# 商家store = temp['name']print(store)DATA.append(temp)books.append({'title': title,'link': link,'price': price,'store': store})if __name__ == '__main__':sn = 9787115428028spider(sn)

四 参考

  • 慕课网-手把手教你把Python应用到实际开发

使用python的requests库实现书籍比价工具相关推荐

  1. Java 爬虫--类似Python的requests库--HttpClient, HttpAsyncClient--Maven

    今天在找Java的爬虫的库,发现Java的爬虫框架挺多的,常见的有nutch,Heritrix,crawler4j等. 但我想要的是类似Python的requests库的Java库. 看到一个库叫Un ...

  2. Python之Requests库的异常

    Python之Requests库的异常 参考文章: (1)Python之Requests库的异常 (2)https://www.cnblogs.com/BASE64/p/10285466.html 备 ...

  3. python中requests库的用途-数据爬虫(三):python中requests库使用方法详解

    有些网站访问时必须带有浏览器等信息,如果不传入headers就会报错,如下 使用 Requests 模块,上传文件也是如此简单的,文件的类型会自动进行处理: 因为12306有一个错误证书,我们那它的网 ...

  4. python中requests库的用途-python中requests库session对象的妙用详解

    在进行接口测试的时候,我们会调用多个接口发出多个请求,在这些请求中有时候需要保持一些共用的数据,例如cookies信息. 妙用1 requests库的session对象能够帮我们跨请求保持某些参数,也 ...

  5. Python 的 requests 库的用法

    Python爬虫利器一之Requests库的用法:http://cuiqingcai.com/2556.html Python利用Requests库写爬虫(一):http://www.jianshu. ...

  6. python导入requests库一直报错原因总结_python pip 安装库文件报错:pip install ImportError: No module named _internal...

    centos6,python3,通过pip安装pycurl出现报错提示 Centos6.7系统,python3.6.7,通过 pip 安装pycurl出现报错: __main__.Configurat ...

  7. python中requests库入门及写入文件

    1.python中requests库入门 import requests r = requests.get("https://www.baidu.com") print(r.sta ...

  8. python网络爬虫系列教程——python中requests库应用全解

    全栈工程师开发手册 (作者:栾鹏) python教程全解 python中requests库的基础应用,网页数据挖掘的常用库之一.也就是说最主要的功能是从网页抓取数据. 使用前需要先联网安装reques ...

  9. python requests_一起看看Python之Requests库

    1 / 写在前面的话 /  今天资源君带大家来看看Python的Requests库,这个库是基于urllib3所建立的,而且被人们称为人性化的库,因为它的代码量相比于urllib中的request少了 ...

最新文章

  1. 日常遇到的一些问题或知识的笔记(一)
  2. ElasticSearch的基本原理与用法
  3. 各种AJAX方法的使用比较
  4. 北京工业大学计算机科学与技术研究生,北京工业大学研究生专业介绍:计算机科学与技术...
  5. 最短路径问题-Dijkstra
  6. 【PM模块】外包服务、工作清场管理、预防性维护
  7. Magento布局layout.xml文件详解
  8. gridview 默认编辑按钮改成图片
  9. 多图技术贴:深入浅出解析大数据平台架构
  10. 【算法】插值查找算法
  11. Linux内存管理:内存描述之内存页面page
  12. 零基础带你学习MySQL—自连接(二十一)
  13. python有道-Python调用有道词典翻译
  14. echarts词云图实现_系统讲解如何用Python制作自己专属的词云图(示例)
  15. ajax 单击事件删除,AJAX删除事件与加载数据方法介绍
  16. Turtle库是Python语言中一个很流行的绘制图像的函数库
  17. HTML 5入门基础
  18. 『淘宝十年技术路』读后想法
  19. Hadoop-2.7.1+Zookeeper-3.4.8+HBase-1.2.1+Hive-2.0.0完全分布式集群
  20. python|图片生成视频MP4

热门文章

  1. 企业管理软件领域的核心竞争力
  2. 【内网安全-隧道技术】SMB、ICMP、DNS隧道、SSH协议
  3. ubuntu eclipse java_ubuntu 下安装eclipse amp;java环境配置
  4. java有阴历年算法吗_中国农历算法java实现
  5. Cadence Allegro使用过程中出现的常见问题-原理图和PCB
  6. 【英语:基础进阶_核心词汇扩充】E4.常见词根拓词
  7. 116张!2021年最全铁塔排名(含图片)值得收藏!
  8. Linux系统和Windows系统的区别
  9. 装了卡巴电脑更卡?原来是Trojan-PSW.Win32.QQPass等盗号木马群作梗2
  10. 最全面最详细的测试用例整理