首先看到这么多朋友浏览,证明对大家还是有帮助的,感谢大家的关注!因为文章是在一年前写的,网站更新了不少,下面的代码已经不适用,特意更新了下,供大家参考。

首先,我们输入我们的主题,看到如下:

我们会发现网站的域名变了,然后我们看下页面源代码:

正如我们看到的,网页的内容在源代码中是隐藏的,那我们在看下post请求的关键性信息,少了很多,而且好像也没什么太大用。

用我们最原始的post携带参数已经行不通了。天无绝人之路,我们直接简单粗暴,用selenium模拟点击,这个得配置一下浏览器插件。

代码如下:

from selenium.webdriver import Chrome

from selenium.webdriver.chrome.options import Options

import time

import random

user_agent = [

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",

"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",

"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"

]

headers = {

'User-Agent': random.choice(user_agent)

}

def crawl(url):

chrome_options = Options()

chrome_options.add_argument('--disable-gpu') # 禁用浏览器正在被自动化程序控制的提示

chrome_options.add_argument('--blink-settings=imagesEnabled=false') # 禁止图片加载

chrome_options.add_experimental_option('excludeSwitches', ['enable-automation']) # 防止被发现

driver = Chrome(options=chrome_options,

executable_path=r"C:\Users\cherich\AppData\Local\Google\Chrome\Application\chromedriver.exe")

driver.maximize_window()

driver.get(url)

time.sleep(2)

content = driver.find_element_by_class_name("result-table-list").text

# 这里包括标题、作者、来源、发表时间、论文类型、下载量,需要什么可以自己解析

time.sleep(7)

print(content)

# 提取详情页链接

#for link in driver.find_elements_by_xpath("//*[@href]"):

#print(link.get_attribute('href'))

time.sleep(10)

# 翻页

driver.find_element_by_link_text('下一页').click()

time.sleep(7)

a = driver.find_element_by_class_name("result-table-list").text

print(a)

driver.quit()

if __name__ == '__main__':

url = 'https://kns8.cnki.net/kns/DefaultResult/Index?dbcode=SCDB&kw=%E5%8D%B7%E7%A7%AF%E7%A5%9E%E7%BB%8F%E7%BD%91%E7%BB%9C&korder=SU'

crawl(url)

结果如下:

概览页包括标题、作者、来源、发表时间、论文类型、下载量,需要什么可以自己解析一下,如果要详情页的信息,比如摘要,也可以,需要自己拼接url。这里我没有解析,需要的话自行解析。画红线的部分,是变量,需要从下面的代码中提取的。

for link in driver.find_elements_by_xpath("//*[@href]"):

print(link.get_attribute('href'))

详情页内容采集代码如下:

from selenium.webdriver import Chrome

from selenium.webdriver.chrome.options import Options

import time

import random

user_agent = [

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",

"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",

"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",

"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 LBBROWSER",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; QQBrowser/7.0.3698.400)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",

"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",

"Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",

"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0b13pre) Gecko/20110307 Firefox/4.0b13pre",

"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:16.0) Gecko/20100101 Firefox/16.0",

"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",

"Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10"

]

headers = {

'User-Agent': random.choice(user_agent)

}

def crawl(url):

chrome_options = Options()

chrome_options.add_argument('--disable-gpu') # 禁用浏览器正在被自动化程序控制的提示

chrome_options.add_argument('--blink-settings=imagesEnabled=false') # 禁止图片加载

chrome_options.add_experimental_option('excludeSwitches', ['enable-automation']) # 防止被发现

driver = Chrome(options=chrome_options,

executable_path=r"C:\Users\cherich\AppData\Local\Google\Chrome\Application\chromedriver.exe")

driver.maximize_window()

driver.get(url)

time.sleep(2)

content = driver.find_element_by_class_name("brief").text

# 这里包括标题、作者、摘要、关键词、发表时间、分类号等等

time.sleep(7)

print(content)

time.sleep(10)

driver.quit()

if __name__ == '__main__':

url = 'https://kns8.cnki.net/KCMS/detail/detail.aspx?dbcode=CPFD&dbname=CPFDLAST2020&filename=ZGXD202011001005&v=基于Mo%20E的倒闸操作过程识别研究'

crawl(url)

结果如下:

代码实现方式已经提供给大家。

如果本文的内容对大家的学习或者工作能带来一定的帮助,记得点个赞❤

python3.6爬虫源代码_基于Python3.6爬虫 采集知网文献(更新)相关推荐

  1. python3 网站状态监控_基于python3监控服务器状态进行邮件报警

    在正式的生产环境中,我们常常会需要监控服务器的状态,以保证公司整个业务的正常运转,常常我们会用到像nagios.zabbix这类工具进行实时监控,那么用python我们怎么进行监控呢?这里我们利用了p ...

  2. python爬虫源代码_零基础自学爬虫(5)B站有哪些爬虫的视频学习资源-附Python源代码...

    前几天看到有人提问:.b站哪个python爬虫视频讲的较好?谢谢各位能解答一下.? 于是顺手写了一个小爬虫,把数据爬了下来. 今天有空放一下源代码. 数据源,是在B站搜索框直接搜索"爬虫&q ...

  3. python抓取文献关键信息,python爬虫——使用selenium爬取知网文献相关信息

    python爬虫--使用selenium爬取知网文献相关信息 写在前面: 本文章限于交流讨论,请不要使用文章的代码去攻击别人的服务器 如侵权联系作者删除 文中的错误已经修改过来了,谢谢各位爬友指出错误 ...

  4. 基于多重继承与信息内容的知网词语相似度计算 - 论文及代码讲解

    文章目录 概念 example.py HybridSim.py howNet.py 论文:<基于多重继承与信息内容的知网词语相似度计算>-2017-张波,陈宏朝等 查看 代码:https: ...

  5. python获取app信息的库_基于python3抓取pinpoint应用信息入库

    这篇文章主要介绍了基于python3抓取pinpoint应用信息入库,文中通过示例代码介绍的非常详细,对大家的学习或者工作具有一定的参考学习价值,需要的朋友可以参考下 Pinpoint是用Java编写 ...

  6. python逗号代码_基于Python3 逗号代码 和 字符图网格(详谈)

    逗号代码 假定有下面这样的列表: spam=['apples','bananas','tofu',' cats'] 编写一个函数,它以一个列表值作为参数,返回一个字符串.该字符串包含所有表项,表项之间 ...

  7. python代码检查工具_基于Python3的漏洞检测工具 ( Python3 插件式框架 )

    [TOC] Python3 漏洞检测工具 -- lance lance, a simple version of the vulnerability detection framework based ...

  8. python3类的继承详解_基于python3 类的属性、方法、封装、继承详解

    下面小编就为大家带来一篇基于python3 类的属性.方法.封装.继承实例讲解.小编觉得挺不错的,现在就分享给大家,也给大家做个参考.一起跟随小编过来看看吧 Python 类 Python中的类提供了 ...

  9. python3web库_基于 Python3 写的极简版 webserver

    基于 Python3 写的极简版 webserver.用于学习 HTTP协议,及 WEB服务器 工作原理.笔者对 WEB服务器 的工作原理理解的比较粗浅,仅是基于个人的理解来写的,存在很多不足和漏洞, ...

最新文章

  1. python批量提取word指定内容_使用python批量读取word文档并整理关键信息到excel表格的实例...
  2. cocos2dx游戏解决方案
  3. php输出0到5之间到数,php如何实现输出链表倒数第k个结点(代码实例)
  4. Flask Web中的db.relationship()
  5. icloud 购买存储空间_如何释放iCloud存储空间
  6. 【转载】架构师需要了解的Paxos原理、历程及实战
  7. sql server 触发器应用 insert
  8. 【一】最新多智能体强化学习方法【总结】
  9. 常用的页面布局(两栏布局、三栏(圣杯、双飞翼)布局)
  10. 重试利器之Guava Retrying
  11. python改变字符颜色_Python字符串为颜色
  12. win10升级补丁_干掉烦人的win10升级助手gwx
  13. 126邮箱注册思维导图
  14. android 检测电量变化,Android电池电量检测
  15. 智汇华云 | ArSDN之分布式路由及浮动IP简介
  16. assignin与evalin用法理解
  17. dz邮箱验证怎么设置_详细步骤!Discuz如何设置通过 SOCKET 连接 SMTP 服务器发送(支持 ESMTP 验证)实现论坛邮箱验证功能...
  18. 为什么要用MQ,MQ是什么?(消息队列)
  19. [渗透教程]-006-渗透测试-Metasploit以及实战教程
  20. 按关键字爬取网页信息

热门文章

  1. 数模IC设计笔试面试100真题
  2. Python中的字典到底是有序的吗
  3. 32.java_注解(Annotation)
  4. 计算机软考机考要带2b铅笔吗,软考考试的话是机考还是笔考?
  5. 1176:谁考了第k名
  6. IP地址学习和网络测试命令
  7. html css行高距离算法,CSS行高——line-height 行间距
  8. 智能扑克牌识别软件(Python+YOLOv5深度学习模型+清新界面)
  9. 城市垃圾处理无线监控综合解决方案
  10. API管理工具-Swagger