python+selenium多线程与多进程爬虫

使用python+selenium抓取深圳证券交易所本所公告数据，刚开始是用单进程爬取的，最近将代码修改了一下，分别用多进程和多线程进行抓取，速度非常快。如果对selenium不了解的请移步别的地方学习一下。

多进程爬取

# coding=utf-8
'''
多进程抓取深圳证券交易所本所公告数据
标题和公告内容写入了不同的csv文件里
Author:西兰
Date：2019-11-30
'''from selenium import webdriver
import time
import csv
from multiprocessing import Processdef process(start,end,num):driver_path = r"D:\chromedriver.exe"#使用开发者模式#options = webdriver.ChromeOptions()#options.add_experimental_option('excludeSwitches', ['enable-automation'])browser = webdriver.Chrome(executable_path=driver_path)browser.implicitly_wait(1)count=1for j in range(start,end):if(count%21==0):count=1if(j==1):url="http://www.szse.cn/disclosure/notice/index.html"else:url="http://www.szse.cn/disclosure/notice/index_"+str(j-1)+".html"# if(j%10==0):#每处理10页数据，关闭并重启一次浏览器#     browser.quit()#     browser = webdriver.Chrome(executable_path=driver_path)for i in range(20):browser.get(url)browser.maximize_window()print("####################################################第",j,"页，第",count,"条记录")# 获取列表页handlelist_page_handle = browser.current_window_handlediv_content = browser.find_element_by_css_selector('div.g-content-list')li_list = div_content.find_elements_by_tag_name('li')a_href = li_list[i].find_element_by_tag_name('a').get_attribute('href')if(a_href.find('.pdf')>0 or a_href.find('.doc')>0 or a_href.find('.DOC')>0):continueprint(a_href)li_list[i].find_element_by_tag_name('a').click()all_handles = browser.window_handlesfor handle in all_handles:if (handle != list_page_handle):browser.switch_to.window(handle)#标题title_div = browser.find_element_by_css_selector('div.des-header')title_h2 = title_div.find_element_by_tag_name('h2')print(title_h2.text)data_row_title = [title_h2.text]with open('./data/sz_data_title' + str(num) + '.csv', 'a+', newline="", encoding='utf-8') as f:csv_add = csv.writer(f)csv_add.writerow(data_row_title)#公告内容content_div = browser.find_element_by_id('desContent')p_content_list = content_div.find_elements_by_tag_name('p')final_text=""for p in p_content_list:final_text+=p.text.strip()print(final_text)data_row = [final_text]with open('./data/sz_data'+ str(num) +'.csv', 'a+', newline="",encoding='utf-8') as f:csv_add = csv.writer(f)csv_add.writerow(data_row)time.sleep(1)count += 1browser.close()browser.switch_to.window(list_page_handle)def main():#开启4个进程，传入爬取的页码范围process_list = []p1 = Process(target=process, args=(400,600,1))p1.start()p2 = Process(target=process, args=(600, 800,1))p2.start()p3 = Process(target=process, args=(800, 1000, 1))p3.start()p4 = Process(target=process, args=(1000, 1129, 1))p4.start()process_list.append(p1)process_list.append(p2)process_list.append(p3)process_list.append(p4)for t in process_list:t.join()if __name__ == '__main__':s = time.time()main()e = time.time()print('总用时：',e-s)

多线程爬取

# coding=utf-8
# --coding--=utf-8
'''
多线程抓取深圳证券交易所本所公告数据
Author:西兰
Date：2019-11-30
'''from selenium import webdriver
import time
import csv
from threading import Threaddef process(start,end,num):driver_path = r"D:\chromedriver.exe"#使用开发者模式#options = webdriver.ChromeOptions()#options.add_experimental_option('excludeSwitches', ['enable-automation'])browser = webdriver.Chrome(executable_path=driver_path)browser.implicitly_wait(1)count=1for j in range(start,end):if(count%21==0):count=1if(j==1):url="http://www.szse.cn/disclosure/notice/index.html"else:url="http://www.szse.cn/disclosure/notice/index_"+str(j-1)+".html"# if(j%10==0):#每处理10页数据，关闭并重启一次浏览器#     browser.quit()#     browser = webdriver.Chrome(executable_path=driver_path)for i in range(20):browser.get(url)browser.maximize_window()print("####################################################第",j,"页，第",count,"条记录")# 获取列表页handlelist_page_handle = browser.current_window_handlediv_content = browser.find_element_by_css_selector('div.g-content-list')li_list = div_content.find_elements_by_tag_name('li')a_href = li_list[i].find_element_by_tag_name('a').get_attribute('href')if(a_href.find('.pdf')>0 or a_href.find('.doc')>0 or a_href.find('.DOC')>0):continueprint(a_href)li_list[i].find_element_by_tag_name('a').click()all_handles = browser.window_handlesfor handle in all_handles:if (handle != list_page_handle):browser.switch_to.window(handle)#标题title_div = browser.find_element_by_css_selector('div.des-header')title_h2 = title_div.find_element_by_tag_name('h2')print(title_h2.text)data_row_title = [title_h2.text]with open('./data/sz_data_title' + str(num) + '.csv', 'a+', newline="", encoding='utf-8') as f:csv_add = csv.writer(f)csv_add.writerow(data_row_title)#公告内容content_div = browser.find_element_by_id('desContent')p_content_list = content_div.find_elements_by_tag_name('p')final_text=""for p in p_content_list:final_text+=p.text.strip()print(final_text)data_row = [final_text]with open('./data/sz_data'+ str(num) +'.csv', 'a+', newline="",encoding='utf-8') as f:csv_add = csv.writer(f)csv_add.writerow(data_row)time.sleep(1)count += 1browser.close()browser.switch_to.window(list_page_handle)def main():#开启4个进程，传入爬取的页码范围thead_list = []t1 = Thread(target=process, args=(400,600,1))t1.start()t2 = Thread(target=process, args=(600, 800,1))t2.start()t3 = Thread(target=process, args=(800, 1000, 3))t3.start()t4 = Thread(target=process, args=(1000, 1129, 4))t4.start()thead_list.append(t1)thead_list.append(t2)thead_list.append(t3)thead_list.append(t4)for t in thead_list:t.join()if __name__ == '__main__':s = time.time()main()e = time.time()print('总用时：',e-s)

喜欢编程的朋友可以关注我的公众号，我们一起进步！

参考：python多进程与多线程

python+selenium多线程与多进程爬虫相关推荐

Python+Selenium多线程基础微博爬虫
一.随便扯扯的概述大家好,虽然我自上大学以来就一直在关注着CSDN,在这上面学到了很多知识,可是却从来没有发过博客(还不是因为自己太菜,什么都不会),这段时间正好在机房进行期末实训,我们组做的是一个 ...
人人美剧迅雷链接多线程和多进程爬虫分析
人人美剧迅雷链接多线程和多进程爬虫分析浅谈GIL cpu计算密集型 IO密集型普通裸奔多线程多进程+多线程总结浅谈GIL 使用python中的多线程就不得不聊聊GIL,基于cpython, ...
python的多线程和多进程网络编程
二十八.python的多线程和多进程网络编程线程和进程关系: 进程是具有独立功能的程序,进程是系统进行资源分配和调度的一个独立单位线程是进程的一个实体,是cpu调度的基本单位,它是比进程更小的能独 ...
python+selenium实现的谷歌爬虫(超详细)
python+selenium实现的谷歌爬虫接到一个需求,需要从谷歌图库中爬取图片.于是按照之前的爬取国内网站的图片的方法,进入谷歌图库的图片页面,打开谷歌开发者工具,选中network面板,然后翻 ...
python爬虫用多线程还是多进程_python爬虫之多线程、多进程爬虫
多线程对爬虫的效率提高是非凡的,当我们使用python的多线程有几点是需要我们知道的: countdown是一个计数的方法,正常执行它,我们一般使用countdown(10),就可以达到执行的目的,当 ...
python爬虫之多线程、多进程爬虫
一.原因多线程对爬虫的效率提高是非凡的,当我们使用python的多线程有几点是需要我们知道的: 1.Python的多线程并不如java的多线程,其差异在于当python解释器开始执行任务时,受制于G ...
爬虫-python -(8) 多线程与多进程操作以及线程池异步操作
文章目录 1.通过异步操作提高爬虫效率 2.多线程 3.多进程 4.线程池与进程池 5.线程池实例-新发地菜价保存 6.总结 1.通过异步操作提高爬虫效率一般爬虫过程为,请求网页-响应请求-从响应中 ...
python+Selenium多线程后台爬虫例子
Selenium多线程后台爬虫一.前言: 有些网站不支持网页源码爬虫.或要爬取的网页内容不在网页源码中, 等需要使用Selenium进行爬虫二.准备工作: 安装selenium及对应googlec ...
关于python的多线程和多进程_Python的多线程和多进程
(1)多线程的产生并不是因为发明了多核CPU甚至现在有多个CPU+多核的硬件,也不是因为多线程CPU运行效率比单线程高.单从CPU的运行效率上考虑,单任务进程及单线程效率是最高的,因为CPU没有任何进 ...

python+selenium多线程与多进程爬虫

多进程爬取

多线程爬取

python+selenium多线程与多进程爬虫相关推荐

最新文章

热门文章