【爬虫】Python Selenium爬取TEDTalks

1、爬虫相关：

TED-Talks的视频（www.ted.com/talks）云集了曾踏上过TED讲坛、举世闻名的思想家、艺术家和科技专家。在TED.com网站上，我们可以免费下载这些视频。视频包含了可以互动的英文讲稿以及多达80多个语种的字幕。

这次的爬取场景是将某个演讲视频下英语和匈牙利语的字幕稿给抽取出来并一一对应后写入文件，并利用Selenium随机点击下一个视频，不断执行上述操作。
英语字幕稿： 例子链接：https://www.ted.com/talks/fabio_pacucci_could_the_earth_be_swallowed_by_a_black_hole/transcript

匈牙利语字幕稿： 例子链接=英文链接+?language=hu

点击下一个： 每个视频右边都会有一列推荐视频，只要用selenium进行随机某一个进入下一个视频就行。

2、需解决的问题：

有些视频没有匈牙利语字幕：每次点击视频时，视频默认下方是英语字幕，但可能没有匈牙利语字幕；如果没有匈牙利语字幕，在英文加完?language=hu会得到404的相应。而selenium随机点击后都是默认英文字幕，所以得在链接后?language=hu看selenium能不能捕捉到元素，如果捕捉不到，就说明此视频没有匈牙利语字幕就得跳回视频的英文字幕链接，并直接随机点推荐视频，直到某个视频有匈牙利语字幕才开始抽取字幕。
爬取速度过慢： 换成无头浏览器，不用PhantomJS的原因是PhantomJS已停止开发，谷歌浏览器已不支持。
options.add_argument('--headless')
异常捕捉

3、环境：

Python 3.6 版本
Python的selenium库

4、项目结构：

get_align.py: 实现爬虫
main.py : 实现用户调用，略过
mul_process.py :多进程，略过
options_settings.py : webdriver参数配置文件
parse_settings.py : 解析语句文件
start_setting.py : 项目基础配置文件

1、爬虫实现代码

'''这里是 import
'''
__author = 'cyy'local = threading.local()id = 0local.id = idclass Get_Align(object):'''获得对齐句对'''def __init__(self,num,frequecy):self.__num = numself.__path = text_path[self.__num-1]self.frequecy = frequecyself.driver = Noneself.hu_behind = behindself.en_texts = []self.hu_texts = []self.start_url = start_urls[self.__num - 1]@propertydef num(self):return self.__num@num.setterdef num(self,value):self.__num=valuedef __call__(self, *args, **kwargs):return self.get_align()def get_align(self):f = open(self.__path,'a',encoding='utf-8')self.driver = webdriver.Chrome(chrome_options=options)self.driver.maximize_window()self.driver.get(self.start_url + self.hu_behind)hu = self.driver.find_elements_by_css_selector(text)for h in hu:self.hu_texts.append(h.text)self.driver.get(self.start_url)en = self.driver.find_elements_by_css_selector(text)for e in en:self.en_texts.append(e.text)for i in range(len(self.hu_texts)):try:f.writelines(self.en_texts[i]+'\n')f.writelines(self.hu_texts[i]+'\n' + '\n')except IndexError as e:breakf.close()local.id += 1print(local.id)self.get_align_continue()def check_hu(self):'''检查视频是否有匈牙利字幕:return:'''ne = self.driver.find_elements_by_xpath(click)if not ne:local.id -= 1self.driver.back()ne = self.driver.find_elements_by_xpath(click)ra = random.choice(ne)ra.click()return self.check_hu()else:passdef get_align_continue(self):next = self.driver.find_element_by_xpath(click).click()for i in range(2, self.frequecy):f = open(self.__path, 'a', encoding='utf-8')self.en_texts , self.hu_texts = [] , []en_url = self.driver.current_urlself.driver.get(self.driver.current_url + '/transcript' + self.hu_behind)self.check_hu()hu = self.driver.find_elements_by_css_selector(text)for h in hu:self.hu_texts.append(h.text)self.driver.get(en_url + '/transcript')en = self.driver.find_elements_by_css_selector(text)for e in en:self.en_texts.append(e.text)try:for i in range(len(self.hu_texts)):f.writelines(self.en_texts[i] + '\n')f.writelines(self.hu_texts[i] + '\n' + '\n')except IndexError as e:print(e.args)f.close()local.id += 1print(local.id)try:next = self.driver.find_elements_by_xpath(click)choice=random.choice(next)choice.click()except IndexError as e:next = self.driver.find_element_by_xpath(click)next.click()self.driver.close()

2、有关配置

options_settings.py：webdriver参数配置文件
parse_settings.py : 解析语句文件
start_setting.py : 项目基础配置文件

3、运行效果

【爬虫】Python Selenium爬取TEDTalks相关推荐

python爬虫——使用selenium爬取微博数据（一）
python爬虫--使用selenium爬取微博数据(二) 写在前面之前因为在组里做和nlp相关的项目,需要自己构建数据集,采用selenium爬取了几十万条微博数据,学习了很多,想在这里分享一下如 ...
python抓取文献关键信息,python爬虫——使用selenium爬取知网文献相关信息
python爬虫--使用selenium爬取知网文献相关信息写在前面: 本文章限于交流讨论,请不要使用文章的代码去攻击别人的服务器如侵权联系作者删除文中的错误已经修改过来了,谢谢各位爬友指出错误 ...
python爬虫——用selenium爬取淘宝商品信息
python爬虫--用selenium爬取淘宝商品信息 1.附上效果图 2.淘宝网址https://www.taobao.com/ 3.先写好头部 browser = webdriver.Chrome ...
layui获取input信息_python爬虫—用selenium爬取京东商品信息
python爬虫--用selenium爬取京东商品信息 1.先附上效果图(我偷懒只爬了4页) 2.京东的网址https://www.jd.com/ 3.我这里是不加载图片,加快爬取速度,也可以用Hea ...
用 Python selenium爬取股票新闻并存入mysql数据库中带翻页功能demo可下载
用 Python selenium爬取实时股票新闻并存入mysql数据库中 1.分析需求 2.创建表 3.分析需要爬取的网页内容 4.python里面selenium进行爬虫操作 1.添加包 2.连接 ...
python selenium爬取去哪儿网的酒店信息——详细步骤及代码实现
目录准备工作一.webdriver部分二.定位到新页面三.提取酒店信息 ??这里要注意?? 四.输出结果五.全部代码准备工作 1.pip install selenium 2.配置浏览器驱 ...
python+selenium爬取智联招聘信息
python+selenium爬取智联招聘信息需求准备代码结果需求老板给了我一份公司名单(大概几百家如下图),让我到网上看看这些公司分别在招聘哪些岗位,通过分析他们的招聘需求大致能推断出我 ...
python+selenium 爬取微博（网页版）并解决账号密码登录、短信验证
使用python+selenium 爬取微博前言为什么爬网页版微博为什么使用selenium 怎么模拟微博登录一.事前准备二.Selenium安装关于selenium 安装步骤三.sel ...
爬虫之selenium爬取斗鱼网站
爬虫之selenium爬取斗鱼网站示例代码: from selenium import webdriver import timeclass Douyu(object):def __init__(s ...