先登录这个网址获取航班列表
http://www.variflight.com/sitemap.html?AE71649A58c77=

然后随便点击一个获取其Cookie,由于这个网址封ip 和验证码比较厉害记得
验证完以后获得其Request Header 的全部内容放到 代码的request 里面的
headers 里,如果有代理ip最好用代理ip获取数据,如果没有可以用自己的手机热点进行代理,在程序被中断要打开手机飞行模型,然后再让电脑链接热点,这个时候我们就又切换了一个ip,这个重复进行这个网址数据就可以随意拿下来了.

结果

程序里的redis 一个是用来存储爬取到航班编号,一个是存储航班的编号在数组中的位置。方便在程序中断是重新启动的时候再次接着上一次的位置继续获取数据

// An highlighted block
import requests
import json
from bs4 import BeautifulSoup as bs
import re
import threading
from queue import Queue
import collections
import redis
import os,signal
pid =os.getpid ()
redis = redis.Redis (decode_responses=True, password="****")ip_key = ['60.170.152.46:38888', '111.177.192.26:3256','125.122.52.15:8088','47.107.128.69:888','47.92.234.75:80']
url_base = 'http://www.variflight.com'
url ='http://www.variflight.com/sitemap.html?AE71649A58c77='
print ('主进程id :', os.getpid ())num_of_threads =10buffer_keys = collections.deque(maxlen=len(ip_key))headers={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9  ',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': "zh-CN,zh;q=0.9,en;q=0.8",
'Cache-Control': 'no-cache',
'Cookie': '''PHPSESSID=et83gh4r9gnrlollsl3lmnfh20; vaptchaNetway=1; ASPSESSIONIDCCQACTCD=OBFOOGLBHOCDNLIICCKGLFCP; ASPSESSIONIDSACTARSB=AKPPJPFCGIAHJENADBCOHAIA; ASPSESSIONIDQCDQATTD=IPJAKPFCMONGAIBDOKHDHLBL; ASPSESSIONIDACTBCTDC=JMNBKPFCAIGHEBEOAKDOLLBK; Hm_lvt_d1f759cd744b691c20c25f874cadc061=1625052094,1625121572; ASPSESSIONIDSCDQDRSD=CHDGGIADEIHELFNICNGMEACA; ASPSESSIONIDCCQDBTCD=GIDGGIADNEJCNPOLCLKPFHKD; ASPSESSIONIDQACSDSQB=OAEGGIADHBLFFBHJADECGNAP; ASPSESSIONIDQCBQCRTC=JNHDCBLDJAFKOPBGOOKALLAA; ASPSESSIONIDQACTDSRA=GHIDCBLDBOGLMFMIICEJKILD; ASPSESSIONIDAATBDSCD=JLIDCBLDGFPNOFEDEICNDGPG; ASPSESSIONIDQCCRDQTD=DPOIOJFACOFMLONNBKENPNGM; ASPSESSIONIDQCBTBQTB=ADOIOJFADCONDFDPHIGDCLIM; ASPSESSIONIDCARDCTDD=KOFKOJFALGIBFPJNKIMIJHKK; authCode=ffd66e69d01fee073453c62715cf0b07; fnumHistory=%5B%7B%22fnum%22%3A%22CZ3474%22%7D%2C%7B%22fnum%22%3A%22CA3954%22%7D%2C%7B%22fnum%22%3A%22CA3681%22%7D%2C%7B%22fnum%22%3A%22CA4432%22%7D%2C%7B%22fnum%22%3A%22CA3879%22%7D%2C%7B%22fnum%22%3A%22CA1101%22%7D%2C%7B%22fnum%22%3A%223U2011%22%7D%2C%7B%22fnum%22%3A%223U2013%22%7D%5D; vaptchaNetwayTime=1625395534387; salt=60e1914ed231e; Hm_lpvt_d1f759cd744b691c20c25f874cadc061=1625395544''',
'Host': 'www.variflight.com',
'Pragma': 'no-cache',
'Proxy-Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}headers1={'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'no-cache',
'Cookie': 'PHPSESSID=et83gh4r9gnrlollsl3lmnfh20; vaptchaNetway=1; ASPSESSIONIDCCQACTCD=OBFOOGLBHOCDNLIICCKGLFCP; ASPSESSIONIDSACTARSB=AKPPJPFCGIAHJENADBCOHAIA; ASPSESSIONIDQCDQATTD=IPJAKPFCMONGAIBDOKHDHLBL; ASPSESSIONIDACTBCTDC=JMNBKPFCAIGHEBEOAKDOLLBK; Hm_lvt_d1f759cd744b691c20c25f874cadc061=1625052094,1625121572; ASPSESSIONIDSCDQDRSD=CHDGGIADEIHELFNICNGMEACA; ASPSESSIONIDCCQDBTCD=GIDGGIADNEJCNPOLCLKPFHKD; ASPSESSIONIDQACSDSQB=OAEGGIADHBLFFBHJADECGNAP; ASPSESSIONIDQCBQCRTC=JNHDCBLDJAFKOPBGOOKALLAA; ASPSESSIONIDQACTDSRA=GHIDCBLDBOGLMFMIICEJKILD; ASPSESSIONIDAATBDSCD=JLIDCBLDGFPNOFEDEICNDGPG; ASPSESSIONIDQCCRDQTD=DPOIOJFACOFMLONNBKENPNGM; ASPSESSIONIDQCBTBQTB=ADOIOJFADCONDFDPHIGDCLIM; ASPSESSIONIDCARDCTDD=KOFKOJFALGIBFPJNKIMIJHKK; ASPSESSIONIDSCCQAQTB=DHIOJCABKBPLCHCIDJJCPBHP; ASPSESSIONIDSCBRCRSC=ALIOJCABFPFHBOOJBNKGGAIA; ASPSESSIONIDAARCDTDD=LDOOJCABBPFKMPCFGADDHOGK; ASPSESSIONIDQADSBQTA=CNABGLKBJGELBNHKAJGDCNAO; ASPSESSIONIDSCBQDRSD=KBCBGLKBHABKJDBNCJPMFIDC; ASPSESSIONIDCCQCASCD=DBCBGLKBCHNCEHNMEAPMOBLN; fnumHistory=%5B%7B%22fnum%22%3A%223U8837%22%7D%2C%7B%22fnum%22%3A%223U8513%22%7D%2C%7B%22fnum%22%3A%22CZ3937%22%7D%2C%7B%22fnum%22%3A%223U8758%22%7D%2C%7B%22fnum%22%3A%223U5103%22%7D%2C%7B%22fnum%22%3A%223U8411%22%7D%2C%7B%22fnum%22%3A%223U5082%22%7D%2C%7B%22fnum%22%3A%223U5048%22%7D%5D; vaptchaSpareCh=1; salt=60e40d13446a7; midsalt=60e40d13634b9; authCode=18373f4fddf15b5400dbb02ec7ecef6b; vaptchaNetwayTime=1625560654940; Hm_lpvt_d1f759cd744b691c20c25f874cadc061=1625560658',
'Host': 'www.variflight.com',
'Pragma': 'no-cache',
'Proxy-Connection': 'keep-alive',
'Referer': 'http://www.variflight.com/flight/fnum/CA4432.html?AE71649A58c77&fdate=20210703',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
}r = requests.get (url, headers=headers)
soup = bs (r.text, 'lxml')
list_a = soup.find (class_='list').find_all ('a')
list_url_fnum = [url_base + a.attrs['href'] for a in list_a]
print (list_url_fnum)def init_queen():for i in range(len(ip_key)):buffer_keys.append(ip_key[i])print('当前可供使用的高德密钥:', buffer_keys)airnumdict = {}
def get_index() :for i, url_fr in enumerate (list_url_fnum[1 :]) :flightnum = 'http://www.variflight.com/flight/fnum/(.*?).html.*'flightdata = re.compile (flightnum, re.S).findall (str (url_fr))airnumdict[flightdata[0]] = ireturn airnumdictdef load_data_from_dict(o, *keys):oo = ofor i, key in enumerate(keys):if not oo:return Noneif i == (len(keys) - 1):return oo.get(key) if isinstance(oo, dict) else Noneoo = oo.get(key) if isinstance(oo, dict) else oodef get_proxy():
#获取ip代理a = requests.get ("http://*****/get")b = json.loads (a.text)proxy = load_data_from_dict (b, "proxy")ip_list = [proxy]print(proxy)url_fr ='http://httpbin.org/ip'try:r1 = requests.get (url_fr,  proxies={'http' : 'http://' + proxy, 'https' : 'https://' + proxy},timeout=5)origin = load_data_from_dict (json.loads (r1.text), 'origin')if (origin.split (',')[0] == ip_list[0].split (':')[0]) :print (proxy)return proxyelse :return get_proxy ()except:return get_proxy ()return proxyclass myThread (threading.Thread) :def __init__(self, threadID, city_queue,proxy) :threading.Thread.__init__ (self)self.threadID = threadIDself.city_queue = city_queueself.proxy=proxyself.singal = threading.Event ()self.singal.set ()def run(self) :while not self.city_queue.empty () :code = self.city_queue.get ()print (code)self.mian (code,proxy)def pause(self) :self.log_ctrl.AppendText ("pause\n")self.singal.clear ()def restart(self) :self.log_ctrl.AppendText ("continues\n")self.singal.set ()def write_fun(self,line):with open('飞行{}.csv'.format(fdata),'a') as f:f.write(line)f.close()def get_location(self,address, i):# 输入API问号前固定不变的部分url = 'https://restapi.amap.com/v3/geocode/geo'# 将两个参数放入字典params = {'key' : '***','address' : str(address)}res = requests.get (url, params)# 输出结果为json,将其转为字典格式jd = json.loads (res.text)geocodes=load_data_from_dict(jd,'geocodes')location=load_data_from_dict (geocodes[0], 'location')return locationdef mian(self,code,proxy):# if buffer_keys.maxlen == 0 :#     print ('密钥已经用尽,程序退出!!!!!!!!!!!!!!!')#     exit (0)# proxy = buffer_keys[0]  # 总是获取队列中的第一个密钥print("*"*100)for index in range(6):flight=redis.hget ("flight:num1", 1)if flight:h = get_index ()[flight]i =get_index()[code[0]]if i > h :url_fr = 'http://www.variflight.com/flight/fnum/{0}.html?AE71649A58c77&fdate={1}'.format (code[0],fdata)#proxies = {'http' : 'http://' + proxy, 'https' : 'https://' + proxy}r1 = requests.get (url_fr, headers=headers1)#print(r1.text)soup = bs (r1.text, 'lxml')try:dplan='<span class="w150" dplan=\\"(.*?)\\">'aplan='<span aplan=\\"(.*?)\\" class="w150">'arae='<span class="w150">(.*?)</span>'flightnum ='http://www.variflight.com/flight/fnum/(.*?).html.*'badip='<html><body><p>{"msg":\\"(.*?)\\"}</p></body></html>'dplandata = re.compile (dplan, re.S).findall (str (soup))aplandata = re.compile (aplan, re.S).findall (str (soup))araedata  = re.compile (arae, re.S).findall (str (soup))flightdata = re.compile (flightnum, re.S).findall (str (url_fr))badipdata = re.compile (badip, re.S).findall (str (soup))redis.hset ("flight:num1", 2, i)if len(badipdata) >0:if badipdata[0] =="IP blocked":print('无效的密钥!!!!!!!!!!!!!,重新切换密钥进行爬取')try:#proxy = buffer_keys[0] # 总是获取队列中的第一个密钥# proxy = get_proxy ()# mian (code,proxy)print (proxy+" "+str(self.threadID)+" "+ badipdata[0]+'密钥已经用尽,程序退出...')redis.hset ("flight:num1", 1, code[0])os.kill (pid, signal.SIGHUP)exit (0)except Exception as e:print(proxy+" "+str(self.threadID)+" "+ badipdata[0]+'异常密钥已经用尽,程序退出...')redis.hset ("flight:num1", 1, code[0])os.kill (pid, signal.SIGHUP)exit(0)if len (dplandata) > 1 :j = 0for i, value in enumerate (dplandata) :line = str (flightdata[0]) + ',' + str (dplandata[i]) + ',' + str (aplandata[i]) + ',' + str (araedata[j]) + ',' + str (araedata[j + 1]) + '\n'j = (i + 1) * 3print (str (h) + " " + str (i) + " " + line)self.write_fun (line)elif len (dplandata) == 1 :line = str (flightdata[0]) + ',' + str (dplandata[0]) + ',' + str (aplandata[0]) + ',' + str (araedata[0]) + ',' + str (araedata[1]) + '\n'print (str (h) + " " + str (i) + " " + line)self.write_fun (line)elif len (dplandata) == 0 :print (str (h) + " " + str (i) + " " + str (flightdata) + "无数据: " + str (dplandata) + " " + str (badipdata))redis.hset ("flight:num1", 1, code[0])except:print("异常: "+str(dplandata))if __name__ == '__main__':fdata = 20210708init_queen ()get_index ()#proxy = get_proxy ()city_queue = Queue ()num = redis.hget ("flight:num1", 2)for i in list_url_fnum[int('{}'.format(num)):]:flightnum = 'http://www.variflight.com/flight/fnum/(.*?).html.*'flightdata = re.compile (flightnum, re.S).findall (str (i))if len(flightdata) >0:city_queue.put (flightdata)threads = [myThread (i, city_queue,proxy) for i in range (num_of_threads)]for i in range (num_of_threads) :threads[i].start ()

variflight 多线程爬虫获取所有航班信息,绕过封锁ip相关推荐

  1. 多线程爬虫获取平板电脑信息

    先上效果图 获取url信息 我们目标是想获取所有平板电脑信息,而平板电脑页面有100页所以我们得获取控制每一页得url规律 首先输入搜索平板电脑,然后观察地址栏上面的url 好像看不出有控制页码的信息 ...

  2. python爬虫获取天猫店铺信息(更新到2020年)

    python爬虫获取天猫店铺信息 爬取需求 在天猫搜索一个关键词,然后抓取这个关键词下的相关店铺,由于taobao的反爬策略,只能爬取到第十页大概200个店铺的信息. 效果预览 最终爬取的数据用exc ...

  3. 国航爆账号串联漏洞,可“无限”获取他人航班信息

    本文讲的是国航爆账号串联漏洞,可"无限"获取他人航班信息,在新京报今天发布的"APP泄露航班信息 80元买到鹿晗航班行程"报道中,曝光了国内某知名航空公司APP ...

  4. python爬虫可以爬取个人信息吗_手把手教你利用Python网络爬虫获取旅游景点信息...

    爬虫系列: 当我们出去旅游时,会看这个地方有哪些旅游景点,景点价格.开放时间.用户的评论等. 本文基于Python网络爬虫技术,以hao123旅游网为例,获取旅游景点信息. 1.项目目标 获取网站的景 ...

  5. 多线程爬虫获取A股历史行情数据!股票量化分析工具QTYX-V2.2.3

    前言 股票数据的获取是从事股票分析的第一步. 数据获取的途径有很多,对大家来说除了数据的质量以外,获取的效率是第二敏感的点. 市面上有一些股票数据服务平台提供了Python接口来获取数据. 总的来说, ...

  6. 爬取实时航班信息 - 从航班信息网站获取实时航班信息

    目录 1. 选择目标航班信息网站 2. 分析网站结构 3. 准备工具和库 4. 编写爬虫程序

  7. Python多线程爬虫,主播信息资料爬取采集

    头榜,一个集合主播信息及资讯的网站,内容比较齐全,现今直播火热,想要找寻各种播主信息,这类网站可以搜集到相关热门主播信息. 目标网址: http://www.toubang.tv/baike/list ...

  8. 想学爬虫的同学看过来,手把手教你利用Python网络爬虫获取APP推广信息

    一.前言 CPA之家app推广平台是国内很大的推广平台.该网址的数据信息高达数万条,爬取该网址的信息进行数据的分析. 二.项目目的 实现将获取到的QQ,导入excel模板,并生成独立的excel文档. ...

  9. 这些美食你吃过吗!使用Python网络爬虫获取菜谱图文信息一起学习

    /1 前言/ 在放假时 ,经常想尝试一下自己做饭,下厨房这个网址是个不错的选择. 下厨房是必选的网址之一,主要提供各种美食做法以及烹饪技巧.包含种类很多. 今天教大家去爬取下厨房的菜谱 ,保存在wor ...

最新文章

  1. Serverless:微服务架构的终极模式(文末赠书)
  2. 自然语言处理NLP之主题模型、LDA(Latent Dirichlet Allocation)、语义分析、词义消歧、词语相似度
  3. 春节到,献诗一首,祝大家新春愉快
  4. 网络编程之select
  5. c语言无视数据类型字符串存储,C语言基础-第二课-数据类型与运算符(示例代码)...
  6. 计算机二级vf上机试题,2016年计算机二级《VF》上机题及答案
  7. AttributeError: module ‘urllib’ has no attribute ‘quote’的解决办法
  8. delphi 组件安装教程详解
  9. 超宽屏幕比例_显示器屏幕比例与分辨率对照表
  10. 几个找pdf资源的网站
  11. java设置word审阅最终状态_如何使得打开word文件显示最终的修改状态
  12. PTA 10-99 3-1-(b) 查询st1制片公司的总裁
  13. bi 工具 市场排行榜_常用的5款数据分析BI软件,你用过哪个?-工具
  14. MATLAB2014b安装(Ubuntu 14.10)
  15. android apk 微信登入_微信第三方登录(Android 实现)
  16. 11、Altiris cms 7.0 系统监控
  17. 关于微信支付签名错误的一些可能的解决方向
  18. 动态加载、插件化、热部署、热修复(更新)知识汇总
  19. 手机吃鸡语音服务器异常是怎么回事,《绝地求生》新版本里实用功能介绍 卡语音怎么解决...
  20. 燃气scada系统重启服务器,国内燃气SCADA系统发展现状及其意义

热门文章

  1. Kolmogorov Smirnov 检验
  2. 随机过程(下):Markov Jump与Kolmogorov equation
  3. 计算机打开文件左栏怎么没有桌面,桌面任务栏不见了 - 电脑任务栏不见了的解决办法 - 安全专题...
  4. Xcode抓包ios
  5. 常见网络安全攻防知识
  6. 全基因组关联分析(Genome-Wide Association Study,GWAS)流程
  7. 常用socket函数详解
  8. 国考计算机理论基础知识试题及答案,国考行测试题及答案
  9. selenium headless
  10. LiveGBS国标GB/T28181流媒体平台支持实时录像相关接口及操作