Python爬数据之全国中小学信息

爬取网站：http://www.xuexiaodaquan.com/ 学校大全

技术路线： requests + BeautifulSoup

貌似这个网站反爬虫还挺牛的，经常就返回自动跳入的139网站，随意得换着IP试试

需要准备中国市名称拼音存在EXCEL中，显示是第一列：市民；第二列：拼音；到市级就可以。

需要挖掘哪些城市就放哪些，如果挖全国，就要放所有市名。

如：

输出是一个EXCEL，包括：

城市

类型

学习名称

地址

电话

网址

如：

直接上代码：

from bs4 import BeautifulSoup
import requests
import re
import sys
import xlwt
import xlrd
from xlutils.copy import copy#获取html
def getHtmlText(url, code="GBK"):try:headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.63 Safari/537.36'}r = requests.get(url,headers = headers,timeout = 30)r.raise_for_status()r.encoding = codereturn r.textexcept:return "获取html异常"
#解析地区，返回地区清单
'''
def getGroundList(htext):try:grounddict = {}soup = BeautifulSoup(htext, "html.parser")gdname = soup.find('dl', attrs={'class':'nobackground'})keyList = gdname.find_all('a')for i in range(1,len(keyList)):key = keyList[i].textval = keyList[i].get('href')grounddict[key] = valreturn grounddictexcept:print("getGroundList异常")
'''
#解析页码
def getPageCode(htext,typeitem):   try:soup = BeautifulSoup(htext, "html.parser")s1 = soup.find('a', attrs={'class':'last'})if (s1):pat = re.compile(typeitem + r'pn([0-9]+).html')if(s1.get('href')):code = pat.search(s1.get('href'))if(code):return code.group(1)else:return 0except:print("getPageCode异常")#解析学校信息，返回学校名称、地址、电话、网址
def getSchoolList(htext,fileAddress,cityitem,typeitem):try:schoolDict = {}soup = BeautifulSoup(htext, "html.parser")sclist1 = soup.find_all('dl',attrs={'class':'left'})sclist2 = soup.find_all('dl',attrs={'class':'right'})sclist = sclist1 + sclist2for item in sclist:schoolDict['城市'] = cityitemschoolDict['类型'] = typeitemschoolDict['学习名称'] = item.find('p').textsl = item.find_all('li')schoolDict['地址'] = sl[0].textschoolDict['电话'] = sl[1].textschoolDict['网址'] = sl[2].text#f = open(fileAddress, 'a', encoding='utf-8')#f.write(str(schoolDict)  + '\n' )savefile(schoolDict,fileAddress)except:print("getSchoolList异常")#保存到excel
def savefile(schoolDict,fileAddress):workbook = xlrd.open_workbook(fileAddress,'w+b')sheet = workbook.sheet_by_index(0)wb = copy(workbook)ws = wb.get_sheet(0)rowNum = sheet.nrowsws.write(rowNum,0,schoolDict['城市'])ws.write(rowNum,1,schoolDict['类型'])ws.write(rowNum,2,schoolDict['学习名称'])ws.write(rowNum,3,schoolDict['地址'])ws.write(rowNum,4,schoolDict['电话'])ws.write(rowNum,5,schoolDict['网址'])wb.save(fileAddress)#获取城市列表,城市由EXCEL文件存储
def getCityList():try:cityFileAddress = r'D:\中国省市名称拼音.xls'file = xlrd.open_workbook(cityFileAddress)sheet = file.sheet_by_name('city')cityDic = {}for i in range(sheet.nrows):key = sheet.col_values(0)[i]value = sheet.col_values(1)[i].lower()cityDic[key] = valuereturn cityDicexcept:print("getCityList失败")def main():cityList = getCityList()typeList = {'小学':'/xiaoxue/','初中':'/chuzhong/','高中':'/gaozhong/'}for cityitem in cityList:for typeitem in typeList:searchUrl = 'http://'+ cityList[cityitem] + '.xuexiaodaquan.com'fileAddress = 'D:/school.xls'htext = getHtmlText(searchUrl+typeList[typeitem])getSchoolList(htext,fileAddress,cityitem,typeitem)pagecode = int(getPageCode(htext,typeList[typeitem]))if pagecode != 0:for i in range(2,pagecode+1):h1text = getHtmlText(searchUrl+typeList[typeitem]+'pn'+str(i)+'.html')getSchoolList(h1text,fileAddress,cityitem,typeitem)main()

Python爬数据之全国中小学信息相关推荐

python爬取旅游信息_用Python爬取了全国近5000家旅游景点，分析国庆去哪玩
2020 国庆马上就要到了我想今年大家在家都憋坏了今年国庆和中秋刚好又是同一天,加起来有 8 天假这么长的假期,当然是出去玩玩玩! 但是每次长假期间,你有没有想起被人山人海支配的恐惧呢? 那么 ...
python爬虫怎么爬同一个网站的多页数据-如何用Python爬数据？（一）网页抓取
如何用Python爬数据?(一)网页抓取你期待已久的Python网络数据爬虫教程来了.本文为你演示如何从网页里找到感兴趣的链接和说明文字,抓取并存储到Excel. 需求我在公众号后台,经常可以收到 ...
python爬网站数据实例-如何用Python爬数据？（一）网页抓取
如何用Python爬数据?(一)网页抓取你期待已久的Python网络数据爬虫教程来了.本文为你演示如何从网页里找到感兴趣的链接和说明文字,抓取并存储到Excel. 需求我在公众号后台,经常可以收到 ...
python关于二手房的课程论文_基于python爬取链家二手房信息代码示例
基本环境配置 python 3.6 pycharm requests parsel time 相关模块pip安装即可确定目标网页数据哦豁,这个价格..................看到都觉得脑阔 ...
用Python爬取了全国近5000家旅游景点，分析国庆去哪玩
双节同庆,小长假如约而至我想今年大家在家都憋坏了这么长的假期,当然是出去玩玩玩! 每当长假的时候,有没有想起被人山人海支配的恐惧! 该去哪些地方呢? 我用 Python 爬取了全国近 5000 ...
Python爬取药监局化妆品管理信息发现的问题
Python爬取药监局化妆品管理信息 **1.json格式本质上是字符串!!! 今天在爬取国家药监局化妆品管理信息的时候,发现"json数据本质上是字符串",以前我还以为json本 ...
国庆去哪玩？用Python爬取了全国5000家旅游景区（记得收藏）
2020 国庆马上就要到了我想今年大家在家都憋坏了今年国庆和中秋刚好又是同一天,加起来有 8 天假这么长的假期,当然是出去玩玩玩! 但是每次长假期间,你有没有想起被人山人海支配的恐惧呢? 那么 ...
Python 爬取拉勾招聘信息
Python 爬取拉勾招聘信息故事背景最近有个好哥们啊浪迫于家里工资太低,准备从北方老家那边来深圳这边找工作,啊浪是学平面设计的知道我在深圳这边于是向我打听深圳这边平面设计薪资水平,当时我有点懵逼 ...
国庆小长假来点不一样的，如何用Python爬取了全国近5000家旅游景点，一起来看
2020 国庆马上就要到了我想今年大家在家都憋坏了今年国庆和中秋刚好又是同一天,加起来有 8 天假这么长的假期,当然是出去玩玩玩! 但是每次长假期间,你有没有想起被人山人海支配的恐惧呢? 那么 ...

Python爬数据之全国中小学信息

Python爬数据之全国中小学信息相关推荐

最新文章

热门文章