豆瓣电影Top250 步骤解析

大致的流程是：1.发送请求 2.接受信息，提取信息 3.保存信息

发送请求

1.用到的库是urllib .==urllib.requst.urlopen(url,head)==是发送语句，它返回的是网页的源代码，也是我们要的信息，只不过后面要提取。url是我们的目的网址，head是我们要伪装的头部。代码readurllib.requat.Requst{ A} 其中的A是字典的形式，也可以是列表包含字典。
2.response=urill.requst.urlopen(url,head) ,html=response.read.decode(“utf-8”) 是把response读出来并用“utf-8”的方法解释出来


def askURL(url):head={    #模拟浏览器头部信息，向浏览器发送消息"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}#用户代理，表示告诉服务器，我们是上面类型的机器（本质上告诉浏览器我们可以获得什么样的数据）red=urllib.request.Request(url,headers=head)# headers={#     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"# }# red=urllib.request.Request(url=url,headers=headers)html=""try:# response=urllib.request.urlopen(red)response=urllib.request.urlopen(red)html=response .read().decode("utf-8")# print(html)except urllib.error.URLError as e:if hasattr(e,"code"):print(e.code)if  hasattr(e,"reason"):print(e.reason)return html

接受并提取信息

##1. 正则提取。用re.compile(r’ ')进行。正则表达式中有一个括号时，其输出的内容就是括号匹配到的内容，而不是整个表达式所匹配到的结果。下面给出正则表达式的一些信息

#正则表达式中有一个括号时，其输出的内容就是括号匹配到的内容，而不是整个表达式所匹配到的结果
#影片详情的规则
findlink=re.compile(r'<a href="(.*?)">')  #创建正则表达式，表示规则（字符串的模式）
#图片的规则
findImgSrc=re.compile(r'<img.*src="(.*?)".*/>',re.S)   #re.S 让换行符包括在里面
#影片片名
findTitle=re.compile(r'<span class="title">(.*)</span>')
#影片评分
findRating=re.compile(r'<span class="rating_num" property="v:average">(.*)</span>')
#评价人数
findJude=re.compile(r'<span>(\d*)人评价</span>')
#概况
findIng=re.compile(r'<span class="inq">(.*)</span>')
#相关内容
findBd=re.compile(r'<p class="">(.*?)</p>',re.S)

2.因为有10页，所以我们要用for循环，for i in range(0,10):

    url=baseurl+str(20*i)    ，获取网页的信息，记得加strhtml=askURL(url），把获取是信息传入进来。soup=Beautifulsoup（html,"heml.parser" 把html以html.parser这种方式解析for item in soup.find_all"div",class_="item"  在soup里面 找到标签里面只要有"div",class_="item"这个字符串即有div和claa=item就把她里面的内容放进列表里面。class记得加_,表示他的属性值,放进一个列表link=re.findall(findlink,item)[0]    如果有多个【0】表示第一个

注意：soup.find_all"div",class_=“item” 返回的是一个标签及其子标签
re.findall(findlink,item)是返回的是用列表封装的符合正则表达式的字符串

#爬取网页，解析数据
def getData(baseurl):datalist=[]for i in range(0,10):url=baseurl+str(20*i)    #获取网页的信息，记得加strhtml=askURL(url)#解析数据soup=BeautifulSoup(html,"html.parser")for item in soup.find_all("div",class_="item"): #查找符合标准的字符串,class记得加_,表示他的属性值,放进一个列表# print(item)   #测试：查看item的所以信息data=[]item=str(item)#影片详情的链接link=re.findall(findlink,item)[0]    #如果有多个【0】表示第一个data.append(link)# print(link)# print(findlink)imgSrc=re.findall(findImgSrc,item)[0]data.append(imgSrc)title=re.findall(findTitle,item)if(len(title)==2):ctitle=title[0]data.append(ctitle)    #添加中国名字otitle=title[1].replace("/","")   #因为外文名有/，所以用” “替换”/“otitle=otitle.replace("\xa0\xa0","")data.append(otitle)else:data.append(title[0])data.append(' ')    #只有中国名字，第二个也要添加空格rating=re.findall(findRating,item)[0]data.append(rating)judeNumber=re.findall(findJude,item)[0]data.append(judeNumber)ing=re.findall(findIng ,item)if len(ing)!=0:ing=ing[0].replace("。","")  #去掉句号data.append(ing)else:data.append(" ")bd=re.findall(findBd ,item)[0]be=re.sub('<br(\s+)?/>(\s+)?',"",bd)   #匹配空白和tabj键be=re.sub('/',"",be)data.append(be.strip())   #去掉前后空格datalist.append(data)# print(datalist)for i in datalist:print(i)return datalist

存储数据

1.#workbook=xlwt.Workbook(encoding=“utf-8”) #创建workbook对象

#worksheet=workbook.add_sheet(‘sheet1’) # 创建一个工作表

worksheet.write(0,0,‘hello’) #写入数据，第一个参数“行”，第二个参数“列”，第三给参数内容

workbook.save(‘student.xls’) #将她保存在student表中

def saveData(datalist,savepath):print("save---")book=xlwt.Workbook(encoding="utf-8")   #在进行新的运行的时候记得关闭excel表格sheet=book.add_sheet('豆瓣电影Top250')col=("电影详情链接","图片链接","电影中文名","电影外国名","评分","评价人数","概况","相关信息")for i in range(0,8):sheet.write(0,i,col[i])for i in range(0,250):print("第%d条"%i)data=datalist[i]for j in range(0,8):sheet.write(i+1,j,data[j])book.save(savepath)

Python3 re.findall()方法及 re.compile()