爬虫过程中解决html乱码和获取的文本乱码问题

爬虫过程中解决html乱码和获取的文本乱码问题
response1 = requests.get(url=detail_url, headers=headers)
responseText1 = response1.text
获取的html中有乱码，xpath解析出来的文本当然也有乱码。
解决办法：
responseText1 = response1.text.encode(‘iso-8859-1’)
utf-8也不行，用iso-8859-1

# coding=utf-8
import requests
from lxml import etree
import pandas as pd
import time
import csvheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36',
}#  写入表头
header = ['company', 'position', 'salary', 'address', 'experience', 'education', 'number_people', 'date', 'welfare', 'position_type']
with open('./beijing.csv', 'w', encoding='utf-8', newline='')as f:writer = csv.writer(f)writer.writerow(header)page = 1
while 1:# 列表页urlprint('爬取第{}页'.format(page))list_url = 'https://search.51job.com/list/010000,000000,7500,38,9,99,%2B,2,{}.html'.format(str(1))response = requests.get(url=list_url, headers=headers)responseText = response.text# print(responseText)html_str = etree.HTML(responseText)#  得到详情页列表detailUrl_list = html_str.xpath("//p[@class='t1 ']/span/a/@href")print(detailUrl_list)#  请求每一个详情页url，xpath解析数据for detail_url in detailUrl_list:response1 = requests.get(url=detail_url, headers=headers)responseText1 = response1.text.encode('iso-8859-1')html_str1 = etree.HTML(responseText1)#  解析数据#  职位position_list = html_str1.xpath("//div[@class='cn']/h1/@title")position = position_list[0] if position_list else Noneprint(position)#  公司company_list = html_str1.xpath("//p[@class='cname']/a/@title")#  处理为空的数据company = company_list[0] if company_list else Noneprint(company)#  薪资salary_list = html_str1.xpath("//div[@class='cn']/strong/text()")salary = salary_list[0] if salary_list else Noneprint(salary)#  基本信息try:other_list = html_str1.xpath("//p[@class='msg ltype']//text()")print(other_list)#  数据处理Other = ''.join(other_list).replace('|', '').split() if other_list else Noneprint(Other)address = Other[0] if other_list else Noneexperience = Other[1] if other_list else Noneeducation = Other[2] if other_list else Nonenumber_people = Other[3] if other_list else Nonedate = Other[4] if other_list else Noneprint(address, experience, education, number_people, date)except:address, experience, education, number_people, date = None, None, None, None, None#  福利待遇try:welfare_list = html_str1.xpath("//div[@class='t1']/span/text()")welfare = ','.join(welfare_list)print(welfare)except:welfare = '未公布福利待遇'try:position_type_list = html_str1.xpath("//p[@class='fp']/a/text()")position_type = ','.join(position_type_list)print(position_type)except:position_type = '暂无信息'#  将数据存入csvdata_tuple = (company, position, salary, address, experience, education, number_people, date, welfare, position_type)df = pd.DataFrame(columns=data_tuple)df.to_csv('beijing.csv', mode='a', line_terminator='\n', sep=',', index=False)#  调整请求速度，可以自己调整  睡眠单位是秒time.sleep(1)#  页数+1page += 1

爬虫过程中解决html乱码和获取的文本乱码问题相关推荐

python网络爬虫的方法有几种_Python网络爬虫过程中5种网页去重方法简要介绍
一般的,我们想抓取一个网站所有的URL,首先通过起始URL,之后通过网络爬虫提取出该网页中所有的URL链接,之后再对提取出来的每个URL进行爬取,提取出各个网页中的新一轮URL,以此类推.整体的感觉就 ...
Python爬虫过程中验证码识别的三种解决方案
在Python爬虫过程中,有些网站需要验证码通过后方可进入网页,目的很简单,就是区分是人阅读访问还是机器爬虫.验证码问题看似简单,想做到准确率很高,也是一件不容易的事情.为了更好学习爬虫,后续推文中将 ...
爬虫过程中的反爬问题
1.用scrapy爬取企查查时,由于访问频繁,需要通过验证码才能访问页面: 这个问题很明显是因为我们的cookie太单一了,被对方记住了,那么我们就要采取两种方式来解决,第一种是关闭cookie,也就 ...
Python网络爬虫过程中，构建网络请求的时候，参数`stream=True`的使用
点击上方"Python共享之家",进行关注回复"资源"即可获赠Python学习资料今日鸡汤海内存知己,天涯若比邻. 大家好,我是皮皮. 一.前言前 ...
python爬虫中文乱码_Python 爬虫过程中的中文乱码问题
python+mongodb 在爬虫的过程中,抓到一个中文字段,encode和decode都无法正确显示注:以下print均是在mongodb中截图显示的,在pythonshell中可能会有所不同 ...
20150420-20150424 一周工作问题及解决【共享文件的获取、前后台乱码问题解决等】
20150420-20150424问题记录 1.MD5加密原文经过MD5加密后,得到唯一的摘要. 一个摘要可对应多条原文.故:根据摘要不能逆推出原文. 2.关于InputStream.availab ...
python爬虫过程中遇到的问题_python爬虫过程中出现的问题汇总-Go语言中文社区
1.出现 UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 7: ordinal not in range(12 ...
python爬虫模拟点击下拉菜单和_python+selenium爬虫过程中的模拟点击问题
题目描述下拉菜单选项无法提取成列表以及不能够被点击,请帮忙分析看一下什么原因题目来源及自己的思路对于app移动掌上营业厅抓包后,PC端的爬虫过程. 第一步:进入首页,点击"更多&quo ...
[转]NS2仿真过程中解决动画仿真节点未定义问题
原文地址:http://blog.myspace.cn/e/400266384.htm 其实,这个问题已经出现很长时间了,但是直到昨天问题才得到解决. 问题描述用NS2运行无线仿真,然后运行动画程序 ...

爬虫过程中解决html乱码和获取的文本乱码问题

爬虫过程中解决html乱码和获取的文本乱码问题相关推荐

最新文章

热门文章