自制爬虫例--抓取网站图像与简介

#编码格式要求为utf-8

#coding=UTF-8

#urllib解析网站内容

import urllib2

#soup是特别好使的html解析器

from BeautifulSoup import *

#开站，读内容

c=urllib2.urlopen('http://xxxxxx.html')

soup=BeautifulSoup(c.read())

#读作者

user = soup.find('a',οnclick=re.compile('shareRec'))['onclick']

regex=ur"发现：.*的#"

match = re.search(regex, user)

user = match.group()

user = user[3:]

templeng = len(user)

templeng = templeng-2

user = user[:templeng]

print "author: "+user

#读日期

date = soup.find('span',attrs={'class':'date m_l_5'}).text

year = date[:4]

month = date[5:7]

day = date[8:10]

hour = date[11:13]

minute = date[14:16]

second = date[17:19]

print "date: "+date

print "year:"+year

print "month:"+month

print "day:"+day

print "hour:"+hour

print "minute:"+minute

print "second:"+second

#读标题，地区

title = soup.find('div',attrs={'class':'Mztit'}).text

print "title:"+title

areaid = soup.find('a',href = re.compile('mddid')).attrs[0][1]

areaid = areaid[20:]

area = soup.findAll('a',href = re.compile('mddid='+areaid))[1].text

partid = soup.find('a',href = re.compile('travel-scenic-spot')).text

templen = len(partid)

templen = templen - 4

part = partid[:templen]

print "area:"+area

print "part:"+part

#读描述，重点是第一张图之前的文字内容

description = soup.find('div',attrs={'id':'pnl_contentinfo'})

des = description.contents

length = len(des)

descrip = " "

for d in des:

try:

if(not(d.find("img") == -1 or d.find("img") == None)):

if(length < 4):

des_i = d.contents

for i in des_i:

try:

if(not(i.find("img") == -1 or i.find("img") == None)):

break

else:

descrip = descrip + i.text

except:

i = i.strip()

if(not(i.find("img") == -1 or i.find("img") == None)):

break

else:

descrip = descrip + i

leng = len(d.contents)

if(leng > 15):

descrip = descrip + d.text

break

else:

descrip = descrip + d.text

except:

pass

print "description:"+descrip

#重点来了，读取每张图与其文字

data = soup.findAll('div',attrs={'vaname':user})

txt = [""]

p_w_picpath = [""]

for d in data:

have_jpg = d.find('img',attrs={'src':re.compile('jpeg')})

start = "false"

temp_txt = ""

if(have_jpg != None):

content = d.contents

for x in content:

try:

if((not(x.find("img") == -1 or x.find("img") == None)) and start == "false"):

start = "true"

if(start == "true"):

t = x.contents

for tt in t:

try:

if(tt.find('img',src = re.compile('http.*jpeg')) == None):

temp_txt = temp_txt + tt.text

else:

txt.append(temp_txt)

p_w_picpath.append(tt.find('img',src = re.compile('http.*jpeg'))['src'])

temp_txt = ""

except:

ttt = tt.strip()

temp_txt = temp_txt + ttt

except:

pass

print len(txt)

print len(p_w_picpath)

转载于:https://blog.51cto.com/tiandinanyu/810169

自制爬虫例--抓取网站图像与简介相关推荐

Scrapy爬虫轻松抓取网站数据
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架. 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中. 其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也 ...
python爬虫抓取网站技巧总结
不知道为啥要说是黑幕了??哈哈哈-..以后再理解吧 python爬虫抓取网站的一些总结技巧学用python也有3个多月了,用得最多的还是各类爬虫脚本:写过抓代理本机验证的脚本,写过在discuz论坛 ...
Python爬虫小偏方：如何用robots.txt快速抓取网站？
作者 | 王平,一个IT老码农,写Python十年有余,喜欢分享通过爬虫技术挣钱和Python开发经验. 来源 | 猿人学Python 在我抓取网站遇到瓶颈,想剑走偏锋去解决时,常常会先去看下该网站的 ...
php 防止爬虫,服务器反爬虫攻略：Apache/Nginx/PHP禁止某些User Agent抓取网站
我们都知道网络上的爬虫非常多,有对网站收录有益的,比如百度蜘蛛(Baiduspider),也有不但不遵守robots规则对服务器造成压力,还不能为网站带来流量的无用爬虫,最近发现nginx日志中出现了 ...
python爬虫_抓取瓦片图片信息并将其拼接_以mapbar为例（适用交通工程类专业）
python爬虫_抓取瓦片图片信息并将其拼接_以mapbar为例(适用交通工程类专业) 这次就以mapbar为例爬取道路交通拥堵情况第一步,瓦片标号解析第二步,拼url,然后下载第三步,图片拼接 ...
python爬网页数据用什么_初学者如何用“python爬虫”技术抓取网页数据？
原标题:初学者如何用"python爬虫"技术抓取网页数据? 在当今社会,互联网上充斥着许多有用的数据.我们只需要耐心观察并添加一些技术手段即可获得大量有价值的数据.而这里的&quo ...
如何在线把网站html生成xml文件_快速抓取网站信息工具
网络信息抓取如今广泛运用于社会生活的各个领域.在接触网络信息抓取之前,大多数人会觉得这需要编程基础,也因此对信息抓取望而却步,但是随着技术的发展,诞生出了许多工具,借助这些工具我们编程小白也可以获取大 ...
Python爬虫：抓取多级页面数据
前面讲解的爬虫案例都是单级页面数据抓取,但有些时候,只抓取一个单级页面是无法完成数据提取的.本节讲解如何使用爬虫抓取多级页面的数据. 在爬虫的过程中,多级页面抓取是经常遇见的.下面以抓取二级页面为例, ...
java爬虫京东_教您使用java爬虫gecco抓取JD全部商品信息（一）
#教您使用java爬虫gecco抓取JD全部商品信息(一) ##gecco爬虫如果对gecco还没有了解可以参看一下gecco的github首页.gecco爬虫十分的简单易用,JD全部商品信息的抓取 ...

自制爬虫例--抓取网站图像与简介

自制爬虫例--抓取网站图像与简介相关推荐

最新文章

热门文章