批量爬取27270美女栏目图片

运行了一个晚上小水管太慢了,才爬了几万张图片。

做了一下重复抓取,设定抓取八次

写了一下日志,但是想了一下还是注释掉了

代码里面有很多修修改改的痕迹,

如果愿意的话可以拿去把这个程序修改一下

采集的网页是:http://www.27270.com/

当前使用的python版本是python3.5.2

# -*- coding:utf-8 -*-
import os
import sys
import time
import random
import logging
import requests
import multiprocessing
from multiprocessing import Pool
from bs4 import BeautifulSoupimg_href = []
a_index = {}
flag = 'true'
html_index = ''
error_num = []
error_href = []
error_path = []
index = {'start': '', 'end': ''}
url_index = 'http://www.27270.com/ent/meinvtupian/'sys.setrecursionlimit(1000000)# 获取logger实例,如果参数为空则返回root logger
logger = logging.getLogger("AppName")# 指定logger输出格式
formatter = logging.Formatter('%(asctime)s %(levelname)-8s: %(message)s')# 文件日志
file_handler = logging.FileHandler("test.log")
file_handler.setFormatter(formatter)  # 可以通过setFormatter指定输出格式# 为logger添加的日志处理器
logger.addHandler(file_handler)# 指定日志的最低输出级别,默认为WARN级别
logger.setLevel(logging.INFO)class flag(object):def __init__(self):f = Truedef get_f(self):return self.f@staticmethoddef set_f(self):self.f = Falsedef is_folder(file_name=''):# 判断是否存在图片存储文件夹,如不存在则创建cwd = os.getcwd() + file_nameif not os.path.exists(cwd):os.mkdir(cwd)print('已创建图片存储文件夹%s' % file_name)else:# print("检测到已有图片存储文件夹")passdef get_url(url='', host=''):# 获取responseresponse = ''header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8','Accept - Encoding' :  'gzip, deflate','Accept - Language' : 'zh-CN,zh;q=0.9','Cache - Control' : 'max - age = 0','Connection' : 'keep - alive','Upgrade - Insecure - Requests' : '1'}cooke = {'Cookie': 'Hm_lvt_63a864f136a45557b3e0cbce07b7e572=1519296125,1519296217,1519306647,1519309454; Hm_lpvt_63a864f136a45557b3e0cbce07b7e572=1519310130'}# 可设置代理proxies = {"http": "http://"+'61.155.164.106:3128',"https": "http://"+'61.155.164.106:3128',}if host != '':header['Host'] = host'''try:print(header['Host'])except Exception:print('none host')'''try:response = requests.get(url, headers=header, timeout=30)except Exception:response = 'error'logger.error('%s \t\t get error' % url)finally:# print(url)if host != '':del header['Host']time.sleep(random.randint(1, 4))return responsedef download_img(url = 'http://t1.27270.com/uploads/tu/201802/726/e6e5afe62c.jpg', name='', the_path='', num=8):# 下载单张图片response = get_url(url, host='t2.hddhhn.com')if response != 'error':cwd = os.getcwd() + r'\woman'file_name = name + '.' + url.split('/')[-1].split('.')[-1]logger.warn('%s \t\t download...' % (url))with open(cwd + '\\' + the_path + file_name, 'wb') as f:file_data = response.contentf.write(file_data)else:if num > 0:return download_img(url, name=name, the_path=the_path, num=num - 1)print('download error')returndef get_index(url_index):# 获取主页html文件response = get_url(url_index)response.encoding = 'gb2312'return response.textdef get_start_end(url=''):response = get_url(url)response.encoding = 'gb2312'html_index = response.textsoup = BeautifulSoup(html_index, "html.parser")a_index_a_all = soup.find("div", class_="NewPages").find('ul').find_all('a', target='_self')for a_index_a in a_index_a_all:a_index[a_index_a.string] = (url_index + a_index_a['href'])html_index = get_index(url_index)soup = BeautifulSoup(html_index, "html.parser")index['start'] = soup.find("div", class_="NewPages").find('li', class_='thisclass').find('a').stringresponse = get_url(a_index['末页'])response.encoding = 'gb2312'html_index = response.textsoup = BeautifulSoup(html_index, "html.parser")index['end'] = soup.find("div", class_="NewPages").find('li', class_='thisclass').find('a').stringdef get_page_href(url=''):# 获取分页按钮跳转的网页new_num = 0response = get_url(url)if response != 'error':response.encoding = 'gb2312'html_index = response.textsoup = BeautifulSoup(html_index, "html.parser")a_index_a_all = soup.find("div", class_="NewPages").find('ul').find_all('a', target='_self')for a_index_a in a_index_a_all:a_index[a_index_a.string] = (url_index + a_index_a['href'])if str(a_index_a.string).isdigit() and int(a_index_a.string) > int(new_num):new_num = a_index_a.stringprint('已进行:%.2f%%' % (int(new_num)*100/int(index['end'])))if (int(new_num) >= int(index['end'])):returnelse:new_num -= 1print('page error')get_page_href(a_index[new_num])def get_father_img(url_index_child):a_index_a_all = ''response = get_url(url_index_child)if response != 'error':response.encoding = 'gb2312'html_index = response.textsoup = BeautifulSoup(html_index, "html.parser")a_index_a_all = soup.find('div', class_='MeinvTuPianBox').find('ul').find_all('a', class_='MMPic')return a_index_a_alldef download_children_img(url, title):num = 0global child_img_hrefmax_index = '0'child_img_href = {'1' : url}# print(child_img_href)get_child_href(url, max_index, title)print('%d张图片,正在下载\n' % len(child_img_href))for key, val in child_img_href.items():try:response = get_url(val)if response != 'error':response.encoding = 'gb2312'html_index = response.textsoup = BeautifulSoup(html_index, "html.parser")href = str(soup.find('div', class_='articleV4Body').find('img')['src'])# print(href)is_folder(r'\woman\\' + title)download_img(href, str(num), title+'\\')num += 1except Exception:print('下载图片失败')def get_child_href(url_index_child, max_index, file_name=''):num = '0'response = get_url(url_index_child)if response != 'error':if file_name != '':is_folder(r'\woman\\' + file_name)response.encoding = 'gb2312'html_index = response.text# print(html_index)soup = BeautifulSoup(html_index, "html.parser")max_index = soup.find('div', class_='page-tag oh').find('ul').find('li', class_='hide')['pageinfo']a_index_a_first = soup.find("div", class_="page-tag oh").find('ul').find('li', class_='thisclass')for sibling in a_index_a_first.next_siblings:if str(sibling.string).isdigit():if int(sibling.string) > int(num):num = int(sibling.string)child_img_href[str(sibling.string)] = '/'.join(url_index_child.split('/')[:-1]) + '/' + sibling.find('a')['href']# print(num)if int(num) >= int(max_index):returnelse:num = ''+str(int(num)+1)# print(num)get_child_href(child_img_href[str(num)],  max_index)def download_url_all():index = 1zz = 0# a_index = {'1': 'http://www.27270.com/ent/meinvtupian/list_11_1.html', 2: 'http://www.27270.com/ent/meinvtupian/list_11_2.html'}for key, value in a_index.items():img_index = []a_index_a_all = get_father_img(value)print('%d / %s' % (index, len(a_index)))# print('第'+str(index)+'轮下载即将开始')for a_index_a in a_index_a_all:# print(a_index_a)img_href.append(a_index_a)# download_children_img(a_index_a['href'], a_index_a['title'])# print(a_index_a)# logger.warn('图片合集:%d  :  %s %s' % (zz, a_index_a['href'], a_index_a['title']))# download_img(a_index_a['href'], str(zz))zz += 1# print('请等待下一轮下载\n\n')index += 1# 使用进程池,并发数为2print('zhong:%d' % int(len(img_href)/2))def func(all_href):for a_index_a in all_href:# print(a_index_a)download_children_img(a_index_a['href'], a_index_a['title'])if __name__ == '__main__':get_start_end(url_index)get_page_href(url_index)del a_index['首页']del a_index['末页']del a_index['上一页']del a_index['下一页']# for key, value in a_index.items():#     logger.warn('分页按钮:%s  :  %s' % (key, value))is_folder(r'\woman')download_url_all()# print(len(img_href))img_href_first = img_href[:int(len(img_href)/2)]img_href_second = img_href[int(len(img_href)/2+1):]p1 = multiprocessing.Process(target=func, args=(img_href_first,))p2 = multiprocessing.Process(target=func, args=(img_href_second,))p1.start()p2.start()p1.join()p2.join()input('end')

批量爬取27270美女栏目图片相关推荐

  1. Python3 协程 + 正则 批量爬取斗鱼美女图片

    from urllib import request from gevent import monkey import random import re import geventmonkey.pat ...

  2. 使用python批量爬取豆瓣书单图片

    # 引入OS库 import os #引入我们的requests库 import requests #引入我们的pyquery库 from pyquery import PyQuery as pq # ...

  3. python常用小技巧(一)——百度图片批量爬取

    python常用小技巧(一)--百度图片无限制批量爬取 前言:我们在日常使用(搜壁纸,搜美女--)或者科研项目(图像识别)中经常要批量获取某种类型的图片,然而很多时候我们都需要一个个点击下载,有什么办 ...

  4. 使用Python爬虫爬取网络美女图片

    代码地址如下: http://www.demodashi.com/demo/13500.html 准备工作 安装python3.6 略 安装requests库(用于请求静态页面) pip instal ...

  5. python easyicon同类型ico图片批量爬取

    这是第二篇有关图片爬取的博客.似乎本人对图片情有独钟.这篇博客主要是还是用于记录我的学习记录.同时,我们在编写界面的时候,经常需要从网上下载一些ico图标用于自定义控件,也许不同的程序员有自己的下载方 ...

  6. python爬虫爬取百度图片总结_python爬虫如何批量爬取百度图片

    当我们想要获取百度图片的时候,面对一张张图片,一次次的点击右键下载十分麻烦.python爬虫可以实现批量下载,根据我们下载网站位置.图片位置.图片下载数量.图片下载位置等需求进行批量下载,本文演示py ...

  7. python爬虫实战1:批量爬取网址图片

    1.爬虫基础知识 目前,爬虫的相关知识了解了以下这么多. 2.爬取图片的步骤 学习了很多视频教程,基本介绍爬虫都是主要分3步: 1.请求网页 2.解析网页 3.保存图片 3.爬虫实战 唯有实战,才能提 ...

  8. python爬取网站美女图(附代码及教程)爬虫爬取海量手机壁纸,电脑壁纸,美女图片,风景图片等

    想要练习爬虫的兄弟们,可以看看这个 如果大神们看到有哪里可以改进的地方,请告诉小弟,我是个纯纯新手,练习了几次爬虫 以前一直用的BeautifulSoup,一直没有接触过正则表达式 正则表达式是一个非 ...

  9. python爬虫快速下载图片_Python爬虫入门:批量爬取网上图片的两种简单实现方式——基于urllib与requests...

    Python到底多强大,绝对超乎菜鸟们(当然也包括我了)的想象.近期我接触到了爬虫,被小小地震撼一下.总体的感觉就两个词--"强大"和"有趣".今天就跟大家分享 ...

最新文章

  1. 标 题: 腾讯面试题目(PHP程序员)
  2. js ie 6,7,8 使用不了 firstElementChild
  3. Leetcode39.Combination Sum组合总和
  4. 《手机测试Robotium实战教程》——导读
  5. python写前端图形界面_如何Tkinter模块编写Python图形界面
  6. 光端机和光纤交换机的区别?
  7. [CSS] Scale on Hover with Transition
  8. Java中Web程序修改配置文件不重启服务器的方法
  9. C语言:斗地主发牌程序
  10. hdu 1285 确定比赛名次
  11. word不能插入压缩包等文件的解决办法
  12. 模板 n维矩阵的二分幂
  13. linux万兆网卡驱动下载,Intel英特尔PCIe万兆网卡驱动5.11.3版For Linux(2021年3月5日发布)...
  14. linux(三剑客之sed) sed字符串替换命令详解
  15. Java报表导出有哪些技术_报表工具能实现怎么的导出效果?
  16. html中加入emjio表情,jqueryemoji表情插件
  17. [bzoj2434][AC自动机][树状数组]阿狸的打字机
  18. 按键精灵 获取某网站服务器时间,按键精灵如何获得网络时间的毫秒
  19. 萌宠历险记html5游戏在线玩,7724萌宠历险记
  20. 电脑上如何卸载html5,电脑安装影子系统后卸载不了怎么办

热门文章

  1. ubuntu 如何查看计算机名称,ubuntu系统下查看电脑配置的详细教程
  2. luogu P1019 单词接龙 题解
  3. 学习web前端历程(十四)
  4. 2022年乡村医生考试经典试题及答案
  5. 固态硬盘的分类有哪些
  6. 叉乘点乘混合运算公式_初学者的Excel公式及其引用方法指南
  7. ubuntu修改root及开机锁屏密码
  8. 【Jenkins】配置邮件自动发送
  9. 记一次 Lvm 磁盘扩容
  10. 教你从零开始搭建私有网盘及个人博客(云服务器基础使用教程)