需求分析

项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询。旧的数据是几年前了太久了,打算用python爬虫重新爬一份

单线程版本

# coding:utf-8

import requests

from datetime import datetime

class PhoneInfoSpider:

def __init__(self, phoneSections):

self.phoneSections = phoneSections

def phoneInfoHandler(self, textData):

text = textData.splitlines(True)

# print("text length:" + str(len(text)))

if len(text) >= 9:

number = text[1].split('\'')[1]

province = text[2].split('\'')[1]

mobile_area = text[3].split('\'')[1]

postcode = text[5].split('\'')[1]

line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode

line_text = number + "," + province + "," + mobile_area + "," + postcode

print(line_text)

# print("province:" + province)

try:

f = open('./result.txt', 'a')

f.write(str(line_text) + '\n')

except Exception as e:

print(Exception, ":", e)

def requestPhoneInfo(self, phoneNum):

try:

url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum

response = requests.get(url)

self.phoneInfoHandler(response.text)

except Exception as e:

print(Exception, ":", e)

def requestAllSections(self):

# last用于接上次异常退出前的号码

last = 0

# last = 4

# 自动生成手机号码,后四位补0

for head in self.phoneSections:

head_begin = datetime.now()

print(head + " begin time:" + str(head_begin))

# for i in range(last, 10000):

for i in range(last, 10):

middle = str(i).zfill(4)

phoneNum = head + middle + "0000"

self.requestPhoneInfo(phoneNum)

last = 0

head_end = datetime.now()

print(head + " end time:" + str(head_end))

if __name__ == '__main__':

task_begin = datetime.now()

print("phone check begin time:" + str(task_begin))

# 电信,联通,移动,虚拟运营商

dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']

lt = ['130', '131', '132', '145', '146', '155', '156', '166', '171', '175', '176', '185', '186', '166']

yd = ['134', '135', '136', '137', '138', '139', '147', '148', '150', '151', '152', '157', '158', '159', '172',

'178', '182', '183', '184', '187', '188', '198']

add = ['170']

all_num = dx + lt + yd + add

# print(all_num)

print(len(all_num))

# 要爬的号码段

spider = PhoneInfoSpider(all_num)

spider.requestAllSections()

task_end = datetime.now()

print("phone check end time:" + str(task_end))

发现爬取一个号段,共10000次查询,单线程版大概要多1个半小时,太慢了。

多线程版本

# coding:utf-8

import requests

from datetime import datetime

import queue

import threading

threadNum = 32

class MyThread(threading.Thread):

def __init__(self, func):

threading.Thread.__init__(self)

self.func = func

def run(self):

self.func()

def requestPhoneInfo():

global lock

while True:

lock.acquire()

if q.qsize() != 0:

print("queue size:" + str(q.qsize()))

p = q.get() # 获得任务

lock.release()

middle = str(9999 - q.qsize()).zfill(4)

phoneNum = phone_head + middle + "0000"

print("phoneNum:" + phoneNum)

try:

url = 'https://tcc.taobao.com/cc/json/mobile_tel_segment.htm?tel=' + phoneNum

# print(url)

response = requests.get(url)

# print(response.text)

phoneInfoHandler(response.text)

except Exception as e:

print(Exception, ":", e)

else:

lock.release()

break

def phoneInfoHandler(textData):

text = textData.splitlines(True)

if len(text) >= 9:

number = text[1].split('\'')[1]

province = text[2].split('\'')[1]

mobile_area = text[3].split('\'')[1]

postcode = text[5].split('\'')[1]

line = "number:" + number + ",province:" + province + ",mobile_area:" + mobile_area + ",postcode:" + postcode

line_text = number + "," + province + "," + mobile_area + "," + postcode

print(line_text)

# print("province:" + province)

try:

f = open('./result.txt', 'a')

f.write(str(line_text) + '\n')

except Exception as e:

print(Exception, ":", e)

if __name__ == '__main__':

task_begin = datetime.now()

print("phone check begin time:" + str(task_begin))

dx = ['133', '149', '153', '173', '177', '180', '181', '189', '199']

lt = ['130', '131', '132', '145', '155', '156', '166', '171', '175', '176', '185', '186', '166']

yd = ['134', '135', '136', '137', '138', '139', '147', '150', '151', '152', '157', '158', '159', '172', '178',

'182', '183', '184', '187', '188', '198']

all_num = dx + lt + yd

print(len(all_num))

for head in all_num:

head_begin = datetime.now()

print(head + " begin time:" + str(head_begin))

q = queue.Queue()

threads = []

lock = threading.Lock()

for p in range(10000):

q.put(p + 1)

print(q.qsize())

for i in range(threadNum):

middle = str(i).zfill(4)

global phone_head

phone_head = head

thread = MyThread(requestPhoneInfo)

thread.start()

threads.append(thread)

for thread in threads:

thread.join()

head_end = datetime.now()

print(head + " end time:" + str(head_end))

task_end = datetime.now()

print("phone check end time:" + str(task_end))

多线程版的1个号码段1000条数据,大概2,3min就好,cpu使用飙升,大概维持在70%左右。

总共40多个号段,爬完大概1,2个小时,总数据41w左右

python手机号定位_python手机号前7位归属地爬虫相关推荐

  1. python简单爬虫手机号_python手机号前7位归属地爬虫代码实例

    需求分析 项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询.旧的数据是几年前了太久了,打算用python爬虫重新爬一份 单线程版本 # coding:utf-8 import reques ...

  2. python 爬手机号_python手机号前7位归属地爬虫代码实例

    需求分析 项目上需要用到手机号前7位,判断号码是否合法,还有归属地查询.旧的数据是几年前了太久了,打算用python爬虫重新爬一份 单线程版本 # coding:utf-8 import reques ...

  3. python库排行榜_排名前6位的Python NLP库的比较

    排名前6位的Python NLP库的比较 今天,自然语言处理(NLP)变得非常流行,在深度学习发展的背景下,自然语言处理(NLP)变得尤其引人注目.NLP是人工智能领域,旨在理解和提取文本中的重要信息 ...

  4. python怎么通过手机号定位_python通过手机号获取微信男女性别

    不用图像识别,就可以拿到微信性别. 先上一张图片, 通过红框里的小头像可以看出是男是女,或者未知. 我们可以通过判断头像来获取性别,识别男女. 但是太麻烦和复杂了. 其实我们可以通过界面元素进行拾取当 ...

  5. python手机号定位_python实现手机号归属地相关信息查询

    根据指定的手机号码,查询其归属地等相关信息,python实现: 手机号文件:phone.txt18815484184 18818701639 18818773287 18818791154 18819 ...

  6. python怎么通过手机号定位_python 通过手机号识别出对应的微信性别(实例代码)

    python 通过手机号识别出对应的微信性别,具体代码如下所述: def getGender(self,tel): self.d(resourceId="com.tencent.mm:id/ ...

  7. python 防止转义_python字符串前加r、f、u、l 的区别

    f-strings 是指以f或F 开头的字符串,其中以 {}包含的表达式会进行值替换.(目前支持python3.6版本) 下面看下f-strings的使用方法 基本使用(作用:替换值) 在字符串前加r ...

  8. python环境下载_Python for Windows 64位下载

    PythonWin 64位是Windows 64为系统下的Python开发环境,由Python官方提供,Python是一种面向对象脚本语言,无论你是使用Python开发程序还是学习Python,第一步 ...

  9. 电脑python如何下载_Python for Windows 32位 v3.7.2下载|Python 3.7.2免费电脑版

    软件介绍:对于脚本语言比较号的版本很多人不得不提Python 3.7.2,首先是免费的版本就很吸引人了,因为很多人早有耳目,对于功能更加不会让你们失望,因为脚本收纳的丰富类目词库就可以支撑很多个日常操 ...

最新文章

  1. 【Android 应用开发】Android之Bluetooth编程
  2. [SDOI2013]直径 (树的直径,贪心)
  3. JavaScript语法详解(三)
  4. Integer包装类特殊之处
  5. oracle rollup 排序,Oracle教程之rollup用法
  6. 面试准备每日五题:C++(三)——全局局部变量、内存分配、strcpysprintfmemcpy、函数指针、引用
  7. JDBC连接数据库教程,以postgreSQL为例
  8. 第19集 轮廓的提取
  9. Pandas DataFrame 函数应用和映射
  10. 基于Java web的校园滴滴代驾管理系统 毕业设计-附源码260839
  11. MEMORY系列之“DDR参数”
  12. python乌龟吃鱼小游戏(类和对象及Easygui应用)
  13. Qt创建桌面快捷方式
  14. halcon自动对焦
  15. 借助echarts制作酷炫3d地球动画
  16. 【jQwidgets】jqxComboBox
  17. 主干分支开发模式_源代码主干分支开发四大模式
  18. verilog语言实现全加器
  19. 10M独享带宽,能承受多少人下载文件?
  20. [离散数学]谓词逻辑与推理演算

热门文章

  1. xbel文档_什么是.recently-used.xbel?如何永久删除它?
  2. 手机re管理器支持android2.3的,RE管理器安卓版
  3. 谁动了我的奶酪,通过 git 找出内容变更历史
  4. L6网络编程--IO多路复用(day6)
  5. css总结-笔记--部分非原创--属于资源整合
  6. 电子商务网站购物车设计
  7. Spring Boot中使用Feign调用时Hystrix提示异常:could not be queued for execution and no fallback available.以及R...
  8. 首页绑架克星HiJackThis工具软件
  9. 伪原创文章生成器软件
  10. CentOS6.6 NTP配置详解