一 引言

什么是网络爬虫?

个人简单理解: 根据特定规则从指定web开放内容中抓取希望获取的数据,如视频,图片,小说等

官方权威解释(wiki):

网络爬虫(英语:),也叫网络蜘蛛(),是一种用来自动浏览的。其目的一般为编纂。

网络搜索引擎等站点通过爬虫软件更新自身的网站内容或其对其他网站的索引。网络爬虫可以将自己所访问的页面保存下来,以便搜索引擎事后生成索引供用户搜索。

爬虫访问网站的过程会消耗目标系统资源。不少网络系统并不默许爬虫工作。因此在访问大量页面时,爬虫需要考虑到规划、负载,还需要讲“礼貌”。 不愿意被爬虫访问、被爬虫主人知晓的公开站点可以使用robots.txt文件之类的方法避免访问。这个文件可以要求机器人只对网站的一部分进行索引,或完全不作处理。

互联网上的页面极多,即使是最大的爬虫系统也无法做出完整的索引。因此在公元2000年之前的万维网出现初期,搜索引擎经常找不到多少相关结果。现在的搜索引擎在这方面已经进步很多,能够即刻给出高质量结果。

爬虫还可以验证超链接和HTML代码,用于网络抓取(参见数据驱动编程)。

阅读前请知晓:

本教程使用Maven依赖管理

点击查看Maven教程

本教程使用Fiddler4进行网页抓包分析

点击查看fiddler使用教程

工具:IDEA+Fiddler4+Google浏览器

二 准备工作

  1. 在IDEA新建Maven工程并添加以下依赖
<dependencies><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpcore</artifactId><version>4.4.10</version></dependency><dependency><groupId>org.apache.httpcomponents</groupId><artifactId>httpclient</artifactId><version>4.5.6</version></dependency><dependency><groupId>org.jsoup</groupId><artifactId>jsoup</artifactId><version>1.7.3</version></dependency><dependency><groupId>commons-io</groupId><artifactId>commons-io</artifactId><version>2.4</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-lang3</artifactId><version>3.9</version></dependency><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><version>1.18.8</version></dependency><dependency><groupId>commons-httpclient</groupId><artifactId>commons-httpclient</artifactId><version>3.1</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>RELEASE</version><scope>test</scope></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>RELEASE</version><scope>test</scope></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.13-beta-3</version></dependency><dependency><groupId>com.google.code.gson</groupId><artifactId>gson</artifactId><version>2.8.5</version></dependency>
</dependencies>
  1. 打开fiddler对指定网页请求进行过滤分析

    以好看视频为例:

通常做法是将下方响应体转为文本视图复制内容到编辑器中分析可读性更佳

  1. 根据分析得到点击首页视频前往视频详情的路径并分析得到详情页视频链接

步骤:①得到该网页的源码

 /*** 获取网页源码方法* @param urlstr* @return*/static String getPageContent(String urlstr) {HttpClientBuilder builder = HttpClients.custom();// CloseableHttpClient是HttpClient接口的抽象类CloseableHttpClient client = builder.build();URL url = null;URI uri = null;try {url = new URL(urlstr);uri = new URI(url.getProtocol(), url.getHost(), url.getPath(), url.getQuery(), null);} catch (Exception e) {e.printStackTrace();}HttpGet request = new HttpGet(uri);String content = null;try {CloseableHttpResponse response = client.execute(request);HttpEntity entity = response.getEntity();content = EntityUtils.toString(entity, "utf-8");} catch (Exception e) {e.printStackTrace();} finally {request.completed();}return content;}
/*** 获取首页未分页前的源码保存到本地并读取后使用正则匹配* @return*/private static List<String> getMainUrl() {String content = DownLoadVideo.getPageContent("https://haokan.baidu.com/");Document document = Jsoup.parse(content);String filePath = "D:\\video_html\\a.html";      // 此处自定义路径    File file = new File(filePath);List<String> list = new ArrayList<>();try {FileUtils.writeStringToFile(file, content, "UTF-8", false);String fileToString = FileUtils.readFileToString(file, "utf-8");// 此处自定义指定网页匹配的正则String regex = "\"https://haokan.baidu.com\\/v\\?vid=" + "(.*?)" + "\"";matchRegex(list, fileToString, regex);} catch (IOException e) {e.printStackTrace();}// System.out.println("list = " + list);return list;}

​ ②使用正则匹配该网页视频链接

 /*** 正则匹配结果* @param list* @param content* @param regex*/private static void matchRegex(List<String> list, String content, String regex) {Pattern pattern = Pattern.compile(regex);Matcher matcher = pattern.matcher(content);while (matcher.find()) {list.add(matcher.group(0));}}

​ ③分析并获取详情页视频链接

 /*** 得到未分页前的视频链接集合* @param list* @return*/private static List<String> getVideoUrlList(List<String> list) {List<String> matchUrlList = new ArrayList<>();for (String s1 : list) {String s = s1.replace("\"", "");// 获取每个详情页的源码String pageContent = getPageContent(s);try {// File file = new File("D:\\video_html\\" + new Random().nextInt(1000) + ".html");// 写入文件// FileUtils.writeStringToFile(file, pageContent, "utf-8");// 读取文件// String fileToString = FileUtils.readFileToString(file, "utf-8");// 此处正则匹配到的是所有分辨率的视频链接String regex = "vd" + "(.*?)" + "\\/sc" + "(.*?)" + ".mp4";matchRegex(matchUrlList, pageContent, regex);} catch (Exception e) {e.printStackTrace();}}return matchUrlList;}

主要异步网络下载工具类:

/*** NIO异步文件下载工具类*/
public class Downloader extends Observable {private String url, savePath;             //下载地址与保存路径private FileChannel channel;              //保存文件的通道private long size, perSize;              //文件大小与每一个小文件的大小private volatile long downloaded;       // 已下载的private int connectCount;                 //连接数private Connection[] connections;         //连接对象private boolean isSupportRange;         //是否支持断点下载private long timeout;                     //超时private boolean exists;                   //是否存在private RandomAccessFile randomAccessFile;private volatile boolean stop;            //停止private static volatile boolean exception; //是否异常private AtomicLong prevDownloaded = new AtomicLong(0); //上一次的下载结果private static Log log = LogFactory.getLog(Downloader.class);private AtomicInteger loseNum = new AtomicInteger(0);private int maxThread;public Downloader(String url, String savePath) throws IOException {//超时一小时this(url, savePath, 1000 * 60 * 5, 50);}public Downloader(String url, String savePath, long timeout, int maxThread) throws FileNotFoundException {this.timeout = timeout;this.url = url;this.maxThread = maxThread;File file = new File(savePath);if (!file.exists()) file.mkdirs();this.savePath = file.getAbsolutePath() + "/" + url.substring(url.lastIndexOf("/"));exists = new File(this.savePath).exists();if (!exists) {randomAccessFile = new RandomAccessFile(this.savePath + ".temp", "rw");channel = randomAccessFile.getChannel();}}public GetMethod method(long start, long end) throws IOException {GetMethod method = new GetMethod(Downloader.this.url);method.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");if (end > 0) {method.setRequestHeader("Range", "bytes=" + start + "-" + (end - 1));} else {method.setRequestHeader("Range", "bytes=" + start + "-");}HttpClientParams clientParams = new HttpClientParams();//5秒超时clientParams.setConnectionManagerTimeout(5000);HttpClient client = new HttpClient(clientParams);client.executeMethod(method);int statusCode = method.getStatusCode();if (statusCode >= 200 && statusCode < 300) {isSupportRange = (statusCode == 206) ? true : false;}return method;}public void init() throws IOException {size = method(0, -1).getResponseContentLength();if (isSupportRange) {if (size < 4 * 1024 * 1024) {  //假设小于4MconnectCount = 1;} else if (size < 10 * 1024 * 1024) { //假设文件小于10M 则两个连接connectCount = 2;} else if (size < 30 * 1024 * 1024) { //假设文件小于80M 则使用6个连接connectCount = 3;} else if (size < 60 * 1024 * 1024) {          //假设小于60M 则使用10个连接connectCount = 4;} else {//否则为10个连接connectCount = 5;}} else {connectCount = 1;}log.debug(String.format("%s size:%s connectCount:%s", this.url, this.size, this.connectCount));perSize = size / connectCount;connections = new Connection[connectCount];long offset = 0;for (int i = 0; i < connectCount - 1; i++) {connections[i] = new Connection(offset, offset + perSize);offset += perSize;}connections[connectCount - 1] = new Connection(offset, size);}/*** 强制释放内存映射** @param mappedByteBuffer*/static void unmapFileChannel(final MappedByteBuffer mappedByteBuffer) {try {if (mappedByteBuffer == null) {return;}mappedByteBuffer.force();AccessController.doPrivileged(new PrivilegedAction<Object>() {@Overridepublic Object run() {try {Method getCleanerMethod = mappedByteBuffer.getClass().getMethod("cleaner", new Class[0]);getCleanerMethod.setAccessible(true);sun.misc.Cleaner cleaner = (sun.misc.Cleaner) getCleanerMethod.invoke(mappedByteBuffer, new Object[0]);cleaner.clean();} catch (Exception e) {//LOG.error("unmapFileChannel." + e.getMessage());}return null;}});} catch (Exception e) {log.debug("异常->exception=true");exception = true;log.error(e);}}private void timer() {Timer timer = new Timer();//延迟3秒,3秒执行一次timer.schedule(new TimerTask() {@Overridepublic void run() {log.debug(String.format("已下载-->%s -> %s", (((double) downloaded) / size * 100) + "%", url));//假设上一次的下载大小与当前的一样就退出if (prevDownloaded.get() == downloaded && downloaded < size) {if (loseNum.getAndIncrement() >= 10) {log.debug(String.format("上次下载%s与当前下载%s一致,exception->true  url:%s ", prevDownloaded.get(), downloaded, url));exception = true;}}//假设下载完毕或者异常就退出if (downloaded >= size || exception) {stop = true;cancel();}//设置上次下载的大小等于如今的大小prevDownloaded.set(downloaded);}}, 3000, 3000);}public void start() throws IOException {if (exists) {log.info("文件已存在." + this.url);Thread.currentThread().interrupt();return;}while (Thread.activeCount() > maxThread) {try {Thread.sleep(1000);} catch (InterruptedException e) {}}init();timer();CountDownLatch countDownLatch = new CountDownLatch(connections.length);log.debug("開始下载:" + url);for (int i = 0; i < connections.length; i++) {new DownloadPart(countDownLatch, i).start();}end(countDownLatch);}private boolean rename(File tempFile) {File file = new File(this.savePath);boolean isRename = tempFile.renameTo(file);if (!isRename) {try {IOUtils.copy(new FileInputStream(tempFile), new FileOutputStream(file));} catch (IOException e) {log.error(e);}}return true;}public void end(CountDownLatch countDownLatch) {try {//超过指定时间就直接结束countDownLatch.await(timeout, TimeUnit.MILLISECONDS);} catch (InterruptedException e) {exception = true;log.error(e);log.info("下载失败:" + this.url);} finally {try {channel.force(true);channel.close();randomAccessFile.close();} catch (IOException e) {log.error(e);}File temp = new File(this.savePath + ".temp");log.debug(String.format("%s  %s", exception, this.url));//假设有异常则删除已下载的暂时文件if (exception) {if (!temp.delete()) {if (temp != null) temp.delete();}} else {try {Thread.sleep(100);} catch (InterruptedException e) {}rename(temp);setChanged();notifyObservers(this.url);log.info("下载成功:" + this.url);}}}private class Connection {long start, end;public Connection(long start, long end) {this.start = start;this.end = end;}public InputStream getInputStream() throws IOException {return method(start, end).getResponseBodyAsStream();}}private class DownloadPart implements Runnable {CountDownLatch countDownLatch;int i;public DownloadPart(CountDownLatch countDownLatch, int i) {this.countDownLatch = countDownLatch;this.i = i;}public void start() {new Thread(this).start();}@Overridepublic void run() {MappedByteBuffer buffer = null;InputStream is = null;try {is = connections[i].getInputStream();buffer = channel.map(FileChannel.MapMode.READ_WRITE, connections[i].start, connections[i].end - connections[i].start);byte[] bytes = new byte[4 * 1024];int len;while ((len = is.read(bytes)) != -1 && !exception && !stop) {buffer.put(bytes, 0, len);downloaded += len;}log.debug(String.format("file block had downloaded.%s %s", i, url));} catch (IOException e) {log.error(e);} finally {unmapFileChannel(buffer);if (buffer != null) buffer.clear();if (is != null) try {is.close();} catch (IOException e) {}countDownLatch.countDown();}}}
/*** Json数据解析实体类*/
@Data
public class videoBean {private int errno;private String error;private DataBean data;@Datapublic static class DataBean {private RequestParamBean requestParam;private ResponseBean response;@Datapublic static class RequestParamBean {private String tab;private String num;private String bfe;private String shuaxin_id;private String pd;private String act;private String cuid;}@Datapublic static class ResponseBean {private List<VideosBean> videos;@Datapublic static class VideosBean {private String id;private String title;private String poster;private String poster_small;private String poster_big;private String source_name;private String play_url;private int playcnt;private String mthid;private String mthpic;private String threadId;private Object site_name;private String duration;private String url;private String cmd;private String loc_id;private CommentInfoBean commentInfo;private String comment_id;private int show_tag;private String publish_time;private String new_cate_v2;private int like;private String fmlike;private String comment;private String fmcomment;private String fmplaycnt;private String fmplaycnt_2;@Datapublic static class CommentInfoBean {private String source;private String key;}}}}
}

Lombok实现的简化Bean,使用教程可查看Android 使用Lombok工具的入门基本姿势,为简化而生

三 开始测试

 @Testpublic void getVideoContent() throws IOException {// 此处测试分析得到具体的请求路路径包括参数(影视分类,数量40)String urlstr = "https://haokan.baidu.com/videoui/api/videorec?tab=yingshi&act=pcFeed&pd=pc&num=40&shuaxin_id=1569395270987";String pageContent = DownLoadVideo.getPageContent(urlstr);// FileUtils.writeStringToFile(new File("D:\\video_html\\test\\test.html"),pageContent,"utf-8");videoBean videoBean = new Gson().fromJson(pageContent, videoBean.class);List<String> list = videoBean.getData().getResponse().getVideos().stream().map(x -> x.getPlay_url()).collect(Collectors.toList());// System.out.println("list = " + list);File file = new File("D:\\video_html\\test\\yingshivideo");for (String videoUrl : list) {String url = videoUrl.substring(videoUrl.indexOf("h"),videoUrl.indexOf("?"));Downloader downloader = new Downloader(url,file.getPath(),1000*60*60,50);downloader.init();downloader.start();}}

结果:

四 总结

个人简单总结数据爬取简单步骤:

  1. Fiddler4 抓包分析出所需要的资源路径
  2. 用正则表达式匹配出所有资源详情路径 列表出来
  3. 使用http异步访问网络下载
  4. 完成

!!!注意:本教程以好看视频为例,不做商业用途!仅学习参考为主

Java 爬虫简单实现多线程爬取视频相关推荐

  1. java爬虫的2种爬取方式(HTTP||Socket)简单Demo(一)

    转载自 java爬虫的2种爬取方式(HTTP||Socket)简单Demo(一) 最近在找java的小项目自己写着玩,但是找不到合适的,于是写开始学一点爬虫,自己也是感觉爬虫比较有趣.这里自己找了一个 ...

  2. python爬虫实战之多线程爬取前程无忧简历

    python爬虫实战之多线程爬取前程无忧简历 import requests import re import threading import time from queue import Queu ...

  3. python爬虫第二弹-多线程爬取网站歌曲

    python爬虫第二弹-多线程爬取网站歌曲 一.简介 二.使用的环境 三.网页解析 1.获取网页的最大页数 2.获取每一页的url形式 3.获取每首歌曲的相关信息 4.获取下载的链接 四.代码实现 一 ...

  4. Python爬虫实战 | 利用多线程爬取 LOL 高清壁纸

    来源:公众号[杰哥的IT之旅] 作者:阿拉斯加 ID:Jake_Internet 如需获取本文完整代码及 LOL 壁纸,请为本文右下角点赞并添加杰哥微信:Hc220088 获取. 一.背景介绍 随着移 ...

  5. Java爬虫——B站弹幕爬取

    如何通过B站视频AV号找到弹幕对应的xml文件号 首先爬取视频网页,将对应视频网页源码获得 就可以找到该视频的av号aid=8678034 还有弹幕序号,cid=14295428 弹幕存放位置为  h ...

  6. python爬虫实例之——多线程爬取小说

    之前写过一篇爬取小说的博客,但是单线程爬取速度太慢了,之前爬取一部小说花了700多秒,1秒两章的速度有点让人难以接受. 所以弄了个多线程的爬虫. 这次的思路和之前的不一样,之前是一章一章的爬,每爬一章 ...

  7. java爬虫自动识别验证码_简单Java爬虫(一)爬取手机号码

    原创 野狗菌 希望你能喜欢 今天 关于本文: 本文介绍一个简单Java爬虫,获取网页源码,爬取电话号码. 本篇教程用我的博客一个测试网页演示. --野狗菌[希望你能喜欢] 测试页面: https:// ...

  8. Java爬虫_资源网站爬取实战

    对 http://bestcbooks.com/  这个网站的书籍进行爬取 (爬取资源分享在结尾) 下面是通过一个URL获得其对应网页源码的方法 传入一个 url  返回其源码 (获得源码后,对源码进 ...

  9. Python爬虫进阶之多线程爬取数据并保存到数据库

    今天刚看完崔大佬的<python3网络爬虫开发实战>,顿时觉得自己有行了,准备用appium登录QQ爬取列表中好友信息,接踵而来的是一步一步的坑,前期配置无数出错,安装之后连接也是好多错误 ...

最新文章

  1. python中没有arcpy怎么办_Arcpy学习笔记(一)—无中生有(上)
  2. 优秀的数据分析师应该具备哪些技能和特质?
  3. 20140417--Linux课程讲解目录索引
  4. 模拟退火算法求解旅行商问题(python实现)
  5. WeihanLi.Npoi 1.21.0 Released
  6. [c#]获取exchange中的图片
  7. 5.Ray-Handler之ToReadHandler编写
  8. Stanford机器学习---第八讲. 支持向量机SVM
  9. CSV Converter Pro for Mac(CSV数据转换工具)
  10. 跟着王道考研学计算机网络笔记(一):初步了解计算机网络
  11. Visual C++——定时器(计时器)SetTimer函数
  12. JavaScript对象类型的详解
  13. 解读测试能力素质模型(Job Model)
  14. jdk安装包解压后如何安装(jdk下载安装)
  15. android 6.1 app闪退,手机软件闪退怎么办 具体解决方法【图文】
  16. mac zoc ssh工具
  17. sql删除数据的3种方法
  18. 金融 python 招聘,滴滴、度小满金融python工程师社招面经
  19. 最好用的xshell替代软件----FinalShell工具
  20. 电波传播基础公式总结

热门文章

  1. C++之list删除元素
  2. 趣学Python-教孩子学编程--第七章
  3. 搜索引擎迈进四 Google网站管理员工具
  4. 2022年最全教程:如何做大数据的采集数据及数据分析?
  5. 灵雀云CEO左玥:所有数字化转型必然最终会落到容器上
  6. 在谷歌控制台上怎么换行_如何在Google表格中的单元格中换行
  7. Android Camera API/Camera2 API 相机预览及滤镜、贴纸等处理
  8. 用Visual Studio Code配合Linux子系统进行C/C++开发(调试篇)
  9. 红帽云邮企业邮箱的关注点
  10. 程序员35岁前趁工的黄金法则