Gecco定时抓取慕课网实战课入门

一、Gecco是什么

Gecco是一款用java语言开发的轻量化的易用的网络爬虫，不同于Nutch这样的面向搜索引擎的通用爬虫，Gecco是面向主题的爬虫。

通用爬虫一般关注三个主要的问题：下载、排序、索引。
主题爬虫一般关注的是：下载、内容抽取、灵活的业务逻辑处理。

Gecco的目标是提供一个完善的主题爬虫框架，简化下载和内容抽取的开发，利用管道过滤器模式，提供灵活的内容清洗和持久化处理模式，让开发人员把更多的精力投入到与业务主题相关的内容处理上。

主要特征

简单易用，使用jquery的selector风格抽取元素
支持页面中的异步ajax请求
支持页面中的javascript变量抽取
利用Redis实现分布式抓取,参考gecco-redis
支持下载时UserAgent随机选取
支持下载代理服务器随机选取
支持结合Spring开发业务逻辑,参考gecco-spring
支持htmlunit扩展,参考gecco-htmlunit
支持插件扩展机制

二、使用步骤

1. maven依赖

    <dependencies><dependency><groupId>org.projectlombok</groupId><artifactId>lombok</artifactId><version>1.18.10</version><scope>provided</scope></dependency><dependency><groupId>com.geccocrawler</groupId><artifactId>gecco-spring</artifactId><version>1.3.0</version></dependency><dependency><groupId>cn.hutool</groupId><artifactId>hutool-all</artifactId><version>5.5.7</version></dependency></dependencies>

2.项目目录结构

3.入口

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;
import org.springframework.scheduling.annotation.EnableScheduling;@SpringBootApplication
@EnableScheduling
public class Application {public static void main(String[] args) {SpringApplication.run(Application.class, args);}
}

4.定时任务

import com.geccocrawler.gecco.GeccoEngine;
import com.geccocrawler.gecco.spring.SpringPipelineFactory;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.scheduling.annotation.Scheduled;
import org.springframework.stereotype.Component;@Component
public class Task {@Autowiredprivate SpringPipelineFactory springPipelineFactory;@Scheduled(cron = "0 0/1 * * * ?")public void pull() {GeccoEngine.create().pipelineFactory(springPipelineFactory).classpath("com.mt.imooc.spider").start("https://www.imooc.com").interval(3000).start();}
}

5.Gecco配置

import org.springframework.context.annotation.ComponentScan;
import org.springframework.context.annotation.Configuration;/*** @author jay* @date 2021/1/21 22:11*/
@Configuration
@ComponentScan(basePackages = "com.geccocrawler.gecco.spring")
public class GeccoConfig {}

6. 慕课网主页标题与子标题抓取

import com.geccocrawler.gecco.annotation.Gecco;
import com.geccocrawler.gecco.annotation.HtmlField;
import com.geccocrawler.gecco.annotation.Request;
import com.geccocrawler.gecco.annotation.Text;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.spider.HtmlBean;
import lombok.Data;import java.util.List;/*** @author jay* @date 2021/1/21 22:11*/
@Data
@Gecco(matchUrl="https://www.imooc.com", pipelines="indexPipeline")
public class Index implements HtmlBean {@Requestprivate HttpRequest request;/*** 左侧导航栏标题*/@Text@HtmlField(cssPath = "#main > div.bgfff.banner-box > div > div.menuContent > div > span.title")private List<String> title;/*** 左侧导航栏子标题*/@Text@HtmlField(cssPath = "#main > div.bgfff.banner-box > div > div.menuContent > div > span.sub-title")private List<String> subTitle;
}

7.针对慕课网主页的数据的后续处理

import com.geccocrawler.gecco.pipeline.Pipeline;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.scheduler.DeriveSchedulerContext;
import org.springframework.stereotype.Service;import java.util.List;/*** @author jay* @date 2021/1/21 22:11*/
@Service
public class IndexPipeline implements Pipeline<Index> {@Overridepublic void process(Index index) {HttpRequest currRequest = index.getRequest();List<String> list = index.getTitle();list.forEach(e -> {// 只抓取后端开发为java的if ("后端开发：".equals(e)) {DeriveSchedulerContext.into(currRequest.subRequest("https://coding.imooc.com/?c=java"));}});}}

8.代码主页的抓取

import com.geccocrawler.gecco.annotation.*;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.spider.SpiderBean;
import lombok.Data;import java.util.List;/*** @author jay* @date 2021/1/21 22:11*/
@Data
@Gecco(matchUrl="https://coding.imooc.com/?c={type}", pipelines="codingIndexPipeline")
public class CodingIndex implements SpiderBean {@RequestParameterprivate String type;@Requestprivate HttpRequest request;/*** 课程封面在style里面*/@Attr("style")@HtmlField(cssPath = "body > div.main > div.w1430 > ul > li > a > div")private List<String> style;/*** 课程标题*/@Text@HtmlField(cssPath = "body > div.main > div.w1430 > ul > li > a > p.title.ellipsis2")private List<String> title;/*** 当前页*/@Text@HtmlField(cssPath = "body > div.main > div.w1430 > div.page > a.active")private int currPage;/*** 自定义渲染totalPage*/@FieldRenderName("totalPageFieldRender")private int totalPage;
}

9.代码主页数据的后续处理

import cn.hutool.core.util.StrUtil;
import com.geccocrawler.gecco.pipeline.Pipeline;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.scheduler.DeriveSchedulerContext;
import org.apache.commons.lang3.StringUtils;
import org.springframework.stereotype.Service;import java.util.List;/*** @author jay* @date 2021/1/21 22:11*/
@Service
public class CodingIndexPipeline implements Pipeline<CodingIndex> {@Overridepublic void process(CodingIndex codingIndex) {List<String> style = codingIndex.getStyle();style.forEach(e -> System.out.println("课程封面：" + StrUtil.subBetween(e, "//", ")")));HttpRequest request = codingIndex.getRequest();int currPage = codingIndex.getCurrPage();int nextPage = currPage + 1;int totalPage = codingIndex.getTotalPage();if(nextPage <= totalPage) {String nextUrl;String currUrl = request.getUrl();if(currUrl.contains("page=")) {nextUrl = StringUtils.replaceOnce(currUrl, "page=" + currPage, "page=" + nextPage);} else {nextUrl = currUrl + "&" + "page=" + nextPage;}DeriveSchedulerContext.into(request.subRequest(nextUrl));}}}

10.自定义渲染字段

import cn.hutool.core.util.StrUtil;
import com.geccocrawler.gecco.annotation.FieldRenderName;
import com.geccocrawler.gecco.request.HttpRequest;
import com.geccocrawler.gecco.response.HttpResponse;
import com.geccocrawler.gecco.spider.SpiderBean;
import com.geccocrawler.gecco.spider.render.CustomFieldRender;
import net.sf.cglib.beans.BeanMap;import java.lang.reflect.Field;/*** @author jay* @date 2021/1/28 21:47*/
@FieldRenderName("totalPageFieldRender")
public class TotalPageFieldRender implements CustomFieldRender {@Overridepublic void render(HttpRequest request, HttpResponse response, BeanMap beanMap, SpiderBean bean, Field field) {String content = response.getContent();if (StrUtil.isNotBlank(content)) {if (!StrUtil.contains(content, "<span class=\"disabled_page\">尾页</span>")) {String s = StrUtil.subBetween(content, "下一页", "尾页");int i = StrUtil.lastIndexOfIgnoreCase(s, "\"");int j = StrUtil.lastIndexOfIgnoreCase(s, "=");int totalPage = Integer.parseInt(StrUtil.sub(s, j + 1, i));beanMap.put(field.getName(), totalPage);}}}
}

Gecco定时抓取慕课网实战课入门相关推荐

Java爬虫抓取网页数据(抓取慕课网论坛为实例)
1. 网络爬虫网络爬虫(英语:web crawler),也叫网络蜘蛛(spider),是一种用来自动浏览万维网的网络机器人.其目的一般为编纂网络索引.简单来说,就是获取请求的页面源码,再通过正则表达 ...
python爬虫爬取慕课网中的图片
我们简单地爬取慕课网中免费课程下的第一页的图片,如想爬取多页图片,可以添加for循环自行实现 python版本:3.6.5 爬取网址:http://www.imooc.com/course/list ...
python自动抓取网管软件的数据_python实现scrapy爬虫每天定时抓取数据的示例代码...
1. 前言. 1.1. 需求背景. 每天抓取的是同一份商品的数据,用来做趋势分析. 要求每天都需要抓一份,也仅限抓取一份数据. 但是整个爬取数据的过程在时间上并不确定,受本地网络,代理速度,抓取数据量 ...
python中plguba_Python量化交易进阶讲堂-爬虫抓取东方财富网股吧帖子
欢迎大家订阅<Python实战-构建基于股票的量化交易系统>小册子,小册子会陆续推出与小册内容相关的专栏文章,对涉及到的知识点进行更全面的扩展介绍.本篇专栏为小册子内容的加推篇!!! 前言 ...
在当当买了python怎么下载源代码-初学Python 之抓取当当网图书页面目录并保存到txt文件...
这学期新开了门"高大上"的课<机器学习>,也开始入门Python.然后跟我们一样初学Python 的老师布置了个"作业"--用Python 弄个抓取 ...
python批量读取图片并批量保存_Python爬虫：批量抓取花瓣网高清美图并保存
原标题:Python爬虫:批量抓取花瓣网高清美图并保存昨天看到了不错的图片分享网--花瓣,里面的图片质量还不错,所以利用selenium+xpath我把它的妹子的栏目下爬取了下来,以图片栏目名称给文 ...
python 定时自动爬取_python实现scrapy爬虫每天定时抓取数据的示例代码
1. 前言. 1.1. 需求背景. 每天抓取的是同一份商品的数据,用来做趋势分析. 要求每天都需要抓一份,也仅限抓取一份数据. 但是整个爬取数据的过程在时间上并不确定,受本地网络,代理速度,抓取数据量 ...
python定时爬取数据_python实现scrapy爬虫每天定时抓取数据的示例代码
1. 前言. 1.1. 需求背景. 每天抓取的是同一份商品的数据,用来做趋势分析. 要求每天都需要抓一份,也仅限抓取一份数据. 但是整个爬取数据的过程在时间上并不确定,受本地网络,代理速度,抓取数据量 ...
实现从淘宝（天猫）定时抓取订单数据、打印电子面单并保存到ERP表中
实现从淘宝(天猫)定时抓取订单数据.打印电子面单并保存到ERP表中前言实现思路代码片段参考前言最近有厂商提出想把天猫店铺的数据拿到后台ERP管理系统中,并能实现线下打印电子面单功能.接手这个 ...

Gecco定时抓取慕课网实战课入门

一、Gecco是什么

主要特征

二、使用步骤

1. maven依赖

2.项目目录结构

3.入口

4.定时任务

5.Gecco配置

6. 慕课网主页标题与子标题抓取

7.针对慕课网主页的数据的后续处理

8.代码主页的抓取

9.代码主页数据的后续处理

10.自定义渲染字段

Gecco定时抓取慕课网实战课入门相关推荐

最新文章

热门文章