Class3: Web scraping

The goal is to extract data from website

Noisy, weak labels, can be spammy（噪点比较多，标号比较弱，可能是一些垃圾信息）

Available at scale（数据规模大）

Many ML datasets are obtained by web scraping

Web crawling VS scrapping

Crawling: indexing whole pages on internet

Scraping: scraping particular data from web pages of a website

Web scraping tools

“curl” often doesn’t work

Website owners use various ways to stop bots

Use headless browser: a web browser without a GUI

You need a lot of new IPs, easy to get through public clouds

Legal consideration

Web scraping isn’t illegal by itself

But you should

NOT scrape data have sensitive information(eg: private data involving username/password, personal health/medical information)不要去爬敏感数据

NOT scape copyrighted data(eg: You Tube videos)

Follow the Terms of service that explicitly prohibits we scraping

Consult a lawyer if you are doing it for profit

Summary

Web scarping is a powerful way to collect data at scale when the website doesn’t offer a data API.

Low cost if using public clouds

Use browser’s inspection tool to locate the information in HTML

Be cautions to use it properly

Class4: data labeling

Have enough data –improve label, data, or model—enough label? —enough budget? –use weak label?

Semi-supervised learning (SSL)

Focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeled data. （一小部分的数据时有标注的，还有很多的数据没有标注，如何将有标注的数据和大部分没有标注的数据一起利用起来）

Make assumptions on data distribution to use unlabeled data. （对标签数据做一些假设）

Continuity assumption: examples with similar features are more likely to have the same label. （连续性假设）

Cluster assumption: data have inherent cluster structure, examples in the same cluster tend to have the same label. （聚类假设）

Manifold assumption: the data lie on a manifold of much lower dimension than the input space. （流型性假设：数据内在的复杂性远远比看到的要低）

Self-training

Seif-training is a SSL method

Labeled data – train – model（从一小部分已标记的数据中训练一个模型）
Unlabeled – predict – model （用训练得到的模型对问标记的数据进行预测）
Model—pseudo-labeled data （通过模型对未标记的数据得到一些伪标号，only keep highly confident predictions）
pseudo-labeled data—merge – labeled data（将得到伪标号的数据与原始有标记的数据进行融合，得到新的数据标签）

we can use expensive models

deep neural networks, model ensemble/bagging.

Label through crowdsourcing

ImageNet label millions of image through Amazon Mechanical Turk. It took several years and millions dollars to build.

Challenges

Simplify user interaction: design easy tasks, clear instructions and simple to use interface

Need to find qualified workers for complex jobs.

Quality control: label qualities generated by different labels vary.

Reduce #task: Active Learning

Focus on same scenario as SSL but with human intervention

Select the most “interesting” unlabeled data to labelers

Uncertainty sampling chooses an example whole prediction is most uncertain

The highest class prediction score is close to random(1/n)

Similar to self-training we can use expensive models

Query-by-committee trains multiple models and perform major voting

Active Learning + self-training

These two methods are often used together

Quality control

Labelers make mistakes(honest or not) and may fail to understand the instructions

Simplest but most expensive: sending the same task to multiple labeled ,then determine the label by majority voting.

Weak supervision（弱监督学习）

Semi-automatically generation labels

Less accurate than manual ones, but good enough for training

Data programming: heuristic programs to assign labels

Keyword search, pattern matching, third-party models

Summary

Ways to get labels

Self-training: iteratively train models to label unlabeled data

Crowdsourcing: leverage globe labelers to manually label data

Data-programming: heuristic programs to assign noisy labels

李沐实用机器学习（class3, class4)相关推荐

从ACM班、百度到亚马逊，深度学习大牛李沐的开挂人生
"大神",是很多人对李沐的印象.作为一经推出便大受追捧的 MXNet 深度学习框架的主要贡献者之一,李沐功不可没.值得注意的是,这个由 DMLC(Distributed Mac ...
从 ACM 班、百度到亚马逊，深度学习大牛李沐的开挂人生
"大神",是很多人对李沐的印象.作为一经推出便大受追捧的 MXNet 深度学习框架的主要贡献者之一,李沐功不可没.值得注意的是,这个由 DMLC(Distributed Mac ...
李沐分享斯坦福2021秋季新课：实用机器学习
点上方计算机视觉联盟获取更多干货仅作学术分享,不代表本公众号立场,侵权联系删除转载于:新智元 AI博士笔记系列推荐周志华<机器学习>手推笔记正式开源!可打印版本附pdf下载链接李沐 ...
李沐在斯坦福开新课了！面向机器学习实战，课程全部免费，9月1日可报名
明敏发自凹非寺量子位报道 | 公众号 QbitAI 朋友们,又有新课可以白嫖了! 斯坦福新课<实践机器学习(CS 329P)>上线了,主讲人为李沐.黄清清.Alex Smola. ...
极客日报：苹果承认从2019年开始扫描用户邮件寻找虐童资料；新浪回应“花钱买热搜”传闻；李沐斯坦福《机器学习》课程上线
一分钟速览新闻点! 小米成立公寓管理公司:员工宿舍,增强员工幸福感阿里云回应用户注册信息泄露事件新浪公布微博热搜管理规则:不存在商业售卖位置阿里组织调整:俞永福担任本地生活 CEO 高德推出 D ...
笔记｜李沐-动手学习机器学习｜CNN基础知识（视频19-23）
李沐-动手学习机器学习|CNN基础知识卷积层(视频19) 从全连接到卷积(卷积算子) 进行图像识别的两个原则如何从全连接层出发,应用以上两个原则,得到卷积卷积层二维交叉相关二维卷积层交叉相 ...
AI大神李沐B站走红！连博导们都在追更，还亲自带你逐段读懂论文，网友：带B站研究生吧...
明敏梦晨发自凹非寺量子位报道 | 公众号 QbitAI 什么样的B站Up主,让AI专业的导师们纷纷推荐给学生看,甚至导师自己也追更? 又是什么样的Up主,让网友直呼"简直是做慈善& ...
【动手学深度学习】李沐——循环神经网络
本文内容目录序列模型文本预处理语言模型和数据集循环神经网络 RNN的从零开始实现 RNN的简洁实现通过时间反向传播门控循环单元GRU 长短期记忆网络(LSTM) 深度循环神经网络双向循环 ...
李沐论文精读系列一： ResNet、Transformer、GAN、BERT
文章目录一. ResNet 1.0 摘要,论文导读 1.1 导论 1.1.1 为什么提出残差结构? 1.1.2 实验验证 1.2 相关工作 1.3 实验部分 1.3.1 不同配置的ResNet结构 ...

李沐实用机器学习（class3, class4)

Class3: Web scraping

Web scraping tools

Legal consideration

Summary

Class4: data labeling

Semi-supervised learning (SSL)

Self-training

Label through crowdsourcing

Challenges

Reduce #task: Active Learning

Quality control

Weak supervision（弱监督学习）

Summary

李沐实用机器学习（class3, class4)相关推荐

最新文章

热门文章