机器学习:算法视角pdf_何时使用不同的机器学习算法：简单指南

机器学习:算法视角pdf

If you’ve been at machine learning long enough, you know that there is a “no free lunch” principle — there’s no one-size-fits-all algorithm that will help you solve every problem and tackle every dataset.

如果您已经在机器学习领域学习了足够长的时间，那么您就会知道有“免费午餐”的原则-没有一种万能的算法可以帮助您解决每个问题并解决每个数据集。

I work for Springboard — we’ve put a lot of research into machine learning training and resources. At Springboard, we offer the first online course with a machine learning job guarantee.

我在Springboard工作-我们对机器学习培训和资源进行了大量研究。在Springboard，我们提供了第一个具有机器学习工作保障的在线课程。

What helps a lot when confronted with a new problem is to have a primer for what algorithm might be the best fit for certain situations. Here, we talk about different problems and data types and discuss what might be the most effective algorithm to try for each one, along with a resource that can help you implement that particular model.

遇到新问题时，有很大帮助的是为哪种算法在某些情况下最合适提供入门。在这里，我们讨论了不同的问题和数据类型，并讨论了可能是每种算法中最有效的算法，以及可以帮助您实现该特定模型的资源。

Remember: the proof is in the pudding: the best approach to your data is the model that empirically gives you the best results. This guide is meant to hone your first instincts and help you remember what models might be the most effective for each problem, and which would be impractical to use.

请记住：证明在布丁中：数据的最佳方法是凭经验为您提供最佳结果的模型。本指南旨在磨练您的本能，并帮助您记住哪种模型可能对每个问题最有效，而哪些模型不切实际。

Let’s start by talking about the variables we need to consider.

让我们从谈论我们需要考虑的变量开始。

监督学习与监督学习 (Unsupervised learning vs supervised learning)

Unsupervised learning is where you allow the machine learning algorithm to start learning and outputting a result without any explicit human processing of the data beforehand.

在无监督学习中，您可以让机器学习算法开始学习并输出结果，而无需事先对数据进行任何明确的人工处理。

Supervised learning involves some labeling and processing of the training data beforehand in order to structure it for processing.

有监督的学习需要事先对训练数据进行一些标记和处理，以便将其构造成可处理的形式。

The kind of learning you can perform will matter a lot when you start working with different machine learning algorithms.

当您开始使用不同的机器学习算法时，可以执行的学习种类将非常重要。

S 速度和时间注意事项 (Space and time considerations)

There are space and time considerations for each machine learning algorithm. While in practice you’ll likely work with optimized versions of each algorithm packaged in a framework, it is good to consider how the algorithms you choose can affect performance.

每种机器学习算法都有时空考虑 。在实践中，您可能会使用框架中打包的每种算法的优化版本，但最好考虑选择的算法如何影响性能。

输出 (The output)

Third, and perhaps most important, is the output that you want to get. Are you trying to categorize data? Use it to predict future data points? What you’re looking to get as a result and what you want to do to your data will largely determine the algorithmic approaches you should take.

第三，也许是最重要的，是您想要获得的输出 。您是否要对数据进行分类？用它来预测未来的数据点吗？结果您想要得到的以及您想要对数据做什么将在很大程度上决定您应采用的算法方法。

一些例子 (Some examples)

您正在寻找一个结构合理的数据集而又没有太多复杂性的简单预测模型。 (You’re looking to build a simple predictive model with a well-structured dataset without too many complications.)

Your best bet here is probably linear regression, something that can take a whole host of factors and then give you a predictive result with a simple error rate explanation and a simple explanation for which factors contribute to the prediction. It doesn’t take much computational power to run a linear regression either.

您最好的选择可能是线性回归，它可以吸收很多因素，然后通过简单的错误率说明和对哪些因素有助于预测的简单说明为您提供预测结果。进行线性回归也不需要太多的计算能力。

Resource: Linear Regression — Detailed View

资源：线性回归—详细视图

您正在寻找一种在有监督的环境中将已经被标记的数据分类为两种或两种以上截然不同的标签类型(例如，尝试根据孩子的体重和身高确定孩子是男性还是女性)。 (You’re looking to classify data that’s already been labeled into two or more sharply distinct types of labels (e.g., trying to determine if children are likely male or female based on their weight and height) in a supervised setting.)

The first instinct you should have when you see a situation like this is to apply the logistic regression model. After running the model, you’ll see that it forces every data point into two different categories, allowing you to easily output which point belongs to which category. The logistic regression model can also be easily generalized to working with multiple target and result classes if that’s what your problem demands.

当您看到这样的情况时，您应该具有的第一个本能是应用逻辑回归模型 。运行模型后，您将看到它将每个数据点强制分为两个不同的类别，从而使您可以轻松地输出哪个点属于哪个类别。如果您的问题需要，逻辑回归模型也可以很容易地推广到使用多个目标和结果类。

Resource: Building a Logistic Regression

资源：建立逻辑回归

您正在寻求将未标记的连续数据放入不同的组中(例如，将具有某些已记录特征的客户放置，并尝试发现他们可以属于的类别/组)。 (You’re looking to place unlabeled continuous data into different groups (e.g., putting customers with certain recorded traits and trying to discover categories/groups they can belong to).)

The first natural fit for this problem is the K-Means clustering algorithm, which will group and cluster data by measuring the distance between each point. Then there are a variety of clustering algorithms, such as Density-Based Spatial Clustering of Applications with Noise and Mean-Shift algorithms.

解决该问题的第一个自然方法是K-Means聚类算法，该算法将通过测量每个点之间的距离对数据进行分组和聚类。然后是各种各样的聚类算法，例如带噪声的应用程序的基于密度的空间聚类和均值漂移算法。

Resource: The 5 Clustering Algorithms Data Scientists Need to Know

资源：数据科学家需要了解的5种聚类算法

您要预测一串字符或一组特征属于一种数据类别还是另一种数据类别(监督文本分类)，例如，评论是肯定的还是否定的。 (You’re looking to predict whether a string of characters or a grouping of traits falls into one category of data or another (supervised text classification) — e.g, whether a review is positive or negative.)

Your best bet here is probably Naive Bayes, which is a simple but powerful model that can be used for text classification. With some text pre-processing and cleaning (being especially careful to remove filler stop words such as “and” that might add noise to your dataset), you can get a remarkable set of results with a very simple model.

您最好的选择可能是朴素贝叶斯(Naive Bayes)，这是一个简单但功能强大的模型，可用于文本分类。通过一些文本预处理和清理(尤其要小心地删除可能会增加数据集噪音的填充词，例如“ and”)，您可以通过一个非常简单的模型获得一组出色的结果。

Another decent bet is logistic regression, which is a simple model to grasp and explain, and less hard to pick apart than Naive Bayes (which will often assign probabilities word by word rather than holistically labeling a text snippet as being part of one group or another).

另一个不错的选择是逻辑回归，这是一个易于掌握和解释的简单模型，并且比朴素贝叶斯(Naive Bayes)难分难解(朴素贝叶斯通常会逐字分配概率，而不是将文本片段整体标记为一组或另一组的一部分) )。

Moving on to something more powerful, a Linear Support Vector Machine algorithm will likely help improve your performance. If you want to skip right ahead here, you can (though I suggest trying both models and comparing which one works best — Naive Bayes has an absurdly easy implementation on frameworks like scikit-learn and it isn’t very computationally expensive so you can afford to test both).

转向功能更强大的产品时，线性支持向量机算法可能会帮助改善性能。如果您想在这里跳过，您可以(尽管我建议尝试使用两种模型并比较哪种模型最适合— Naive Bayes在scikit-learn等框架上的实现非常简单，而且计算成本也不是很高，因此您可以负担得起同时测试)。

Lastly, bag-of-words analysis could also work — consider doing an ensemble of different methods and testing all of these methods against one another, depending on the dataset in question.

最后，词袋分析也可以工作-考虑使用一组不同的方法，并根据所涉及的数据集对所有这些方法进行相互测试。

Resource: Multi-Class Text Classification Model Comparison and Selection

资源：多类文本分类模型比较和选择

您正在寻找对大型图像或视频数据集(例如图像分类)进行非结构化学习的方法。 (You’re looking to do unstructured learning on large-scale image or video datasets (e.g., image classification).)

The best algorithm to tackle going through different images is a convolutional neural network that is organized similarly to how animal visual cortexes are analyzed.

解决不同图像的最佳算法是卷积神经网络，其组织方式类似于动物视觉皮层的分析方式。

Measured by performance (reduced error rate) in the ImageNet competition, the SE-Resnet architecture comes out on top, though as the field is still developing, new advances come out almost every day.

以ImageNet竞赛中的性能(降低的错误率)来衡量，SE-Resnet体系结构名列前茅，尽管随着该领域的不断发展，几乎每天都有新的进步。

You should be aware, however, that convolutional neural networks are dense and require a lot of computational power — so make sure that you have the hardware capability to run these models on large-scale datasets.

但是，您应该意识到，卷积神经网络是密集的，并且需要大量的计算能力-因此，请确保您具有在大规模数据集上运行这些模型的硬件功能。

Resource: Review of Deep Learning Algorithms for Image Classification

资源：图像分类的深度学习算法的回顾

您正在寻找对来自定义明确的流程的结果点进行分类的方法(例如：预先建立的面试流程的录用人数，您可以在其中知道或可以计算得出每个事件的概率)。 (You’re looking to classify result points that come out of a well-defined process (ex: number of hires from a pre-established interview process, wherein you know or can computationally infer the probabilities of each event).)

The best option for this is probably a decision tree algorithm that will clearly explain what the split points are between classifying something into one group or another.

最好的选择可能是决策树算法，该算法将清楚地说明将某物分为一组或另一组之间的分裂点。

Resource: Decision Trees in Machine Learning

资源：机器学习中的决策树

您正在寻找使用定义明确，受监督的数据进行时间序列分析的方法(例如，根据从过去到现在的时间顺序排列的股票市场的历史模式来预测股票价格)。 (You’re looking to do time series analysis with well-defined, supervised data (e.g., predicting stock prices based on historical patterns in the stock market arranged on a chronological basis from the past to the present).)

A recurrent neural network is set up to do sequence analysis by containing an in-stream internal memory of data it processes, allowing it to take into account the relationship between data and the time horizon and order it’s deployed in.

循环神经网络被设置为通过包含其处理的数据的流内内部存储器来进行序列分析，从而使其能够考虑数据与时间范围之间的关系以及部署的顺序。

Resource: Recurrent Neural Networks and LSTM

资源：递归神经网络和LSTM

结语 (Wrapping up)

Take the recommendations and resources above, and apply them as a sort of first instinct for your modeling — it’ll help you jump into any work you do just a little bit faster. If you’re interested in being mentored by a machine learning expert in learning how to train your instincts further, check out Springboard’s AI/Machine Learning Career Track.

采取上述建议和资源，并将其作为建模的第一手本能–它可以帮助您更快地完成任何工作。如果您有兴趣在机器学习专家的指导下学习如何进一步训练自己的直觉，请查阅Springboard的AI /机器学习职业生涯指南。

翻译自: https://www.freecodecamp.org/news/when-to-use-different-machine-learning-algorithms-a-simple-guide-ba615b19fb3b/

机器学习:算法视角pdf