自然语言理解和自然语言处理

什么是自然语言处理？ (What is natural language processing?)

Natural language processing, or NLP, is a type of artificial intelligence (AI) that specializes in analyzing human language.

自然语言处理(NLP)是一种 专门用于分析人类语言的 人工智能(AI) 。

It does this by:

它通过以下方式做到这一点：

Reading natural language, which has evolved through natural human usage and that we use to communicate with each other every day
阅读自然语言，这种自然语言是通过人类的自然使用而发展起来的，并且我们每天都在与他们交流
Interpreting natural language, typically through probability-based algorithms
解释自然语言，通常通过基于概率的算法
Analyzing natural language and providing an output
分析自然语言并提供输出

Have you ever used Apple’s Siri and wondered how it understands (most of) what you’re saying? This is an example of NLP in practice.

您是否曾经使用过Apple的Siri，想知道它如何理解(大部分)您在说什么？这是实践中NLP的一个示例。

NLP is becoming an essential part of our lives, and together with machine learning and deep learning, produces results that are far superior to what could be achieved just a few years ago.

NLP正在成为我们生活中不可或缺的一部分，并且与机器学习和深度学习一起产生的结果远远优于几年前所能达到的结果。

In this article we’ll take a closer look at NLP, see how it’s applied and learn how it works.

在本文中，我们将仔细研究NLP，了解其应用方式并了解其工作原理。

自然语言处理能做什么？ (What can natural language processing do?)

NLP is used in a variety of ways today. These include:

如今，NLP以多种方式使用。这些包括：

机器翻译 (Machine translation)

When was the last time you visited a foreign country and used your smart phone for language translation? Perhaps you used Google Translate? This is an example of NLP machine translation.

您上次访问国外是什么时候使用智能手机进行语言翻译？也许您使用过Google翻译？这是NLP机器翻译的示例。

Machine translation works by using NLP to translate one language into another. Historically, simple rules-based methods have been used to do this. But today’s NLP techniques are a big improvement on the rules-based methods that have been around for years.

机器翻译通过使用NLP将一种语言翻译成另一种语言来工作。从历史上看，简单的基于规则的方法已用于执行此操作。但是，如今的NLP技术是对基于规则的方法的重大改进，这种方法已经存在多年了。

For NLP to do well at machine translation it employs deep learning techniques. This form of machine translation is sometimes called neural machine translation (NMT), since it makes use of neural networks. NMT therefore interprets language based on a statistical, trial and error approach and can deal with context and other subtleties of language.

为了使NLP在机器翻译方面表现出色，它采用了深度学习技术。这种形式的机器翻译有时被称为神经机器翻译(NMT)，因为它利用了神经网络。因此，NMT基于统计，反复试验的方法来解释语言，并且可以处理上下文和其他语言的细微差别。

In addition to applications like Google Translate, NMT is also used in a range of business applications, such as:

除了Google Translate之类的应用程序外，NMT还用于一系列业务应用程序中，例如：

Translating plain text, web pages or files such as Excel, Powerpoint or Word. Systran is an example of a translation services company that does this.

翻译纯文本，网页或文件，例如Excel，Powerpoint或Word。 Systran是执行此操作的翻译服务公司的示例。
Translating social feeds in real-time, as offered by SDL Government, a company specializing in public sector language services.

由SDL Government (一家专门从事公共部门语言服务的公司)提供的服务，可以实时翻译社交信息。
Translating languages in medical situations, such as when an English-speaking doctor is treating a Spanish-speaking patient, as offered by Canopy Speak.

在Canopy Speak提供的医疗情况下(例如，当说英语的医生正在治疗说西班牙语的患者时)翻译语言。
Translating financial documents such as annual reports, investment commentaries and information documents, as offered by Lingua Custodia, a company specializing in financial translations.

由专门从事财务翻译的公司Lingua Custodia提供的财务文件的翻译，例如年度报告，投资评论和信息文件。

语音识别 (Speech recognition)

Earlier, we mentioned Siri as an example of NLP. One particular feature of NLP used by Siri is speech recognition. Alexa and Google Assistant (“ok Google”) are other well known examples of NLP speech recognition.

之前，我们提到Siri作为NLP的示例。 Siri使用的NLP的一项特殊功能是语音识别。 Alexa和Google Assistant(“ ok Google”)是NLP语音识别的其他知名示例。

Speech recognition isn’t a new science and has been around for over 50 years. It’s only recently though that its ease-of-use and accuracy have improved significantly, thanks to NLP.

语音识别并不是一门新兴的科学，已经有50多年的历史了。直到最近，借助NLP，它的易用性和准确性有了显着提高。

At the heart of speech recognition is the ability to identify spoken words, interpret them and convert them to text. A range of actions can then follow such as answering questions, performing instructions or writing emails.

语音识别的核心是识别口语单词，对其进行解释并将其转换为文本的能力。然后可以采取一系列行动，例如回答问题，执行说明或写电子邮件。

The powerful methods of deep learning used in NLP allow today’s speech recognition applications to work better than ever before.

NLP中使用的强大的深度学习方法使当今的语音识别应用程序比以往任何时候都可以更好地工作。

聊天机器人 (Chatbots)

Chatbots are software programs that simulate natural human conversation. They are used by companies to help with customer service, consumer queries and sales enquiries.

聊天机器人是模拟自然人类对话的软件程序。公司使用它们来帮助客户服务，消费者查询和销售查询。

You may have interacted with a chatbot the last time you logged on to a company website and used their online help system.

您上次登录公司网站并使用其在线帮助系统时，您可能与聊天机器人进行了交互。

While simple chatbots use rules-based methods, today’s more capable chatbots use NLP to understand what customers are saying and how to respond.

尽管简单的聊天机器人使用基于规则的方法，但如今功能更强大的聊天机器人使用NLP来了解客户在说什么以及如何响应。

Well known examples of chatbots include:

聊天机器人的知名示例包括：

The World Health Organization (WHO) chatbot, built on the WhatsApp platform, which shares information and answers queries about the spread of the COVID-19 virus

建立在WhatsApp平台上的世界卫生组织(WHO) 聊天机器人，可共享信息并回答有关COVID-19病毒传播的查询
National Geographic’s Genius chatbot, that speaks like Albert Einstein and engages with users to promote the National Geographic show of the same name

国家地理杂志的Genius 聊天机器人，说话方式像阿尔伯特·爱因斯坦，并与用户互动以宣传同名国家地理杂志节目
Kian, Korean car manufacturer Kia’s chatbot on FaceBook Messenger, that answers queries about Kia cars and helps with sales enquiries

韩国汽车制造商起亚( Kian)在FaceBook Messenger上的聊天机器人，可回答有关起亚汽车的查询并帮助进行销售查询
Whole Foods’ chatbot that help with recipe information, cooking inspiration and product recommendations

Whole Foods的聊天机器人，可提供食谱信息，烹饪灵感和产品推荐

情绪分析 (Sentiment analysis)

Sentiment analysis uses NLP to interpret and classify emotions contained in text data. This is used, for instance, to classify online customer feedback about products or services in terms of positive or negative experience.

情感分析使用NLP来解释和分类文本数据中包含的情感。例如，这可用于根据正面或负面体验对有关产品或服务的在线客户反馈进行分类。

In its simplest form, sentiment analysis can be done by categorizing text based on designated words that convey emotion, like “love”, “hate”, “happy”, ”sad” or “angry”. This type of sentiment analysis has been around for a long time but is of limited practical use due to its simplicity.

以最简单的形式，可以通过根据传达情感的指定单词对文本进行分类来进行情感分析，例如“爱”，“恨”，“快乐”，“悲伤”或“生气”。这种类型的情感分析已经存在很长时间了，但是由于其简单性而在实际应用中受到限制。

Today’s sentiment analysis uses NLP to classify text based on statistical and deep learning methods. The result is sentiment analysis that can handle complex and natural-sounding text.

当今的情感分析使用NLP基于统计和深度学习方法对文本进行分类。结果是可以处理复杂且听起来自然的文本的情感分析。

There’s a huge interest in sentiment analysis nowadays from businesses worldwide. It can provide valuable insights into customer preferences, levels of satisfaction and feedback on opinions which can help with marketing campaigns and product design.

如今，全球企业对情感分析都产生了浓厚的兴趣。它可以提供有关客户偏爱，满意度和对意见的反馈的宝贵见解，从而有助于营销活动和产品设计。

邮件分类 (Email classification)

Email overload is a common challenge in the modern workplace. NLP can help to analyze and classify incoming emails so that they can be automatically forwarded to the right place.

电子邮件超载是现代工作场所中的常见挑战。 NLP可以帮助分析和分类传入的电子邮件，以便可以将它们自动转发到正确的位置。

In the past, simple keyword-matching techniques were used to classify emails. This had mixed success. NLP allows a far better classification approach as it can understand the context of individual sentences, paragraphs and whole sections of text.

过去，简单的关键字匹配技术用于对电子邮件进行分类。这取得了不同的成功。 NLP可以更好地分类，因为它可以理解单个句子，段落和整个文本的上下文。

Given the sheer volume of emails that businesses have to deal with today, NLP-based email classification can be a great help in improving workplace productivity. Classification using NLP helps to ensure that emails don’t get forgotten in over-burdened inboxes and are properly filed for further action.

鉴于当今企业必须处理大量电子邮件，基于NLP的电子邮件分类可以大大提高工作效率。使用NLP进行分类有助于确保电子邮件不会在繁重的收件箱中被遗忘，并且可以正确归档以采取进一步措施。

自然语言处理如何工作？ (How does natural language processing work?)

Now that we’ve seen what NLP can do, let’s try and understand how it works.

既然我们已经了解了NLP可以做什么，那么让我们尝试并了解它的工作原理。

In essence, NLP works by transforming a collection of text information into designated outputs.

本质上，NLP通过将文本信息的集合转换为指定的输出来工作。

If the application is machine translation, then the input text information would be documents in the source language (say, English) and the output would be the translated documents in the target language (say, French).

如果应用程序是机器翻译，则输入文本信息将是源语言(例如英语)的文档，而输出将是目标语言(例如法语)的翻译文档。

If the application is sentiment analysis, then the output would be a classification of the input text into sentiment categories. And so on.

如果应用程序是情感分析，那么输出将是将输入文本分类为情感类别。等等。

NLP工作流程 (The NLP workflow)

Modern NLP is a mixed discipline that draws on linguistics, computer science and machine learning. The process, or workflow, that NLP uses has three broad steps:

现代自然语言处理是一门混合学科，利用语言学，计算机科学和机器学习技术。 NLP使用的过程或工作流包含三个主要步骤：

Step 1 — Text pre-processing

第1步-文本预处理

Step 2 — Text representation

第2步-文本表示

Step 3 — Analysis and modeling

第3步-分析和建模

Each step may use a range of techniques which are constantly evolving with continued research.

每个步骤都可以使用随着不断研究而不断发展的一系列技术。

步骤1：文字预处理 (Step 1: Text pre-processing)

The first step is to prepare the input text so that it can be analyzed more easily. This part of NLP is well established and draws on a range of traditional linguistic methods.

第一步是准备输入文本，以便可以更轻松地对其进行分析。 NLP的这一部分已经很好地建立，并借鉴了一系列传统语言方法。

Some of the key approaches used in this step are:

此步骤中使用的一些关键方法是：

Tokenization, which breaks up text into useful units (tokens). This separates words using blank spaces, for instance, or separates sentences using full stops. Tokenization also recognizes words that often go together, such as “New York” or “machine learning”. As an example, the tokenization of the sentence “Customer service couldn’t be better” would result in the following tokens: “customer service”, “could”, “not”, “be” and “better”.

令牌化 ，它将文本分解成有用的单位(令牌)。例如，这使用空格分隔单词，或使用句号分隔句子。令牌化还可以识别经常一起使用的单词，例如“纽约”或“机器学习”。例如，句子“客户服务再好不过”的标记化将导致以下标记：“客户服务”，“可能”，“不是”，“成为”和“更好”。
Normalization transforms words to their base form using techniques like stemming and lemmatization. This is done to help reduce ‘noise’ and simplify the analysis. Stemming identifies the stems of words by removing their suffixes. The stem of the word “studies”, for instance, is “studi”. Lemmatization similarly removes suffixes, but also removes prefixes if required and results in words that are normally used in natural language. The lemma of the word “studies”, for instance, is “study”. In most applications, lemmatization is preferred to stemming as the resulting words have more meaning in natural speech.

归一化的变换的话他们的碱形式使用像词干和词形还原技术。这样做是为了帮助减少“噪音”并简化分析。词干通过删除词缀来识别词干。例如，“研究”一词的词干是“研究”。词法化同样会删除后缀，但如果需要的话也会删除前缀，并产生通常以自然语言使用的单词。例如，“研究”一词的引理是“研究”。在大多数应用中，词干重于词根优先，因为最终的单词在自然语音中具有更多含义。
Part-of-speech (POS) tagging draws on morphology, or the study of inter-relationships between words. Words (or tokens) are tagged based on their function in sentences. This is done by using established rules from text corpora to identify the purpose of words in speech, ie. verb, noun, adjective etc.

词性(POS)标记利用词法或词间相互关系的研究。单词(或标记)基于其在句子中的功能进行标记。这是通过使用来自文本语料库的已建立规则来识别语音中单词的目的来完成的。动词，名词，形容词等
Parsing draws on syntax, or the understanding of how words and sentences fit together. This helps to understand the structure of sentences and is done by breaking down sentences into phrases based on the rules of grammar. A phrase may contain a noun and an article, such as “my rabbit”, or a verb as in “likes to eat carrots”.

解析使用语法，或者理解单词和句子如何组合在一起。这有助于理解句子的结构，并且可以通过根据语法规则将句子分解为短语来完成。短语可以包含名词和冠词，例如“我的兔子”，也可以包含动词，例如“喜欢吃胡萝卜”。
Semantics identifies the intended meaning of words used in sentences. Words can have more than one meaning. For example “pass” can mean (i) to physically hand over something, (ii) a decision to not take part in something, or (iii) a measure of success in an exam. A word’s meaning can be understood better by looking at the words that appear before and after it.

语义识别句子中使用的单词的预期含义。单词可以有多个含义。例如，“及格”可以表示(i)实际交出某件东西，(ii)不参加某件东西的决定，或(iii)考试成功的量度。通过查看单词之前和之后出现的单词，可以更好地理解单词的含义。

步骤2：文字表示 (Step 2: Text representation)

In order for text to be analyzed using a machine and deep learning methods, it needs to be converted into numbers. This is the purpose of text representation.

为了使用机器和深度学习方法分析文本，需要将其转换为数字。这是文本表示的目的。

Some key methods used in this step are:

此步骤中使用的一些关键方法是：

词袋 (Bag of words)

Bag of words, or BoW, is an approach that represents text by counting how many times each word in an input document occurs in comparison with a known list of reference words (vocabulary).

单词袋或BoW是一种方法，它通过与已知参考单词列表(词汇表)相比，计算输入文档中每个单词出现的次数来表示文本。

The result is a set of vectors that contain numbers depicting how many times each word occurs. These vectors are called ‘bags’ as they don’t include any information about the structure of the input documents.

结果是一组向量，这些向量包含描述每个单词出现多少次的数字。这些向量称为“袋子”，因为它们不包含有关输入文档结构的任何信息。

To illustrate how BoW works, consider the sample sentence “the cat sat on the mat”. This contains the words “the”, “cat”, “sat”, “on” and “mat”. The frequency of occurrence of these words can be represented by a vector of the form [2, 1, 1, 1, 1]. Here, the word “the” occurs twice and the other words occur once.

为了说明BoW的工作原理，请考虑例句“猫坐在垫子上”。其中包含单词“ the”，“ cat”，“ sat”，“ on”和“ mat”。这些单词的出现频率可以用[2，1，1，1，1，1]形式的向量表示。在此，单词“ the”出现两次，而其他单词出现一次。

When compared with a large vocabulary, the vector will expand to include several zeros. This is because all of the words in the vocabulary which aren’t contained in the sample sentence will have zero frequencies against them. The resulting vector may contain a large number of zeros and hence is referred to as a ‘sparse vector’.

与大词汇量相比，向量将扩展为包括几个零。这是因为词汇表中所有单词中未包含的单词对其的频率均为零。所得向量可能包含大量零，因此被称为“稀疏向量”。

The BoW approach is fairly straightforward and easy to understand. The resulting sparse vectors however can be very large when the vocabulary is large. This leads to computationally challenging vectors that don’t contain much information (ie. are mostly zeros).

BoW方法相当简单易懂。但是，当词汇量很大时，所得的稀疏向量可能会非常大。这将导致计算上具有挑战性的向量不包含太多信息(即大部分为零)。

Further, BoW looks at individual words, so any information about words that go together is not captured. This results in a loss of context for later analysis.

此外，BoW会查看单个单词，因此不会捕获有关一起单词的任何信息。这会导致上下文丢失，无法进行后续分析。

袋n克 (Bag of n-grams)

One way of reducing the loss of context with BoW is to create vocabularies of grouped words rather than single words. These grouped words are referred to as ‘n-grams’, where ’n’ is the grouping size. The resulting approach is called ‘bag of n-grams’ (BNG).

减少BoW语境损失的一种方法是创建分组词而不是单个词的词汇表。这些分组的单词称为“ n-grams”，其中“ n”是分组大小。最终的方法称为“ n克袋”(BNG)。

The advantage of BNG is that each n-gram captures more context than single words.

BNG的优点是每个n-gram捕获的上下文比单个单词要多。

In the earlier sample sentence, “sat on” and “the mat” are examples of 2-grams, and “on the mat” is an example of a 3-gram.

在较早的例句中，“ sat on”和“ the mat”是2克的示例，而“ on the mat”是3克的示例。

特遣部队 (TF-IDF)

One issue with counting the number of times a word appears in documents is that certain words start to dominate the count. Words like “the”, “a” or “it”. These words tend to occur frequently but don’t contain much information.

计算单词在文档中出现的次数的一个问题是某些单词开始占主导地位。诸如“ the”，“ a”或“ it”之类的词。这些字词经常出现，但包含的信息并不多。

One way to deal with this is to treat words that appear frequently across documents differently to words that appear uniquely. The words appearing frequently tend to be low value words like “the”. The counts of these words can be penalized to help reduce their dominance.

一种解决方法是将在文档中频繁出现的单词与唯一出现的单词区别对待。经常出现的单词往往是诸如“ the”之类的低价值单词。这些单词的数量可能会受到惩罚，以帮助降低其优势。

This approach is called ‘term frequency — inverse document frequency’ or TF-IDF. Term frequency looks at the frequency of a word in a given document while the inverse document frequency looks at how rare the word is across all documents.

这种方法称为“术语频率-反向文档频率”或TF-IDF。术语频率查看给定文档中单词的频率，而反向文档频率查看单词在所有文档中的稀有度。

The TF-IDF approach acts to downplay frequently occurring words and highlight more unique words that have useful information, such as “cat” or “mat”. This can lead to better results.

TF-IDF方法可淡化经常出现的单词，并突出显示具有有用信息(例如“猫”或“垫子”)的更独特的单词。这样可以带来更好的结果。

词嵌入 (Word embedding)

A more sophisticated approach to text representation involves word embedding. This maps each word to individual vectors, where the vectors tend to be ‘dense’ rather than ‘sparse’ (ie. smaller and with fewer zeros). Each word and the words surrounding it are considered in the mapping process. The resulting dense vectors allow for a better analysis and comparison between words and their context.

文本表示的一种更复杂的方法涉及单词嵌入。这会将每个单词映射到单独的向量，其中向量倾向于“密集”而不是“稀疏”(即较小且零位较少)。在映射过程中会考虑每个单词及其周围的单词。生成的密集向量可以更好地分析和比较单词及其上下文。

Word embedding approaches use powerful machine learning and deep learning to perform the mapping. It is an evolving area which has produced some excellent results. Key algorithms in use today include Word2Vec, GloVe and FastText.

词嵌入方法使用强大的机器学习和深度学习来执行映射。这是一个不断发展的领域，取得了一些出色的成果。今天使用的关键算法包括Word2Vec，GloVe和FastText。

步骤3：分析和建模 (Step 3: Analysis and modeling)

The final step in the NLP process is to perform calculations on the vectors generated through steps 1 and 2, to produce the desired outcomes. Here, machine learning and deep learning methods are used. Many of the same machine learning techniques from non-NLP domains, such as image recognition or fraud detection, may be used in this analysis.

NLP过程的最后一步是对通过步骤1和2生成的向量进行计算，以产生所需的结果。在这里，使用了机器学习和深度学习方法。来自非NLP域的许多相同的机器学习技术，例如图像识别或欺诈检测，都可以在此分析中使用。

Consider sentiment analysis. This can be done using either supervised or unsupervised machine learning. Supervised machine learning requires pre-labeled data while unsupervised machine learning uses pre-prepared databases of curated words (lexicons) to help with classifying sentiment.

考虑情绪分析。这可以使用有监督或无监督的机器学习来完成。有监督的机器学习需要预先标记的数据，而无监督的机器学习则使用预先准备的策展词(词典)数据库来帮助对情感进行分类。

Using machine learning, input text vectors are classified using a probabilistic approach. This is done through either a trained model (supervised machine learning) or by comparison with a suitable lexicon (unsupervised machine learning).

使用机器学习，使用概率方法对输入文本向量进行分类。这可以通过训练模型(有监督的机器学习)或通过与合适的词典进行比较(无监督的机器学习)来完成。

The outcomes are sentiment classifications based on the probabilities generated through the machine learning process.

结果是基于通过机器学习过程生成的概率的情感分类。

结论 (Conclusion)

NLP is developing rapidly and is having an increasing impact on society. From language translation to speech recognition, and from chatbots to identifying sentiment, NLP is providing valuable insights and making our lives more productive.

NLP发展Swift，对社会的影响越来越大。从语言翻译到语音识别，再到聊天机器人再到情感识别，NLP都提供了宝贵的见解，使我们的生活更加高效。

Modern NLP works by using linguistics, computer science and machine learning. Over recent years, NLP has produced results that far surpass what we’ve seen in the past.

现代自然语言处理通过使用语言学，计算机科学和机器学习来工作。近年来，NLP产生的结果远远超过了过去。

The basic workflow of NLP involves text pre-processing, text representation and analysis. A variety of techniques are in use today and more are being developed with ongoing research.

NLP的基本工作流程涉及文本预处理，文本表示和分析。如今，各种技术正在使用中，并且随着不断的研究，正在开发更多的技术。

NLP promises to revolutionize many areas of industry and consumer practice. It’s already become a familiar part of our daily lives.

NLP承诺彻底改变行业和消费者实践的许多领域。它已经成为我们日常生活中熟悉的一部分。

With NLP, we have a powerful way of engaging with a digital future through a medium we are inherently comfortable with — our ability to communicate through natural language.

借助NLP，我们拥有了一种强大的方式，可以通过我们固有的媒介(即通过自然语言进行交流的能力)来参与数字未来。

翻译自: https://towardsdatascience.com/natural-language-processing-a-simple-explanation-7e6379085a50