nlp构建

Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

多年来,自杀一直是全世界主要的死亡原因之一。据维基百科称 ,自杀导致2015年全球死亡828,000人,比1990年的712,000人有所增加。这使自杀成为全球第十大死亡原因。 越来越多的证据表明,互联网和社交媒体可以影响 自杀相关行为 。 使用机器学习中的自然语言处理这一领域,我建立了一个非常简单的自杀意念分类器,该分类器可预测文本是否可能具有自杀意味。

数据 (Data)

I used a Twitter crawler which I found on Github, made some few changes to the code by removing hashtags, links, URL and symbols whenever it crawls data from Twitter, the data were crawled based on query parameters which contain words like:

我使用了一个在Github上找到的Twitter搜寻器,通过在每次从Twitter抓取数据时删除标签,链接,URL和符号来对代码进行了一些更改,这些数据是根据包含以下单词的查询参数进行抓取的:

Depressed, hopeless, Promise to take care of, I dont belong here, Nobody deserve me, I want to die etc.

沮丧,绝望,无极照顾,我不属于这里,没人值得我,我想死等等。

Although some of the text we’re in no way related to suicide at all, I had to manually label the data which were about 8200 rows of tweets. I also sourced for more Twitter Data and I was able to concatenate with the one I previously had which was enough for me to train.

尽管有些文本根本与自杀无关,但我不得不手动标记大约8200行tweet数据。 我还获得了更多的Twitter数据,并且能够与以前拥有的足以进行训练的数据相结合。

建立模型 (Building the Model)

数据预处理 (Data Preprocessing)

I imported the following libraries:

我导入了以下库:

import pickleimport reimport numpy as npimport pandas as pdfrom tqdm import tqdmimport nltknltk.download('stopwords')

I then wrote a function to clean the text data to remove any form of HTML markup, keep emoticon characters, remove non-word character and lastly convert to lowercase.

然后,我编写了一个函数来清除文本数据,以删除任何形式HTML标记,保留表情符号字符,删除非单词字符并最后转换为小写字母。

def preprocess_tweet(text):    text = re.sub('<[^>]*>', '', text)    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)    lowercase_text = re.sub('[\W]+', ' ', text.lower())    text = lowercase_text+' '.join(emoticons).replace('-', '')     return text

After that, I applied the preprocess_tweet function to the tweet dataset to clean the data.

之后,我将preprocess_tweet函数应用于tweet数据集以清理数据。

tqdm.pandas()df = pd.read_csv('data.csv')df['tweet'] = df['tweet'].progress_apply(preprocess_tweet)

Then I converted the text to tokens by using the .split() method and used word stemming to convert the text to their root form.

然后,我使用.split()方法将文本转换为标记,并使用词干将文本转换为其根形式。

from nltk.stem.porter import PorterStemmerporter = PorterStemmer()def tokenizer_porter(text):    return [porter.stem(word) for word in text.split()]

Then I imported the stopwords library to remove stop words in the text.

然后,我导入了停用词库,以删除文本中的停用词。

from nltk.corpus import stopwordsstop = stopwords.words('english')

Testing the function on a single text.

在单个文本上测试功能。

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

Output:

输出:

['runner', 'like', 'run', 'run', 'lot']

矢量化器 (Vectorizer)

For this project, I used the Hashing Vectorizer because it data-independent, which means that it is very low memory scalable to large datasets and it doesn’t store vocabulary dictionary in memory. I then created a tokenizer function for the Hashing Vectorizer

在此项目中,我使用了Hashing Vectorizer,因为它与数据无关,这意味着它的内存非常低,可扩展到大型数据集,并且不将词汇表存储在内存中。 然后,我为Hashing Vectorizer创建了tokenizer函数

def tokenizer(text):    text = re.sub('<[^>]*>', '', text)    emoticons = re.findall('(?::|;|=)(?:-)?(?:\(|D|P)',text.lower())    text = re.sub('[\W]+', ' ', text.lower())    text += ' '.join(emoticons).replace('-', '')    tokenized = [w for w in tokenizer_porter(text) if w not in stop]    return tokenized

Then I created the Hashing Vectorizer object.

然后,我创建了哈希向量化器对象。

from sklearn.feature_extraction.text import HashingVectorizervect = HashingVectorizer(decode_error='ignore', n_features=2**21,                          preprocessor=None,tokenizer=tokenizer)

模型 (Model)

For the Model, I used the stochastic gradient descent classifier algorithm.

对于模型,我使用了随机梯度下降分类器算法。

from sklearn.linear_model import SGDClassifierclf = SGDClassifier(loss='log', random_state=1)

培训与验证 (Training and Validation)

X = df["tweet"].to_list()y = df['label']

For the model, I used 80% for training and 20% for testing.

对于模型,我使用了80%的训练和20%的测试。

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,                                                 y,                                                 test_size=0.20,                                                 random_state=0)

Then I transformed the text data to vectors with the Hashing Vectorizer we created earlier:

然后,使用之前创建的Hashing Vectorizer将文本数据转换为矢量:

X_train = vect.transform(X_train)X_test = vect.transform(X_test)

Finally, I then fit the data to the algorithm

最后,然后将数据拟合到算法中

classes = np.array([0, 1])clf.partial_fit(X_train, y_train,classes=classes)

Let's test the accuracy on our test data:

让我们在测试数据上测试准确性:

print('Accuracy: %.3f' % clf.score(X_test, y_test))

Output:

输出:

Accuracy: 0.912

I had an accuracy of 91% which is fair enough, after that, I then updated the model with the prediction

我的准确度是91%,这还算公允,之后,我用预测更新了模型

clf = clf.partial_fit(X_test, y_test)

测试和做出预测 (Testing and Making Predictions)

I added the text “I’ll kill myself am tired of living depressed and alone” to the model.

我在模型中添加了文本“我会厌倦生活在沮丧和孤独中,杀死自己”。

label = {0:'negative', 1:'positive'}example = ["I'll kill myself am tired of living depressed and alone"]X = vect.transform(example)print('Prediction: %s\nProbability: %.2f%%'      %(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))

And I got the output:

我得到了输出:

Prediction: positiveProbability: 93.76%

And when I used the following text “It’s such a hot day, I’d like to have ice cream and visit the park”, I got the following prediction:

当我使用以下文字“天气真热,我想吃冰淇淋并参观公园”时,我得到以下预测:

Prediction: negativeProbability: 97.91%

The model was able to predict accurately for both cases. And that's how you build a simple suicidal tweet classifier.

该模型能够准确预测这两种情况。 这就是您构建简单的自杀性推文分类器的方式。

You can find the notebook I used for this article here

您可以在这里找到我用于本文的笔记本

Thanks for reading

nlp构建_使用NLP构建自杀性推文分类器相关推荐

  1. 机器学习特征构建_使用Streamlit构建您的基础机器学习Web应用

    机器学习特征构建 Data scientist and ML experts often find it difficult to showcase their findings/result to ...

  2. 公众号jdk 获取手机号_如何获取公众号推文封面图

    曾经有一张好看的图片摆在我的眼前,我却没能保存,等到失去的时候我才后悔莫及. 如果上天能够给我一个再来一次的机会,我会对那张图片说三个字: 我,要,你-- 现在大部分使用智能手机的小伙伴们,一定都关注 ...

  3. 导航栏iframe公共样式_中秋节微信公众号推文样式素材推荐

    九月即将过半,等待我们的不止是中秋假期,还有公众号推文!一篇好的节日公众号推文,可以带动更多的人阅读,带动更多的人观看,增加更多的曝光,一篇好的推文排版可以给文章增加更多色彩,接下来就给大家分享一些中 ...

  4. 部门换届推文文字_部门介绍 | 一篇推文带你了解社团联合会!

    社团联合会部门/岗位介绍 ★ 上期回顾 ★ 上期我们发出邀请 欢迎大家来加入我们这个大家庭 为了方便大家更好的了解我们 我们决定出多一期关于我们部门/岗位的介绍 让大家更直观认识到 我们社团联合会的职 ...

  5. react 组件构建_为React构建星级评定组件

    react 组件构建 Who doesn't love beer? When you drink a great beer you want to tell someone. You definite ...

  6. hibernate框架构建_我们如何构建服务框架而不是框架

    hibernate框架构建 目录 (Table of Contents) Introduction介绍 Building the Skeleton 建立骨架 - HTTP Endpoints -HTT ...

  7. 基于python的系统构建_利用python构建一个简单的推荐系统

    摘要: 快利用python构建一个属于你自己的推荐系统吧,手把手教学,够简单够酷炫. 本文将利用python构建一个简单的推荐系统,在此之前读者需要对pandas和numpy等数据分析包有所了解. 什 ...

  8. gradle项目 构建_使用Gradle构建Monorepo项目

    gradle项目 构建 根据Wikipedia的说法 , monorepo是一种软件开发策略,其中许多项目存储在同一资源库中. 这种策略可以快速检测到因依赖关系的更改而导致的潜在问题和破坏,并且已被许 ...

  9. gradle构建_指定Gradle构建属性

    gradle构建 属性是用于轻松自定义Gradle构建和Gradle环境的宝贵工具. 我将在本文中演示一些用于指定Gradle构建中使用的属性的方法. Gradle支持项目属性和系统属性 . 这篇文章 ...

最新文章

  1. PostgreSQL备份恢复实现
  2. 【孤偏盖全唐】Linux中find命令完整用法
  3. SpringBoot, 启动类,使用「SpringBootApplication」标注
  4. Libxml2的简单介绍及应用
  5. 长城脚下公社之凯宾斯基开业典礼
  6. smb.conf - Samba组件的配置文件
  7. linux pannel 误删除后的恢复方法
  8. whale 帷幄:crm客户管理营销系统全称是什么
  9. 新鲜出炉的点菜系统(附源码)
  10. python如何将两个list合并成字典_怎么把两个列表合并成字典 论Python怎样将两个list合并为一个字典...
  11. Introduction to Fabric.js. Part 3(介绍Fabric.js第三部分)
  12. linux中文件大小10,Linux之查看文件大小
  13. CSS精灵技术与字体图标
  14. 史上最全手机简史,无线通讯佳话还在持续...
  15. 什么是智能双线机房和BGP智能双线机房的原理
  16. QGIS+GeoServer:发布CGCS2000图层组
  17. 实验一:医院住院管理系统需求|软件工程
  18. SAP安全库存(时间)详解
  19. 产业招商有哪些方式,你知道吗?
  20. Sphinx的一个应用实例

热门文章

  1. Spring-Security 自定义Filter完成验证码校验
  2. Ajax请求利用jsonp实现跨域
  3. 关于个人防火墙的真相
  4. Aspx 页面生命周期
  5. Head First summary
  6. SQL技巧(多行合并一列)
  7. deeplearning4j
  8. 语音对话系统的设计要点与多轮对话的重要性
  9. iOS端(腾讯Bugly)闪退异常上报扑获日志集成与使用指南
  10. 轻量级数据库中间件利器Sharding-JDBC深度解析(有彩蛋)