
Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

多年来,自杀一直是全世界主要的死亡原因之一。据维基百科称 ,自杀导致2015年全球死亡828,000人,比1990年的712,000人有所增加。这使自杀成为全球第十大死亡原因。 越来越多的证据表明,互联网和社交媒体可以影响 自杀相关行为 。 使用机器学习中的自然语言处理这一领域,我建立了一个非常简单的自杀意念分类器,该分类器可预测文本是否可能具有自杀意味。

数据 (Data)

I used a Twitter crawler which I found on Github, made some few changes to the code by removing hashtags, links, URL and symbols whenever it crawls data from Twitter, the data were crawled based on query parameters which contain words like:


Depressed, hopeless, Promise to take care of, I dont belong here, Nobody deserve me, I want to die etc.


Although some of the text we’re in no way related to suicide at all, I had to manually label the data which were about 8200 rows of tweets. I also sourced for more Twitter Data and I was able to concatenate with the one I previously had which was enough for me to train.

尽管有些文本根本与自杀无关,但我不得不手动标记大约8200行tweet数据。 我还获得了更多的Twitter数据,并且能够与以前拥有的足以进行训练的数据相结合。

建立模型 (Building the Model)

数据预处理 (Data Preprocessing)

I imported the following libraries:


import pickleimport reimport numpy as npimport pandas as pdfrom tqdm import tqdmimport'stopwords')

I then wrote a function to clean the text data to remove any form of HTML markup, keep emoticon characters, remove non-word character and lastly convert to lowercase.


def preprocess_tweet(text):    text = re.sub('<[^>]*>', '', text)    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)    lowercase_text = re.sub('[\W]+', ' ', text.lower())    text = lowercase_text+' '.join(emoticons).replace('-', '')     return text

After that, I applied the preprocess_tweet function to the tweet dataset to clean the data.


tqdm.pandas()df = pd.read_csv('data.csv')df['tweet'] = df['tweet'].progress_apply(preprocess_tweet)

Then I converted the text to tokens by using the .split() method and used word stemming to convert the text to their root form.


from nltk.stem.porter import PorterStemmerporter = PorterStemmer()def tokenizer_porter(text):    return [porter.stem(word) for word in text.split()]

Then I imported the stopwords library to remove stop words in the text.


from nltk.corpus import stopwordsstop = stopwords.words('english')

Testing the function on a single text.


[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]



['runner', 'like', 'run', 'run', 'lot']

矢量化器 (Vectorizer)

For this project, I used the Hashing Vectorizer because it data-independent, which means that it is very low memory scalable to large datasets and it doesn’t store vocabulary dictionary in memory. I then created a tokenizer function for the Hashing Vectorizer

在此项目中,我使用了Hashing Vectorizer,因为它与数据无关,这意味着它的内存非常低,可扩展到大型数据集,并且不将词汇表存储在内存中。 然后,我为Hashing Vectorizer创建了tokenizer函数

def tokenizer(text):    text = re.sub('<[^>]*>', '', text)    emoticons = re.findall('(?::|;|=)(?:-)?(?:\(|D|P)',text.lower())    text = re.sub('[\W]+', ' ', text.lower())    text += ' '.join(emoticons).replace('-', '')    tokenized = [w for w in tokenizer_porter(text) if w not in stop]    return tokenized

Then I created the Hashing Vectorizer object.


from sklearn.feature_extraction.text import HashingVectorizervect = HashingVectorizer(decode_error='ignore', n_features=2**21,                          preprocessor=None,tokenizer=tokenizer)

模型 (Model)

For the Model, I used the stochastic gradient descent classifier algorithm.


from sklearn.linear_model import SGDClassifierclf = SGDClassifier(loss='log', random_state=1)

培训与验证 (Training and Validation)

X = df["tweet"].to_list()y = df['label']

For the model, I used 80% for training and 20% for testing.


from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test = train_test_split(X,                                                 y,                                                 test_size=0.20,                                                 random_state=0)

Then I transformed the text data to vectors with the Hashing Vectorizer we created earlier:

然后,使用之前创建的Hashing Vectorizer将文本数据转换为矢量:

X_train = vect.transform(X_train)X_test = vect.transform(X_test)

Finally, I then fit the data to the algorithm


classes = np.array([0, 1])clf.partial_fit(X_train, y_train,classes=classes)

Let's test the accuracy on our test data:


print('Accuracy: %.3f' % clf.score(X_test, y_test))



Accuracy: 0.912

I had an accuracy of 91% which is fair enough, after that, I then updated the model with the prediction


clf = clf.partial_fit(X_test, y_test)

测试和做出预测 (Testing and Making Predictions)

I added the text “I’ll kill myself am tired of living depressed and alone” to the model.


label = {0:'negative', 1:'positive'}example = ["I'll kill myself am tired of living depressed and alone"]X = vect.transform(example)print('Prediction: %s\nProbability: %.2f%%'      %(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))

And I got the output:


Prediction: positiveProbability: 93.76%

And when I used the following text “It’s such a hot day, I’d like to have ice cream and visit the park”, I got the following prediction:


Prediction: negativeProbability: 97.91%

The model was able to predict accurately for both cases. And that's how you build a simple suicidal tweet classifier.

该模型能够准确预测这两种情况。 这就是您构建简单的自杀性推文分类器的方式。

You can find the notebook I used for this article here


Thanks for reading


