使用xgboost进行文本分类

1.数据准备

跟之前的文本一样，还是原来的数据格式。

sentence,label
游戏太坑，暴率太低，太克金，平民不能玩,negative
让人失望,negative
能解决一下服务器问题？网络正常老掉线，换手机也一样。。。,negative
期待,positive
一星也不想给，这特么简直龟速，炫舞老年版？,negative
衣服不好看游戏内容无特色，界面乱糟糟的,negative
喜欢喜欢,positive
从有了这个手游就一直玩，很喜欢呀，希望更多漂漂衣服,positive
因违反评价条例规定被折叠,negative

2.数据预处理

import time
import jieba
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics
import xgboost as xgbdef get_stop_words():filename = "your stop words file path"stop_word_list = []with open(filename, encoding='utf-8') as f:for line in f.readlines():stop_word_list.append(line.strip())return stop_word_listdef processing_sentence(x, stop_words):cut_word = jieba.cut(str(x).strip())words = [word for word in cut_word if word not in stop_words and word != ' ']return ' '.join(words)def data_processing():train_file = "your train file path"df = pd.read_csv(train_file)x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['label'], test_size=0.1)stop_words = get_stop_words()x_train = x_train.apply(lambda x: processing_sentence(x, stop_words))x_test = x_test.apply(lambda x: processing_sentence(x, stop_words))tf = TfidfVectorizer()x_train = tf.fit_transform(x_train)x_test = tf.transform(x_test)x_train_weight = x_train.toarray()x_test_weight = x_test.toarray()return x_train_weight, x_test_weight, y_train, y_test

文本预处理得到的，仍然是分词以后的tf-idf特征。

3.模型训练

def train_model():x_train_weight, x_test_weight, y_train, y_test = data_processing()start = time.time()print("start time is: ", start)model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=100,silent=False, objective='binary:logistic')model.fit(x_train_weight, y_train)end = time.time()print("end time is: ", end)print("cost time is: ", (end - start))y_predict = model.predict(x_test_weight)confusion_mat = metrics.confusion_matrix(y_test, y_predict)print('准确率：', metrics.accuracy_score(y_test, y_predict))print("confusion_matrix is: ", confusion_mat)print('分类报告:', metrics.classification_report(y_test, y_predict))

代码训练运行的结果为

start time is:  1649228843.700035
end time is:  1649229253.274875
cost time is:  409.57483983039856
准确率： 0.7524366471734892
confusion_matrix is:  [[137  80][ 47 249]]
分类报告:               precision    recall  f1-score   supportnegative       0.74      0.63      0.68       217positive       0.76      0.84      0.80       296accuracy                           0.75       513macro avg       0.75      0.74      0.74       513
weighted avg       0.75      0.75      0.75       513

4.xgboost参数

xgb的参数还是比较多的，而且在实际使用过程中，调参也是比较重要的一环，下面我们一起看看xgb里面的参数。

xgb自身参数

    booster: stringSpecify which booster to use: gbtree, gblinear or dart.n_jobs : intNumber of parallel threads used to run xgboost.  (replaces ``nthread``)verbosity : intThe degree of verbosity. Valid values are 0 (silent) - 3 (debug).scale_pos_weight : floatBalancing of positive and negative weights.

booster指定树的类型，默认值为gbtree。
scale_pos_weight主要是处理样本不平衡问题，默认值为1。当样本高度不平衡的时候，比如正负样本比为1:100，可以将scale_pos_weight=10，加快模型收敛。

tree参数

    n_estimators : intNumber of trees to fit.max_depth : intMaximum tree depth for base learners.min_child_weight : intMinimum sum of instance weight(hessian) needed in a child.gamma : floatMinimum loss reduction required to make a further partition on a leaf node of the tree.max_delta_step : intMaximum delta step we allow each tree's weight estimation to be.subsample : floatSubsample ratio of the training instance.colsample_bytree : floatSubsample ratio of columns when constructing each tree.

n_estimators:树棵数
max_depth:树最大深度
min_child_weight:每棵树上的叶子节点样本权重和的最小值
gamma:在每棵树上进行进一步分裂所需要的最小损失函数减小值
max_delta_step:每棵树的最大权重
subsample:每棵树训练时每个样本被抽样选择的概率
colsample_bytree:每棵树训练时使用的特征比例

算法通用参数

    learning_rate : floatBoosting learning rate (xgb's "eta")objective : string or callableSpecify the learning task and the corresponding learning objective ora custom objective function to be used (see note below).reg_alpha : float (xgb's alpha)L1 regularization term on weightsreg_lambda : float (xgb's lambda)L2 regularization term on weights

objective包括：
回归任务
reg:linear (默认)
reg:logistic

二分类
binary:logistic 概率
binary:logitraw 类别

多分类
multi:softmax num_class=n 返回类别
multi:softprob num_class=n 返回概率

排序
rank:pairwise

5.参数总结

调整树模型复杂度的参数

n_estimators
max_depth
min_chlid_weight
gamma

增加树随机性的参数

subsample
colsample_bytree
learning_rate
num_round

解决样本不平衡

scale_pos_weight

6.画出特征重要性

将train方法的代码稍作修改，加入刻画特征重要性代码

import matplotlib.pyplot as plt
from xgboost import plot_importancedef train_model():x_train_weight, x_test_weight, y_train, y_test = data_processing()start = time.time()print("start time is: ", start)model = xgb.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=50, n_jobs=2,silent=False, objective='binary:logistic')model.fit(x_train_weight, y_train)end = time.time()print("end time is: ", end)print("cost time is: ", (end - start))y_predict = model.predict(x_test_weight)confusion_mat = metrics.confusion_matrix(y_test, y_predict)print('准确率：', metrics.accuracy_score(y_test, y_predict))print("confusion_matrix is: ", confusion_mat)print('分类报告:', metrics.classification_report(y_test, y_predict))fig, ax = plt.subplots(figsize=(15, 15))plot_importance(model,height=0.5,ax=ax,max_num_features=10)plt.show()

最后会输出如下图

上面的图像就将F值排前十的特征进行了可视化。