笔记实践 | 基于LSTM实现谣言检测 |初识长短记忆神经网络
长短时记忆神经网络
- LSTM介绍
- 实践任务
- 一、环境设置
- 二、数据准备
- 三、模型配置
- 四、模型训练
- 五、模型评估
LSTM介绍
RNN问题:长期依赖问题/梯度消失问题
解决:
LSTM
GRU
长短时记忆神经网络LSTN
LSTM是RNN变体,LSTM引入了一组记忆单元(Memory Units),允许网络可以学习何时遗忘历史信息何时用新信息更新记忆单元
四种门
实践任务
本次实践使用基于循环神经网络(RNN)的谣言检测模型,将文本中的谣言事件向量化,通过循环神经网络的学习训练来挖掘表示文本深层的特征,避免了特征构建的问题,并能发现那些不容易被人发现的特征,从而产生更好的效果。
数据集介绍:
本次实践所使用的数据是从新浪微博不实信息举报平台抓取的中文谣言数据,数据集中共包含1538条谣言和1849条非谣言。每条数据均为json格式,其中text字段代表微博原文的文字内容。
数据集介绍参考https://github.com/thunlp/Chinese_Rumor_Dataset
一、环境设置
import paddle
import numpy as np
import matplotlib.pyplot as plt
print(paddle.__version__)
二、数据准备
(1)解压数据,读取并解析数据,生成all_data.txt
(2)生成数据字典,即dict.txt
(3)生成数据列表,并进行训练集与验证集的划分,train_list.txt 、eval_list.txt
(4)定义训练数据集提供器
import os, zipfile
src_path="data/data20519/Rumor_Dataset.zip"
target_path="/home/aistudio/data/Chinese_Rumor_Dataset-master"
if(not os.path.isdir(target_path)):z = zipfile.ZipFile(src_path, 'r')z.extractall(path=target_path)z.close()
import io
import random
import json
#谣言数据文件路径
rumor_class_dirs = os.listdir(target_path+"/Chinese_Rumor_Dataset-master/CED_Dataset/rumor-repost/")#非谣言数据文件路径
non_rumor_class_dirs = os.listdir(target_path+"/Chinese_Rumor_Dataset-master/CED_Dataset/non-rumor-repost/")original_microblog = target_path+"/Chinese_Rumor_Dataset-master/CED_Dataset/original-microblog/"#谣言标签为0,非谣言标签为1
rumor_label="0"
non_rumor_label="1"#分别统计谣言数据与非谣言数据的总数
rumor_num = 0
non_rumor_num = 0all_rumor_list = []
all_non_rumor_list = []#解析谣言数据
for rumor_class_dir in rumor_class_dirs: if(rumor_class_dir != '.DS_Store'):#遍历谣言数据,并解析with open(original_microblog + rumor_class_dir, 'r') as f:rumor_content = f.read()rumor_dict = json.loads(rumor_content)all_rumor_list.append(rumor_label+"\t"+rumor_dict["text"]+"\n")rumor_num +=1#解析非谣言数据
for non_rumor_class_dir in non_rumor_class_dirs: if(non_rumor_class_dir != '.DS_Store'):with open(original_microblog + non_rumor_class_dir, 'r') as f2:non_rumor_content = f2.read()non_rumor_dict = json.loads(non_rumor_content)all_non_rumor_list.append(non_rumor_label+"\t"+non_rumor_dict["text"]+"\n")non_rumor_num +=1print("谣言数据总量为:"+str(rumor_num))
print("非谣言数据总量为:"+str(non_rumor_num))
#全部数据进行乱序后写入all_data.txt
data_list_path="/home/aistudio/data/"
all_data_path=data_list_path + "all_data.txt"
all_data_list = all_rumor_list + all_non_rumor_list
random.shuffle(all_data_list)#在生成all_data.txt之前,首先将其清空
with open(all_data_path, 'w') as f:f.seek(0)f.truncate() with open(all_data_path, 'a') as f:for data in all_data_list:f.write(data)
# 生成数据字典
def create_dict(data_path, dict_path):with open(dict_path, 'w') as f:f.seek(0)f.truncate() dict_set = set()# 读取全部数据with open(data_path, 'r', encoding='utf-8') as f:lines = f.readlines()# 把数据生成一个元组for line in lines:content = line.split('\t')[-1].replace('\n', '')for s in content:dict_set.add(s)# 把元组转换成字典,一个字对应一个数字dict_list = []i = 0for s in dict_set:dict_list.append([s, i])i += 1# 添加未知字符dict_txt = dict(dict_list)end_dict = {"<unk>": i}dict_txt.update(end_dict)end_dict = {"<pad>": i+1}dict_txt.update(end_dict)# 把这些字典保存到本地中with open(dict_path, 'w', encoding='utf-8') as f:f.write(str(dict_txt))print("数据字典生成完成!")
# 创建序列化表示的数据,并按照一定比例划分训练数据train_list.txt与验证数据eval_list.txt
def create_data_list(data_list_path):#在生成数据之前,首先将eval_list.txt和train_list.txt清空with open(os.path.join(data_list_path, 'eval_list.txt'), 'w', encoding='utf-8') as f_eval:f_eval.seek(0)f_eval.truncate()with open(os.path.join(data_list_path, 'train_list.txt'), 'w', encoding='utf-8') as f_train:f_train.seek(0)f_train.truncate() with open(os.path.join(data_list_path, 'dict.txt'), 'r', encoding='utf-8') as f_data:dict_txt = eval(f_data.readlines()[0])with open(os.path.join(data_list_path, 'all_data.txt'), 'r', encoding='utf-8') as f_data:lines = f_data.readlines()i = 0maxlen = 0with open(os.path.join(data_list_path, 'eval_list.txt'), 'a', encoding='utf-8') as f_eval,open(os.path.join(data_list_path, 'train_list.txt'), 'a', encoding='utf-8') as f_train:for line in lines:words = line.split('\t')[-1].replace('\n', '')maxlen = max(maxlen, len(words))label = line.split('\t')[0]labs = ""# 每8个 抽取一个数据用于验证if i % 8 == 0:for s in words:lab = str(dict_txt[s])labs = labs + lab + ','labs = labs[:-1]labs = labs + '\t' + label + '\n'f_eval.write(labs)else:for s in words:lab = str(dict_txt[s])labs = labs + lab + ','labs = labs[:-1]labs = labs + '\t' + label + '\n'f_train.write(labs)i += 1print("数据列表生成完成!")print("样本最长长度:" + str(maxlen))
谣言数据总量为:1538
非谣言数据总量为:1849
# 把生成的数据列表都放在自己的总类别文件夹中
data_root_path = "/home/aistudio/data/"
data_path = os.path.join(data_root_path, 'all_data.txt')
dict_path = os.path.join(data_root_path, "dict.txt")# 创建数据字典
create_dict(data_path, dict_path)# 创建数据列表
create_data_list(data_root_path)
def load_vocab(file_path):fr = open(file_path, 'r', encoding='utf8')vocab = eval(fr.read()) #读取的str转换为字典fr.close()return vocab
# 打印前2条训练数据
vocab = load_vocab(os.path.join(data_root_path, 'dict.txt'))def ids_to_str(ids):words = []for k in ids:w = list(vocab.keys())[list(vocab.values()).index(int(k))]words.append(w if isinstance(w, str) else w.decode('ASCII'))return " ".join(words)file_path = os.path.join(data_root_path, 'train_list.txt')
with io.open(file_path, "r", encoding='utf8') as fin:i = 0for line in fin:i += 1cols = line.strip().split("\t")if len(cols) != 2:sys.stderr.write("[NOTICE] Error Format Line!")continuelabel = int(cols[1])wids = cols[0].split(",")print(str(i)+":")print('sentence list id is:', wids)print('sentence list is: ', ids_to_str(wids))print('sentence label id is:', label)print('---------------------------------')if i == 2: break
vocab = load_vocab(os.path.join(data_root_path, 'dict.txt'))class RumorDataset(paddle.io.Dataset):def __init__(self, data_dir):self.data_dir = data_dirself.all_data = []with io.open(self.data_dir, "r", encoding='utf8') as fin:for line in fin:cols = line.strip().split("\t")if len(cols) != 2:sys.stderr.write("[NOTICE] Error Format Line!")continuelabel = []label.append(int(cols[1]))wids = cols[0].split(",")if len(wids)>=150:wids = np.array(wids[:150]).astype('int64') else:wids = np.concatenate([wids, [vocab["<pad>"]]*(150-len(wids))]).astype('int64')label = np.array(label).astype('int64')self.all_data.append((wids, label))def __getitem__(self, index):data, label = self.all_data[index]return data, labeldef __len__(self):return len(self.all_data)batch_size = 32
train_dataset = RumorDataset(os.path.join(data_root_path, 'train_list.txt'))
test_dataset = RumorDataset(os.path.join(data_root_path, 'eval_list.txt'))train_loader = paddle.io.DataLoader(train_dataset, places=paddle.CPUPlace(), return_list=True,shuffle=True, batch_size=batch_size, drop_last=True)
test_loader = paddle.io.DataLoader(test_dataset, places=paddle.CPUPlace(), return_list=True,shuffle=True, batch_size=batch_size, drop_last=True)
#checkprint('=============train_dataset =============')
for data, label in train_dataset:print(data)print(np.array(data).shape)print(label)breakprint('=============test_dataset =============')
for data, label in test_dataset:print(data)print(np.array(data).shape)print(label)break
三、模型配置
import paddle
from paddle.nn import Conv2D, Linear, Embedding
from paddle import to_tensor
import paddle.nn.functional as Fclass RNN(paddle.nn.Layer):def __init__(self):super(RNN, self).__init__()self.dict_dim = vocab["<pad>"]self.emb_dim = 128self.hid_dim = 128self.class_dim = 2self.embedding = Embedding(self.dict_dim + 1, self.emb_dim,sparse=False)self._fc1 = Linear(self.emb_dim, self.hid_dim)self.lstm = paddle.nn.LSTM(self.hid_dim, self.hid_dim)self.fc2 = Linear(19200, self.class_dim)def forward(self, inputs):# [32, 150]emb = self.embedding(inputs)# [32, 150, 128]fc_1 = self._fc1(emb)#第一层# [32, 150, 128]x = self.lstm(fc_1)x = paddle.reshape(x[0], [0, -1])x = self.fc2(x)x = paddle.nn.functional.softmax(x)return xrnn = RNN()
paddle.summary(rnn,(32,150),"int64")
四、模型训练
def draw_process(title,color,iters,data,label):plt.title(title, fontsize=24)plt.xlabel("iter", fontsize=20)plt.ylabel(label, fontsize=20)plt.plot(iters, data,color=color,label=label) plt.legend()plt.grid()plt.show()
def train(model):model.train()opt = paddle.optimizer.Adam(learning_rate=0.002, parameters=model.parameters())steps = 0Iters, total_loss, total_acc = [], [], []for epoch in range(3):for batch_id, data in enumerate(train_loader):steps += 1sent = data[0]label = data[1]logits = model(sent)loss = paddle.nn.functional.cross_entropy(logits, label)acc = paddle.metric.accuracy(logits, label)if batch_id % 50 == 0:Iters.append(steps)total_loss.append(loss.numpy()[0])total_acc.append(acc.numpy()[0])print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, batch_id, loss.numpy()))loss.backward()opt.step()opt.clear_grad()# evaluate model after one epochmodel.eval()accuracies = []losses = []for batch_id, data in enumerate(test_loader):sent = data[0]label = data[1]logits = model(sent)loss = paddle.nn.functional.cross_entropy(logits, label)acc = paddle.metric.accuracy(logits, label)accuracies.append(acc.numpy())losses.append(loss.numpy())avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)print("[validation] accuracy: {}, loss: {}".format(avg_acc, avg_loss))model.train()paddle.save(model.state_dict(),"model_final.pdparams")draw_process("trainning loss","red",Iters,total_loss,"trainning loss")draw_process("trainning acc","green",Iters,total_acc,"trainning acc")model = RNN()
train(model)
五、模型评估
'''
模型评估
'''
model_state_dict = paddle.load('model_final.pdparams')
model = RNN()
model.set_state_dict(model_state_dict)
model.eval()
label_map = {0:"是", 1:"否"}
samples = []
predictions = []
accuracies = []
losses = []for batch_id, data in enumerate(test_loader):sent = data[0]label = data[1]logits = model(sent)for idx,probs in enumerate(logits):# 映射分类labellabel_idx = np.argmax(probs)labels = label_map[label_idx]predictions.append(labels)samples.append(sent[idx].numpy())loss = paddle.nn.functional.cross_entropy(logits, label)acc = paddle.metric.accuracy(logits, label)accuracies.append(acc.numpy())losses.append(loss.numpy())avg_acc, avg_loss = np.mean(accuracies), np.mean(losses)
print("[validation] accuracy: {}, loss: {}".format(avg_acc, avg_loss))
print('数据: {} \n\n是否谣言: {}'.format(ids_to_str(samples[0]), predictions[0]))
笔记实践 | 基于LSTM实现谣言检测 |初识长短记忆神经网络相关推荐
- 时序预测 | MATLAB实现基于Adam算法优化BiLSTM双向长短期记忆神经网络时间序列预测
时序预测 | MATLAB实现基于Adam算法优化BiLSTM双向长短期记忆神经网络时间序列预测 目录 时序预测 | MATLAB实现基于Adam算法优化BiLSTM双向长短期记忆神经网络时间序列预测 ...
- 【回归预测】基于TPA-LSTM(时间注意力注意力机制长短期记忆神经网络)实现数据多输入单输出回归预测附matlab代码
1 内容介绍 随着当今时代科技不断地飞速发展,科技信息也在急剧增加,收集并挖掘分析这些来源多样化的科技信息,有助于推动科技的发展.而预测作为一种重要的数据研究方法,在各个行业各个领域都有着广泛的应用. ...
- Python实现XGBoost+长短记忆神经网络(LSTM)+随机森林(RF)分析布鲁克林篮网队胜利因素并预测下一场比赛结果
最近看NBA被篮网队所吸引,有进攻有防守而且观赏性很强,所以特此使用算法来分析一下篮网队赢球的关键因素有哪些,同时也简单预测下一场比赛结果,仅供参考. 数据来源:布鲁克林篮网队20-21赛季数据 ht ...
- 论文阅读-社交媒体上的谣言检测:数据集、方法和机会
论文链接:https://aclanthology.org/D19-5008.pdf 目录 摘要 引言 1.1谣言检测 1.2 问题陈述 1.3 用户立场 2 数据集和评估指标 2.1 数据集 2.2 ...
- 谣言检测文献阅读三—The Future of False Information Detection on Social Media:New Perspectives and Trends
系列文章目录 谣言检测文献阅读一-A Review on Rumour Prediction and Veracity Assessment in Online Social Network 谣言检测 ...
- 自动谣言检测综述分享——Automatic Rumor Detection on Microblogs: A Survey
谣言检测方法可以分为三类:基于人工特征的分类方法.基于传播的方法和基于神经网络的方法. 1 问题定义 谣言的定义 谣言的传统定义来源于社会心理学[27].也就是说,谣言是一种未经证实的说法,广泛传播, ...
- 谣言检测论文分享(三)
论文分享之 Rumor Detection on Twitter with Tree-structured Recursive Neural Networks Jing Ma , Wei Gao , ...
- 谣言检测文献阅读二—Earlier detection of rumors in online social networks using certainty‑factor‑based convolu
系列文章目录 谣言检测文献阅读一-A Review on Rumour Prediction and Veracity Assessment in Online Social Network 谣言检测 ...
- 基于深度学习lstm_深度学习和基于LSTM的恶意软件分类
基于深度学习lstm Malware development has seen diversity in terms of architecture and features. This advanc ...
最新文章
- AI检测贫血不看血,竟是看眼睛
- Django连接数据mysql
- 相册服务器位置,王者荣耀游戏相册是什么 游戏相册开放服务器以及功能使用详细介绍...
- 在sql server里,日期字段按天数进行group by查询的方法
- B站爱情怀,投资者只看利益
- iview图表_【技术博客】iview常用工具记录
- markdown流程图画法小结
- MFC制作打地鼠小游戏
- 日均互动50万次 微信的营销的成功之道
- [JNI] 开发基础(1) c语言基本类型
- 十大热门编程语言的介绍
- 在web浏览器页面使用IC卡读卡器
- 计算机基础雨课堂答案,基于“雨课堂”助推大学计算机基础课革新
- PL/SQL——员工涨工资问题
- linux /home recovering journal,启动Ubuntu时出现 /dev/sda2 clean 和 /dev/sda2 recovering journal 现象的解决办法...
- android studio 显示view树_Android手势分发和嵌套滚动机制
- 使用elastix搭建IP电话及传真
- pdf批量修改属性工具软件使用教程
- 算法题-排列组合问题
- macbook air2018 安装win10
热门文章
- 基于python包sentinelsat的Sentinel-2数据下载
- 如何获取access_token以及同步菜单
- 提取HTML中所有图片地址的正则表达式
- 基于SMBJ在局域网内读取共享文件
- HTML5设计原理-------Jeremy Keith在 Fronteers 2010 上的主题演讲
- H5 canvas基础入门到捕鱼达人小游戏实现(3)-canvas运动入门,渐变,文字渲染,阴影
- uni-app 不使用mac快速打包ipa 发布到ios市场
- AB测试实战案例讲解及踩坑事项
- 摇摇开门:管理,需要思维的大转变
- SSL证书是否要付费购买 免费SSL证书无法使用