基于transformer和相关预训练模型的任务调优

使用的环境依赖:

python3.9
'''
对应的依赖：
tensorflow==2.11.0
transformers==4.26.0
pandas==1.3.5
scikit-learn==1.0.2
'''

模型的训练代码如下：

from transformers import BertTokenizer,TFBertForSequenceClassification
import tensorflow as tf
from sklearn.model_selection import train_test_split
import pandas as pd
max_length = 40
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')'''
对应的依赖：
tensorflow==2.11.0
transformers==4.26.0
pandas==1.3.5
scikit-learn==1.0.2
'''def split_dataset(df):train_set, x = train_test_split(df,stratify=df['label'],test_size=0.1,random_state=42)val_set, test_set = train_test_split(x,stratify=x['label'],test_size=0.5,random_state=43)return train_set,val_set, test_setdf_raw = pd.read_csv("data/originalthuctcdata/THUCTC_subdata.txt",sep="\t",header=None,names=["text","label"])
# label
df_label = pd.DataFrame({"label":["财经","房产","股票","教育","科技","社会","时政","体育","游戏","娱乐"],"y":list(range(10))})
df_raw = pd.merge(df_raw,df_label,on="label",how="left")train_data,val_data, test_data = split_dataset(df_raw)def convert_example_to_feature(review):return tokenizer.encode_plus(review,add_special_tokens=True,  # add [CLS], [SEP]padding='max_length',max_length=max_length,  # max length of the text that can go to BERT# pad_to_max_length=True,return_attention_mask=True,  # add attention mask to not focus on pad tokens)# map to the expected input to TFBertForSequenceClassification, see here
def map_example_to_dict(input_ids, attention_masks, token_type_ids, label):return {"input_ids": input_ids,"token_type_ids": token_type_ids,"attention_mask": attention_masks,}, labeldef encode_examples(ds, limit=-1):# prepare list, so that we can build up final TensorFlow dataset from slices.input_ids_list = []token_type_ids_list = []attention_mask_list = []label_list = []if (limit > 0):ds = ds.take(limit)for index, row in ds.iterrows():review = row["text"]label = row["y"]bert_input = convert_example_to_feature(review)input_ids_list.append(bert_input['input_ids'])token_type_ids_list.append(bert_input['token_type_ids'])attention_mask_list.append(bert_input['attention_mask'])label_list.append([label])return tf.data.Dataset.from_tensor_slices((input_ids_list, attention_mask_list, token_type_ids_list, label_list)).map(map_example_to_dict)# train dataset
batch_size=100
ds_train_encoded = encode_examples(train_data).shuffle(10000).batch(batch_size)# val dataset
ds_val_encoded = encode_examples(val_data).batch(batch_size)
# test dataset
ds_test_encoded = encode_examples(test_data).batch(batch_size)# recommended learning rate for Adam 5e-5, 3e-5, 2e-5
learning_rate = 2e-5
# we will do just 1 epoch for illustration, though multiple epochs might be better as long as we will not overfit the model
number_of_epochs = 1# model initialization
model = TFBertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=10)# optimizer Adam recommended
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate,epsilon=1e-08, clipnorm=1)# we do not have one-hot vectors, we can use sparce categorical cross entropy and accuracy
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])# fit model
bert_history = model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_val_encoded)
# evaluate test set
model.evaluate(ds_test_encoded)
tf.keras.models.save_model(model,filepath="my_model")

其中，模型训练可以根据个人的设备适当调整batch大小。基于transformer的bert相关模型的输入是

{"input_ids":[[]],"token_type_ids":[[]],"attention_mask":[[]]}

input_ids：表示的是输入文本进行分词处理并按照指定的长度要求进行padding或truncate后的结果

input_ids = tokenizer.convert_tokens_to_ids(tokenized)# precalculation of pad length, so that we can reuse it later on
padding_length = max_length_test - len(input_ids)# map tokens to WordPiece dictionary and add pad token for those text shorter than our max length
input_ids = input_ids + ([0] * padding_length)

token_type_ids：表示的是当前词是第几个句子，一般在有多个句子作为模型输入时用来区分句子的

attention_mask：表示的是为了区分input_ids的padding和非padding数据

# attention should focus just on sequence with non padded tokens
attention_mask = [1] * len(input_ids)# do not focus attention on padded tokens
attention_mask = attention_mask + ([0] * padding_length)

基于训练好的模型的预测：

from transformers import BertTokenizer, TFBertForSequenceClassification
import tensorflow as tf
max_length = 40
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')my_model = tf.keras.models.load_model(filepath="my_model")
input_message = "82岁老太为学生做饭扫地44年获授港大荣誉院士"bert_input = tokenizer.encode_plus(input_message,add_special_tokens=True,  # add [CLS], [SEP]padding='max_length',max_length=max_length,  # max length of the text that can go to BERT# pad_to_max_length=True,return_attention_mask=True,  # add attention mask to not focus on pad tokens)
predict_result = my_model({"input_ids":[bert_input['input_ids']],"token_type_ids":[bert_input['token_type_ids']],"attention_mask":[bert_input['attention_mask']],})print(predict_result)

相关模型也可以部署到tf-serving中

其中模型返回的结果是一个logits结果，也就是没有经过softmax处理，所以如果要按照probability返回结果的话，可以手动增加一个soft计算tf.math.softmax(predict_result["logits"],axis=1)

相关完整代码：bert_related_task: 使用基于bert的预训练模型，对各个方向的任务进行二次训练，获取特定任务的模型 (gitee.com)

关于基于bert的模型二次封装和结构调整，下一次给大家介绍

基于transformer和相关预训练模型的任务调优相关推荐

最新综述：基于Transformer的NLP预训练模型已经发展到何种程度？
©作者 | 机器之心编辑部来源 | 机器之心 Transformer 为自然语言处理领域带来的变革已无需多言.近日,印度国立理工学院.生物医学人工智能创业公司 Nference.ai 的研究者全面调 ...
【LLM系列之GPT】GPT（Generative Pre-trained Transformer）生成式预训练模型
GPT模型简介 GPT(Generative Pre-trained Transformer)是由OpenAI公司开发的一系列自然语言处理模型,采用多层Transformer结构来预测下一个单词的概率 ...
[转]linux下基于SMP架构的多队列网卡的调优（Multi-queue network interfaces with SMP on Linux）
转自: http://blog.csdn.net/vah101/article/details/38615795 在许多商业应用场景下,使用linux来搭建路由器是一种可选的方案.在这篇博文中,我们将 ...
【CANN训练营第三季】基于昇腾PyTorch框架的模型训练调优
文章目录性能分析工具PyTorch Profiling 性能分析工具CANN Profiling 结业考核 1.使用Pytorch实现LeNet网络的minist手写数字识别. 2.采用课程中学习到 ...
transformer模型_【预训练模型】万字长文梳理NLP预训练模型！从transformer到albert...
公众号关注 "ML_NLP"设为 "星标",重磅干货,第一时间送达! " 万字长文梳理NLP预训练模型的发展历程,从transformer到alber ...
NLP预训练模型：从transformer到albert
转载自:<NLP预训练模型:从transformer到albert>(https://zhuanlan.zhihu.com/p/85221503) 背景语言模型是机器理解人类语言的途径, ...
论文泛读记录(多模情感分析/探测；厌恶语音探测；属性级情感分析；CVPR2022和ACL2022 三元组/对比学习/视觉语言预训练/机器翻译/预训练模型/机器阅读理解)
文章目录 1.1 CTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fus ...
周明：预训练模型在多语言、多模态任务的进展
2020-09-12 15:34:16 作者 | 周明编辑 | 陈彩娴 8月29日至30日,由中国科学技术协会.中国科学院.南京市人民政府为指导单位,中国人工智能学会.南京市建邺区人民政府.江苏省科 ...
如何高效、精准地进行图片搜索？看看轻量化视觉预训练模型
来源 | 微软研究院AI头条编者按:你是否有过图像检索的烦恼?或是难以在海量化的图像中准确地找到所需图像,或是在基于文本的检索中得到差强人意的结果.对于这个难题,微软亚洲研究院和微软云计算与人工智能 ...

基于transformer和相关预训练模型的任务调优

使用的环境依赖:

模型的训练代码如下：

基于训练好的模型的预测：

基于transformer和相关预训练模型的任务调优相关推荐

最新文章

热门文章