B站视频讲解

本文主要介绍一下如何使用 PyTorch 复现BERT。请先花上 10 分钟阅读我的这篇文章 BERT详解(附带ELMo、GPT介绍),再来看本文,方能达到醍醐灌顶,事半功倍的效果

准备数据集

这里我并没有用什么大型的数据集,而是手动输入了两个人的对话,主要是为了降低代码阅读难度,我希望读者能更关注模型实现的部分

'''code by Tae Hwan Jung(Jeff Jung) @graykode, modify by wmathorReference : https://github.com/jadore801120/attention-is-all-you-need-pytorchhttps://github.com/JayParks/transformer, https://github.com/dhlee347/pytorchic-bert
'''
import re
import math
import torch
import numpy as np
from random import *
import torch.nn as nn
import torch.optim as optim
import torch.utils.data as Datatext = ('Hello, how are you? I am Romeo.\n' # R'Hello, Romeo My name is Juliet. Nice to meet you.\n' # J'Nice meet you too. How are you today?\n' # R'Great. My baseball team won the competition.\n' # J'Oh Congratulations, Juliet\n' # R'Thank you Romeo\n' # J'Where are you going today?\n' # R'I am going shopping. What about you?\n' # J'I am going to visit my grandmother. she is not very well' # R
)
sentences = re.sub("[.,!?\\-]", '', text.lower()).split('\n') # filter '.', ',', '?', '!'
word_list = list(set(" ".join(sentences).split())) # ['hello', 'how', 'are', 'you',...]
word2idx = {'[PAD]' : 0, '[CLS]' : 1, '[SEP]' : 2, '[MASK]' : 3}
for i, w in enumerate(word_list):word2idx[w] = i + 4
idx2word = {i: w for i, w in enumerate(word2idx)}
vocab_size = len(word2idx)token_list = list()
for sentence in sentences:arr = [word2idx[s] for s in sentence.split()]token_list.append(arr)

最终token_list是个二维的list,里面每一行代表一句话

print(token_list)
'''
[[12, 7, 22, 5, 39, 21, 15],[12, 15, 13, 35, 10, 27, 34, 14, 19, 5],[34, 19, 5, 17, 7, 22, 5, 8],[33, 13, 37, 32, 28, 11, 16],[30, 23, 27],[6, 5, 15],[36, 22, 5, 31, 8],[39, 21, 31, 18, 9, 20, 5],[39, 21, 31, 14, 29, 13, 4, 25, 10, 26, 38, 24]]
'''

模型参数

# BERT Parameters
maxlen = 30
batch_size = 6
max_pred = 5 # max tokens of prediction
n_layers = 6
n_heads = 12
d_model = 768
d_ff = 768*4 # 4*d_model, FeedForward dimension
d_k = d_v = 64  # dimension of K(=Q), V
n_segments = 2
  • maxlen表示同一个batch中的所有句子都由30个token组成,不够的补PAD(这里我实现的方式比较粗暴,直接固定所有batch中的所有句子都为30)
  • max_pred表示最多需要预测多少个单词,即BERT中的完形填空任务
  • n_layers表示Encoder Layer的数量
  • d_model表示Token Embeddings、Segment Embeddings、Position Embeddings的维度
  • d_ff表示Encoder Layer中全连接层的维度
  • n_segments表示Decoder input由几句话组成

数据预处理

数据预处理部分,我们需要根据概率随机make或者替换(以下统称mask)一句话中15%的token,还需要拼接任意两句话

# sample IsNext and NotNext to be same in small batch size
def make_data():batch = []positive = negative = 0while positive != batch_size/2 or negative != batch_size/2:tokens_a_index, tokens_b_index = randrange(len(sentences)), randrange(len(sentences)) # sample random index in sentencestokens_a, tokens_b = token_list[tokens_a_index], token_list[tokens_b_index]input_ids = [word2idx['[CLS]']] + tokens_a + [word2idx['[SEP]']] + tokens_b + [word2idx['[SEP]']]segment_ids = [0] * (1 + len(tokens_a) + 1) + [1] * (len(tokens_b) + 1)# MASK LMn_pred =  min(max_pred, max(1, int(len(input_ids) * 0.15))) # 15 % of tokens in one sentencecand_maked_pos = [i for i, token in enumerate(input_ids)if token != word2idx['[CLS]'] and token != word2idx['[SEP]']] # candidate masked positionshuffle(cand_maked_pos)masked_tokens, masked_pos = [], []for pos in cand_maked_pos[:n_pred]:masked_pos.append(pos)masked_tokens.append(input_ids[pos])if random() < 0.8:  # 80%input_ids[pos] = word2idx['[MASK]'] # make maskelif random() > 0.9:  # 10%index = randint(0, vocab_size - 1) # random index in vocabularywhile index < 4: # can't involve 'CLS', 'SEP', 'PAD'index = randint(0, vocab_size - 1)input_ids[pos] = index # replace# Zero Paddingsn_pad = maxlen - len(input_ids)input_ids.extend([0] * n_pad)segment_ids.extend([0] * n_pad)# Zero Padding (100% - 15%) tokensif max_pred > n_pred:n_pad = max_pred - n_predmasked_tokens.extend([0] * n_pad)masked_pos.extend([0] * n_pad)if tokens_a_index + 1 == tokens_b_index and positive < batch_size/2:batch.append([input_ids, segment_ids, masked_tokens, masked_pos, True]) # IsNextpositive += 1elif tokens_a_index + 1 != tokens_b_index and negative < batch_size/2:batch.append([input_ids, segment_ids, masked_tokens, masked_pos, False]) # NotNextnegative += 1return batch
# Proprecessing Finishedbatch = make_data()
input_ids, segment_ids, masked_tokens, masked_pos, isNext = zip(*batch)
input_ids, segment_ids, masked_tokens, masked_pos, isNext = \torch.LongTensor(input_ids),  torch.LongTensor(segment_ids), torch.LongTensor(masked_tokens),\torch.LongTensor(masked_pos), torch.LongTensor(isNext)class MyDataSet(Data.Dataset):def __init__(self, input_ids, segment_ids, masked_tokens, masked_pos, isNext):self.input_ids = input_idsself.segment_ids = segment_idsself.masked_tokens = masked_tokensself.masked_pos = masked_posself.isNext = isNextdef __len__(self):return len(self.input_ids)def __getitem__(self, idx):return self.input_ids[idx], self.segment_ids[idx], self.masked_tokens[idx], self.masked_pos[idx], self.isNext[idx]loader = Data.DataLoader(MyDataSet(input_ids, segment_ids, masked_tokens, masked_pos, isNext), batch_size, True)

上述代码中,positive变量代表两句话是连续的个数,negative代表两句话不是连续的个数,我们需要做到在一个batch中,这两个样本的比例为1:1。随机选取的两句话是否连续,只要通过判断tokens_a_index + 1 == tokens_b_index即可

然后是随机mask一些token,n_pred变量代表的是即将mask的token数量,cand_maked_pos代表的是有哪些位置是候选的、可以mask的(因为像[SEP],[CLS]这些不能做mask,没有意义),最后shuffle()一下,然后根据random()的值选择是替换为[MASK]还是替换为其它的token

接下来会做两个Zero Padding,第一个是为了补齐句子的长度,使得一个batch中的句子都是相同长度。第二个是为了补齐mask的数量,因为不同句子长度,会导致不同数量的单词进行mask,我们需要保证同一个batch中,mask的数量(必须)是相同的,所以也需要在后面补一些没有意义的东西,比方说[0]

以上就是整个数据预处理的部分

模型构建

模型结构主要采用了Transformer的Encoder,所以这里我不再多赘述,可以直接看我的这篇文章Transformer的PyTorch实现,以及B站视频讲解

def get_attn_pad_mask(seq_q, seq_k):batch_size, seq_len = seq_q.size()# eq(zero) is PAD tokenpad_attn_mask = seq_len.data.eq(0).unsqueeze(1)  # [batch_size, 1, seq_len]return pad_attn_mask.expand(batch_size, seq_len, seq_len)  # [batch_size, seq_len, seq_len]def gelu(x):"""Implementation of the gelu activation function.For information: OpenAI GPT's gelu is slightly different (and gives slightly different results):0.5 * x * (1 + torch.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * torch.pow(x, 3))))Also see https://arxiv.org/abs/1606.08415"""return x * 0.5 * (1.0 + torch.erf(x / math.sqrt(2.0)))class Embedding(nn.Module):def __init__(self):super(Embedding, self).__init__()self.tok_embed = nn.Embedding(vocab_size, d_model)  # token embeddingself.pos_embed = nn.Embedding(maxlen, d_model)  # position embeddingself.seg_embed = nn.Embedding(n_segments, d_model)  # segment(token type) embeddingself.norm = nn.LayerNorm(d_model)def forward(self, x, seg):seq_len = x.size(1)pos = torch.arange(seq_len, dtype=torch.long)pos = pos.unsqueeze(0).expand_as(x)  # [seq_len] -> [batch_size, seq_len]embedding = self.tok_embed(x) + self.pos_embed(pos) + self.seg_embed(seg)return self.norm(embedding)class ScaledDotProductAttention(nn.Module):def __init__(self):super(ScaledDotProductAttention, self).__init__()def forward(self, Q, K, V, attn_mask):scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k) # scores : [batch_size, n_heads, seq_len, seq_len]scores.masked_fill_(attn_mask, -1e9) # Fills elements of self tensor with value where mask is one.attn = nn.Softmax(dim=-1)(scores)context = torch.matmul(attn, V)return contextclass MultiHeadAttention(nn.Module):def __init__(self):super(MultiHeadAttention, self).__init__()self.W_Q = nn.Linear(d_model, d_k * n_heads)self.W_K = nn.Linear(d_model, d_k * n_heads)self.W_V = nn.Linear(d_model, d_v * n_heads)def forward(self, Q, K, V, attn_mask):# q: [batch_size, seq_len, d_model], k: [batch_size, seq_len, d_model], v: [batch_size, seq_len, d_model]residual, batch_size = Q, Q.size(0)# (B, S, D) -proj-> (B, S, D) -split-> (B, S, H, W) -trans-> (B, H, S, W)q_s = self.W_Q(Q).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # q_s: [batch_size, n_heads, seq_len, d_k]k_s = self.W_K(K).view(batch_size, -1, n_heads, d_k).transpose(1,2)  # k_s: [batch_size, n_heads, seq_len, d_k]v_s = self.W_V(V).view(batch_size, -1, n_heads, d_v).transpose(1,2)  # v_s: [batch_size, n_heads, seq_len, d_v]attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1) # attn_mask : [batch_size, n_heads, seq_len, seq_len]# context: [batch_size, n_heads, seq_len, d_v], attn: [batch_size, n_heads, seq_len, seq_len]context = ScaledDotProductAttention()(q_s, k_s, v_s, attn_mask)context = context.transpose(1, 2).contiguous().view(batch_size, -1, n_heads * d_v) # context: [batch_size, seq_len, n_heads, d_v]output = nn.Linear(n_heads * d_v, d_model)(context)return nn.LayerNorm(d_model)(output + residual) # output: [batch_size, seq_len, d_model]class PoswiseFeedForwardNet(nn.Module):def __init__(self):super(PoswiseFeedForwardNet, self).__init__()self.fc1 = nn.Linear(d_model, d_ff)self.fc2 = nn.Linear(d_ff, d_model)def forward(self, x):# (batch_size, seq_len, d_model) -> (batch_size, seq_len, d_ff) -> (batch_size, seq_len, d_model)return self.fc2(gelu(self.fc1(x)))class EncoderLayer(nn.Module):def __init__(self):super(EncoderLayer, self).__init__()self.enc_self_attn = MultiHeadAttention()self.pos_ffn = PoswiseFeedForwardNet()def forward(self, enc_inputs, enc_self_attn_mask):enc_outputs = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_inputs to same Q,K,Venc_outputs = self.pos_ffn(enc_outputs) # enc_outputs: [batch_size, seq_len, d_model]return enc_outputsclass BERT(nn.Module):def __init__(self):super(BERT, self).__init__()self.embedding = Embedding()self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])self.fc = nn.Sequential(nn.Linear(d_model, d_model),nn.Dropout(0.5),nn.Tanh(),)self.classifier = nn.Linear(d_model, 2)self.linear = nn.Linear(d_model, d_model)self.activ2 = gelu# fc2 is shared with embedding layerembed_weight = self.embedding.tok_embed.weightself.fc2 = nn.Linear(d_model, vocab_size, bias=False)self.fc2.weight = embed_weightdef forward(self, input_ids, segment_ids, masked_pos):output = self.embedding(input_ids, segment_ids) # [bach_size, seq_len, d_model]enc_self_attn_mask = get_attn_pad_mask(input_ids, input_ids) # [batch_size, maxlen, maxlen]for layer in self.layers:# output: [batch_size, max_len, d_model]output = layer(output, enc_self_attn_mask)# it will be decided by first token(CLS)h_pooled = self.fc(output[:, 0]) # [batch_size, d_model]logits_clsf = self.classifier(h_pooled) # [batch_size, 2] predict isNextmasked_pos = masked_pos[:, :, None].expand(-1, -1, d_model) # [batch_size, max_pred, d_model]h_masked = torch.gather(output, 1, masked_pos) # masking position [batch_size, max_pred, d_model]h_masked = self.activ2(self.linear(h_masked)) # [batch_size, max_pred, d_model]logits_lm = self.fc2(h_masked) # [batch_size, max_pred, vocab_size]return logits_lm, logits_clsf
model = BERT()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adadelta(model.parameters(), lr=0.001)

这段代码中用到了一个激活函数gelu,这是BERT论文中提出来的,具体公式可以看这篇文章GELU激活函数

这段代码有一个特别不好理解的地方,就是到数第7行的代码,用到了torch.gather()函数,这里我稍微讲一下。这个函数实际上实现了以下的功能

out = torch.gather(input, dim, index)
# out[i][j][k] = input[index[i][j][k]][j][k] # dim=0
# out[i][j][k] = input[i][index[i][j][k]][k] # dim=1
# out[i][j][k] = input[i][j][index[i][j][k]] # dim=2

具体以一个例子来说就是,首先我生成index变量

index = torch.from_numpy(np.array([[1, 2, 0], [2, 0, 1]])).type(torch.LongTensor)
index = index[:, :, None].expand(-1, -1, 10)
print(index)
'''
tensor([[[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],[[2, 2, 2, 2, 2, 2, 2, 2, 2, 2],[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]])
'''

然后随机生成一个[2, 3, 10]维的tensor,可以理解为有2个batch,每个batch有3句话,每句话由10个词构成,只不过这里的词不是以正整数(索引)的形式出现,而是连续的数值

input = torch.rand(2, 3, 10)
print(input)
'''
tensor([[[0.7912, 0.7098, 0.7548, 0.8627, 0.1966, 0.6327, 0.6629, 0.8158,0.7094, 0.1476],[0.0774, 0.6794, 0.0030, 0.1855, 0.7391, 0.0641, 0.2950, 0.9734,0.7018, 0.3370],[0.2190, 0.3976, 0.0112, 0.5581, 0.1329, 0.2154, 0.6277, 0.0850,0.4446, 0.5158]],[[0.4145, 0.8486, 0.9515, 0.3826, 0.6641, 0.5192, 0.2311, 0.6960,0.4215, 0.5597],[0.0221, 0.5232, 0.3971, 0.8972, 0.2772, 0.5046, 0.1881, 0.9044,0.6925, 0.9837],[0.6797, 0.5538, 0.8139, 0.1199, 0.0095, 0.4940, 0.7814, 0.1484,0.0200, 0.7489]]])
'''

之后调用torch.gather(input, 1, index)函数

print(torch.gather(input, 1, index))
'''
tensor([[[0.0774, 0.6794, 0.0030, 0.1855, 0.7391, 0.0641, 0.2950, 0.9734,0.7018, 0.3370],[0.2190, 0.3976, 0.0112, 0.5581, 0.1329, 0.2154, 0.6277, 0.0850,0.4446, 0.5158],[0.7912, 0.7098, 0.7548, 0.8627, 0.1966, 0.6327, 0.6629, 0.8158,0.7094, 0.1476]],[[0.6797, 0.5538, 0.8139, 0.1199, 0.0095, 0.4940, 0.7814, 0.1484,0.0200, 0.7489],[0.4145, 0.8486, 0.9515, 0.3826, 0.6641, 0.5192, 0.2311, 0.6960,0.4215, 0.5597],[0.0221, 0.5232, 0.3971, 0.8972, 0.2772, 0.5046, 0.1881, 0.9044,0.6925, 0.9837]]])
'''

index中第一行的tensor会作用于input的第一个batch,具体来说,原本三句话的顺序是[0, 1, 2],现在会根据[1, 2, 0]调换顺序。index中第2行的tensor会作用于input的第二个batch,具体来说,原本三句话的顺序是[0, 1, 2],现在会根据[2, 0, 1]调换顺序

训练&测试

以下是训练代码

for epoch in range(180):for input_ids, segment_ids, masked_tokens, masked_pos, isNext in loader:logits_lm, logits_clsf = model(input_ids, segment_ids, masked_pos)loss_lm = criterion(logits_lm.view(-1, vocab_size), masked_tokens.view(-1)) # for masked LMloss_lm = (loss_lm.float()).mean()loss_clsf = criterion(logits_clsf, isNext) # for sentence classificationloss = loss_lm + loss_clsfif (epoch + 1) % 10 == 0:print('Epoch:', '%04d' % (epoch + 1), 'loss =', '{:.6f}'.format(loss))optimizer.zero_grad()loss.backward()optimizer.step()

以下是测试代码

# Predict mask tokens ans isNext
input_ids, segment_ids, masked_tokens, masked_pos, isNext = batch[0]
print(text)
print([idx2word[w] for w in input_ids if idx2word[w] != '[PAD]'])logits_lm, logits_clsf = model(torch.LongTensor([input_ids]), \torch.LongTensor([segment_ids]), torch.LongTensor([masked_pos]))
logits_lm = logits_lm.data.max(2)[1][0].data.numpy()
print('masked tokens list : ',[pos for pos in masked_tokens if pos != 0])
print('predict masked tokens list : ',[pos for pos in logits_lm if pos != 0])logits_clsf = logits_clsf.data.max(1)[1].data.numpy()[0]
print('isNext : ', True if isNext else False)
print('predict isNext : ',True if logits_clsf else False)

最后给出完整代码链接(需要科学的力量)
Github 项目地址:nlp-tutorial

BERT 的 PyTorch 实现(超详细)相关推荐

  1. Bert 源码(pytorch)超详细的解读

    model.py 对transformers的bert源码的解读 # coding=utf-8from __future__ import absolute_import, division, pri ...

  2. Win11上Pytorch的安装并在Pycharm上调用PyTorch最新超详细过程并附详细的系统变量添加过程,可解决pycharm中pip不好使的问题

    网上资源越来越多,关于PyTorch的安装教程各式各样,下面我将详细介绍在安装过程中的操作步骤. 经过上述流程图的介绍我们心中对安装过程有了一个大致的轮廓.下面我将对每一步进行细致的说明 步骤Ⅰ:检查 ...

  3. 在Anaconda下安装Pytorch的超详细步骤

    在Anaconda下安装Pytorch 安装pytorch,有两种办法,一是pip,二是conda.不管什么样的方法,首先,都要安装最新的anaconda. 一.安装Anaconda Anaconda ...

  4. NLP之BERT英文阅读理解问答SQuAD 2.0超详细教程

    环境 linux python 3.6 tensorflow 1.12.0 文件准备工作 下载bert源代码 : https://github.com/google-research/bert 下载b ...

  5. bert下游_原来你是这样的BERT,i了i了! —— 超详细BERT介绍(三)BERT下游任务...

    原来你是这样的BERT,i了i了! -- 超详细BERT介绍(三)BERT下游任务 BERT(Bidirectional Encoder Representations from Transforme ...

  6. ResNeXt代码复现+超详细注释(PyTorch)

    ResNeXt就是一种典型的混合模型,由基础的Inception+ResNet组合而成,本质在gruops分组卷积,核心创新点就是用一种平行堆叠相同拓扑结构的blocks代替原来 ResNet 的三层 ...

  7. SENet代码复现+超详细注释(PyTorch)

    在卷积网络中通道注意力经常用到SENet模块,来增强网络模型在通道权重的选择能力,进而提点.关于SENet的原理和具体细节,我们在上一篇已经详细的介绍了:经典神经网络论文超详细解读(七)--SENet ...

  8. CNN经典网络模型(四):GoogLeNet简介及代码实现(PyTorch超详细注释版)

    目录 一.开发背景 二.网络结构 三.模型特点 四.代码实现 1. model.py 2. train.py 3. predict.py 4. spilit_data.py 五.参考内容 一.开发背景 ...

  9. Deep Learning:基于pytorch搭建神经网络的花朵种类识别项目(内涵完整文件和代码)—超详细完整实战教程

    基于pytorch的深度学习花朵种类识别项目完整教程(内涵完整文件和代码) 相关链接:: 超详细--CNN卷积神经网络教程(零基础到实战) 大白话pytorch基本知识点及语法+项目实战 文章目录 基 ...

  10. CNN经典网络模型(二):AlexNet简介及代码实现(PyTorch超详细注释版)

    目录 一.开发背景 二.网络结构 三.模型特点 四.代码实现 1. model.py 2. train.py 3. predict.py 4. spilit_data.py 五.参考内容 一.开发背景 ...

最新文章

  1. 如果你要对一个变量进行反向传播,你必须保证其为Tensor
  2. linux命令行颜色
  3. 常用的Linux命令,日常收集记录
  4. SpringBoot+MyBatisPlus+ElementUI一步一步搭建前后端分离的项目(附代码下载)
  5. Fortran 入门——函数调用
  6. c/c++ 阻塞和非阻塞,fcntl应用
  7. SAP Spartacus 因为 refresh token 刷新令牌过期后显示用户重新登录页面的逻辑
  8. Android中的动画有哪几类?各自的特点和区别是什么?
  9. 国科大prml15-BP
  10. etcd 启动分析_Etcd 架构与实现解析
  11. 视觉SLAM十四讲学习笔记-第二讲-开发环境搭建
  12. as3程序主类,执行顺序
  13. Python3入门机器学习经典算法与应用 第3章 numpy 聚合操作
  14. ajax性能测试脚本,mqtt性能测试工具
  15. 二叉树叶子结点个数——C++
  16. 实验室服务器系统崩溃,选课系统崩溃解救报告
  17. 如何用ps将图片修改成指定大小
  18. [JAVA使用技巧]Java抽取Word和PDF格式文件_网络大本营
  19. C#语言操作Win7系统任务栏(TaskBar)中程序图标的背景进度条
  20. 亚马逊echo中国使用_您可以(也可以不)使用多个Amazon Echo做的事情

热门文章

  1. 图解HTTP 学习笔记
  2. Jplayer播放器广告插件
  3. 只待狂欢!青岛凤凰音乐节三大主题舞台搭建完毕
  4. 什么是内部链接和外部链接
  5. 关机程序C语言(快来整蛊你的小伙伴吧)
  6. TortoiseGit的使用详解
  7. 【报告分享】2020年中国食品冷链供应链研究报告-阿里研究院(附下载)
  8. 阿里云服务器安装redis
  9. 数字信号处理(陈后金)课堂笔记1绪论
  10. 一文透析华为鸿蒙科技含量!华为的创新必将刺激国内外巨头跟进