Seq2Seq，机器翻译Encoder-Decoder

Seq2Seq是NLP的一个热门分支，模型通常应用于机器翻译和聊天机器人，Seq2Seq从最初的Encoder-Decoder发展起来，在2014到2015年间，出现了Attention（注意力）机制，注意力机制与Seq2Seq结合后进一步提高了模型的表现。

现在将实现Encoder-Decoder模型，将其用于机器翻译任务；

机器翻译数据集与数据预处理

使用轻量级的机器翻译数据集nmt，数据内容在个人资源处，这是一个小型的机器翻译数据集便于开展实验，en-cn为英文与中文，en-fr为英文与法文。Encoder-Decoder将实现英文翻译中文的任务，en-cn数据量小，训练文件只有14533组中英语句，数据集格式：英文+中文(繁体字)。每组语句都较短，比如 train.txt 前6组数据：

Anyone can do that. 任何人都可以做到。
How about another piece of cake?    要不要再來一塊蛋糕？
She married him. 她嫁给了他。
I don't like learning irregular verbs.   我不喜欢学习不规则动词。
It's a whole new ball game for me.    這對我來說是個全新的球類遊戲。
He's sleeping like a baby.   他正睡着，像个婴儿一样。

原始语料需要进行预处理，所以导入必要的包和模块：

import os
import sys
import math
from collections import Counter
import numpy as np
import randomimport torch
import torch.nn as nn
import torch.nn.functional as F

其次，导入nltk，nltk专用于英文分词：

import nltk

初次安装nltk后，进行分词需要依赖工具punkt，分词工具内容在个人资源处，punkt是nltk的分词工具，将其解压，放置到当前虚拟环境（假设环境名为TORCH），则目录结构为"TORCH/nltk_data/tokenizers/punkt"；定义函数load_data，用于读取句子，再将句子转为分词的列表，额外地设置起始标志，每个句子以BOS开始，以EOS结束：

# 读取句子，将句子转为词的列表，每个句子以BOS开始，EOS结束
def load_data(in_file):cn=[]en=[]num_example=0with open(in_file,'r',encoding='utf-8') as f:# readlines不同于readline，readlines返回的是一个列表，每个元素就是一行# readlines适合处理小数据，大数据最好用readline+生成器方式读for line in f.readlines():# strip()用于去除首尾的指定字符，split()用于文本分隔，回顾python记事本line=line.strip().split("\t") # line[0]存英文句子,line[1]存中文句子# nltk.word_tokenize对英文分词，标识字符串BOS，EOS分别表示Beginning of Sentence,Ending of Sentenceen.append(["BOS"]+nltk.word_tokenize(line[0].lower())+["EOS"])#中文分词按照字逐个分开cn.append(["BOS"]+[c for c in line[1]]+["EOS"])num_example+=1return en,cntrain_file="./nmt/en-cn/train.txt"
dev_file="./nmt/en-cn/dev.txt"train_en,train_cn=load_data(train_file)
dev_en,dev_cn=load_data(dev_file)

查看分词结果：

# 查看分词结果
dev_en[0],dev_cn[0]"""
(['BOS', 'she', 'put', 'the', 'magazine', 'on', 'the', 'table', '.', 'EOS'],['BOS', '她', '把', '雜', '誌', '放', '在', '桌', '上', '。', 'EOS'])
"""

构建单词表vocab{word:counts}，通过词汇表内的词生成word_to_idx：

UNK_IDX=0
PAD_IDX=1def build_dict(sentences,max_words=50000):# 使用Counter计数word_count=Counter()for sentence in sentences:for s in sentence:word_count[s]+=1ls=word_count.most_common(max_words) # ls为列表，每个元素是元组(word,counts)total_words=len(ls)+2# word_to_idx {word:idx}word_dict={w[0]:index+2 for index,w in enumerate(ls)}word_dict["UNK"]=UNK_IDXword_dict["PAD"]=PAD_IDXreturn word_dict,total_wordsen_dict,en_total_words=build_dict(train_en)
cn_dict,cn_total_words=build_dict(train_cn)

构造idx_to_word {idx:word}：

# idx_to_word {idx:word}
inv_en_dict={v:k for k,v in en_dict.items()}
inv_cn_dict={v:k for k,v in cn_dict.items()}

把英文，中文的词均转为数字，同时根据句子长度进行排序处理（排序可以使每个batch里的句子长度相接近），其中的高阶函数sorted回顾Python笔记本.第五课.Python函数(二)：

def encode(en_sentences,cn_sentences,en_dict,cn_dict,sort_by_len=True):length=len(en_sentences)# D.get(k[,d]) -> D[k] if k in D, else d.out_en_sentences=[[en_dict.get(word,0) for word in sent] for sent in en_sentences]out_cn_sentences=[[cn_dict.get(word,0) for word in sent] for sent in cn_sentences]# 给一批语句，按照每句话的词数排序def len_argsort(seq):# sorted(iterable, key=None, reverse=False)，默认排序是升序排列，key参数接收的是一个函数# 回顾 第五课.Python函数(二)return sorted(range(len(seq)),key=lambda x:len(seq[x]))if sort_by_len:sorted_index=len_argsort(out_en_sentences)out_en_sentences=[out_en_sentences[i] for i in sorted_index]out_cn_sentences=[out_cn_sentences[i] for i in sorted_index]return out_en_sentences,out_cn_sentencestrain_en,train_cn=encode(train_en,train_cn,en_dict,cn_dict)
dev_en,dev_cn=encode(dev_en,dev_cn,en_dict,cn_dict)

查看前10个英文句子，由于sorted默认升序，所以短的句子在前面：

train_en[:10]"""
[[2, 475, 4, 3],[2, 1318, 126, 3],[2, 1707, 126, 3],[2, 254, 126, 3],[2, 1318, 126, 3],[2, 130, 11, 3],[2, 2045, 126, 3],[2, 693, 126, 3],[2, 2266, 126, 3],[2, 1707, 126, 3]]
"""

将数字转回句子原型，比如选择第24个句子：

[inv_en_dict[i] for i in train_en[23]],[inv_cn_dict[i] for i in train_cn[23]]"""
(['BOS', 'why', 'me', '?', 'EOS'],['BOS', '为', '什', '么', '是', '我', '？', 'EOS'])
"""

将数据生成batch：

def get_minibatch(n:"数据集一共有多少组句子",minibatch_size,shuffle=True):idx_list=np.arange(0,n,minibatch_size)if shuffle:np.random.shuffle(idx_list)minibatches=[]for idx in idx_list:minibatches.append(np.arange(idx,min(idx+minibatch_size,n)))return minibatchesdef prepare_data(seqs):# 将句子处理到相同长度,不够的在句子后面填充0lengths=[len(seq) for seq in seqs]n_samples=len(seqs)max_len=np.max(lengths)x=np.zeros((n_samples,max_len)).astype('int32')x_lengths=np.array(lengths).astype('int32')for idx,seq in enumerate(seqs):x[idx,:lengths[idx]]=seqreturn x,x_lengthsdef gen_examples(en_sentences,cn_sentences,batch_size):minibatches=get_minibatch(len(en_sentences),batch_size)all_ex=[]for minibatch in minibatches:mb_en_sentences=[en_sentences[t] for t in minibatch]mb_cn_sentences=[cn_sentences[t] for t in minibatch]mb_x,mb_x_len=prepare_data(mb_en_sentences)mb_y,mb_y_len=prepare_data(mb_cn_sentences)# mb_x [batch_size,该batch内英文句子最长长度]# mb_x_len [batch_size,]# mb_y [batch_size,该batch内中文句子最长长度]# mb_y_len [batch_size,]all_ex.append((mb_x,mb_x_len,mb_y,mb_y_len))return all_exbatch_size=64
train_data=gen_examples(train_en,train_cn,batch_size)
#  random.shuffle(x)->"Shuffle list x"
random.shuffle(train_data)
dev_data=gen_examples(dev_en,dev_cn,batch_size)

Encoder-Decoder模型

Encoder-Decoder模型本质是两个循环神经网络（一般使用GRU）进行连接；假设现在有一个Seq元组：一句英文，一句中文，句子已经分词处理过，令 $x$ 表示英语的分词， $y$ 表示中文的分词，既有：
$x,y):[x_{1},x_{2},x_{3}]|[y_{1},y_{2},y_{3},y_{4}]$
按照Seq2Seq的一般处理格式，会构造 $x_{1},x_{2},x_{3},y_{1})$ 为输入数据， $y_{2},y_{3},y_{4})$ 为标签；

Encoder-Decoder的网络结构如下：

上述结构中，Encoder的初始输入 hidden state： $h_{0}$ 可使用零向量，Decoder输出的预测结果为 $yp_{2},yp_{3},yp_{4})$ ，对比标签数据 $y_{2},y_{3},y_{4})$ ，机器翻译问题即转为普通的分类任务；Decoder其实是一个语言模型，利用当前中文分词，顺序预测后面的中文分词；

Encoder-Decoder使用到GRU，先了解pytorch中的GRU，GRU可看做是LSTM的简化版本，其用法类似LSTM，只是少了cell state；

GRU参数：input_size - 输入词向量的维数；hidden_size - 输出向量的维数；num_layers - GRU的层数；batch_first - 是否将batch维度设置到首维,默认为false:(seq_len, batch, input_size)Inputs: input, (h_0)input的形状 (seq_len, batch, input_size)，input_size为输入词向量的维数；h_0是GRU最开始输入需要的hidden state向量,默认为0向量如果GRU是双向的，num_directions=2，否则为1h_0 的形状  (num_layers * num_directions, batch, hidden_size)Outputs: output, (h_n)output的形状  (seq_len, batch, num_directions * hidden_size)h_n 的形状 (num_layers * num_directions, batch, hidden_size)

一般Seq为了构成一个batch，会向尾部弥补"填充字符"（比如"<PAD>"），当RNN计算这个batch时，会对大部分句子出现无用计算，因为填充字符没有太多实际意义，这降低了效率，所以可以借助nn.utils.rnn.pack_padded_sequence计算到每个Seq的句尾就结束（使用该工具前需要先对张量按seq的长度排序，且为降序排列，长的在前，短的在后），得到的输出再使用nn.utils.rnn.pad_packed_sequence补齐长度，得到一个规整的输出张量，整个过程从表面看，和直接将填充字符纳入计算后的输出张量形状一样，但执行效率却高了很多；

Encoder模型为：

class PlainEncoder(nn.Module):def __init__(self,vocab_size,hidden_size,dropout=0.2):super().__init__()self.embed=nn.Embedding(vocab_size,hidden_size)# GRU用法类似LSTM,少了cell stateself.rnn=nn.GRU(hidden_size,hidden_size,batch_first=True)self.dropout=nn.Dropout(dropout)def forward(self,x:"一个batch:[batch_size,seq_len:句子最长长度]",lengths:"一个batch中各句子长度"):# 把batch里的seq按照长度排序# Input沿着指定维度dim排序# torch.sort(input, dim=None, descending=False:"默认降序为False")->排序后的tensor,原tensor的序号索引sorted_len,sorted_index=lengths.sort(0,descending=True)# 将句子按照长度排序,eg:arr[[m,n,k]]挑选出矩阵arr第m,n,k行组成新矩阵x_sorted=x[sorted_index] # [batch_size,seq_len]embeded=self.dropout(self.embed(x_sorted)) # [batch_size,seq_len,hidden_size]"""将句子张量"压缩"(使rnn处理张量时不计算其末尾的pad),提高计算效率torch.nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first=False)输入的形状可以是(T×B×* )，T是最长序列长度，B是batch size，*代表任意维度；返回 PackedSequence 对象input (Variable) – 变长序列 被填充后的 batch，input中保存的序列，应该按降序排列，长的在前，短的在后lengths (list[int]) – Variable 中 每个序列的长度。batch_first (bool, optional) – 如果是True，input的形状应该是B*T*size。"""packed_embeded=nn.utils.rnn.pack_padded_sequence(embeded,sorted_len.long().cpu().data.numpy(),batch_first=True)# packed_out [batch_size, seq_len, num_directions * hidden_size]# hid [batch_size, num_layers * num_directions, hidden_size]packed_out,hid=self.rnn(packed_embeded)"""将张量填充回去,填充时会初始化为0torch.nn.utils.rnn.pad_packed_sequence(PackedSequence,batch_first)sequence (PackedSequence) – 将要被填充的 batchbatch_first (bool, optional) – 如果为True，返回的数据的格式为 B×T×*。返回值: 一个tuple，包含被填充后的batch，和batch中序列(填充前)的长度列表"""out,_=nn.utils.rnn.pad_packed_sequence(packed_out,batch_first=True) # out [batch_size, seq_len, num_directions * hidden_size]# sorted_index是降序排列的结果,升序排列返回原顺序(在降序之前其实就已经是排序过的)# 返回原顺序才能与中文target匹配_,original_idx=sorted_index.sort(0,descending=False)# tensor.contiguous()将tensor在内存中变成物理连续分布形式,节省空间out=out[original_idx.long()].contiguous() # [batch_size, seq_len, num_directions * hidden_size]hid=hid[:,original_idx.long()].contiguous() # [num_layers * num_directions, batch_size, hidden_size]# hid只要最后一行,即最后一层的 hidden state# hid[[-1]] :[1, batch_size, hidden_size]return out,hid[[-1]]

Decoder模型为：

class PlainDecoder(nn.Module):def __init__(self,vocab_size,hidden_size,dropout=0.2):super().__init__()self.embed=nn.Embedding(vocab_size,hidden_size)self.rnn=nn.GRU(hidden_size,hidden_size,batch_first=True)self.fc=nn.Linear(hidden_size,vocab_size)self.dropout=nn.Dropout(dropout)def forward(self,y:"[batch_size,seq_len]",y_lengths:"一个batch中各句子长度",hid:"[1, batch_size, hidden_size]"):sorted_len,sorted_index=y_lengths.sort(dim=0,descending=True)y_sorted=y[sorted_index] # [batch_size,seq_len]hid=hid[:,sorted_index]  # [1, batch_size, hidden_size]embeded=self.dropout(self.embed(y_sorted)) # [batch_size,seq_len,hidden_size]packed_embeded=nn.utils.rnn.pack_padded_sequence(embeded,sorted_len.long().cpu().data.numpy(),batch_first=True)packed_out,hid=self.rnn(packed_embeded,hid)        out,_= nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True) #out [batch_size, seq_len, num_directions * hidden_size]_, original_idx = sorted_index.sort(dim=0,descending=False)out=out[original_idx.long()].contiguous() #out [batch_size, seq_len, num_directions * hidden_size]hid=hid[:,original_idx.long()].contiguous() # [num_layers * num_directions, batch_size, hidden_size]# self.out(out) [batch_size, seq_len, vocab_size]# log_softmax对每个元素都计算LogSoftmax,dim=-1表示沿着vocab_size轴操作output = F.log_softmax(self.fc(out), dim=-1) # [batch_size, seq_len, vocab_size]return output,hid[[-1]]

结合Encoder和Decoder实现Seq2Seq：

class PlainSeq2Seq(nn.Module):def __init__(self,encoder,decoder):super().__init__()self.encoder=encoderself.decoder=decoderdef forward(self, x, x_lengths, y, y_lengths):encoder_out,hid=self.encoder(x,x_lengths)output,hid=self.decoder(y,y_lengths,hid)return outputdef translate(self, x, x_lengths, y:"BOS格式遵循[batch_size,seq_len=1]", max_length=10):"""从中文BOS开始,逐个预测下一文字"""encoder_out, hid = self.encoder(x, x_lengths)preds = []batch_size = x.shape[0]for i in range(max_length):# output [batch_size, seq_len=1, vocab_size]output, hid = self.decoder(y=y,y_lengths=torch.ones(batch_size).long().to(y.device),hid=hid)# tensor.max(dim)->tensor:"最大值组成的张量",tensor:"最大值索引组成的张量"y = output.max(dim=2)[1].view(batch_size, 1) # y [batch_size,1]preds.append(y)return torch.cat(preds, dim=1) # [batch_size,max_length]

Seq2Seq中定义了实例方法translate，其过程为：

英文分词输入Encoder，得到输出的 hidden state；
Decoder的GRU固定为输出 max seq len 个词向量；
将标志符号BOS作为中文的第一个分词，结合Encoder输出的 hidden state 输入到Decoder，得到第一个输出词向量，再将该词向量作为Decoder的输入词向量，依次得到一组输出词向量，每个词向量经过全连接映射得到one-hot编码，即得到输出的中文分词列表；
顺着列表检查分词，如果出现标志符号EOS就截取前面的分词组成中文结果。

损失函数

先了解gather的用法，torch.gather用于收集输入的特定维度指定位置的数值，其参数有：

torch.gather:input(tensor):待操作张量;dim(int):待操作的维度;index(LongTensor):在input的dim维度上取出对应位置的值

采用自定义的损失函数，以便提升效果：

class LanguageModelCriterion(nn.Module):"""自定义损失函数"""def __init__(self):super().__init__()def forward(self,input,target,mask):# input [batch_size, seq_len, vocab_size]# target [batch_size, seq_len]input=input.contiguous().view(-1,input.size(2)) # [*,vocab_size]target=target.contiguous().view(-1,1) # [*,1]# mask 表示哪些词是句子中的,哪些不是# mask [batch_size, seq_len]mask=mask.contiguous().view(-1,1) # reshape到[*,1]# input是模型的Decoder的输出,输出经过了log_softmax,已经有log,加负号后相当于负对数损失output=-input.gather(1,target)*maskoutput=torch.sum(output)/torch.sum(mask)return output

损失函数的计算使用了mask（mask可看做一个batch的byte型矩阵，是句子的词，则对应位置值为1，否则为0），mask加强了网络判别预测分词能不能算是句子组成部分的能力；

实例化模型并定义优化方法：

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
dropout=0.2
hidden_size=100encoder=PlainEncoder(vocab_size=en_total_words,hidden_size=hidden_size,dropout=dropout)decoder=PlainDecoder(vocab_size=cn_total_words,hidden_size=hidden_size,dropout=dropout)model=PlainSeq2Seq(encoder,decoder)
model=model.to(device)loss_fn=LanguageModelCriterion().to(device)
optimizer=torch.optim.Adam(model.parameters())

训练与测试

实际工作中，训练一个好的机器翻译模型需要大量的语料，一般要训练2周。本次实验数据集简单，训练耗时短，定义训练函数为：

def train(model,data,num_epochs=30):for epoch in range(num_epochs):model.train()num_words=total_loss=0.0for it,(mb_x,mb_x_len,mb_y,mb_y_len) in enumerate(data):"""mb_x [batch_size,该batch内英文句子最长长度]mb_x_len [batch_size,]mb_y [batch_size,该batch内中文句子最长长度]mb_y_len [batch_size,]"""# torch.from_numpy(ndarray)将ndarray转为tensormb_x=torch.from_numpy(mb_x).to(device).long()mb_x_len=torch.from_numpy(mb_x_len).to(device).long()# 对于一句中文: BOS 为 什 么 是 我 EOS# 根据模型架构,输入是：BOS 为 什 么 是 我# 标签是：为 什 么 是 我 EOSmb_input=torch.from_numpy(mb_y[:,:-1]).to(device).long()mb_output=torch.from_numpy(mb_y[:,1:]).to(device).long()mb_y_len=torch.from_numpy(mb_y_len-1).to(device).long()# 某个长度为0的句子赋值长度为1,避免异常情况mb_y_len[mb_y_len<=0]=1mb_pred=model(mb_x,mb_x_len,mb_input,mb_y_len)# mask 表示哪些词是句子中的,哪些不是# 0 到 原始中文batch中句子最长长度减1mb_out_mask=torch.arange(mb_y_len.max().item(),device=device) # [mb_y_len.max().item()]# 增加维度# mb_out_mask.unsqueeze(dim=0) [1,mb_y_len.max().item()]# mb_y_len.unsqueeze(dim=-1) [batch_size,1]# 广播比较,比如对于第一句话,取mb_out_mask逐个元素,小于mb_y_len[0]的认为是句子里的词mb_out_mask=mb_out_mask.unsqueeze(dim=0) < mb_y_len.unsqueeze(dim=-1) # [batch_size,mb_y_len.max().item()]mb_out_mask=mb_out_mask.float()loss=loss_fn(mb_pred,mb_output,mb_out_mask)# 计算梯度更新模型optimizer.zero_grad()loss.backward()# 梯度限制：回顾第五课.语言模型torch.nn.utils.clip_grad_norm_(model.parameters(),5.0)optimizer.step()if it % 100 == 0:print("Epoch:",epoch,"iter:",it,"loss:",loss.item())# 在dev上进行验证衡量if epoch % 5 == 0:evaluate(model, dev_data)

注意到存在验证函数evaluate，该函数使模型在dev上进行一个epoch的前向计算，并返回损失：

def evaluate(model, data):model.eval()total_num_words = total_loss = 0.with torch.no_grad():for it, (mb_x, mb_x_len, mb_y, mb_y_len) in enumerate(data):mb_x = torch.from_numpy(mb_x).to(device).long()mb_x_len = torch.from_numpy(mb_x_len).to(device).long()mb_input = torch.from_numpy(mb_y[:, :-1]).to(device).long()mb_output = torch.from_numpy(mb_y[:, 1:]).to(device).long()mb_y_len = torch.from_numpy(mb_y_len-1).to(device).long()mb_y_len[mb_y_len<=0] = 1mb_pred = model(mb_x, mb_x_len, mb_input, mb_y_len)mb_out_mask = torch.arange(mb_y_len.max().item(), device=device)[None, :] < mb_y_len[:, None]mb_out_mask = mb_out_mask.float()loss = loss_fn(mb_pred, mb_output, mb_out_mask)num_words = torch.sum(mb_y_len).item()total_loss += loss.item() * num_wordstotal_num_words += num_wordsprint("Evaluation loss", total_loss/total_num_words)

训练模型：

train(model, train_data, num_epochs=20)"""
Epoch: 0 iter: 0 loss: 8.087753295898438
Epoch: 0 iter: 100 loss: 5.2040791511535645
Epoch: 0 iter: 200 loss: 5.6370744705200195
Evaluation loss 4.843846693269169
...
Epoch: 15 iter: 0 loss: 2.4589157104492188
Epoch: 15 iter: 100 loss: 2.647231101989746
Epoch: 15 iter: 200 loss: 3.7229344844818115
Evaluation loss 3.2680165134782917
...
Epoch: 19 iter: 0 loss: 2.302227020263672
Epoch: 19 iter: 100 loss: 2.4202232360839844
Epoch: 19 iter: 200 loss: 3.61832857131958
"""

使用模型进行机器翻译：

# 使用模型进行机器翻译
def translate_dev(i:"第i个句子"):en_sent = " ".join([inv_en_dict[w] for w in dev_en[i]])print(en_sent)cn_sent = " ".join([inv_cn_dict[w] for w in dev_cn[i]])print("".join(cn_sent))mb_x = torch.from_numpy(np.array(dev_en[i]).reshape(1, -1)).long().to(device)mb_x_len = torch.from_numpy(np.array([len(dev_en[i])])).long().to(device)# cn_dict 中文的word_to_idx {word:idx}bos = torch.Tensor([[cn_dict["BOS"]]]).long().to(device)  # [batch_size=1,seq_len=1]translation= model.translate(mb_x, mb_x_len, bos) # [batch_size=1,max_length=10]translation = [inv_cn_dict[i] for i in translation.data.cpu().numpy().reshape(-1)]trans = []for word in translation:if word != "EOS":trans.append(word)else:breakprint("".join(trans))for i in range(100,120):print("句子序号:",i)translate_dev(i)

结果为：

句子序号: 100
BOS you have nice skin . EOS
BOS 你的皮膚真好。 EOS
你最好的很好。
句子序号: 101
BOS you 're UNK correct . EOS
BOS 你部分正确。 EOS
你是个好人的。
句子序号: 102
BOS everyone admired his courage . EOS
BOS 每個人都佩服他的勇氣。 EOS
大家都沒有人。
句子序号: 103
BOS what time is it ? EOS
BOS 几点了？ EOS
那裡有什么？
句子序号: 104
BOS i 'm free tonight . EOS
BOS 我今晚有空。 EOS
我有一個好人。
句子序号: 105
BOS here is your book . EOS
BOS 這是你的書。 EOS
這是你的朋友。
句子序号: 106
BOS they are at lunch . EOS
BOS 他们在吃午饭。 EOS
他們在家裡。
…
句子序号: 119
BOS i made a mistake . EOS
BOS 我犯了一個錯。 EOS
我有一個漂亮的。

结合Luong Attention

Luong Attention

Attention机制通常有Bahdanau Attention与Luong Attention，两种注意力的理论相似，Luong Attention使用更加广泛。通常Attention会结合原始Encoder和原始Decoder的输出，重新整合得到新的输出：

网络的Encoder输出为序列 $o_{s}$ （每个元素是一个词向量），原始Decoder输出序列为 $o_{c}$ ，Attention层会对两个特征序列进行一下处理：

计算score， $o_{s}$ 中的词向量需要经过全连接层进行变换： $W_{a}o_{s}$ ，变换到特征的另一种表达；然后用 $o_{c}$ 的每一个词向量与变换后的 $W_{a}o_{s}$ 的特征逐个计算点积：
假设对于一个Seq， $o_{s}$ 的形状为 $[E n g l i s h S e q L e n, E n c o d e r H i d d e n S i z e]$ ；
$o_{c}$ 的形状为 $[C h i n e s e S e q L e n, D e c o d e r H i d d e n S i z e]$ ；
则 $W_{a}$ 对应pytorch的Linear应设置为：nn.Linear(enc_hidden_size, dec_hidden_size, bias=False)；
用 $o_{c}$ 的每个词与 $W_{a}o_{s}$ 的特征逐个求点积，即 $o_{c}$ 每个词对应一个形状为 $[E n g l i s h S e q L e n,]$ 的向量，score计算为：
$score(o_{c},o_{s})=o_{c}W_{a}o_{s}$
通过softmax对score计算比例 $a (s c o r e)$ ：
$a (s c o r e) = s o f t m a x (s c o r e, d i m = - 1)$
已知张量 $s c o r e$ 为 $[C h i n e s e S e q L e n, E n g l i s h S e q L e n]$ ，在最后一维上进行softmax，得到张量的第 $i$ 行表示各个英文分词对第 $i$ 个中文分词的重要程度；
将比例融入回Encoder的输出 $o_{s}$ 得到 $o_{new}$ ：
$o_{new}=a(score)o_{s}$
$o_{new}$ 形状为 $[C h i n e s e S e q L e n, E n c o d e r H i d d e n S i z e]$ ， $o_{new}$ 可看做是一个包含了各个英文分词重要程度的特征，用该特征与原始Decoder的输出 $o_{c}$ 在最后一维进行拼接，得到张量形状为：
$[C h i n e s e S e q L e n, E n c o d e r H i d d e n S i z e + D e c o d e r H i d d e n S i z e]$
对该张量进行全连接变换恢复维度，综合看来可以描述为：
$o_{h}=tanh(W_{c}[o_{new};o_{c}])$
$o_{h}$ 的形状通常会与 $o_{c}$ 一致： $[C h i n e s e S e q L e n, D e c o d e r H i d d e n S i z e]$ ；
对 $o_{h}$ 进行映射，变为 $[C h i n e s e S e q L e n, V o c a b S i z e]$ 的one-hot编码，即得到机器翻译结果；

上述过程中，待学习的参数有 $W_{a}$ 和 $W_{c}$ ， $W_{a}$ 用于变换Encoder输出特征的表达，使之可以通过与中文分词的特征点积获得每个英文分词的重要程度，即对某个中文分词，可以计算出各个英文分词对它的重要性，也就是"注意力"； $W_{c}$ 主要是用于变换输出张量的维度；

在原始的Encoder-Decoder模型里，英文句子的信息被压缩在Encoder的输出 hidden state 里，这不可避免的造成大量信息损失，对翻译中文不利，引入注意力后，给原始Decoder的某个输出词向量融合了其对应的重要英文分词信息，能提升翻译出该中文分词的准确性；

Encoder-Decoder结合LuongAttention

根据上述说明实现Encoder：

class Encoder(nn.Module):def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):super().__init__()self.embed = nn.Embedding(vocab_size, embed_size)self.rnn = nn.GRU(embed_size, enc_hidden_size, batch_first=True, bidirectional=True)self.dropout = nn.Dropout(dropout)self.fc = nn.Linear(enc_hidden_size * 2, dec_hidden_size)def forward(self, x, lengths):sorted_len, sorted_idx = lengths.sort(0, descending=True)x_sorted = x[sorted_idx.long()]embedded = self.dropout(self.embed(x_sorted))packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, sorted_len.long().cpu().data.numpy(), batch_first=True)packed_out, hid = self.rnn(packed_embedded)out, _ = nn.utils.rnn.pad_packed_sequence(packed_out, batch_first=True)_, original_idx = sorted_idx.sort(0, descending=False)out = out[original_idx.long()].contiguous() # [batch_size, seq_len, num_directions=2 * enc_hidden_size]hid = hid[:, original_idx.long()].contiguous() # [num_layers=1 * num_directions=2, batch_size, enc_hidden_size]# hid[m] [batch_size, enc_hidden_size]hid = torch.cat([hid[-2], hid[-1]], dim=1) # [batch_size, enc_hidden_size*2]hid = torch.tanh(self.fc(hid)).unsqueeze(0) # [1,batch_size,dec_hidden_size]return out, hid

实现Attention层：

class Attention(nn.Module):def __init__(self, enc_hidden_size, dec_hidden_size):super().__init__()self.enc_hidden_size = enc_hidden_sizeself.dec_hidden_size = dec_hidden_sizeself.linear_in = nn.Linear(enc_hidden_size*2, dec_hidden_size, bias=False)self.linear_out = nn.Linear(enc_hidden_size*2 + dec_hidden_size, dec_hidden_size)def forward(self, output:"decoder的'GRU'输出-seq_len对应中文", context:"encoder的输出-seq_len对应英文", mask):# output: [batch_size, output_len, dec_hidden_size]# context: [batch_size, context_len, 2*enc_hidden_size]batch_size = output.size(0)output_len = output.size(1)input_len = context.size(1)context_in = self.linear_in(context.view(batch_size*input_len, -1)).view(                batch_size, input_len, -1) # batch_size, context_len, dec_hidden_size# context_in.transpose(1,2): batch_size, dec_hidden_size, context_len# output: batch_size, output_len, dec_hidden_sizeattn = torch.bmm(output, context_in.transpose(1,2))  # batch_size, output_len, context_len# mask必须是一个 ByteTensor 而且shape必须和 attn 一样 并且元素只能是 0或者1# tensor.data.masked_fill(mask,value):将 mask中为1的 元素所在的索引，在tensor中相同的的索引处替换为 value# 将不是单词的位置设成很小的数,使softmax不受非单词的元素影响attn.data.masked_fill(mask, -1e6)attn = F.softmax(attn, dim=2)  # batch_size, output_len, context_lencontext = torch.bmm(attn, context) # batch_size, output_len, 2*enc_hidden_sizeoutput = torch.cat((context, output), dim=2) # batch_size, output_len, enc_hidden_size*2 + dec_hidden_sizeoutput = output.view(batch_size*output_len, -1)output = torch.tanh(self.linear_out(output))output = output.view(batch_size, output_len, -1) # batch_size, output_len, dec_hidden_sizereturn output, attn

新的Decoder实际上是原始Decoder加Attention层：

class Decoder(nn.Module):def __init__(self, vocab_size, embed_size, enc_hidden_size, dec_hidden_size, dropout=0.2):super(Decoder, self).__init__()self.embed = nn.Embedding(vocab_size, embed_size)self.attention = Attention(enc_hidden_size, dec_hidden_size)self.rnn = nn.GRU(embed_size, dec_hidden_size, batch_first=True)self.out = nn.Linear(dec_hidden_size, vocab_size)self.dropout = nn.Dropout(dropout)def create_mask(self, y_len:"对应中文len", x_len:"对应英文len"):max_y_len = y_len.max()max_x_len = x_len.max()x_mask = torch.arange(max_x_len, device=x_len.device)[None, :] < x_len[:, None] # [batch_size,max_x_len]y_mask = torch.arange(max_y_len, device=x_len.device)[None, :] < y_len[:, None] # [batch_size,max_y_len]# 以1个batch的1句话为例(英文共n个字),取其中一个中文字(矩阵的某行),英文前n个字(列)值为0,为0代表有效# ~代表bool变量取反mask = (~(y_mask[:, :, None] * x_mask[:, None, :])).byte() # [batch_size,max_y_len,max_x_len]return maskdef forward(self, ctx:"encoder的输出", ctx_lengths, y, y_lengths, hid:"[1,batch_size,dec_hidden_size]"):sorted_len, sorted_idx = y_lengths.sort(0, descending=True)y_sorted = y[sorted_idx.long()] # [batch_size,y_len,vocab_size]hid = hid[:, sorted_idx.long()]y_sorted = self.dropout(self.embed(y_sorted)) # batch_size, 中文seq_length, embed_sizepacked_seq = nn.utils.rnn.pack_padded_sequence(y_sorted, sorted_len.long().cpu().data.numpy(), batch_first=True)out, hid = self.rnn(packed_seq, hid)unpacked, _ = nn.utils.rnn.pad_packed_sequence(out, batch_first=True) # [batch_size,中文seq_len,dec_hidden_size]_, original_idx = sorted_idx.sort(0, descending=False)output_seq = unpacked[original_idx.long()].contiguous() # [batch_size,中文seq_len,dec_hidden_size]hid = hid[:, original_idx.long()].contiguous() # [1,batch_size,dec_hidden_size]mask = self.create_mask(y_lengths, ctx_lengths) # [batch_size,中文seq_len,英文seq_len]# output [batch_size, 中文seq_len, dec_hidden_size]# attn [batch_size,中文seq_len,英文seq_len]output, attn = self.attention(output_seq, ctx, mask)output = F.log_softmax(self.out(output), dim=-1) # batch_size, output_len, vocab_sizereturn output, hid, attn

为了进一步提升效果，在Attention中新增了mask进行干预：
$[B a t c h S i z e, C h i n e s e S e q L e n, E n g l i s h S e q L e n]$
以1个batch第 $s$ 个Seq为例（假设该Seq的英文共n个分词， $n < E n g l i s h S e q L e n$ ），取Seq第 $i$ 个中文分词（矩阵mask[s]第 $i$ 行），前n个元素（即英文分词）才有效；其余元素的位置对应到 $score(o_{c},o_{s})$ 上，将 $s c o r e$ 的这些位置值设为极小的正数，在计算 $a (s c o r e)$ 时便可以加强中英两个句子的对应关系，将注意力"集中"在属于该Seq中文分词的英文分词上；

将Encoder与新的Decoder组合成为Seq2Seq模型：

class Seq2Seq(nn.Module):def __init__(self, encoder, decoder):super(Seq2Seq, self).__init__()self.encoder = encoderself.decoder = decoderdef forward(self, x, x_lengths, y, y_lengths):encoder_out, hid = self.encoder(x, x_lengths)output, hid, attn = self.decoder(ctx=encoder_out, ctx_lengths=x_lengths,y=y,y_lengths=y_lengths,hid=hid)return outputdef translate(self, x, x_lengths, y, max_length=100):encoder_out, hid = self.encoder(x, x_lengths)preds = []batch_size = x.shape[0]attns = []for i in range(max_length):output, hid, attn = self.decoder(ctx=encoder_out, ctx_lengths=x_lengths,y=y,y_lengths=torch.ones(batch_size).long().to(y.device),hid=hid)y = output.max(2)[1].view(batch_size, 1)preds.append(y)attns.append(attn)return torch.cat(preds, dim=1)

实例化模型：

dropout = 0.2
embed_size = hidden_size = 100
encoder = Encoder(vocab_size=en_total_words,embed_size=embed_size,enc_hidden_size=hidden_size,dec_hidden_size=hidden_size,dropout=dropout)
decoder = Decoder(vocab_size=cn_total_words,embed_size=embed_size,enc_hidden_size=hidden_size,dec_hidden_size=hidden_size,dropout=dropout)model = Seq2Seq(encoder, decoder)
model = model.to(device)
loss_fn = LanguageModelCriterion().to(device)
optimizer = torch.optim.Adam(model.parameters())

结合Attention后，模型在输入输出上依然和原始Encoder-Decoder一样，所以，可以使用之前定义的训练与验证函数，以及翻译函数；训练如下：

# 无视警告
import warnings
warnings.filterwarnings("ignore")train(model, train_data, num_epochs=20)"""
Epoch: 0 iter: 0 loss: 8.07270622253418
Epoch: 0 iter: 100 loss: 5.43229866027832
Epoch: 0 iter: 200 loss: 5.791189670562744
Evaluation loss 5.098033384093281
...
Evaluation loss 2.9741732850286744
...
Epoch: 19 iter: 0 loss: 1.8926399946212769
Epoch: 19 iter: 100 loss: 1.9622280597686768
Epoch: 19 iter: 200 loss: 3.3936874866485596
"""

进行机器翻译：

for i in range(100,120):print("句子序号:",i)translate_dev(i)

结果为：

句子序号: 100
BOS you have nice skin . EOS
BOS 你的皮膚真好。 EOS
你不是否明白。
句子序号: 101
BOS you 're UNK correct . EOS
BOS 你部分正确。 EOS
你是个想的。
句子序号: 102
BOS everyone admired his courage . EOS
BOS 每個人都佩服他的勇氣。 EOS
每個人都都在家了。
句子序号: 103
BOS what time is it ? EOS
BOS 几点了？ EOS
它是什麼時候的？
句子序号: 104
BOS i 'm free tonight . EOS
BOS 我今晚有空。 EOS
我今晚了。
句子序号: 105
BOS here is your book . EOS
BOS 這是你的書。 EOS
这里有你的書。
句子序号: 106
BOS they are at lunch . EOS
BOS 他们在吃午饭。 EOS
他們在午餐。
…
句子序号: 119
BOS i made a mistake . EOS
BOS 我犯了一個錯。 EOS
我做了一個錯誤。

第十二课.Seq2Seq与Attention相关推荐

Kali Linux Web 渗透测试— 第十二课-websploit
Kali Linux Web 渗透测试- 第十二课-websploit 文/玄魂目录 Kali Linux Web 渗透测试- 第十二课-websploit..................... ...
C#之windows桌面软件第十二课：电脑ADC值显示（上位机），记忆上次串口号，并用TrackBar控件显示ADC值
C#之windows桌面软件第十二课:电脑ADC值显示(上位机),记忆上次串口号,并用TrackBar控件显示ADC值 using System; using System.Collections.G ...
NeHe OpenGL第三十二课：拾取游戏
NeHe OpenGL第三十二课:拾取游戏拾取, Alpha混合, Alpha测试, 排序: 这又是一个小游戏,交给的东西会很多,慢慢体会吧欢迎来到32课. 这课大概是在我所写作已来最大的一课 ...
《SQL必知必会》学习笔记——第十二课连结表
第十二课连结表一.创建连结 SELECT inf.id, age, rank FROM inf,score WHERE inf.id = score.id; 注意 WHERE inf.id = s ...
量化交易第十二课因子数据处理之市值中性化
第十二课因子数据处理之市值中性化概述市值影响怎么去除市值影响回归法简介流程分析代码实现概述众所周知, 行业和市值是两个十分显著对因子有影响力的因素. 在进行截面回归判断每个单因子的 ...
OpenGL教程翻译第二十二课使用Assimp加载模型
第二十二课使用Assimp加载模型原文地址:http://ogldev.atspace.co.uk/(源码请从原文主页下载) 背景到现在为止我们都在使用手动生成的模型.正如你所想的,指明每个顶点 ...
手把手教你读财报---银行业---第十二课
第十二课风险加权资产风险的种类: 银行是经营风险的,巴塞尔委员会根据银行的业务特征和风险诱发原因,将银行面临的风险划分为:信用风险.市场风险.操作风险.流动性风险.法律风险.声誉风险.国别风险以 ...
python血条游戏代码_零基础快速学十二课Python完整游戏代码，使用「格式符%」来处理...
十二课Python不同数据类型的拼接方式,使用[格式符%]来处理不过它还没有全部解决:打印出每局结果,三局两胜打印最终战果.这就是版本3.0需要做的事情. 打印战果,三局两胜. 对比2.0版本,在3 ...
OpenGL教程翻译第十二课透视投影
第十二课透视投影原文地址:http://ogldev.atspace.co.uk/(源码请从原文主页下载) 背景在这一节中我们将会介绍如何在保持深度外观的情况下将三维世界中的物体投影到二维平面上 ...

第十二课.Seq2Seq与Attention

目录

Seq2Seq，机器翻译Encoder-Decoder

机器翻译数据集与数据预处理

Encoder-Decoder模型

损失函数

训练与测试

结合Luong Attention

Luong Attention

Encoder-Decoder结合LuongAttention

第十二课.Seq2Seq与Attention相关推荐

最新文章

热门文章