Seq2seq初阶，最原始的模型

文章目录

正文
- 数据处理部分
- 构建seq2seq模型
- seq2seq
- 训练过程

正文

本文将使用PyTorch和torchtext构建一个机器学习模型，这里的例子是德语和英语之间的翻译。
以RNN为例，效果如图：

每个句子加了开始与结束标志
RNN接受序列后的输出送到下个RNN作为隐层，这部分叫编码，隐层输入到下个RNN，对于训练过程这个RNN的输入也是安排好的，叫做解码。
这里黄色的是embedding层，输入与输出的embedding层不一样。

对于语言来说我们需要建立一个叫 tokenizers 的东西，可以把句子分为token 。 e.g. “good morning!” 变为 [“good”, “morning”, “!”]

作者发现反转句子可能是有用的，可以把德语变为token后反转再复制。

还有 index，对每个token有不同的index
使用BucketIterator而不是标准 Iterator ，因为它创建批处理的方式使源语句和目标语句中的填充量最小化
使用迭代器得到实例时需要填充保证所有源语句padding到同样的长度，target也同样，但这里BucketIterator 实现了已经。

数据处理部分


import torch
import torch.nn as nn
import torch.optim as optimfrom torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIteratorimport spacy
import numpy as npimport random
import math
import timeSEED = 1234random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True#  spacy为每种语言提供了模型， 加载这些模型后可以访问每个模型的 tokenizers
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')# 创建tokenize 函数。它们可以传递给torchtext，
# 并将句子作为string接收，并将句子作为list返回。def tokenize_de(text):"""Tokenizes German text from a string into a list of strings (tokens) and reverses it"""return [tok.text for tok in spacy_de.tokenizer(text)][::-1]def tokenize_en(text):"""Tokenizes English text from a string into a list of strings (tokens)"""return [tok.text for tok in spacy_en.tokenizer(text)]# 德语作为source， 英语作为target， 将其都变为小写lowercase
SRC = Field(tokenize = tokenize_de, init_token = '<sos>', eos_token = '<eos>', lower = True)TRG = Field(tokenize = tokenize_en, init_token = '<sos>', eos_token = '<eos>', lower = True)print(f"Number of training examples:{len(train_data.examples)}")
print(f"Number of validation examples:{len(valid_data.examples)}")
print(f"Number of testing examples:{len(test_data.examples)}")print(vars(train_data.examples[0])) # 输出的是倒叙的德语和正序英语>>
>> Number of training examples: 29000
>> Number of validation examples: 1014
>> Number of testing examples: 1000
>> {'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}# 这就是构建索引吧，将每个token和索引关联 这里还用了minfreq参数，出现次数小于2次的token被转换为<unk>标记
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)print(f"Unique tokens in source (de) vocabulary:{len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary:{len(TRG.vocab)}")>> Unique tokens in source (de) vocabulary: 7853
>> Unique tokens in target (en) vocabulary: 5893# 准备数据的最后一步是创建迭代器,返回的数据带有src属性(就是数字化的源语句)还有trg属性(就是数字化的目标语句)
# 数字化就是建立索引那一步.# 使用BucketIterator而不是标准 Iterator ，因为它创建批处理的方式使源语句和目标语句中的填充量最小化
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')BATCH_SIZE = 128train_iterator, valid_iterator, test_iterator = BucketIterator.splits((train_data, valid_data, test_data), batch_size = BATCH_SIZE, device = device)

构建seq2seq模型

encoder 是这样, 输入有结束和开始标志

# 这里使用的是最后的隐藏状态
class Encoder(nn.Module):def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.hid_dim = hid_dimself.n_layers = n_layersself.embedding = nn.Embedding(input_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.dropout = nn.Dropout(dropout)def forward(self, src):# src = [src len, batch size]embedded = self.dropout(self.embedding(src))# embedded = [src len, batch size, emb dim]outputs, (hidden, cell) = self.rnn(embedded)# outputs = [src len, batch size, hid dim * n directions]# hidden = [n layers * n directions, batch size, hid dim]# cell = [n layers * n directions, batch size, hid dim]# outputs are always from the top hidden layerreturn hidden, cell# 这里每次只解码一个token,所以seq是1
class Decoder(nn.Module):def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):super().__init__()self.output_dim = output_dimself.hid_dim = hid_dimself.n_layers = n_layersself.embedding = nn.Embedding(output_dim, emb_dim)self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)self.fc_out = nn.Linear(hid_dim, output_dim)self.dropout = nn.Dropout(dropout)def forward(self, input, hidden, cell):# input = [batch size]# hidden = [n layers * n directions, batch size, hid dim]# cell = [n layers * n directions, batch size, hid dim]# n directions in the decoder will both always be 1, therefore:# hidden = [n layers, batch size, hid dim]# context = [n layers, batch size, hid dim]input = input.unsqueeze(0)# input = [1, batch size]embedded = self.dropout(self.embedding(input))# embedded = [1, batch size, emb dim]output, (hidden, cell) = self.rnn(embedded, (hidden, cell))# output = [seq len, batch size, hid dim * n directions]# hidden = [n layers * n directions, batch size, hid dim]# cell = [n layers * n directions, batch size, hid dim]# seq len and n directions will always be 1 in the decoder, therefore:# output = [1, batch size, hid dim]# hidden = [n layers, batch size, hid dim]# cell = [n layers, batch size, hid dim]prediction = self.fc_out(output.squeeze(0))# prediction = [batch size, output dim]return prediction, hidden, cell

seq2seq

# 这部分实现要确保编码器和解码器的层数和hidden还有cell的维度要相等,首先要建立一个向量Y存储所有的预测项
# 解码器这部分是有一个循环:
# -将输入、先前隐藏和先前单元状态传递到解码器中
# -从解码器接收预测、下一个隐藏状态和下一个单元状态
# -把我们的预测y放在Y中
# - 有个参数**teacher force**
# 如果我们这样做了，下一个输入就是序列中的下一个基本真值标记，y/trg[t]
# 否则，下一个输入是序列中预测的下一个标记，y/top1，我们通过对输出张量执行argmax得到# 我们的解码器循环从1开始，而不是0。这意味着输出张量的第0个元素保持为全零。
# 输入到decoder的最后一个是eos之前的值, eos不被输入到decoder
# 所以我们的trg和输出看起来像下图:

# 稍后当我们计算损耗时，我们切下每个张量的第一个元素，得到

class Seq2Seq(nn.Module):def __init__(self, encoder, decoder, device):super().__init__()self.encoder = encoderself.decoder = decoderself.device = deviceassert encoder.hid_dim == decoder.hid_dim, \"Hidden dimensions of encoder and decoder must be equal!"assert encoder.n_layers == decoder.n_layers, \"Encoder and decoder must have equal number of layers!"def forward(self, src, trg, teacher_forcing_ratio=0.5):# src = [src len, batch size]# trg = [trg len, batch size]# teacher_forcing_ratio is probability to use teacher forcing# e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the timebatch_size = trg.shape[1]trg_len = trg.shape[0]trg_vocab_size = self.decoder.output_dim# tensor to store decoder outputsoutputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)# last hidden state of the encoder is used as the initial hidden state of the decoderhidden, cell = self.encoder(src)# first input to the decoder is the <sos> tokensinput = trg[0, :]for t in range(1, trg_len):# insert input token embedding, previous hidden and previous cell states# receive output tensor (predictions) and new hidden and cell statesoutput, hidden, cell = self.decoder(input, hidden, cell)# place predictions in a tensor holding predictions for each tokenoutputs[t] = output# decide if we are going to use teacher forcing or notteacher_force = random.random() < teacher_forcing_ratio# get the highest predicted token from our predictionstop1 = output.argmax(1)# if teacher forcing, use actual next token as next input# if not, use predicted tokeninput = trg[t] if teacher_force else top1return outputs

训练过程

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)model = Seq2Seq(enc, dec, device).to(device)


# 从-0.08到+0.08之间的均匀分布初始化所有权重
def init_weights(m):for name, param in m.named_parameters():nn.init.uniform_(param.data, -0.08, 0.08)model.apply(init_weights)# 定义一个函数来计算模型中可训练参数的数量。
def count_parameters(model):return sum(p.numel() for p in model.parameters() if p.requires_grad)print(f'The model has{count_parameters(model):,}trainable parameters')optimizer = optim.Adam(model.parameters())# 使用CrossEntropyLoss, 如果是<pad>部分就忽略TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

# training loop
# 当我们计算损失时，我们切断每个张量的第一个元素，得到

'''
在每次迭代中：
从批中获取源句子和目标句子，然后
将最后一批计算的梯度归零
将源和目标输入模型以获得输出，
由于损失函数仅适用于具有1d目标的2d输入，因此我们需要使用.view将每个目标展平
我们切掉输出和目标张量的第一列，如上所述
用loss.backward（）计算梯度
clip the gradients 以防止其爆炸（RNNs中的常见问题）
通过执行优化器步骤来更新模型的参数
将损失值加总成
'''

Seq2seq初阶，最原始的模型相关推荐

学习笔记：C++初阶【C++入门、类和对象、C/C++内存管理、模板初阶、STL简介、string、vector、list、stack、queueu、模板进阶、C++的IO流】
文章目录前言一.C++入门 1. C++关键字 2.命名空间 2.1 C语言缺点之一,没办法很好地解决命名冲突问题 2.2 C++提出了一个新语法--命名空间 2.2.1 命名空间概念 2.2.2 ...
python如何给某列数据打标签_Python map, apply, transform 打标签方法汇总(初阶到高阶)...
前言根据数据的某列进行打标签这个操作在数据分析领域极度常用,对于一些较为复杂的打标签方法,Python 与 SQL 都能很好的实现,这篇针对 Python,主要用到 map,apply 与 tran ...
《JavaEE初阶》HTTP协议和HTTPS
<JavaEE初阶>HTTP协议和HTTPS 文章目录 <JavaEE初阶>HTTP协议和HTTPS HTTP协议是应用层协议: 使用Fiddler抓取HTTP请求和响应: F ...
牛津初阶字典精华总结
本总结主要针对人群为能力在CET4.6一般上下读者.因Time is invaluable,如你是英语专业8以上或大学四级以下者,也许该套总结层次不适合你.该总结从初阶字典里吸取了鄙人自认为精华相应部 ...
SAP WM初阶之事务代码LS22 - Change Quant
SAP WM初阶事务代码LS22 - Change Quant SAP WM模块中,事务代码LS22可以用于修改Quant中的数据. SAP WM中Quant(中文是份)是一个很特殊的概念,用于代表在 ...
R语言学习笔记——入门篇：第三章-图形初阶
R语言 R语言学习笔记--入门篇:第三章-图形初阶文章目录 R语言一.使用图形 1.1.基础绘图函数:plot( ) 1.2.图形控制函数:dev( ) 补充--直方图函数:hist( ) 补充- ...
ERP初阶（九）：ERP的主要功能模块简介
ERP初阶(九):ERP的主要功能模块简介 (by AMT ERP研究小组) 对于ERP原理的初步了解是我们认识ERP的基础,接下来,为了使大家对ERP产品有一个感性的认识,我们就要从它的功能上再给予 ...
readxmls r语言_R语言实战（一）介绍、数据集与图形初阶
本文对应<R语言实战>前3章,因为里面大部分内容已经比较熟悉,所以在这里只是起一个索引的作用. 第1章 R语言介绍获取帮助函数 help(), ? 查看函数帮助 exampl ...
Eclipse开发初阶之Eclipse多国语言包安装
Eclipse开发初阶之Eclipse多国语言包安装 2008-02-01 13:07:13  标签:Eclipse 开发多国语言包 Java 安装 ...

Seq2seq初阶，最原始的模型

文章目录

正文

数据处理部分

构建seq2seq模型

seq2seq

训练过程

Seq2seq初阶，最原始的模型相关推荐

最新文章

热门文章

Seq2seq初阶， 最原始的模型

文章目录

正文

数据处理部分

构建seq2seq模型

seq2seq

训练过程

Seq2seq初阶， 最原始的模型相关推荐

最新文章

热门文章

Seq2seq初阶，最原始的模型

Seq2seq初阶，最原始的模型相关推荐