一共实现三个功能:

1. 续写五言诗

2. 续写七言诗

3. 写五言藏头诗

之前用这个做Intro to Computer Science的期末项目折腾太久,不想赘述,内容介绍及实现方法可参考期末presentation的slides:

https://docs.google.com/presentation/d/1DFy3VwAETeqK0QFsokeBpDwyVkMavOjpckQKpc6XPTI/edit#slide=id.gb037c6e317_2_312

训练数据来源:

https://github.com/BraveY/AI-with-code/tree/master/Automatic-poem-writing

五言诗数据预处理(七言类似,不再贴代码):

import os
import numpy as np
import numpy as npfrom copy import deepcopyall_data = np.load("tang.npz", allow_pickle=True)
dataset = all_data["data"]
word2ix = all_data["word2ix"].item()
ix2word = all_data["ix2word"].item()
print(len(word2ix))
print(dataset.shape)
l = dataset.tolist()
poems = [[None for i in range(125)] for j in range(57580)]
for i in range(57580):for j in range(125):poems[i][j] = ix2word[l[i][j]]data = list()
for i in range(57580):s = 0e = 0for ele in poems[i]:if ele == '<START>':s += 1if ele == '<EOP>':e += 1if s == 1 and e == 1:st = poems[i].index('<START>')ed = poems[i].index('<EOP>')if (ed - st - 1) % 6 == 0 and poems[i][st + 6] == ',' and (ed - st - 1) == 48:# 五言诗,每诗4句line = poems[i][st + 1:ed]for j in range(0, len(line), 24):cur = line[j:j + 24]if cur[5] == ',' and cur[11] == '。' and cur[17] == ',' and cur[23] == '。':data.append(cur)# for ele in data:
# print(ele)
print(len(data))
t = list()
for i, line in enumerate(data):print(i, line)words = line[0:5] + line[6:11] + line[12:17] + line[18:23]nums = [word2ix[words[i]] for i in range(len(words))]t.append(nums)t = np.array(t)
print(t.shape)t = t[77:]labels = deepcopy(t)
for i in range(29696):for j in range(20):if j < 19:labels[i][j] = labels[i][j+1]else:labels[i][j] = 0
np.save("train_x.npy", t)
np.save("train_y.npy", labels)

五言诗训练(七言类似,不再贴代码):

# -*- coding: utf-8 -*-
"""54.ipynbAutomatically generated by Colaboratory.Original file is located athttps://colab.research.google.com/drive/1ZdPq71-40K5tGK__OPcUe84mfopWwJP-
"""import numpy as np
import matplotlib.pyplot as pltimport torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffrom copy import deepcopyVOCAB_SIZE = 8293all_data = np.load("/content/drive/My Drive/Deep Learning/LSTM - Chinese Poem Writer/tang.npz", allow_pickle=True)
word2ix = all_data["word2ix"].item()
ix2word = all_data["ix2word"].item()class Model(nn.Module):def __init__(self, vocab_size, embedding_dim, hidden_dim):super(Model, self).__init__()self.hidden_dim = hidden_dimself.embedding = nn.Embedding(vocab_size, embedding_dim)self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, num_layers=3)self.out = nn.Linear(self.hidden_dim, vocab_size)def forward(self, input, hidden=None):seq_len, batch_size = input.shapeif hidden is None:h_0 = torch.zeros(3, batch_size, self.hidden_dim).cuda()c_0 = torch.zeros(3, batch_size, self.hidden_dim).cuda()else:h_0, c_0 = hiddenembed = self.embedding(input)# embed_size = (seq_len, batch_size, embedding_dim)output, hidden = self.lstm(embed, (h_0, c_0))# output_size = (seq_len, batch_size, hidden_dim)output = output.reshape(seq_len * batch_size, -1)# output_size = (seq_len * batch_size, hidden_dim)output = self.out(output)# output_size = (seq_len * batch_size, vocab_size)return output, hiddentrain_x = np.load("/content/drive/My Drive/Deep Learning/LSTM - Chinese Poem Writer/train_x_54.npy")
train_y = np.load("/content/drive/My Drive/Deep Learning/LSTM - Chinese Poem Writer/train_y_54.npy")id = 253print(train_x.shape)
print(train_y.shape)
print(train_x[id])
print(train_y[id])
a = [None for i in range(20)]
b = [None for j in range(20)]
for j in range(20):a[j] = ix2word[train_x[id][j]]b[j] = ix2word[train_y[id][j]]
b[19] = 'Null'
print(a)
print(b)class PoemWriter():def __init__(self):self.model = Model(VOCAB_SIZE, 64, 128)self.lr = 1e-3self.epochs = 0self.seq_len = 20self.batch_size = 128self.opt = optim.Adam(self.model.parameters(), lr=self.lr)def train(self, epochs):self.epochs = epochsself.model = self.model.cuda()criterion = nn.CrossEntropyLoss()all_losses = []for epoch in range(self.epochs):print("Epoch:", epoch + 1)total_loss = 0for i in range(0, train_x.shape[0], self.batch_size):# print(i, i + self.batch_size)cur_x = torch.from_numpy(train_x[i:i + self.batch_size])cur_y = torch.from_numpy(train_y[i:i + self.batch_size])cur_x = torch.transpose(cur_x, 0, 1).long()cur_y = torch.transpose(cur_y, 0, 1).long()cur_y = cur_y.reshape(self.seq_len * self.batch_size, -1).squeeze(1)cur_x, cur_y = cur_x.cuda(), cur_y.cuda()pred, _ = self.model.forward(cur_x)loss = criterion(pred, cur_y)self.opt.zero_grad()loss.backward()self.opt.step()total_loss += loss.item()print("Loss:", total_loss)all_losses.append(total_loss)self.model = self.model.cpu()plt.plot(all_losses, 'r')def write(self, string="空山新雨后"):inp = []for c in string:inp.append(word2ix[c])inp = torch.from_numpy(np.array(inp)).unsqueeze(1).long()# print(inp.shape, inp)tmp = torch.zeros(15, 1).long()inp = torch.cat([inp, tmp], dim=0)inp = inp.cuda()self.model = self.model.cuda()for tim in range(15):pred, _ = self.model.forward(inp)pred = torch.argmax(pred, dim=1)inp[tim + 5] = pred[tim + 4]ans = list()for i in range(20):ans.append(ix2word[inp[i].item()])out = ""for i in range(20):out += ans[i]if i == 4 or i == 14:out += ','if i == 9 or i == 19:out += '。'print(out)torch.cuda.get_device_name()Waner = PoemWriter()Waner.train(1000)Waner.write()torch.save(Waner.model, '/content/drive/My Drive/Deep Learning/LSTM - Chinese Poem Writer/model_54.pkl')while True:st = input()if st == "-1":breakWaner.write(st)

最终模型:

import numpy as np
import matplotlib.pyplot as pltimport torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as Ffrom copy import deepcopyVOCAB_SIZE = 8293all_data = np.load("tang.npz", allow_pickle=True)
word2ix = all_data["word2ix"].item()
ix2word = all_data["ix2word"].item()class Model(nn.Module):def __init__(self, vocab_size, embedding_dim, hidden_dim):super(Model, self).__init__()self.hidden_dim = hidden_dimself.embedding = nn.Embedding(vocab_size, embedding_dim)self.lstm = nn.LSTM(embedding_dim, self.hidden_dim, num_layers=3)self.out = nn.Linear(self.hidden_dim, vocab_size)def forward(self, input, hidden=None):seq_len, batch_size = input.shapeif hidden is None:h_0 = torch.zeros(3, batch_size, self.hidden_dim).cuda()c_0 = torch.zeros(3, batch_size, self.hidden_dim).cuda()else:h_0, c_0 = hiddenembed = self.embedding(input)# embed_size = (seq_len, batch_size, embedding_dim)output, hidden = self.lstm(embed, (h_0, c_0))# output_size = (seq_len, batch_size, hidden_dim)output = output.reshape(seq_len * batch_size, -1)# output_size = (seq_len * batch_size, hidden_dim)output = self.out(output)# output_size = (seq_len * batch_size, vocab_size)return output, hiddenclass Writer():def __init__(self):self.model_74 = torch.load('model_74.pkl')self.model_54 = torch.load('model_54.pkl')# self.lr = 1e-3self.epochs = 0self.seq_len = 28self.batch_size = 128# self.opt = optim.Adam(self.model.parameters(), lr=self.lr)def write_74(self, string="锦瑟无端五十弦"):inp = []for c in string:inp.append(word2ix[c])inp = torch.from_numpy(np.array(inp)).unsqueeze(1).long()# print(inp.shape, inp)tmp = torch.zeros(21, 1).long()inp = torch.cat([inp, tmp], dim=0)inp = inp.cuda()self.model_74 = self.model_74.cuda()for tim in range(21):pred, _ = self.model_74.forward(inp)pred = torch.argmax(pred, dim=1)inp[tim + 7] = pred[tim + 6]ans = list()for i in range(28):ans.append(ix2word[inp[i].item()])out = ""for i in range(28):out += ans[i]if i == 6 or i == 20:out += ','if i == 13 or i == 27:out += '。'return outdef write_54(self, string="空山新雨后"):inp = []for c in string:inp.append(word2ix[c])inp = torch.from_numpy(np.array(inp)).unsqueeze(1).long()# print(inp.shape, inp)tmp = torch.zeros(15, 1).long()inp = torch.cat([inp, tmp], dim=0)inp = inp.cuda()self.model_54 = self.model_54.cuda()for tim in range(15):pred, _ = self.model_54.forward(inp)pred = torch.argmax(pred, dim=1)inp[tim + 5] = pred[tim + 4]ans = list()for i in range(20):ans.append(ix2word[inp[i].item()])out = ""for i in range(20):out += ans[i]if i == 4 or i == 14:out += ','if i == 9 or i == 19:out += '。'return outdef acrostic(self, string="为尔心悦"):inp = torch.zeros(20, 1).long()inp = inp.cuda()self.model_54 = self.model_54.cuda()for i in range(20):if i == 0 or i == 5 or i == 10 or i == 15:inp[i] = word2ix[string[i // 5]]else:inp[i] = pred[i - 1]pred, _ = self.model_54.forward(inp)pred = torch.argmax(pred, dim=1)ans = list()for i in range(20):ans.append(ix2word[inp[i].item()])out = ""for i in range(20):out += ans[i]if i == 4 or i == 14:out += ','if i == 9 or i == 19:out += '。'return outdef task(self, string):l = len(string)try:if l == 4:return self.acrostic(string)elif l == 5:return self.write_54(string)elif l == 7:return self.write_74(string)except:return "I don't know how to write one...QAQ"'''
a = torch.ones(5)
a[1] = 4
print(a)
'''

测试样例:

虽然韵脚未专门处理,大多不太对,但是能学到一些意象并营造一定意境,如果用更好的word embedding可能会有更好的performance(目前的embedding为pytorch随机生成)。

NLP入门 - 基于Word Embedding + LSTM的古诗生成器相关推荐

  1. python输出一首诗_基于循环神经网络(RNN)的古诗生成器

    基于循环神经网络(RNN)的古诗生成器,具体内容如下 之前在手机百度上看到有个"为你写诗"功能,能够随机生成古诗,当时感觉很酷炫= = 在学习了深度学习后,了解了一下原理,打算自己 ...

  2. TensorFlow练手项目二:基于循环神经网络(RNN)的古诗生成器

    基于循环神经网络(RNN)的古诗生成器 2019.01.02更新: 代码比较老了,当时的开发环境为Python 2.7 + TensorFlow 1.4,现在可能无法直接运行了.如果有兴趣,可以移步我 ...

  3. 自然语言处理(NLP)发展历程(2),什么是词嵌入(word embedding) ?

    四.如何表示一个词语的意思 4.1.NLP概念术语   这里我将引入几个概念术语,便于大家理解及阅读NLP相关文章. 语言模型(language model,LM),简单地说,语言模型就是用来计算一个 ...

  4. [NLP入门篇] 我用四大名著学Embedding

    NLP入门之 通过 四大名著 学Embedding 为什么会有embedding出现? 这里我以自己搞图像分类的理解,来代入我们的nlp的Embedding. Word2Vec embedding的进 ...

  5. NLP(词向量、word2vec和word embedding)

    最近在做一些文本处理相关的任务,虽然对于相关知识有所了解,而且根据相关开源代码也可以完成相应任务:但是具有有些细节,尤其是细节之间的相互关系,感觉有些模糊而似懂非懂,所以找到相关知识整理介绍,分享如下 ...

  6. 从Word Embedding到Bert模型---NLP中预训练发展史

    本文转自张俊林老师,希望加深记忆及理解. 本文的主题是自然语言处理中的预训练过程,会大致说下NLP中的预训练技术是一步一步如何发展到Bert模型的,从中可以很自然地看到Bert的思路是如何逐渐形成的, ...

  7. NLP入门之综述阅读-基于深度学习的自然语言处理研究综述

    NLP入门-综述阅读-[基于深度学习的自然语言处理研究综述] 基于深度学习的自然语言处理研究综述 摘要 0 引言 1 深度学习概述 卷积神经网络 递归神经网络 2 NLP应用研究进展 3 预训练语言模 ...

  8. NLP深入学习——什么是词向量和句向量(Word Embedding and Sentence Embedding)

    词向量(Word Embedding) 词向量(Word embedding),又叫Word嵌入式自然语言处理(NLP)中的一组语言建模和特征学习技术的统称,其中来自词汇表的单词或短语被映射到实数的向 ...

  9. 【组队学习】【29期】9. 基于transformers的自然语言处理(NLP)入门

    9. 基于transformers的自然语言处理(NLP)入门 航路开辟者:多多.erenup.张帆.张贤.李泺秋.蔡杰.hlzhang 领航员:张红旭.袁一涵 航海士:多多.张红旭.袁一涵.童鸣 基 ...

最新文章

  1. 浏览器中Javascript的加载和执行
  2. 安装好Pycharm后如何配置Python解释器简易教程(configure python interpreter)
  3. 在ASP.NET 3.5中使用新的ListView控件1
  4. Tex, LaTex概念及实例
  5. 你在闲鱼捡过最大的漏是什么?
  6. 求$N^N$的首位数字
  7. [转载] python int类数据的内存大小
  8. Apache2 实现https访问http服务
  9. php 调用极光api,利用php+curl调用极光IM第三方REST API方法经验
  10. JSP幼儿园管理系统
  11. 微信小程序设置为体验版需要打开调试模式
  12. Vue项目JS脚本错误捕获
  13. RK3399 修改android桌面图标默认大小
  14. Android APK瘦身之Android Studio Lint (代码审查)
  15. python基础入门1
  16. js中yyyy-MM-dd格式的日期转换
  17. MySQL按中文排序
  18. 嵌入式linux备份flash,嵌入式Linux裸机开发(十一)——Nandflash
  19. Java编程序哥德巴赫猜想
  20. 如何破坏Excel文件,让其显示文件已损坏方法

热门文章

  1. 微信公众号数据2019_2019年4月房地产微信公众号排行榜:郑州楼市第一
  2. 复合函数求导定义证明_复合函数求导法则证明方法的探讨
  3. 【nacos】com.alibaba.nacos.shaded.io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
  4. java.lang.ArithmeticException: Rounding necessary
  5. 算数运算异常ArithmeticException
  6. Sublime Text 3 快捷实现文件在浏览器中打开
  7. mysql 硬盘死机_磁盘空间不够导致mysql崩溃重启
  8. CartoonGAN github
  9. CloudDB——构建云化网络统一融合数据层
  10. 苹果手机自带软件删除了怎么恢复_苹果手机数据被删除如何来恢复数据???...