“There is no rule on how to write. Sometimes it comes easily and perfectly: sometimes it’s like drilling rock and then blasting it out with charges” — Ernest Hemingway
“没有写法的规则。 有时候它变得轻松而完美：有时候就像钻石头，然后用炸药炸开它” –欧内斯特·海明威(Ernest Hemingway)
The aim of this blog is to explain the building of an end-to-end model for text generation by implementing a powerful architecture based on LSTMs.
The blog is divided into the following sections:
You can find the complete code at: https://github.com/FernandoLpz/Text-Generation-BiLSTM-PyTorch
您可以在以下位置找到完整的代码： https : //github.com/FernandoLpz/Text-Generation-BiLSTM-PyTorch
Over the years, various proposals have been launched to model natural language, but how is this? what does the idea of “modeling natural language” refer to? We could think that “modeling natural language” refers to the reasoning given to the semantics and syntax that make up the language, in essence, it is, but it goes further.
多年以来，已经提出了各种建议来模拟自然语言 ，但这是怎么回事？ “ 模拟自然语言 ”的想法指的是什么？ 我们可以认为“ 对自然语言建模 ”是指对构成语言的语义和语法的推理，从本质上讲，确实如此，但它可以走得更远。
Nowadays, the field of Natural Language Processing (NLP) deals with different tasks that refer to reasoning, understanding and modeling of language through different methods and techniques. The field of NLP (Natural Language processing) has been growing extremely fast in this past decade. It has been proposed in plenty of models to solve different NLP tasks from different perspectives. Likewise, the common denominator among the most popular proposals is the implementation of Deep Learning based models.
如今， 自然语言处理 ( NLP )领域通过不同的方法和技术处理涉及语言推理，理解和建模的不同任务。 在过去的十年中，NLP(自然语言处理)领域的发展非常Swift。 已经提出了许多模型来从不同角度解决不同的NLP任务。 同样，最受欢迎的提案中的共同点是基于深度学习的模型的实现。
As already mentioned, NLP field addresses a huge number of problems, specifically in this blog we will address the problem of text generation by making use of deep learning based models, such as the recurrent neural networks LSTM and Bi-LSTM. Likewise, we will use one of the most sophisticated frameworks today to develop deep learning models, specifically we will use the LSTMCell class from PyTorch to develop the proposed architecture.
如前所述， NLP领域解决了大量问题，特别是在此博客中，我们将通过使用基于深度学习的模型 (例如递归神经网络 LSTM和Bi-LSTM)来解决文本生成问题。 同样，我们将用最先进的框架之一，今天深发展的学习模式，特别是我们将使用LSTMCell 类从PyTorch发展所提出的架构。
If you want to dig into the mechanics of the LSTM, as well as how it is implemented in PyTorch, take a look at this amazing explanation: From a LSTM Cell to a Multilayer LSTM Network with PyTorch
如果您想了解LSTM的原理以及在PyTorch中的实现方式 ，请看一下以下令人惊奇的解释： 从LSTM单元到带有PyTorch的多层LSTM网络
问题陈述 (Problem statement)
Given a text, a neural network will be fed through character sequences in order to learn the semantics and syntactics of the given text. Subsequently, a sequence of characters will be randomly taken and the next character will be predicted.
So, let’s get started!
文字预处理 (Text preprocessing)
First, we are going to need a text which we are going to work with. There are different resources where you can find different texts in plain text, I recommend you take a look at the Gutenberg Project.
首先，我们需要一个将要使用的文本。 您可以在不同的资源中找到纯文本的不同文本，我建议您看一下Gutenberg项目 。
In this case, I will use the book called Jack Among the Indians by George Bird Grinnell, the one you can find here: link to the book. So, the first lines of chapter 1 look like:
The train rushed down the hill, with a long shrieking whistle, and then began to go more and more slowly. Thomas had brushed Jack off and thanked him for the coin that he put in his hand, and with the bag in one hand and the stool in the other now went out onto the platform and down the steps, Jack closely following.
As you can see, the text contains uppercase, lowercase, line breaks, punctuation marks, etc. What is suggested to do is to try to adapt the text to a form which allows us to handle it in a better way and which mainly reduces the complexity of the model that we are going to develop. So we are going to transform each character to its lowercase form. Also, it is advisable to handle the text as a list of characters, that is, instead of having a “big string of characters”, we will have a list of characters. The purpose of having the text as a sequence of characters is for better handling when generating the sequences which the model will be fed with (we will see this in the next section in detail).
如您所见，文本包含大写，小写，换行符，标点符号等。建议做的是尝试使文本适应某种形式，以使我们可以更好地处理它，并且主要减少我们将要开发的模型的复杂性。 因此，我们将每个字符转换为小写形式 。 另外，建议将文本作为字符列表来处理，也就是说，我们将拥有一个字符列表，而不是使用“ 大字符字符串 ”。 将文本作为字符序列的目的是为了更好地处理生成模型将要使用的序列(我们将在下一节中详细介绍)。
So let’s do it!
As we can see, in line 2 we are defining the characters to be used, all other symbols will be discarded, we only keep the “white space” symbol. In lines 6 and 10 we are reading the raw file and transforming it into its lowercase form. In the loops of lines 14 and 19 we are creating and string which represents the entire book and generating a list of characters. In line 23 we are filtering the text list by only keeping the letters defined in line 2.
如我们所见，在第2行中，我们定义了要使用的字符，所有其他符号都将被丢弃，我们仅保留“ 空白 ”符号。 在第6和10行中，我们正在读取原始文件并将其转换为小写形式。 在第14和19行的循环中，我们创建并表示整个书籍的字符串，并生成一个字符列表。 在第23行中，我们仅保留第2行中定义的字母来过滤文本列表 。
So, once the text is loaded and preprocessed, we will go from having a text like this:
text = "The train rushed down the hill."
to have a list of characters like this:
text = ['t','h','e',' ','t','r','a','i','n',' ','r','u','s','h','e','d',' ','d','o','w','n',
' ','t','h','e',' ','h','i','l','l']
Well, we already have the full text as a list of characters. As it’s well known, we cannot introduce raw characters directly to a neural network, we require a numerical representation, therefore, we need to transform each character to a numerical representation. For this, we are going to create a dictionary which will help us to save the equivalence “character-index” and “index-character”.
好吧，我们已经有了全文作为字符列表。 众所周知，我们无法将原始字符直接引入神经网络，我们需要一个数字表示形式 ，因此，我们需要将每个字符转换为一个数字表示形式。 为此，我们将创建一个字典，该字典将帮助我们保存等价的“ character-index ”和“ i ndex-character ”。
So, let’s do it!
As we can notice, in lines 11 and 12 the “char-index” and “index-char” dictionaries are created.
我们可以注意到，在第11和12行中，创建了“ char-index ”和“ index-char ”字典。
So far we have already shown how to load the text and save it in the form of a list of characters, we have also created a couple of dictionaries that will help us to encode-decode each character. Now, it is time to see how we will generate the sequences that will be introduced to the model. So, let’s go to the next section!
到目前为止，我们已经展示了如何加载文本并以字符列表的形式保存文本，我们还创建了两个字典，可以帮助我们对每个字符进行编码/解码。 现在，该看一下我们如何生成将引入模型的序列了。 因此，让我们进入下一部分！
序列产生 (Sequence generation)
The way in which the sequences are generated depends entirely on the type of model that we are going to implement. As already mentioned, we will use recurrent neural networks of the LSTM type, which receive data sequentially (time steps).
For our model, we need to form sequences of a given length which we will call “window”, where the character to predict (the target) will be the character next to the window. Each sequence will be made up of the characters included in the window. To form a sequence, the window is sliced one character to the right at a time. The character to predict will always be the character following the window. We can clearly see this process in Figure 1.
对于我们的模型，我们需要形成给定长度的序列，我们将其称为“ 窗口 ”，其中要预测的字符( 目标 )将是窗口旁边的字符。 每个序列将由窗口中包含的字符组成。 为了形成一个序列，将窗口一次向右切一个字符。 要预测的字符始终是跟随窗口的字符。 我们可以在图1中清楚地看到此过程。
Well, so far we have seen how to generate the character sequences in a simple way. Now we need to transform each character to its respective numerical format, for this we will use the dictionary generated in the preprocessing phase. This process can be visualized in Figure 2.
好了，到目前为止，我们已经看到了如何以简单的方式生成字符序列。 现在我们需要将每个字符转换为其各自的数字格式，为此，我们将使用在预处理阶段生成的字典。 此过程可以在图2中看到。
Great, now we know how to generate the character sequences using a window that slides one character at a time and how we transform the characters into a numeric format, the following code snippet shows the process described.
Fantastic, now we know how to preprocess raw text, how to transform it into a list of characters and how to generate sequences in a numeric format. Now we go to the most interesting part, the model architecture.
模型架构 (Model architecture)
As you already read in the title of this blog, we are going to make use of Bi-LSTM recurrent neural networks and standard LSTMs. Essentially, we make use of this type of neural network due to its great potential when working with sequential data, such as the case of text-type data. Likewise, there are a large number of articles that refer to the use of architectures based on recurrent neural networks (e.g. RNN, LSTM, GRU, Bi-LSTM, etc.) for text modeling, specifically for text generation [1, 2].
正如您已经在该博客的标题中阅读的那样，我们将使用Bi-LSTM递归神经网络和标准LSTM 。 本质上，由于这种类型的神经网络在处理顺序数据(例如文本类型数据)时具有巨大的潜力，因此我们会使用这种类型的神经网络。 同样，有很多文章引用了基于递归神经网络(例如RNN， LSTM ， GRU ， Bi-LSTM等)的架构进行文本建模，尤其是用于文本生成[1，2]。
The architecture of the proposed neural network consists of an embedding layer followed by a Bi-LSTM as well as a LSTM layer. Right after, the latter LSTM is connected to a linear layer.
所提出的神经网络的体系结构由嵌入层， Bi-LSTM以及LSTM层组成。 之后，将后者的LSTM连接到线性层 。
The methodology consists of passing each sequence of characters to the embedding layer, this to generate a representation in the form of a vector for each element that makes up the sequence, therefore we would be forming a sequence of embedded characters. Subsequently, each element of the sequence of embedded characters will be passed to the Bi-LSTM layer. Subsequently, a concatenation of each output of the LSTMs that make up the Bi-LSTM (the forward LSTM and the backward LSTM) will be generated. Right after, each forward + backward concatenated vector will be passed to the LSTM layer from which the last hidden state will be taken to feed the linear layer. This last linear layer will have as activation function a Softmax function in order to represent the probability of each character. Figure 3 show the described methodology.
该方法包括将每个字符序列传递给嵌入层，从而为构成该序列的每个元素生成矢量形式的表示形式，因此我们将形成一个嵌入字符序列 。 随后， 嵌入字符序列中的每个元素都将传递到Bi-LSTM 层 。 随后，将生成组成Bi-LSTM的LSTM的每个输出( 正向LSTM和反向LSTM )的串联。 之后，每个正向和反向连接的向量将传递到LSTM 层，从该层将获取最后的隐藏状态以馈送线性层 。 最后的线性层将具有Softmax函数作为激活函数，以便表示每个字符的概率。 图3显示了所描述的方法。
Fantastic, so far we have already explained the architecture of the model for text generation as well as the implemented methodology. Now we need to know how to do all this with the PyTorch framework, but first, I would like to briefly explain how the Bi-LSTM and the LSTM work together to later see how we would do it in code, so let’s see how a Bi-LSTM network works.
太棒了，到目前为止，我们已经解释了文本生成模型的体系结构以及实现的方法。 现在我们需要知道如何使用PyTorch框架来完成所有这些工作，但是首先，我想简要地解释一下Bi-LSTM和LSTM如何一起工作，以便以后在代码中看到我们将如何做，所以让我们看看Bi-LSTM 网络有效。
Bi-LSTM和LSTM (Bi-LSTM & LSTM)
The key difference between a standard LSTM and a Bi-LSTM is that the Bi-LSTM is made up of 2 LSTMs, better known as “forward LSTM” and “backward LSTM”. Basically, the forward LSTM receives the sequence in the original order, while the backward LSTM receives the sequence in reverse. Subsequently and depending on what is intended to be done, each hidden state for each time step of both LSTMs can be joined or only the last states of both LSTMs will be operated. In the proposed model, we suggest joining both hidden states for each time step.
标准LSTM和Bi-LSTM之间的主要区别在于Bi-LSTM 由 2 个LSTM 组成 ，通常称为“ 正向 LSTM ”和“ 反向 LSTM ”。 基本上， 前向 LSTM按原始顺序接收序列，而后向 LSTM 接收相反的顺序。 随后，根据要执行的操作，两个LSTM每个时间步的每个隐藏状态都可以合并，或者仅两个LSTM的最后一个状态都将被操作。 在提出的模型中，我们建议为每个时间步加入两个隐藏状态 。
Perfect, now we understand the key difference between a Bi-LSTM and an LSTM. Going back to the example we are developing, Figure 4 represents the evolution of each sequence of characters when they are passed through the model.
Great, once everything about the interaction between Bi-LSTM and LSTM is clear, let’s see how we do this in code using only LSTMCells from the great PyTorch framework.
So, first let’s understand how we make the constructor of the TextGenerator class, let’s take a look at the following code snippet:
As we can see, from lines 6 to 10 we define the parameters that we will use to initialize each layer of the neural network. It is important to mention that input_size is equal to the size of the vocabulary (that is, the number of elements that our dictionary generated in the preprocessing contains). Likewise, the number of classes to be predicted is also the same size as the vocabulary and sequence_length refers to the size of the window.
如我们所见，从第6行到第10行，我们定义了用于初始化神经网络每一层的参数。 重要的是要提到input_size等于词汇表的大小 (即预处理中生成的字典中包含的元素数)。 同样，要预测的类的数量也与词汇表相同，并且sequence_length指的是窗口的大小。
On the other hand, in lines 20 and 21 we are defining the two LSTMCells that make up the Bi-LSTM (forward and backward). In line 24 we define the LSTMCell that will be fed with the output of the Bi-LSTM. It is important to mention that the hidden state size is double compared to the Bi-LSTM, this is because the output of the Bi-LSTM is concatenated. Later on line 27 we define the linear layer, which will be filtered later by the softmax function.
另一方面，在第20行和第21行中，我们定义了两个构成 Bi-LSTM的 LSTMCell ( 正向和反向 )。 在第24行中，我们定义了将与Bi-LSTM的输出一起馈入的LSTMCell 。 值得一提的是， 隐藏状态的大小是Bi-LSTM的两倍，这是因为Bi-LSTM的输出是串联的。 在第27行的后面，我们定义了线性层 ，稍后将通过softmax函数对其进行过滤。
Once the constructor is defined, we need to create the tensors that will contain the cell state (cs) and hidden state (hs) for each LSTM. So, we proceed to do it as follows:
定义构造函数后，我们需要创建张量，其中将包含每个LSTM的单元状态 ( cs )和隐藏状态 ( hs )。 因此，我们继续执行以下操作：
Fantastic, once the tensors that will contain the hidden state and cell state have been defined, it is time to show how the assembly of the entire architecture is done, let’s go for it!
First, let’s take a look at the following code snippet:
For a better understanding, we are going to explain the assembly with some defined values, in such a way that we can understand how each tensor is passed from one layer to another. So say we have:
batch_size = 64
hidden_size = 128
sequence_len = 100
num_classes = 27
so the x input tensor will have a shape:
# torch.Size([batch_size, sequence_len])
x : torch.Size([64, 100])
then, in line 2 is passed the x tensor through the embedding layer, so the output would have a size:
# torch.Size([batch_size, sequence_len, hidden_size])
x_embedded : torch.Size([64, 100, 128])
It is important to notice that in line 5 we are reshaping the x_embedded tensor. This is because we need to have the sequence length as the first dimension, essentially because in the Bi-LSTM we will iterate over each sequence, so the reshaped tensor will have a shape:
请注意，在第5行中，我们正在重塑 x_embedded张量。 这是因为我们需要将序列长度作为第一维，主要是因为在Bi-LSTM中，我们将遍历每个序列，因此重塑后的张量将具有以下形状：
# torch.Size([sequence_len, batch_size, hidden_size])
x_embedded_reshaped : torch.Size([100, 64, 128])
Right after, in lines 7 and 8 the forward and backward lists are defined. There we will store the hidden states of the Bi-LSTM.
之后，在第7和8行中，定义了向前和向后的列表。 在那里，我们将存储Bi-LSTM的隐藏状态 。
So it’s time to feed the Bi-LSTM. First, in line 12 we are iterating over forward LSTM, we are also saving the hidden states of each time step (hs_forward). In line 19 we are iterating over the backward LSTM, at the same time we are saving the hidden states of each time step (hs_backward). You can notice that the loop is done in the same sequence, the difference is that it’s read in reversed form. Each hidden state will have the following shape:
因此，现在该喂Bi-LSTM了 。 首先，在第12行中，我们遍历正向LSTM ，还保存了每个时间步的隐藏状态 ( hs_forward )。 在第19行中，我们遍历向后 LSTM ，同时，我们保存每个时间步的隐藏状态 ( hs_backward )。 您可能会注意到，循环以相同的顺序完成，不同之处在于它是以相反的形式读取的。 每个隐藏状态将具有以下形状：
# hs_forward : torch.Size([batch_size, hidden_size])
hs_forward : torch.Size([64, 128])# hs_backward : torch.Size([batch_size, hidden_size])
hs_backward: torch.Size([64, 128])
Great, now let’s see how to feed the latest LSTM layer. For this, we make use of the forward and backward lists. In line 26 we are iterating through each hidden state corresponding to forward and backward which are concatenated in line 27. It is important to note that by concatenating both hidden states, the dimension of the tensor will increase 2X, that is, the tensor will have the following shape:
太好了，现在让我们看看如何添加最新的LSTM层 。 为此，我们使用前向和后向列表。 在第26行中，我们遍历与在第27行中串联的 前向和后向相对应的每个隐藏状态 。重要的是要注意，通过将这两个隐藏状态 串联在一起 ，张量的尺寸将增加2倍，也就是说，张量将具有以下形状：
# input_tesor : torch.Size([bathc_size, hidden_size * 2])
input_tensor : torch.Size([64, 256])
Finally, the LSTM will return a hidden state of size:
# last_hidden_state: torch.Size([batch_size, num_classes])
last_hidden_state: torch.Size([64, 27])
At the very end, the last hidden state of the LSTM will be passed through a linear layer, as shown on line 31. So, the complete forward function is shown in the following code snippet:
最后 ， LSTM的最后一个隐藏状态将通过inear层传递，如第31行所示。因此，下面的代码片段显示了完整的forward函数：
Congratulations! Up to this point we already know how to assemble the neural networks using LSTMCell in PyTorch. Now it’s time to see how we do the training phase, so let’s move on to the next section.
恭喜你！ 至此，我们已经知道如何在PyTorch中使用LSTMCell组装神经网络。 现在是时候看看我们如何进行培训了，接下来让我们继续下一节。
训练阶段 (Training phase)
Great, we’ve come to training. To perform the training we need to initialize the model and the optimizer, later we need to iterate for each epoch and for each mini-batch, so let’s do it!
太好了，我们来训练了 。 为了执行训练，我们需要初始化模型和优化器 ，稍后我们需要针对每个时期和每个迷你批处理进行迭代，让我们开始吧！
Once the model is trained, we will need to save the weights of the neural network to later use them to generate text. For this we have two options, the first is to define a fixed number of epochs and then save the weights, the second is to determine a stop function to obtain the best version of the model. In this particular case, we are going to opt for the first option. After training the model under a certain number of epochs, we save the weights as follows:
训练完模型后，我们将需要保存神经网络的权重 ，以便以后使用它们生成文本 。 为此，我们有两个选择，第一个是定义固定数量的纪元 ，然后保存权重，第二个是确定停止函数以获得模型的最佳版本。 在这种情况下，我们将选择第一个选项。 在一定时期内训练模型后，我们将权重保存如下：
Perfect, up to this point we have already seen how to train the text generator and how to save the weights, now we are going to the top part of this blog, the text generation! So let’s go to the next section.
完美，到目前为止，我们已经了解了如何训练文本生成器以及如何节省权重 ，现在我们将转到本博客的顶部，即文本生成！ 因此，让我们进入下一部分。
文字产生 (Text generation)
Fantastic, we have reached the final part of the blog, the text generation. For this, we need to do two things: the first is to load the trained weights and the second is to take a random sample from the set of sequences as the pattern to start generating the next character. So let’s take a look at the following code snippet:
太棒了，我们已经到达了博客的最后一部分， 即文本生成 。 为此，我们需要做两件事：第一是加载训练后的权重 ，第二是从序列集中获取随机样本作为模式，以开始生成下一个字符。 因此，让我们看一下以下代码片段：
So, by training the model under the following characteristics:
window : 100
epochs : 50
hidden_dim : 128
batch_size : 128
learning_rate : 0.001
we can generate the following:
Seed:one of the prairie swellswhich gave a little wider view than most of them jack saw quite close to thePrediction:one of the prairie swellswhich gave a little wider view than most of them jack saw quite close to the wnd banngessejang boffff we outheaedd we band r hes tller a reacarof t t alethe ngothered uhe th wengaco ack fof ace ca e s alee bin cacotee tharss th band fofoutod we we ins sange trre anca y w farer we sewigalfetwher d e we n s shed pack wngaingh tthe we the we javes t supun f the har man bllle s ng ou y anghe ond we nd ba a she t t anthendwe wn me anom ly tceaig t i isesw arawns t d ks wao thalac tharr jad d anongive where the awe w we he is ma mie cack seat sesant sns t imes hethof riges we he d ooushe he hang out f t thu inong bll llveco we see s the he haa is s igg merin ishe d t san wack owhe o or th we sbe se we we inange t ts wan br seyomanthe harntho thengn th me ny we ke in acor offff of wan s arghe we t angorro the wand be thing a sth t tha alelllll willllsse of s wed w brstougof bage orore he anthesww were ofawe ce qur the he sbaing tthe bytondece nd t llllifsffo acke o t in ir me hedlff scewant pi t bri pi owasem the awh thorathas th we hed ofainginictoplid we me
As we can see, the generated text may not make any sense, however there are some words and phrases that seem to form an idea, for example:
we, band, pack, the, man, where, he, hang, out, be, thing, me, were
Congratulations, we have reached the end of the blog!
Throughout this blog we have shown how to make an end-to-end model for text generation using PyTorch’s LSTMCell and implementing an architecture based on recurring neural networks LSTM and Bi-LSTM.
It is important to comment that the suggested model for text generation can be improved in different ways. Some suggested ideas would be to increase the size of the text corpus to be trained, increase the number of epochs as well as the memory size for each LSTM. On the other hand, we could think of an interesting architecture based on Convolutional-LSTM (maybe a topic for another blog).
重要的是要评论可以以不同方式改进建议的文本生成模型。 一些建议的想法将是增加要训练的文本语料库的大小 ， 增加每个LSTM 的时期数以及内存大小 。 另一方面，我们可以想到一个基于卷积LSTM (也许是另一个博客的主题)的有趣架构。