attention 订阅
《Attention》是美国男歌手Charlie Puth演唱的一首歌曲,发行于2017年4月21日 [1]  ,收录于Charlie Puth2018年5月11日发行的录音室专辑《Voicenotes》中。 展开全文
《Attention》是美国男歌手Charlie Puth演唱的一首歌曲,发行于2017年4月21日 [1]  ,收录于Charlie Puth2018年5月11日发行的录音室专辑《Voicenotes》中。
Charlie Puth
注意 [1]
谱    曲
Jacob Kasher [2]  ,Charlie Puth [2]
Voicenotes [2]
填    词
Jacob Kasher [2]  ,Charlie Puth [2]
  • attention

    2021-01-20 12:40:27
    1.什么是Attention机制? 当我们人在看一样东西的时候,我们当前时刻关注的一定是我们当前正在看的这样东西的某一地方,换句话说,当我们目光移到别处时,注意力随着目光的移动也在转移,这意味着,当人们注意到某个...
  • Attention

    2019-08-12 18:56:21
    SNAL Attention for landmark Detection 基本结构简介: 模型的backbone有四个SANL attention结构,这些结构的spatial attention map自与 GCAM。模型既可以用于分类也可以用于关键点检测。总体来说是一种基于热力...

    SNAL Attention for landmark Detection


    模型的backbone有四个SANL attention结构,这些结构的spatial attention map自与 GCAM。模型既可以用于分类也可以用于关键点检测。总体来说是一种基于热力图的attention方法。

    Attention 种类

    视觉attention主要分为两类,一种是相关attention,根据feature map与关键点直接框出,比如Fashionnet。第二种是使用热力图,比如本网络。

    SNAL attention结构:



    其中X是输入,M是attention map, 计算有矩阵点积和矩阵乘法,矩阵加法。



    是一种基于像素对区域过滤的一种算法,对于feature map的激活权重一般是根据feature map的激活值加总。



    Yi= 1C(X)jf(Xi , Xj)g(Xj)






    其中M是attention map, h 是使用与X 和M的转换函数。

    其中f 函数为相似度函数可以写成


    landmark detect

    landmark detect 的当前趋势是,从与关键点相关的热力图中进行预测,所以预测几个关键点就需要几个热力图。




    其中Cvis表示是否可见,tijc 地ij的真实landmark点,x则表示预测的点。



  • 该文档主要介绍了attention及其变种self attention 、multi-attention以及一些相关的paper
  • TCN-with-attention-master_attention_tcn_attention预测_attention-LS
  • Attention Mechanism Can I have your Attention please! The introduction of the Attention Mechanism in deep learning has improved the success of various models in recent years, and continues to be an.....

     Attention Mechanism 

    Can I have your Attention please! The introduction of the Attention Mechanism in deep learning has improved the success of various models in recent years, and continues to be an omnipresent component in state-of-the-art models. Therefore, it is vital that we pay Attention to Attention and how it goes about achieving its effectiveness.

    In this article, I will be covering the main concepts behind Attention, including an implementation of a sequence-to-sequence Attention model, followed by the application of Attention in Transformers and how they can be used for state-of-the-art results. It is advised that you have some knowledge of Recurrent Neural Networks (RNNs) and their variants, or an understanding of how sequence-to-sequence models work.

    What is Attention?

    When we think about the English word “Attention”, we know that it means directing your focus at something and taking greater notice. The Attention mechanism in Deep Learning is based off this concept of directing your focus, and it pays greater attention to certain factors when processing the data.

    In broad terms, Attention is one component of a network’s architecture, and is in charge of managing and quantifying the interdependence:

    1. Between the input and output elements (General Attention)
    2. Within the input elements (Self-Attention)

    Let me give you an example of how Attention works in a translation task. Say we have the sentence “How was your day”, which we would like to translate to the French version - “Comment se passe ta journée”. What the Attention component of the network will do for each word in the output sentence is map the important and relevant words from the input sentence and assign higher weights to these words, enhancing the accuracy of the output prediction.


    Weights are assigned to input words at each step of the translation

    The above explanation of Attention is very broad and vague due to the various types of Attention mechanisms available. But fret not, you’ll gain a clearer picture of how Attention works and achieves its objectives further in the article. As the Attention mechanism has undergone multiple adaptations over the years to suit various tasks, there are many different versions of Attention that are applied. We will only cover the more popular adaptations here, which are its usage in sequence-to-sequence models and the more recent Self-Attention.

    While Attention does have its application in other fields of deep learning such as Computer Vision, its main breakthrough and success comes from its application in Natural Language Processing (NLP) tasks. This is due to the fact that Attention was introduced to address the problem of long sequences in Machine Translation, which is also a problem for most other NLP tasks as well.

    Attention in Sequence-to-Sequence Models

    Most articles on the Attention Mechanism will use the example of sequence-to-sequence (seq2seq) models to explain how it works. This is because Attention was originally introduced as a solution to address the main issue surrounding seq2seq models, and to great success. If you are unfamiliar with seq2seq models, also known as the Encoder-Decoder model, I recommend having a read through this article to get you up to speed.

    Overall process of a Sequence-to-sequence model

    The standard seq2seq model is generally unable to accurately process long input sequences, since only the last hidden state of the encoder RNN is used as the context vector for the decoder. On the other hand, the Attention Mechanism directly addresses this issue as it retains and utilises all the hidden states of the input sequence during the decoding process. It does this by creating a unique mapping between each time step of the decoder output to all the encoder hidden states. This means that for each output that the decoder makes, it has access to the entire input sequence and can selectively pick out specific elements from that sequence to produce the output.

    Therefore, the mechanism allows the model to focus and place more “Attention” on the relevant parts of the input sequence as needed.

    Types of Attention

    Comparing Bahdanau Attention with Luong Attention

    Before we delve into the specific mechanics behind Attention, we must note that there are 2 different major types of Attention:

    While the underlying principles of Attention are the same in these 2 types, their differences lie mainly in their architectures and computations.

    Bahdanau Attention

    Overall process for Bahdanau Attention seq2seq model

    The first type of Attention, commonly referred to as Additive Attention, came from a paper by Dzmitry Bahdanau, which explains the less-descriptive original name. The paper aimed to improve the sequence-to-sequence model in machine translation by aligning the decoder with the relevant input sentences and implementing Attention. The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

    1. Producing the Encoder Hidden States - Encoder produces hidden states of each element in the input sequence
    2. Calculating Alignment Scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)
    3. Softmaxing the Alignment Scores - the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
    4. Calculating the Context Vector - the encoder hidden states and their respective alignment scores are multiplied to form the context vector
    5. Decoding the Output - the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output
    6. The process (steps 2-5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

    Flow of calculating Attention weights in Bahdanau Attention

    Now that we have a high-level understanding of the flow of the Attention mechanism for Bahdanau, let’s take a look at the inner workings and computations involved, together with some code implementation of a language seq2seq model with Attention in PyTorch.

    1. Producing the Encoder Hidden States

    Encoder RNNs will produce a hidden state for each input

    For our first step, we’ll be using an RNN or any of its variants (e.g. LSTM, GRU) to encode the input sequence. After passing the input sequence through the encoder RNN, a hidden state/output will be produced for each input passed in. Instead of using only the hidden state at the final time step, we’ll be carrying forward all the hidden states produced by the encoder to the next step.

    class EncoderLSTM(nn.Module):
      def __init__(self, input_size, hidden_size, n_layers=1, drop_prob=0):
        super(EncoderLSTM, self).__init__()
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_size, hidden_size)
        self.lstm = nn.LSTM(hidden_size, hidden_size, n_layers, dropout=drop_prob, batch_first=True)
      def forward(self, inputs, hidden):
        # Embed input words
        embedded = self.embedding(inputs)
        # Pass the embedded word vectors into LSTM and return all outputs
        output, hidden = self.lstm(embedded, hidden)
        return output, hidden
      def init_hidden(self, batch_size=1):
        return (torch.zeros(self.n_layers, batch_size, self.hidden_size, device=device),
                torch.zeros(self.n_layers, batch_size, self.hidden_size, device=device))

    In the code implementation of the encoder above, we’re first embedding the input words into word vectors (assuming that it’s a language task) and then passing it through an LSTM. The encoder over here is exactly the same as a normal encoder-decoder structure without Attention.

    2. Calculating Alignment Scores

    For these next 3 steps, we will be going through the processes that happen in the Attention Decoder and discuss how the Attention mechanism is utilised. The class BahdanauDecoderLSTM defined below encompasses these 3 steps in the forward function.

    class BahdanauDecoder(nn.Module):
      def __init__(self, hidden_size, output_size, n_layers=1, drop_prob=0.1):
        super(BahdanauDecoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.drop_prob = drop_prob
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.fc_hidden = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.fc_encoder = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
        self.weight = nn.Parameter(torch.FloatTensor(1, hidden_size))
        self.attn_combine = nn.Linear(self.hidden_size * 2, self.hidden_size)
        self.dropout = nn.Dropout(self.drop_prob)
        self.lstm = nn.LSTM(self.hidden_size*2, self.hidden_size, batch_first=True)
        self.classifier = nn.Linear(self.hidden_size, self.output_size)
      def forward(self, inputs, hidden, encoder_outputs):
        encoder_outputs = encoder_outputs.squeeze()
        # Embed input words
        embedded = self.embedding(inputs).view(1, -1)
        embedded = self.dropout(embedded)
        # Calculating Alignment Scores
        x = torch.tanh(self.fc_hidden(hidden[0])+self.fc_encoder(encoder_outputs))
        alignment_scores = x.bmm(self.weight.unsqueeze(2))  
        # Softmaxing alignment scores to get Attention weights
        attn_weights = F.softmax(alignment_scores.view(1,-1), dim=1)
        # Multiplying the Attention weights with encoder outputs to get the context vector
        context_vector = torch.bmm(attn_weights.unsqueeze(0),
        # Concatenating context vector with embedded input word
        output =, context_vector[0]), 1).unsqueeze(0)
        # Passing the concatenated vector as input to the LSTM cell
        output, hidden = self.lstm(output, hidden)
        # Passing the LSTM output through a Linear layer acting as a classifier
        output = F.log_softmax(self.classifier(output[0]), dim=1)
        return output, hidden, attn_weights

    After obtaining all of our encoder outputs, we can start using the decoder to produce outputs. At each time step of the decoder, we have to calculate the alignment score of each encoder output with respect to the decoder input and hidden state at that time step. The alignment score is the essence of the Attention mechanism, as it quantifies the amount of “Attention” the decoder will place on each of the encoder outputs when producing the next output.

    The alignment scores for Bahdanau Attention are calculated using the hidden state produced by the decoder in the previous time step and the encoder outputs with the following equation:


    The decoder hidden state and encoder outputs will be passed through their individual Linear layer and have their own individual trainable weights.

    Linear layers for encoder outputs and decoder hidden states

    In the illustration above, the hidden size is 3 and the number of encoder outputs is 2.

    Thereafter, they will be added together before being passed through a tanh activation function. The decoder hidden state is added to each encoder output in this case.

    Above outputs combined and tanh applied

    Lastly, the resultant vector from the previous few steps will undergo matrix multiplication with a trainable vector, obtaining a final alignment score vector which holds a score for each encoder output.

    Matrix Multiplication to obtain Alignment score

    Note: As there is no previous hidden state or output for the first decoder step, the last encoder hidden state and a Start Of String (<SOS>) token can be used to replace these two, respectively.

    3. Softmaxing the Alignment Scores

    After generating the alignment scores vector in the previous step, we can then apply a softmax on this vector to obtain the attention weights. The softmax function will cause the values in the vector to sum up to 1 and each individual value will lie between 0 and 1, therefore representing the weightage each input holds at that time step.

    Alignment scores are softmaxed

    4. Calculating the Context Vector

    After computing the attention weights in the previous step, we can now generate the context vector by doing an element-wise multiplication of the attention weights with the encoder outputs.

    Due to the softmax function in the previous step, if the score of a specific input element is closer to 1 its effect and influence on the decoder output is amplified, whereas if the score is close to 0, its influence is drowned out and nullified.

    Context Vector is derived from the weights and encoder outputs

    5. Decoding the Output

    The context vector we produced will then be concatenated with the previous decoder output. It is then fed into the decoder RNN cell to produce a new hidden state and the process repeats itself from step 2. The final output for the time step is obtained by passing the new hidden state through a Linear layer, which acts as a classifier to give the probability scores of the next predicted word.

    Context vector and previous output will give new decoder hidden state

    Steps 2 to 4 are repeated until the decoder generates an End Of Sentence token or the output length exceeds a specified maximum length.

    Luong Attention

    Overall process for Luong Attention seq2seq model

    The second type of Attention was proposed by Thang Luong in this paper. It is often referred to as Multiplicative Attention and was built on top of the Attention mechanism proposed by Bahdanau. The two main differences between Luong Attention and Bahdanau Attention are:

    1. The way that the alignment score is calculated
    2. The position at which the Attention mechanism is being introduced in the decoder

    There are three types of alignment scoring functions proposed in Luong’s paper compared to Bahdanau’s one type. Also, the general structure of the Attention Decoder is different for Luong Attention, as the context vector is only utilised after the RNN produced the output for that time step. We will explore these differences in greater detail as we go through the Luong Attention process, which is:

    1. Producing the Encoder Hidden States - Encoder produces hidden states of each element in the input sequence
    2. Decoder RNN - the previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step
    3. Calculating Alignment Scores - using the new decoder hidden state and the encoder hidden states, alignment scores are calculated
    4. Softmaxing the Alignment Scores - the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
    5. Calculating the Context Vector - the encoder hidden states and their respective alignment scores are multiplied to form the context vector
    6. Producing the Final Output - the context vector is concatenated with the decoder hidden state generated in step 2 as passed through a fully connected layer to produce a new output
    7. The process (steps 2-6) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

    As we can already see above, the order of steps in Luong Attention is different from Bahdanau Attention. The code implementation and some calculations in this process is different as well, which we will go through now.

    1. Producing the Encoder Hidden States

    Just as in Bahdanau Attention, the encoder produces a hidden state for each element in the input sequence.

    2. Decoder RNN

    Unlike in Bahdanau Attention, the decoder in Luong Attention uses the RNN in the first step of the decoding process rather than the last. The RNN will take the hidden state produced in the previous time step and the word embedding of the final output from the previous time step to produce a new hidden state which will be used in the subsequent steps

    class LuongDecoder(nn.Module):
      def __init__(self, hidden_size, output_size, attention, n_layers=1, drop_prob=0.1):
        super(LuongDecoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.n_layers = n_layers
        self.drop_prob = drop_prob
        # The Attention Mechanism is defined in a separate class
        self.attention = attention
        self.embedding = nn.Embedding(self.output_size, self.hidden_size)
        self.dropout = nn.Dropout(self.drop_prob)
        self.lstm = nn.LSTM(self.hidden_size, self.hidden_size)
        self.classifier = nn.Linear(self.hidden_size*2, self.output_size)
      def forward(self, inputs, hidden, encoder_outputs):
        # Embed input words
        embedded = self.embedding(inputs).view(1,1,-1)
        embedded = self.dropout(embedded)
        # Passing previous output word (embedded) and hidden state into LSTM cell
        lstm_out, hidden = self.lstm(embedded, hidden)
        # Calculating Alignment Scores - see Attention class for the forward pass function
        alignment_scores = self.attention(lstm_out,encoder_outputs)
        # Softmaxing alignment scores to obtain Attention weights
        attn_weights = F.softmax(alignment_scores.view(1,-1), dim=1)
        # Multiplying Attention weights with encoder outputs to get context vector
        context_vector = torch.bmm(attn_weights.unsqueeze(0),encoder_outputs)
        # Concatenating output from LSTM with context vector
        output =, context_vector),-1)
        # Pass concatenated vector through Linear layer acting as a Classifier
        output = F.log_softmax(self.classifier(output[0]), dim=1)
        return output, hidden, attn_weights
    class Attention(nn.Module):
      def __init__(self, hidden_size, method="dot"):
        super(Attention, self).__init__()
        self.method = method
        self.hidden_size = hidden_size
        # Defining the layers/weights required depending on alignment scoring method
        if method == "general":
          self.fc = nn.Linear(hidden_size, hidden_size, bias=False)
        elif method == "concat":
          self.fc = nn.Linear(hidden_size, hidden_size, bias=False)
          self.weight = nn.Parameter(torch.FloatTensor(1, hidden_size))
      def forward(self, decoder_hidden, encoder_outputs):
        if self.method == "dot":
          # For the dot scoring method, no weights or linear layers are involved
          return encoder_outputs.bmm(decoder_hidden.view(1,-1,1)).squeeze(-1)
        elif self.method == "general":
          # For general scoring, decoder hidden state is passed through linear layers to introduce a weight matrix
          out = self.fc(decoder_hidden)
          return encoder_outputs.bmm(out.view(1,-1,1)).squeeze(-1)
        elif self.method == "concat":
          # For concat scoring, decoder hidden state and encoder outputs are concatenated first
          out = torch.tanh(self.fc(decoder_hidden+encoder_outputs))
          return out.bmm(self.weight.unsqueeze(-1)).squeeze(-1)

    3. Calculating Alignment Scores

    In Luong Attention, there are three different ways that the alignment scoring function is defined- dot, general and concat. These scoring functions make use of the encoder outputs and the decoder hidden state produced in the previous step to calculate the alignment scores.

    • Dot
      The first one is the dot scoring function. This is the simplest of the functions; to produce the alignment score we only need to take the hidden states of the encoder and multiply them by the hidden state of the decoder.


    • General
      The second type is called general and is similar to the dot function, except that a weight matrix is added into the equation as well.


    • Concat
      The last function is slightly similar to the way that alignment scores are calculated in Bahdanau Attention, whereby the decoder hidden state is added to the encoder hidden states.


    However, the difference lies in the fact that the decoder hidden state and encoder hidden states are added together first before being passed through a Linear layer. This means that the decoder hidden state and encoder hidden state will not have their individual weight matrix, but a shared one instead, unlike in Bahdanau Attention.After being passed through the Linear layer, a tanh activation function will be applied on the output before being multiplied by a weight matrix to produce the alignment score.

    4. Softmaxing the Alignment Scores

    Similar to Bahdanau Attention, the alignment scores are softmaxed so that the weights will be between 0 to 1.

    5. Calculating the Context Vector

    Again, this step is the same as the one in Bahdanau Attention where the attention weights are multiplied with the encoder outputs.

    6. Producing the Final Output

    In the last step, the context vector we just produced is concatenated with the decoder hidden state we generated in step 2. This combined vector is then passed through a Linear layer which acts as a classifier for us to obtain the probability scores of the next predicted word.

    Testing The Model

    Since we’ve defined the structure of the Attention encoder-decoder model and understood how it works, let’s see how we can use it for an NLP task - Machine Translation.

    Ready to build, train, and deploy AI?

    Get started with FloydHub's collaborative AI platform for free

    Try FloydHub for free

    We will be using English to German sentence pairs obtained from the Tatoeba Project, and the compiled sentences pairs can be found at this link. You can run the code implementation in this article on FloydHub using their GPUs on the cloud by clicking the following link and using the main.ipynb notebook.

    Run on FloydHub

    This will speed up the training process significantly. Alternatively, the link to the GitHub repository can be found here.

    The goal of this implementation is not to develop a complete English to German translator, but rather just as a sanity check to ensure that our model is able to learn and fit to a set of training data. I will briefly go through the data preprocessing steps before running through the training procedure.

    import torch
    import torch.nn as nn
    import torch.nn.functional as F
    import torch.optim as optim
    import numpy as np
    import pandas
    import spacy
    from spacy.lang.en import English
    from import German
    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker
    from tqdm import tqdm_notebook
    import random
    from collections import Counter
    if torch.cuda.is_available:
        device = torch.device("cuda")
        device = torch.device("cpu")

    We start by importing the relevant libraries and defining the device we are running our training on (GPU/CPU). If you’re using FloydHub with GPU to run this code, the training time will be significantly reduced. In the next code block, we’ll be doing our data preprocessing steps:

    1. Tokenizing the sentences and creating our vocabulary dictionaries
    2. Assigning each word in our vocabulary to integer indexes
    3. Converting our sentences into their word token indexes
    # Reading the English-German sentences pairs from the file
    with open("deu.txt","r+") as file:
      deu = [x[:-1] for x in file.readlines()]
    en = []
    de = []
    for line in deu:
    # Setting the number of training sentences we'll use
    training_examples = 10000
    # We'll be using the spaCy's English and German tokenizers
    spacy_en = English()
    spacy_de = German()
    en_words = Counter()
    de_words = Counter()
    en_inputs = []
    de_inputs = []
    # Tokenizing the English and German sentences and creating our word banks for both languages
    for i in tqdm_notebook(range(training_examples)):
        en_tokens = spacy_en(en[i])
        de_tokens = spacy_de(de[i])
        if len(en_tokens)==0 or len(de_tokens)==0:
        for token in en_tokens:
        en_inputs.append([token.text.lower() for token in en_tokens] + ['_EOS'])
        for token in de_tokens:
        de_inputs.append([token.text.lower() for token in de_tokens] + ['_EOS'])
    # Assigning an index to each word token, including the Start Of String(SOS), End Of String(EOS) and Unknown(UNK) tokens
    en_words = ['_SOS','_EOS','_UNK'] + sorted(en_words,key=en_words.get,reverse=True)
    en_w2i = {o:i for i,o in enumerate(en_words)}
    en_i2w = {i:o for i,o in enumerate(en_words)}
    de_words = ['_SOS','_EOS','_UNK'] + sorted(de_words,key=de_words.get,reverse=True)
    de_w2i = {o:i for i,o in enumerate(de_words)}
    de_i2w = {i:o for i,o in enumerate(de_words)}
    # Converting our English and German sentences to their token indexes
    for i in range(len(en_inputs)):
        en_sentence = en_inputs[i]
        de_sentence = de_inputs[i]
        en_inputs[i] = [en_w2i[word] for word in en_sentence]
        de_inputs[i] = [de_w2i[word] for word in de_sentence]

    Since we’ve already defined our Encoder and Attention Decoder model classes earlier, we can now instantiate the models.

    hidden_size = 256
    encoder = EncoderLSTM(len(en_words), hidden_size).to(device)
    attn = Attention(hidden_size,"concat")
    decoder = LuongDecoder(hidden_size,len(de_words),attn).to(device)
    lr = 0.001
    encoder_optimizer = optim.Adam(encoder.parameters(), lr=lr)
    decoder_optimizer = optim.Adam(decoder.parameters(), lr=lr)

    We’ll be testing the LuongDecoder model with the scoring function set as concat. During our training cycle, we’ll be using a method called teacher forcing for 50% of the training inputs, which uses the real target outputs as the input to the next step of the decoder instead of our decoder output for the previous time step. This allows the model to converge faster, although there are some drawbacks involved (e.g. instability of trained model).

    EPOCHS = 10
    teacher_forcing_prob = 0.5
    tk0 = tqdm_notebook(range(1,EPOCHS+1),total=EPOCHS)
    for epoch in tk0:
        avg_loss = 0.
        tk1 = tqdm_notebook(enumerate(en_inputs),total=len(en_inputs),leave=False)
        for i, sentence in tk1:
            loss = 0.
            h = encoder.init_hidden()
            inp = torch.tensor(sentence).unsqueeze(0).to(device)
            encoder_outputs, h = encoder(inp,h)
            #First decoder input will be the SOS token
            decoder_input = torch.tensor([en_w2i['_SOS']],device=device)
            #First decoder hidden state will be last encoder hidden state
            decoder_hidden = h
            output = []
            teacher_forcing = True if random.random() < teacher_forcing_prob else False
            for ii in range(len(de_inputs[i])):
              decoder_output, decoder_hidden, attn_weights = decoder(decoder_input, decoder_hidden, encoder_outputs)
              # Get the index value of the word with the highest score from the decoder output
              top_value, top_index = decoder_output.topk(1)
              if teacher_forcing:
                decoder_input = torch.tensor([de_inputs[i][ii]],device=device)
                decoder_input = torch.tensor([top_index.item()],device=device)
              # Calculate the loss of the prediction against the actual word
              loss += F.nll_loss(decoder_output.view(1,-1), torch.tensor([de_inputs[i][ii]],device=device))
            avg_loss += loss.item()/len(en_inputs)
      # Save model after every epoch (Optional){"encoder":encoder.state_dict(),"decoder":decoder.state_dict(),"e_optimizer":encoder_optimizer.state_dict(),"d_optimizer":decoder_optimizer},"./")

    Using our trained model, let’s visualise some of the outputs that the model produces and the attention weights the model assigns to each input element.

    # Choose a random sentences
    i = random.randint(0,len(en_inputs)-1)
    h = encoder.init_hidden()
    inp = torch.tensor(en_inputs[i]).unsqueeze(0).to(device)
    encoder_outputs, h = encoder(inp,h)
    decoder_input = torch.tensor([en_w2i['_SOS']],device=device)
    decoder_hidden = h
    output = []
    attentions = []
    while True:
      decoder_output, decoder_hidden, attn_weights = decoder(decoder_input, decoder_hidden, encoder_outputs)
      _, top_index = decoder_output.topk(1)
      decoder_input = torch.tensor([top_index.item()],device=device)
      # If the decoder output is the End Of Sentence token, stop decoding process
      if top_index.item() == de_w2i["_EOS"]:
    print("English: "+ " ".join([en_i2w[x] for x in en_inputs[i]]))
    print("Predicted: " + " ".join([de_i2w[x] for x in output]))
    print("Actual: " + " ".join([de_i2w[x] for x in de_inputs[i]]))
    # Plotting the heatmap for the Attention weights
    fig = plt.figure(figsize=(12,9))
    ax = fig.add_subplot(111)
    cax = ax.matshow(np.array(attentions))
    ax.set_xticklabels(['']+[en_i2w[x] for x in en_inputs[i]])
    ax.set_yticklabels(['']+[de_i2w[x] for x in output])
    [Out]: English: she kissed him . _EOS
           Predicted: sie küsste ihn .
           Actual: sie küsste ihn . _EOS

    Plot showing the weights assigned to each input word for each output

    From the example above, we can see that for each output word from the decoder, the weights assigned to the input words are different and we can see the relationship between the inputs and outputs that the model is able to draw. You can try this on a few more examples to test the results of the translator.

    In our training, we have clearly overfitted our model to the training sentences. If we were to test the trained model on sentences it has never seen before, it is unlikely to produce decent results. Nevertheless, this process acts as a sanity check to ensure that our model works and is able to function end-to-end and learn.

    The challenge of training an effective model can be attributed largely to the lack of training data and training time. Due to the complex nature of the different languages involved and a large number of vocabulary and grammatical permutations, an effective model will require tons of data and training time before any results can be seen on evaluation data.


    The Attention mechanism has revolutionised the way we create NLP models and is currently a standard fixture in most state-of-the-art NLP models. This is because it enables the model to “remember” all the words in the input and focus on specific words when formulating a response.

    We covered the early implementations of Attention in seq2seq models with RNNs in this article. However, the more recent adaptations of Attention has seen models move beyond RNNs to Self-Attention and the realm of Transformer models. Google’s BERT, OpenAI’s GPT and the more recent XLNet are the more popular NLP models today and are largely based on self-attention and the Transformer architecture.

    I’ll be covering the workings of these models and how you can implement and fine-tune them for your own downstream tasks in my next article. Stay tuned!


  • Self-attention

    万次阅读 2018-12-28 19:43:20
    本人在阅读大神文章《Attention is all you need》的过程中,遇到了有关attention方面的内容,尤其是对于self-attention方面的内容饶有兴趣,于是做了许多调查,下面是我的一些总结。 二、基本知识 1、...


            本人在阅读大神文章《Attention is all you need》的过程中,遇到了有关attention方面的内容,尤其是对于self-attention方面的内容饶有兴趣,于是做了许多调查,下面是我的一些总结。


           1、Attention Mechanism

               本文主要讲解Self_attention方面的内容,这方面的知识是建立在attention机制之上的,因此若读者不了解attention mechanism的话,希望你们能去略微了解一下。本人也将在这里稍微的解释一下。

               对于encoder-decoder模型,decoder的输入包括(注意这里是包括)encoder的输出。但是根据常识来讲,某一个输出并不需要所有encoder信息,而是只需要部分信息。这句话就是attention的精髓所在。怎么理解这句话呢?举个例子来说:假如我们正在做机器翻译,将“I am a student”翻译成中文“我是一个学生”。根据encoder-decoder模型,在输出“学生”时,我们用到了“我”“是”“一个”以及encoder的输出。但事实上,我们或许并不需要“I am a ”这些无关紧要的信息,而仅仅只需要“student”这个词的信息就可以输出“学生”(或者说“I am a”这些信息没有“student”重要)。这个时候就需要用到attention机制来分别为“I”、“am”、“a”、“student”赋一个权值了。例如分别给“I am a”赋值为0.1,给“student”赋值剩下的0.7,显然这时student的重要性就体现出来了。具体怎么操作,我这里就不在讲了。


               self-attention显然是attentio机制的一种。上面所讲的attention是输入对输出的权重,例如在上文中,是I am a student 对学生的权重。self-attention则是自己对自己的权重,例如I am a student分别对am的权重、对student的权重。之所以这样做,是为了充分考虑句子之间不同词语之间的语义及语法联系。





  • Attention机制与Self-Attention机制的区别

    万次阅读 多人点赞 2020-09-08 15:33:20
    本文主要讲解Attention机制与Self-Attention机制的区别,默认读者已经了解过Attention、Self-Attention、Transformer、seq2seq model。 传统的Attention机制在一般任务的Encoder-Decoder model中,输入Source和...

           本文主要讲解Attention机制与Self-Attention机制的区别,默认读者已经了解过Attention、Self-Attention、Transformer、seq2seq model。

           传统的Attention机制在一般任务的Encoder-Decoder model中,输入Source和输出Target内容是不一样的,比如对于英-中机器翻译来说,Source是英文句子,Target是对应的翻译出的中文句子,Attention机制发生在Target的元素Query和Source中的所有元素之间。简单的讲就是Attention机制中的权重的计算需要Target来参与的,即在Encoder-Decoder model中Attention权值的计算不仅需要Encoder中的隐状态而且还需要Decoder 中的隐状态。

           而Self Attention顾名思义,指的不是Target和Source之间的Attention机制,而是Source内部元素之间或者Target内部元素之间发生的Attention机制,也可以理解为Target=Source这种特殊情况下的注意力计算机制。例如在Transformer中在计算权重参数时将文字向量转成对应的KQV,只需要在Source处进行对应的矩阵操作,用不到Target中的信息。

  • AttentionAttention

    千次阅读 2019-06-14 15:24:13
    下文主要是结合自己的理解翻译自:AttentionAttention! 注意力(Attention)在近些年成为深度学习领域一个极其受欢迎的概念,同时作为一个强有力的工具也被集成到了各种模型中来处理相应的任务。下面将介绍注意...
  • Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural ...
  • 自然语言处理中的Attention机制总结

    万次阅读 多人点赞 2018-08-22 15:20:57
  • Attention Transfer

    千次阅读 2019-11-07 20:45:52
    Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer Motivation 大量的论文已经证明Attention在CV、NLP中都发挥着巨大的作用,因为本文利用...
  • Attention model 可以应用在图像领域也可以应用在自然语言识别领域 本文讨论的Attention模型是应用在自然语言领域的Attention模型,本文以神经网络机器翻译为研究点讨论注意力机制,参考文献《Effective Approaches...
  • 论文名字:Effective Approaches to Attention-based Neural Machine Translation这篇文章提出了两种attention机制:全局attention机制和局部attention机制(区别在于关注的是所有encoder状态还是部分encoder状态)...
  • attention机制觉得是通过连接实现的
  • Attention机制

    2019-10-29 21:35:24
    Soft-Attention模型 Attention机制的本质思想 Hard-Attention模型 Local-Attention模型 静态Attention模型 Self-Attention机制 为什么要引入Attention机制 Attention机制有哪些 Self-Attention在长距离序列...
  • 1、attention 理解方式 理解 : key 与 query生成权重 α ,α 与value 生成 attention value 注意:在tensorflow中 seq2seq + attentionattention 的 key 与 value 是相同的,都是解码器的输...
  • attention代码

    2018-12-10 15:08:06
    attention model,主要用在处理文本的seq2seq上,能够根据文中的每个词的重要性去生成权重。生成摘要,生成句子序列都需要
  • Attention, please A survey of Neural Attention Models in Deep
  • Attention原理

    千次阅读 2019-05-28 21:58:23
    文章目录Attention原理HAN原理利用Attention模型进行文本分类参考资料 Attention原理 转载一个Hierarchical Attention神经网络的实现 转载 图解Transformer 转载 Attention原理和源码解析 论文链接 Attention is All...
  • BahdanauAttention与LuongAttention注意力机制简介

    万次阅读 多人点赞 2018-09-26 19:31:33
    在使用tensorflow时发现其提供了两种Attention Mechanisms(注意力机制),如下 The two basic attention mechanisms are: tf.contrib.seq2seq.BahdanauAttention (additive attention, ref.) tf.contrib.seq2...
  • Attention Flows:Analyzing and Comparing Attention Mechanisms in Language Models
  • 详解Transformer中Self-Attention以及Multi-Head Attention

    千次阅读 多人点赞 2021-06-08 10:18:22
    原文名称:Attention Is All You Need 原文链接: 最近Transformer在CV领域很火,Transformer是2017年Google在Computation and Language上发表的,当时主要是针对自然语言处理领域...
  • Decomposable Attention Model

    2018-07-04 14:43:18
    Decomposable Attention Model for Natural Language Inference
  • Attention 机制

    2020-06-16 10:16:27
    文章目录Attention 的本质是什么Attention 的3大优点Attention 的原理Attention 的 N 种类型 转载来源: Attention 正在被越来越广泛的得到应用。尤其是 BERT 火爆了...
  • Attention Mechanism.pdf

    2020-07-08 12:03:21
    Attention Mechanism.pdf
  •   参考的代码来源1:Attention mechanism Implementation for Keras.网上大部分代码都源于此,直接使用时注意Keras版本,若版本不对应,在merge处会报错,解决办法为:导入Multiply层并将attention_dense.py第17行...



1 2 3 4 5 ... 20
收藏数 103,897
精华内容 41,558