Deep Natural Language Processing 101

Written on 16 Mar 2019



Sequence-to-Sequence

As expected, a sequence-to-sequence model transforms a sequence to another using an encoder and decoder module (two RNNs). As the encoder processes each item in the input sequences, it compiles the information into a context vector. The decoder is then conditioned using the context vector and begins generating the output sequence. An example is machine translation (say from English to French) in which, the decoder reads the English sentence, maps it to some context vector, which then allows the decoder to generate the corresponding sentence in French.

Because the encoder module is a recurrent network, it processes the input sequence one element at a time; each time it produces a new hidden state, used in processing the next element. The hidden state that is generated after processing the last element becomes the context vector, which is used to condition the decoder.

Attention

While this procedure works in theory, in practice, we found out that the context vector turned out to be a bottleneck when trying to reason about long sequences (for ex. long sentences of text). And so in 2014 and 2015 a lot of the academics started to turn their attention to attention (pun 100% intended).

Attention is a technique used to allow the encoder to pay specific attention to parts of the input sequence that are the most relevant at each time step. In attention-based models, the encoder passes not only the last hidden state, but also all the intermediate ones as well, each of which is most assocaited with a certain element in the input sequence. This allows the decoder to at each step of generating the output sequence, select which of the hidden states to pay attention to, wherever they may have happened in the input sequence.

Transformer

The Transformer model outperforms Google NMT (Neural Machine Translation) model in specific tasks by leveraging parallelization and a slightly more sophisticated encoder-decoder setup. Both the encoder and decoder components are composed of a stack of individual encoder/decoders (without any weight sharing). The Encoder stack is composed of a self-attention layer followed by a feed forward network. The self-attention allows the encoder to look at other elements in the input sequence when encoding a specific element. The Decoder stack is composed of a self-attention layer, followed by an encoder-decoder attention layer which allows it to focus on relevant parts of the input sequence, followed by the usual feed forward layer.

For each encoder, each element flows through the self attention layer, which embeds it into a latent space, followed by the feed forward layer - a key point is that all elements can flow through the feed forward layer in parallel as there are no dependencies between their computations. Let’s step through the self-attention network. For each element e_i, we calculate a query, key and value vector by multiplying the element with three matrices learned during training. We then calculate the relevance of all other elements w.r.t to e_i by dotting the query vectory of e_i with the key vector of the respective elements. We then normalize all relevance scores by dividing by the norm of the query dimension and running all scores through a softmax function. We then multiply the softmax scores by the value vector to produce the output of the self-attention layer.

We can extend this technique by using multi-headed attention, in which we have multiple Q/K/V vectors. This expands the encoder’s ability to focus on different positions and learn multiple (initially randomized, but fine-tuned through some learning procedure) Q/K/V matrices. In multi-headed attention, we concatenate the resulting computations of all attention heads, multiply by a weight matrix that is jointly learned with the model, and then finally pass the resulting matrix through the feed forward network.

Let’s summarize what happens in each encoder:

  1. Take the input as a sequence of vectors (length L)
  2. Embed each element E to X (X_1 … X_L)
  3. Use N different heads (Q/K/V) to come up with weight matrices (3 * N for each element E)
  4. Calculate attention using the weights from above (N Z matrices for each E)
  5. Concatenate the N Z matrices and multiply by the weight matrix to produce final output (1 Z vector for each E)

The decoder uses much of the same construction. The output of the last encoder module (L Z vectors) is transformed into 2 K/V matrices that are fed to the Decoder module. At each timestep, the encoder-decoder attention layer receives the previously generated element. The self-attention layer works by creating its Q matrix from the layer below it, and taking the K/V matrices from the output of the encoder stack; it is also only allowed to look to the earlier positions in the output sequence. The output of the decoder stack is a vector of floats, which the final linear layer, followed by a softmax layer turn into an element to be outputed. An element can be generated from the output of the softmax layer in various ways. We can simply take the highest probability (greedy decoding). Alternatively, we can explore the next N step using the M highest probabilities, and choose the path that minimizes total error. Hyper-parameter N is called beam_size and M top_beams.