Transformer understanding

The transformer model was introduced in 2017 by Google's paper Attention Is All You Need.

Below image shows the architecture of the transformer model which is from the Google's paper Attention Is All You Need.

The model has 6 layers.

Encoder and Decoder Stacks

The output of l layer is the input of l+1 layer.
The left side is a stack of N = 6 identical encode layers.

Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, positionwise fully connected feed-forward network.
The right side is a stack of N=6 identical decode layers.

The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
In each layer, there is a residual connection around each of the two sub-layers, followed by layer normalization, to make sure the information such as position encoding will be not lost. The output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself.
To facilitate these residual connections, all sub-layers in the model, as well as the embedding layers, produce outputs of dimension \(d_{model}=512\).

Convert the input token into dimension \(d_{model}=512\).

To encoding the positional into the token's model. The following formula is used to calculate the positional.

\[ PE_{(pos 2i)} = sin(\frac{pos}{10000^{\frac{2i}{d_{model}}}}) \]

\[ PE_{(pos 2i+1)} = cos(\frac{pos}{10000^{\frac{2i}{d_{model}}}}) \]

For each input vector \(d_{model}=512\) of token \(X_n\):

\[ pe(x_n)=[d_1=9.09297407e^{'}01, d2=-4.16146845e'01,...,d_{512}=1.00000000e+00] \]

Each head's output: \(Z=(Z_0, Z_1, Z_2, Z_3, Z_4, Z_5, Z_6, Z_7)\)

Final output: \(MultiHead(output)=Concat(Z_0, Z_1, Z_2, Z_3, Z_4, Z_5, Z_6, Z_7)=x, d_{model}\)

In each head attention, word matrix has 3 represents.

In the original transformer model, the attention can be described as,

\[ Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{D_k}})V \]

Normalization can be described as \(LayerNormalization(x+Sublayer(x))\)

There are many ways to do layer normalization, one way is:

\[ LayerNormalization(v)=\gamma\frac{\nu-\mu}{\sigma} + \beta \]

The dimension is \(d_{model}=512\),

\[ FFN(x)=max(0, xW_1 + b_1)W_2 + b_2 \]

There is also a residual connection around each of the two sub-layers, followed by layer normalization:

\[ LayerNormalization(x+Sublayer(x)) \]

They are the same as encoder.

The sub-layer input is \(Input\_Attention=(Output\_decoder\_sub\_layer - 1(Q), Output\_encoder\_layer(K,V))\).

The transformer will only generate an output sequence of an element:

\[ Output\_sequence=(y_1, y_2, ..., y_n) \]

The liner will generate an output sequence, it will vary based on model, but it will approach \(y=w*x+b\).

\(w\) and \(b\) are learnable parameters.

to be continued...