Implementation details: From the original Transformer to GPT
1 The original Transformers reviewed
1.1 Attention block
1.1.1 Scaled dotproduct attention
 Query and the attention output are “onetoone” mapped. For each query (a vector), we need to compute an output.
 So, if we have $n$ queries (i.e., Q is $n\times d$), the output needs to be $n\times d$.
 The output for each query is a weighted sum of values.
 Softmax is applied to the rows of $QK^T$, i.e., sum(i,:)=1
When we compute $QK^T$, we need to compute many vector doc product of size $d_{k}$ by $d_{k}$. When $d_{k}$ is large (e.g., 512, 4096), the variance of result become large. That is, if we select out one row (recall softmax is applied to rows) of $QK^T$, we will observe some elements are very small while some others are very large. If we apply softmax on this row, the result will be skewed towards either 0 or 1.
This leads to small gradient update. You can see the gradient at the two sizes of a sigmoid function is close to zero:
If we divide by $\sqrt{d_{k}}$, the result of softmax is closer to the center of the distribution, and the gradient become larger.
The brief explanation from the authors is:
1.1.2 Multihead attention (MHA)
If we only use dotproduct attention, there’re no learnable parameters!
 Each head could learn different attention pattern.
 Each head is like a “filter” as in CNN
1.1.3 Cross attention (Encoder & decoder)
1.2 FFN block
FFN is simply an MLP applying to the last dimension. The output of the attention block is $n\times d_{k}$ where $n$ is sequence length (num tokens). $x$ is the embedding of one token (a vector). What FFN does is: $$ FFN(x)=max(0,xW_{1}+b_{1})W_{2}+b_{2} $$ $W_{1}$ projects from $d_{k}$ to $4\times d_{k}$, and $W_{2}$ projects it back to $d_{k}$.
1.3 Embedding and positional encoding
 In the original paper, the authors use the same embedding matrix in three places: 1) encoder, 2) decoder, and 3) the final FF layer before softmax (where for each token you produce V logits (V is the vocabulary size))
 embedding matrix: $V\times d_{model}$
 The authors also multiply the embedding matrix by $\sqrt{d_{model}}$
 An observation is that after the training, the l2 norm of a embedding vector is usually very small and doesn’t increase with $d_{model}$. For example, when $d_{model}$ is increased from 512 to 4096, the l2 norm or the embedding may still be 1.
 However, the l2 norm of positional encoding DOES increase with length.
 So, if we don’t scale embedding, it will be dominated by the positional encoding when they’re added up.
1.4 Dropout
 Residual dropout: before added to the sublayer input (See “Drop1” and “Drop2” in the following preLN example. Note “Drop” is for FFN)
x + Drop1(MHA(LN(x)))
(Attention sublayer)x + Drop2(Linear2(Drop(activation(Linear1(LN(x))))))
(FFN sublayer)
 Attention dropout:
 It’s applied after softmax, before multiplying V (i.e., on the attention weights).
 $Drop\left(Softmax\left( \frac{QK^T}{\sqrt{d_{k}}}\right) \right)V$
 Embedding dropout: after the sum of embedding and positional encoding
Drop(input_embed + pos_enc)
 FFN dropout: only for the hidden layer in FFN. (The “Drop” in the above example)
1.5 LayerNorm
 There’re two layernorm: preLN and **postLN. In the original Transformer paper, the authors use postnorm, but GPT and later models prefer to uses preLN.
 PreLN make the gradient more smooth. See (Xiong et al., 2020).


2 Pytorch implementaion of Transformer
last update: pytorch2.0.1
2.1 Config
norm_first
: If True, use prenorm. Default: False (postnorm).dim_feedforward
: default=2048activation
: Default “relu”dropout
: Default 0.1
2.2 Encoder


2.3 Decoder


3 Compare with GPT
Below, I show how GPTs differ from the original Transformer. Since the GPT family is not opensource, the GPT code is from Huggingface’s implementation of GPT2. OpenAI claimed GPT3 uses the same architecture as GPT2, except for the Sparse Transformer part.
 In the original paper, the decoder has three sublayers, SelfAttn, CrossAttn, FFN because it needs input from the encoder
 GPT2 has no “CrossAttn”
 So, GPT’s decoder is equivalent to an encoder, except for the mask in the SelfAttn (See this SO post).
3.1 Tokenizer: A variant of BPE
 Works on byte, but avoid merges across character categories (e.g., punctuations and letters are not allowed to merge), except for spaces.
 e.g.,
"Hello world" => ["Hello", " world"]
. Notice there’s a space in front of “world.”
3.2 Embedding & positional encoding
 Both are learned
 Embedding is not scaled before adding to positional encoding (PE)
 In the original Transformer,
input_embeds
is multiplied by $\sqrt{d_{model}}$. That’s because the PE is determined by sin/cos (not learned!) and its l2 norm increases with $d_{model}$  But in GPT, PE is learned. Therefore, PE and input_embeds can be of similar scale.
 In the original Transformer,
3.3 LayerNorm
 It uses “preLN”
 Another LN is added after the final attention block


3.4 Dropout
 GPTs have dropout in residual, embedding, and attention, same as the original Transformer (drop=0.1)
 GPT has no dropout in FFN!
3.5 Activation: GLUE
(Hendrycks & Gimpel, 2016) $$ \text{GELU} = x\Phi(x) $$ where $\Phi(x)$ is CDF of normal. Its expectation can be approximated with: $$ 0.5 \cdot x \cdot \left( 1 + \tanh\left[ \sqrt{2/\pi}\left(x+0.044715x^3 \right)\right] \right) $$


3.6 Initialization
 Linear and conv1d are normal with
mean=0, std=0.02
 Embedding & PE are normal with
mean=0, std=0.02
; padding_idx is 0  Layer norm has no affine transformation
 (important!) Reinit selected weights
c_proj
is the $W^O$ matrix in the original paper, it’s $d_{model}\times d_{model}$. The concatenated attention outputs are multiplied by $W^O$ before sent to residual. I still not fully understand why “training signals will accumulate through the residual path.”