Implementation details: From the original Transformer to GPT
1 The original Transformer reviewed
1.1 Attention block
1.1.1 Scaled dot-product attention


- Query and the attention output are one-to-one mapped. For each query (a vector), we compute one output.
- So, if we have queries (i.e., is ), the output is .
- The output for each query is a weighted sum of values.
- Softmax is applied to the rows of , i.e., .
When we compute , we compute many vector dot products of size . When is large (e.g., 512, 4096), the variance of the result becomes large. Selecting a row (recall softmax is row-wise) of , some elements are very small while others are very large. If we apply softmax on this row, the result will be skewed toward 0 or 1.
This leads to small gradient updates. You can see the gradient at the two sides of a sigmoid function is close to zero:

Dividing by keeps the softmax inputs closer to the center of the distribution, leading to larger gradients.
The brief explanation from the authors is:

1.1.2 Multi-head attention (MHA)
If we only use dot-product attention, there are no learnable parameters.
- Each head can learn different attention patterns.
- Each head is like a filter, as in CNNs.


1.1.3 Cross attention (Encoder & decoder)

1.2 FFN block
FFN is simply an MLP applied to the last dimension. The output of the attention block is where is the sequence length (number of tokens). Let be the embedding of one token (a vector). FFN does: projects from to , and projects it back to .
1.3 Embedding and positional encoding
- In the original paper, the authors use the same embedding matrix in three places: 1) encoder, 2) decoder, and 3) the final FF layer before softmax (where for each token you produce logits, with the vocabulary size)
- embedding matrix:
- The authors also multiply the embedding matrix by .
- After training, the norm of an embedding vector is usually very small and does not increase with . For example, when increases from 512 to 4096, the norm of the embedding may still be ~1.
- However, the norm of positional encoding increases with length.
- So, if we do not scale the embedding, it will be dominated by the positional encoding when added.
1.4 Dropout
- Residual dropout: before adding to the sublayer input (see “Drop1” and “Drop2” in the pre-LN example below; “Drop” is for FFN)
x + Drop1(MHA(LN(x)))
(Attention sublayer)x + Drop2(Linear2(Drop(activation(Linear1(LN(x))))))
(FFN sublayer)
- Attention dropout:
- Applied after softmax, before multiplying V (i.e., on the attention weights).
- Embedding dropout: after summing embedding and positional encoding
Drop(input_embed + pos_enc)
- FFN dropout: only for the hidden layer in FFN.
1.5 LayerNorm
- There are two variants: pre-LN and post-LN. In the original Transformer, the authors use post-norm, but GPT and later models prefer pre-LN.
- Pre-LN makes gradients smoother; see Xiong et al. (2020).
# pre-norm (preferred)
x = x + MHA(LN(x))
x = x + FFN(LN(x))
# post-norm
x = LN(x + MHA(x))
x = LN(x + FFN(x))
2 PyTorch implementation of Transformer
last update: pytorch-2.0.1
2.1 Config
norm_first
: If True, use pre-norm. Default: False (post-norm).dim_feedforward
: default=2048activation
: Default “relu”dropout
: Default 0.1
2.2 Encoder
# How TransformerEncoderLayer.forward works
# x: input source
# Drop1, Drop2: residual dropout
# Drop: FFN dropout
# note: SA() doesn't include any dropout layer
# Pre-LN (preferred)
x = x + Drop1(SA(LN(x))) # Self-Attn sublayer
x = x + Drop2(Linear2(Drop(Activation(Linear1(LN(x)))))) # FFN sublayer
# Post-LN
x = LN(x + Drop1(SA(x))) # Self-Attn sublayer
x = LN(x + Drop2(Linear2(Drop(Activation(Linear1(x)))))) # FFN sublayer
2.3 Decoder
# How DecoderLayer.forward works
# x: the "Q"
# memory: the "K and V", from encoder
# Drop1, Drop2, Drop3: residual dropout
# Drop: FFN dropout
# note: SA() and MHA() don't include any dropout layer
# Pre-LN (preferred)
x = x + Drop1(SA(LN1(x))) # SA sublayer
x = x + Drop2(MHA(LN2(x), memory)) # Multi-Head Attn
x = x + Drop3(Linear2(Drop(Activation(Linear1(LN3(x)))))) # FFN sublayer
# Post-LN
x = LN1(x + Drop1(SA(x))) # SA
x = LN2(x + Drop2(MHA(x, memory))) # MHA
x = LN3(x + Drop3(Linear2(Drop(Activation(Linear1(x)))))) # FFN sublayer
3 Compare with GPT
Below, I show how GPTs differ from the original Transformer. Since the GPT family is not open-source, the GPT code is from Hugging Face’s implementation of GPT-2. OpenAI reports GPT-3 uses the same architecture as GPT-2, except for the Sparse Transformer part.
- In the original paper, the decoder has three sublayers: SelfAttn, CrossAttn, FFN, because it needs input from the encoder.
- GPT-2 has no CrossAttn.
- So, GPT’s decoder is equivalent to an encoder, except for the mask in SelfAttn (see this post).
3.1 Tokenizer: A variant of BPE
- Works on bytes, but avoids merges across character categories (e.g., punctuation and letters are not allowed to merge), except for spaces.
- e.g.,
"Hello world" => ["Hello", " world"]
. Notice the leading space before “world.”
3.2 Embedding & positional encoding
- Both are learned
Source: Hugging Face
- Embedding is not scaled before adding to positional encoding (PE)
- In the original Transformer,
input_embeds
is multiplied by . That’s because PE is based on sin/cos (not learned) and its norm increases with . - But in GPT, PE is learned. Therefore, PE and input embeddings can be of similar scale.
Source: Hugging Face
- In the original Transformer,
3.3 LayerNorm
- Uses pre-LN.
- Another LN is added after the final attention block.
# first go through the blocks
for block in DecoderList:
x = block(x)
# the final LN before output!
output = LN(x)
3.4 Dropout
- GPTs have dropout in residual, embedding, and attention, same as the original Transformer (drop=0.1).
- GPT has no dropout in FFN.
3.5 Activation: GELU
(Hendrycks & Gimpel, 2016) where is the CDF of the normal distribution. A common approximation is:
def gelu(self, input: Tensor) -> Tensor:
return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
3.6 Initialization
- Linear and conv1d are normal with
mean=0, std=0.02
. - Embedding and PE are normal with
mean=0, std=0.02
;padding_idx
is 0. - LayerNorm has no affine transformation.
- (Important) Reinitialize selected weights
c_proj
is the matrix in the original paper, sized . The concatenated attention outputs are multiplied by before the residual.- I still do not fully understand why “training signals accumulate through the residual path.”