# Implementation details: From the original Transformer to GPT

Series - Deep Learning
Contents

• Query and the attention output are “one-to-one” mapped. For each query (a vector), we need to compute an output.
• So, if we have $n$ queries (i.e., Q is $n\times d$), the output needs to be $n\times d$.
• The output for each query is a weighted sum of values.
• Softmax is applied to the rows of $QK^T$, i.e., sum(i,:)=1
Why divided by $\sqrt{d_{k}}$?

When we compute $QK^T$, we need to compute many vector doc product of size $d_{k}$ by $d_{k}$. When $d_{k}$ is large (e.g., 512, 4096), the variance of result become large. That is, if we select out one row (recall softmax is applied to rows) of $QK^T$, we will observe some elements are very small while some others are very large. If we apply softmax on this row, the result will be skewed towards either 0 or 1.

This leads to small gradient update. You can see the gradient at the two sizes of a sigmoid function is close to zero:

If we divide by $\sqrt{d_{k}}$, the result of softmax is closer to the center of the distribution, and the gradient become larger.

The brief explanation from the authors is:

Why MHA?

If we only use dot-product attention, there’re no learnable parameters!

• Each head could learn different attention pattern.
• Each head is like a “filter” as in CNN

FFN is simply an MLP applying to the last dimension. The output of the attention block is $n\times d_{k}$ where $n$ is sequence length (num tokens). $x$ is the embedding of one token (a vector). What FFN does is: $$FFN(x)=max(0,xW_{1}+b_{1})W_{2}+b_{2}$$ $W_{1}$ projects from $d_{k}$ to $4\times d_{k}$, and $W_{2}$ projects it back to $d_{k}$.

Why the weights of FFN is the same for every token?
Because we already “processed each token individually” in the previous attention block.
• In the original paper, the authors use the same embedding matrix in three places: 1) encoder, 2) decoder, and 3) the final FF layer before softmax (where for each token you produce V logits (V is the vocabulary size))
• embedding matrix: $V\times d_{model}$
• The authors also multiply the embedding matrix by $\sqrt{d_{model}}$
• An observation is that after the training, the l2 norm of a embedding vector is usually very small and doesn’t increase with $d_{model}$. For example, when $d_{model}$ is increased from 512 to 4096, the l2 norm or the embedding may still be 1.
• However, the l2 norm of positional encoding DOES increase with length.
• So, if we don’t scale embedding, it will be dominated by the positional encoding when they’re added up.
• Residual dropout: before added to the sublayer input (See “Drop1” and “Drop2” in the following pre-LN example. Note “Drop” is for FFN)
• x + Drop1(MHA(LN(x))) (Attention sublayer)
• x + Drop2(Linear2(Drop(activation(Linear1(LN(x)))))) (FFN sublayer)
• Attention dropout:
• It’s applied after softmax, before multiplying V (i.e., on the attention weights).
• $Drop\left(Softmax\left( \frac{QK^T}{\sqrt{d_{k}}}\right) \right)V$
• Embedding dropout: after the sum of embedding and positional encoding
• Drop(input_embed + pos_enc)
• FFN dropout: only for the hidden layer in FFN. (The “Drop” in the above example)
• There’re two layernorm: pre-LN and **post-LN. In the original Transformer paper, the authors use post-norm, but GPT and later models prefer to uses pre-LN.
• Pre-LN make the gradient more smooth. See (Xiong et al., 2020).
 1 2 3 4 5 6 7  # pre-norm (preferred) x = x + MHA(LN(x)) x = x + FFN(LN(x)) # post-norm x = LN(x + MHA(x)) x = LN(x + FFN(x)) 

last update: pytorch-2.0.1

• norm_first: If True, use pre-norm. Default: False (post-norm).
• dim_feedforward: default=2048
• activation: Default “relu”
• dropout: Default 0.1
  1 2 3 4 5 6 7 8 9 10 11 12 13 14  # How TransformerEncoderLayer.forward works # x: input source # Drop1, Drop2: residual dropout # Drop: FFN dropout # note: SA() doesn't include any dropout layer # Pre-LN (preferred) x = x + Drop1(SA(LN(x))) # Self-Attn sublayer x = x + Drop2(Linear2(Drop(Activation(Linear1(LN(x)))))) # FFN sublayer # Post-LN x = LN(x + Drop1(SA(x))) # Self-Attn sublayer x = LN(x + Drop2(Linear2(Drop(Activation(Linear1(x)))))) # FFN sublayer 
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  # How DecoderLayer.forward works # x: the "Q" # memory: the "K and "V", from encoder # Drop1, Drop2, Drop3: residual dropout # Drop: FFN dropout # note: SA() and MHA() doesn't include any dropout layer # Pre-LN (preferred) x = x + Drop1(SA(LN1(x))) # SA sublayer x = x + Drop2(MHA(LN2(x), memory)) # Multi-Head Attn x = x + Drop3(Linear2(Drop(Activation(Linear1(LN3(x)))))) # FFN sublayer # Post-LN x = LN1(x + Drop1(SA(x))) # SA x = LN2(x + Drop2(MHA(x, memory))) # MHA x = LN3(x + Drop3(Linear2(Drop(Activation(Linear1(x)))))) # FFN sublayer 

Below, I show how GPTs differ from the original Transformer. Since the GPT family is not open-source, the GPT code is from Huggingface’s implementation of GPT-2. OpenAI claimed GPT-3 uses the same architecture as GPT-2, except for the Sparse Transformer part.

"GPT's decoder" vs. "Original Transformer decoder"
• In the original paper, the decoder has three sublayers, SelfAttn, CrossAttn, FFN because it needs input from the encoder
• GPT-2 has no “CrossAttn”
• So, GPT’s decoder is equivalent to an encoder, except for the mask in the SelfAttn (See this SO post).
• Works on byte, but avoid merges across character categories (e.g., punctuations and letters are not allowed to merge), except for spaces.
• e.g., "Hello world" => ["Hello", " world"]. Notice there’s a space in front of “world.”
• Embedding is not scaled before adding to positional encoding (PE)
• In the original Transformer, input_embeds is multiplied by $\sqrt{d_{model}}$. That’s because the PE is determined by sin/cos (not learned!) and its l2 norm increases with $d_{model}$
• But in GPT, PE is learned. Therefore, PE and input_embeds can be of similar scale. Source: Huggingface
• It uses “pre-LN”
• Another LN is added after the final attention block
 1 2 3 4 5 6  # first go through the blocks for block in DecoderList: x = block(x) # the final LN before output! output = LN(x) 
• GPTs have dropout in residual, embedding, and attention, same as the original Transformer (drop=0.1)
• GPT has no dropout in FFN!

(Hendrycks & Gimpel, 2016) $$\text{GELU} = x\Phi(x)$$ where $\Phi(x)$ is CDF of normal. Its expectation can be approximated with: $$0.5 \cdot x \cdot \left( 1 + \tanh\left[ \sqrt{2/\pi}\left(x+0.044715x^3 \right)\right] \right)$$

 1 2  def glue(self, input: Tensor) -> Tensor: return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0)))) 
• Linear and conv1d are normal with mean=0, std=0.02
• Embedding & PE are normal with mean=0, std=0.02; padding_idx is 0
• Layer norm has no affine transformation
• (important!) Reinit selected weights
• c_proj is the $W^O$ matrix in the original paper, it’s $d_{model}\times d_{model}$. The concatenated attention outputs are multiplied by $W^O$ before sent to residual.
• I still not fully understand why “training signals will accumulate through the residual path.”