Implementation details: From the original Transformer to GPT

  • Query and the attention output are “one-to-one” mapped. For each query (a vector), we need to compute an output.
    • So, if we have nn queries (i.e., Q is n×dn\times d), the output needs to be n×dn\times d.
    • The output for each query is a weighted sum of values.
  • Softmax is applied to the rows of QKTQK^T, i.e., sum(i,:)=1
Why divided by dk\sqrt{d_{k}}?

When we compute QKTQK^T, we need to compute many vector doc product of size dkd_{k} by dkd_{k}. When dkd_{k} is large (e.g., 512, 4096), the variance of result become large. That is, if we select out one row (recall softmax is applied to rows) of QKTQK^T, we will observe some elements are very small while some others are very large. If we apply softmax on this row, the result will be skewed towards either 0 or 1.

This leads to small gradient update. You can see the gradient at the two sizes of a sigmoid function is close to zero:

If we divide by dk\sqrt{d_{k}}, the result of softmax is closer to the center of the distribution, and the gradient become larger.

The brief explanation from the authors is:

Why MHA?

If we only use dot-product attention, there’re no learnable parameters!

  • Each head could learn different attention pattern.
  • Each head is like a “filter” as in CNN

FFN is simply an MLP applying to the last dimension. The output of the attention block is n×dkn\times d_{k} where nn is sequence length (num tokens). xx is the embedding of one token (a vector). What FFN does is: FFN(x)=max(0,xW1+b1)W2+b2 FFN(x)=max(0,xW_{1}+b_{1})W_{2}+b_{2} W1W_{1} projects from dkd_{k} to 4×dk4\times d_{k}, and W2W_{2} projects it back to dkd_{k}.

Why the weights of FFN is the same for every token?
Because we already “processed each token individually” in the previous attention block.
  • In the original paper, the authors use the same embedding matrix in three places: 1) encoder, 2) decoder, and 3) the final FF layer before softmax (where for each token you produce V logits (V is the vocabulary size))
    • embedding matrix: V×dmodelV\times d_{model}
  • The authors also multiply the embedding matrix by dmodel\sqrt{d_{model}}
    • An observation is that after the training, the l2 norm of a embedding vector is usually very small and doesn’t increase with dmodeld_{model}. For example, when dmodeld_{model} is increased from 512 to 4096, the l2 norm or the embedding may still be 1.
    • However, the l2 norm of positional encoding DOES increase with length.
    • So, if we don’t scale embedding, it will be dominated by the positional encoding when they’re added up.
  • Residual dropout: before added to the sublayer input (See “Drop1” and “Drop2” in the following pre-LN example. Note “Drop” is for FFN)
    • x + Drop1(MHA(LN(x))) (Attention sublayer)
    • x + Drop2(Linear2(Drop(activation(Linear1(LN(x)))))) (FFN sublayer)
  • Attention dropout:
    • It’s applied after softmax, before multiplying V (i.e., on the attention weights).
    • Drop(Softmax(QKTdk))VDrop\left(Softmax\left( \frac{QK^T}{\sqrt{d_{k}}}\right) \right)V
  • Embedding dropout: after the sum of embedding and positional encoding
    • Drop(input_embed + pos_enc)
  • FFN dropout: only for the hidden layer in FFN. (The “Drop” in the above example)
  • There’re two layernorm: pre-LN and **post-LN. In the original Transformer paper, the authors use post-norm, but GPT and later models prefer to uses pre-LN.
  • Pre-LN make the gradient more smooth. See (Xiong et al., 2020).

python

# pre-norm (preferred)
x = x + MHA(LN(x))
x = x + FFN(LN(x))

# post-norm
x = LN(x + MHA(x))
x = LN(x + FFN(x))

last update: pytorch-2.0.1

  • norm_first: If True, use pre-norm. Default: False (post-norm).
  • dim_feedforward: default=2048
  • activation: Default “relu”
  • dropout: Default 0.1

python

# How TransformerEncoderLayer.forward works

# x: input source
# Drop1, Drop2: residual dropout
# Drop: FFN dropout
# note: SA() doesn't include any dropout layer

# Pre-LN (preferred)
x = x + Drop1(SA(LN(x)))  # Self-Attn sublayer
x = x + Drop2(Linear2(Drop(Activation(Linear1(LN(x))))))  # FFN sublayer

# Post-LN
x = LN(x + Drop1(SA(x)))  # Self-Attn sublayer
x = LN(x + Drop2(Linear2(Drop(Activation(Linear1(x))))))  # FFN sublayer

python

# How DecoderLayer.forward works

# x: the "Q"
# memory: the "K and "V", from encoder
# Drop1, Drop2, Drop3: residual dropout
# Drop: FFN dropout
# note: SA() and MHA() doesn't include any dropout layer

# Pre-LN (preferred)
x = x + Drop1(SA(LN1(x)))  # SA sublayer
x = x + Drop2(MHA(LN2(x), memory))  # Multi-Head Attn
x = x + Drop3(Linear2(Drop(Activation(Linear1(LN3(x))))))  # FFN sublayer

# Post-LN
x = LN1(x + Drop1(SA(x)))  # SA
x = LN2(x + Drop2(MHA(x, memory)))  # MHA
x = LN3(x + Drop3(Linear2(Drop(Activation(Linear1(x))))))  # FFN sublayer

Below, I show how GPTs differ from the original Transformer. Since the GPT family is not open-source, the GPT code is from Huggingface’s implementation of GPT-2. OpenAI claimed GPT-3 uses the same architecture as GPT-2, except for the Sparse Transformer part.

"GPT's decoder" vs. "Original Transformer decoder"
  • In the original paper, the decoder has three sublayers, SelfAttn, CrossAttn, FFN because it needs input from the encoder
  • GPT-2 has no “CrossAttn”
  • So, GPT’s decoder is equivalent to an encoder, except for the mask in the SelfAttn (See this SO post).
  • Works on byte, but avoid merges across character categories (e.g., punctuations and letters are not allowed to merge), except for spaces.
  • e.g., "Hello world" => ["Hello", " world"]. Notice there’s a space in front of “world.”
  • Both are learned
    Source: Huggingface

    Source: Huggingface

  • Embedding is not scaled before adding to positional encoding (PE)
    • In the original Transformer, input_embeds is multiplied by dmodel\sqrt{d_{model}}. That’s because the PE is determined by sin/cos (not learned!) and its l2 norm increases with dmodeld_{model}
    • But in GPT, PE is learned. Therefore, PE and input_embeds can be of similar scale.
      Source: Huggingface

      Source: Huggingface

  • It uses “pre-LN”
  • Another LN is added after the final attention block

python

# first go through the blocks
for block in DecoderList:
	x = block(x)

# the final LN before output!
output = LN(x) 
  • GPTs have dropout in residual, embedding, and attention, same as the original Transformer (drop=0.1)
  • GPT has no dropout in FFN!

(Hendrycks & Gimpel, 2016) GELU=xΦ(x) \text{GELU} = x\Phi(x) where Φ(x)\Phi(x) is CDF of normal. Its expectation can be approximated with: 0.5x(1+tanh[2/π(x+0.044715x3)]) 0.5 \cdot x \cdot \left( 1 + \tanh\left[ \sqrt{2/\pi}\left(x+0.044715x^3 \right)\right] \right)

python

def glue(self, input: Tensor) -> Tensor:
	return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
  • Linear and conv1d are normal with mean=0, std=0.02
  • Embedding & PE are normal with mean=0, std=0.02; padding_idx is 0
  • Layer norm has no affine transformation
  • (important!) Reinit selected weights
    • c_proj is the WOW^O matrix in the original paper, it’s dmodel×dmodeld_{model}\times d_{model}. The concatenated attention outputs are multiplied by WOW^O before sent to residual.
    • I still not fully understand why “training signals will accumulate through the residual path.”
Nickname
Email
Website
0/500
  • OωO
  • |´・ω・)ノ
  • ヾ(≧∇≦*)ゝ
  • (☆ω☆)
  • (╯‵□′)╯︵┴─┴
  •  ̄﹃ ̄
  • (/ω\)
  • ∠( ᐛ 」∠)_
  • (๑•̀ㅁ•́ฅ)
  • →_→
  • ୧(๑•̀⌄•́๑)૭
  • ٩(ˊᗜˋ*)و
  • (ノ°ο°)ノ
  • (´இ皿இ`)
  • ⌇●﹏●⌇
  • (ฅ´ω`ฅ)
  • (╯°A°)╯︵○○○
  • φ( ̄∇ ̄o)
  • ヾ(´・ ・`。)ノ"
  • ( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
  • (ó﹏ò。)
  • Σ(っ °Д °;)っ
  • ( ,,´・ω・)ノ"(´っω・`。)
  • ╮(╯▽╰)╭
  • o(*////▽////*)q
  • >﹏<
  • ( ๑´•ω•) "(ㆆᴗㆆ)
  • 😂
  • 😀
  • 😅
  • 😊
  • 🙂
  • 🙃
  • 😌
  • 😍
  • 😘
  • 😜
  • 😝
  • 😏
  • 😒
  • 🙄
  • 😳
  • 😡
  • 😔
  • 😫
  • 😱
  • 😭
  • 💩
  • 👻
  • 🙌
  • 🖕
  • 👍
  • 👫
  • 👬
  • 👭
  • 🌚
  • 🌝
  • 🙈
  • 💊
  • 😶
  • 🙏
  • 🍦
  • 🍉
  • 😣
  • 颜文字
  • Emoji
  • Bilibili
0 comments
No comment