Implementation details: From the original Transformer to GPT

Yu Zhu included in series

2023-06-10 2023-06-11 1160 words 6 minutes

Series - Deep Learning

1 The original Transformers reviewed

1.1 Attention block

1.1.1 Scaled dot-product attention

Query and the attention output are “one-to-one” mapped. For each query (a vector), we need to compute an output.
- So, if we have $n$ queries (i.e., Q is $n\times d$ ), the output needs to be $n\times d$ .
- The output for each query is a weighted sum of values.
Softmax is applied to the rows of $QK^T$ , i.e., sum(i,:)=1

Why divided by

\sqrt{d_{k}}

When we compute $QK^T$ , we need to compute many vector doc product of size $d_{k}$ by $d_{k}$ . When $d_{k}$ is large (e.g., 512, 4096), the variance of result become large. That is, if we select out one row (recall softmax is applied to rows) of $QK^T$ , we will observe some elements are very small while some others are very large. If we apply softmax on this row, the result will be skewed towards either 0 or 1.

This leads to small gradient update. You can see the gradient at the two sizes of a sigmoid function is close to zero:

If we divide by $\sqrt{d_{k}}$ , the result of softmax is closer to the center of the distribution, and the gradient become larger.

The brief explanation from the authors is:

1.1.2 Multi-head attention (MHA)

Why MHA?

If we only use dot-product attention, there’re no learnable parameters!

Each head could learn different attention pattern.
Each head is like a “filter” as in CNN

1.1.3 Cross attention (Encoder & decoder)

1.2 FFN block

FFN is simply an MLP applying to the last dimension. The output of the attention block is $n\times d_{k}$ where $n$ is sequence length (num tokens). $x$ is the embedding of one token (a vector). What FFN does is: $FFN(x)=max(0,xW_{1}+b_{1})W_{2}+b_{2}$ $W_{1}$ projects from $d_{k}$ to $4\times d_{k}$ , and $W_{2}$ projects it back to $d_{k}$ .

Why the weights of FFN is the same for every token?

Because we already “processed each token individually” in the previous attention block.

1.3 Embedding and positional encoding

In the original paper, the authors use the same embedding matrix in three places: 1) encoder, 2) decoder, and 3) the final FF layer before softmax (where for each token you produce V logits (V is the vocabulary size))
- embedding matrix: $V\times d_{model}$
The authors also multiply the embedding matrix by $\sqrt{d_{model}}$
- An observation is that after the training, the l2 norm of a embedding vector is usually very small and doesn’t increase with $d_{model}$ . For example, when $d_{model}$ is increased from 512 to 4096, the l2 norm or the embedding may still be 1.
- However, the l2 norm of positional encoding DOES increase with length.
- So, if we don’t scale embedding, it will be dominated by the positional encoding when they’re added up.

1.4 Dropout

Residual dropout: before added to the sublayer input (See “Drop1” and “Drop2” in the following pre-LN example. Note “Drop” is for FFN)
- x + Drop1(MHA(LN(x))) (Attention sublayer)
- x + Drop2(Linear2(Drop(activation(Linear1(LN(x)))))) (FFN sublayer)
Attention dropout:
- It’s applied after softmax, before multiplying V (i.e., on the attention weights).
- $Drop\left(Softmax\left( \frac{QK^T}{\sqrt{d_{k}}}\right) \right)V$
Embedding dropout: after the sum of embedding and positional encoding
- Drop(input_embed + pos_enc)
FFN dropout: only for the hidden layer in FFN. (The “Drop” in the above example)

1.5 LayerNorm

There’re two layernorm: pre-LN and **post-LN. In the original Transformer paper, the authors use post-norm, but GPT and later models prefer to uses pre-LN.
Pre-LN make the gradient more smooth. See (Xiong et al., 2020).

python

# pre-norm (preferred)
x = x + MHA(LN(x))
x = x + FFN(LN(x))

# post-norm
x = LN(x + MHA(x))
x = LN(x + FFN(x))

2 Pytorch implementaion of Transformer

last update: pytorch-2.0.1

2.1 Config

norm_first: If True, use pre-norm. Default: False (post-norm).
dim_feedforward: default=2048
activation: Default “relu”
dropout: Default 0.1

2.2 Encoder

python

# How TransformerEncoderLayer.forward works

# x: input source
# Drop1, Drop2: residual dropout
# Drop: FFN dropout
# note: SA() doesn't include any dropout layer

# Pre-LN (preferred)
x = x + Drop1(SA(LN(x)))  # Self-Attn sublayer
x = x + Drop2(Linear2(Drop(Activation(Linear1(LN(x))))))  # FFN sublayer

# Post-LN
x = LN(x + Drop1(SA(x)))  # Self-Attn sublayer
x = LN(x + Drop2(Linear2(Drop(Activation(Linear1(x))))))  # FFN sublayer

2.3 Decoder

python

# How DecoderLayer.forward works

# x: the "Q"
# memory: the "K and "V", from encoder
# Drop1, Drop2, Drop3: residual dropout
# Drop: FFN dropout
# note: SA() and MHA() doesn't include any dropout layer

# Pre-LN (preferred)
x = x + Drop1(SA(LN1(x)))  # SA sublayer
x = x + Drop2(MHA(LN2(x), memory))  # Multi-Head Attn
x = x + Drop3(Linear2(Drop(Activation(Linear1(LN3(x))))))  # FFN sublayer

# Post-LN
x = LN1(x + Drop1(SA(x)))  # SA
x = LN2(x + Drop2(MHA(x, memory)))  # MHA
x = LN3(x + Drop3(Linear2(Drop(Activation(Linear1(x))))))  # FFN sublayer

3 Compare with GPT

Below, I show how GPTs differ from the original Transformer. Since the GPT family is not open-source, the GPT code is from Huggingface’s implementation of GPT-2. OpenAI claimed GPT-3 uses the same architecture as GPT-2, except for the Sparse Transformer part.

"GPT's decoder" vs. "Original Transformer decoder"

In the original paper, the decoder has three sublayers, SelfAttn, CrossAttn, FFN because it needs input from the encoder
GPT-2 has no “CrossAttn”
So, GPT’s decoder is equivalent to an encoder, except for the mask in the SelfAttn (See this SO post).

3.1 Tokenizer: A variant of BPE

Works on byte, but avoid merges across character categories (e.g., punctuations and letters are not allowed to merge), except for spaces.
e.g., "Hello world" => ["Hello", " world"]. Notice there’s a space in front of “world.”

3.2 Embedding & positional encoding

Both are learned
Source: Huggingface
Embedding is not scaled before adding to positional encoding (PE)
- In the original Transformer, input_embeds is multiplied by $\sqrt{d_{model}}$ . That’s because the PE is determined by sin/cos (not learned!) and its l2 norm increases with $d_{model}$
- But in GPT, PE is learned. Therefore, PE and input_embeds can be of similar scale.
  Source: Huggingface

3.3 LayerNorm

It uses “pre-LN”
Another LN is added after the final attention block

python

# first go through the blocks
for block in DecoderList:
	x = block(x)

# the final LN before output!
output = LN(x)

3.4 Dropout

GPTs have dropout in residual, embedding, and attention, same as the original Transformer (drop=0.1)
GPT has no dropout in FFN!

3.5 Activation: GLUE

(Hendrycks & Gimpel, 2016) $\text{GELU} = x\Phi(x)$ where $\Phi(x)$ is CDF of normal. Its expectation can be approximated with: $0.5 \cdot x \cdot \left( 1 + \tanh\left[ \sqrt{2/\pi}\left(x+0.044715x^3 \right)\right] \right)$

python

def glue(self, input: Tensor) -> Tensor:
	return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))

3.6 Initialization

Linear and conv1d are normal with mean=0, std=0.02
Embedding & PE are normal with mean=0, std=0.02; padding_idx is 0
Layer norm has no affine transformation
(important!) Reinit selected weights
- c_proj is the $W^O$ matrix in the original paper, it’s $d_{model}\times d_{model}$ . The concatenated attention outputs are multiplied by $W^O$ before sent to residual.
- I still not fully understand why “training signals will accumulate through the residual path.”

Nickname

Website

0/500

OωO
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
（╯‵□′）╯︵┴─┴
￣﹃￣
(/ω＼)
∠( ᐛ 」∠)＿
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ｀)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ(￣∇￣o)
ヾ(´･･｀｡)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò｡)
Σ(っ °Д °;)っ
( ,,´･ω･)ﾉ"(´っω･｀｡)
╮(╯▽╰)╭
o(*////▽////*)q
＞﹏＜
( ๑´•ω•) "(ㆆᴗㆆ)

😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣

颜文字
Emoji
Bilibili

0 comments

No comment

Contents

Implementation details: From the original Transformer to GPT

1 The original Transformers reviewed

1.1 Attention block

1.1.1 Scaled dot-product attention

1.1.2 Multi-head attention (MHA)

1.1.3 Cross attention (Encoder & decoder)

1.2 FFN block

1.3 Embedding and positional encoding

1.4 Dropout

1.5 LayerNorm

2 Pytorch implementaion of Transformer

2.1 Config

2.2 Encoder

2.3 Decoder

3 Compare with GPT

3.1 Tokenizer: A variant of BPE

3.2 Embedding & positional encoding

3.3 LayerNorm

3.4 Dropout

3.5 Activation: GLUE

3.6 Initialization