Quicktake: BPE, WordPiece, and SentencePiece

This post draws upon information from:

  • The mainstream tokenization for neural models can be summarized as “BPE by default, maybe SentencePiece for multilingual.” BPE-dropout may become more widespread due to its simplicity.
  • Byte-level BPE (BBPE) and SentencePiece (unigram) are the two most popular variants.
  • BPE vs. SentencePiece (unigram)
    • BPE is greedy and deterministic. It can’t sample different tokenizations for the same string. BPE-dropout, however, introduces stochasticity.
    • In SentencePiece, tokens have probability, therefore sampling during tokenization is possible.
  • “Lossless” is a matter of extent.
    • BPE (GPT) is “fully” lossless. It keeps any length of consecutive spaces.
    • SentencePiece (XLNet) is “partially” lossless. It only keeps one space for multiple consecutive spaces.
    • WordPiece is lossy. It even doesn’t preserve any space.

