Quicktake: BPE, WordPiece, and SentencePiece

This post draws upon information from:


  • The mainstream tokenization for neural models can be summarized as “BPE by default, maybe SentencePiece for multilingual.” BPE-dropout may become more widespread due to its simplicity.
  • Byte-level BPE (BBPE) and SentencePiece (unigram) are the two most popular variants.
  • BPE vs. SentencePiece (unigram)
    • BPE is greedy and deterministic. It can’t sample different tokenizations for the same string. BPE-dropout, however, introduces stochasticity.
    • In SentencePiece, tokens have probability, therefore sampling during tokenization is possible.
  • “Lossless” is a matter of extent.
    • BPE (GPT) is “fully” lossless. It keeps any length of consecutive spaces.
    • SentencePiece (XLNet) is “partially” lossless. It only keeps one space for multiple consecutive spaces.
    • WordPiece is lossy. It even doesn’t preserve any space.

  • Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., & Launay, J. (2023). The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (arXiv:2306.01116). arXiv. https://doi.org/10.48550/arXiv.2306.01116
  • Dowling, M., & Lucey, B. (2023). ChatGPT for (Finance) research: The Bananarama Conjecture. Finance Research Letters, 53, 103662. https://doi.org/10.1016/j.frl.2023.103662
  • Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., & Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning, 10524–10533.
  • Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units (arXiv:1508.07909). arXiv. https://doi.org/10.48550/arXiv.1508.07909
  • Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149–5152. https://doi.org/10.1109/ICASSP.2012.6289079
  • Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (arXiv:1808.06226). arXiv. https://doi.org/10.48550/arXiv.1808.06226
Nickname
Email
Website
0/500
  • OωO
  • |´・ω・)ノ
  • ヾ(≧∇≦*)ゝ
  • (☆ω☆)
  • (╯‵□′)╯︵┴─┴
  •  ̄﹃ ̄
  • (/ω\)
  • ∠( ᐛ 」∠)_
  • (๑•̀ㅁ•́ฅ)
  • →_→
  • ୧(๑•̀⌄•́๑)૭
  • ٩(ˊᗜˋ*)و
  • (ノ°ο°)ノ
  • (´இ皿இ`)
  • ⌇●﹏●⌇
  • (ฅ´ω`ฅ)
  • (╯°A°)╯︵○○○
  • φ( ̄∇ ̄o)
  • ヾ(´・ ・`。)ノ"
  • ( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
  • (ó﹏ò。)
  • Σ(っ °Д °;)っ
  • ( ,,´・ω・)ノ"(´っω・`。)
  • ╮(╯▽╰)╭
  • o(*////▽////*)q
  • >﹏<
  • ( ๑´•ω•) "(ㆆᴗㆆ)
  • 😂
  • 😀
  • 😅
  • 😊
  • 🙂
  • 🙃
  • 😌
  • 😍
  • 😘
  • 😜
  • 😝
  • 😏
  • 😒
  • 🙄
  • 😳
  • 😡
  • 😔
  • 😫
  • 😱
  • 😭
  • 💩
  • 👻
  • 🙌
  • 🖕
  • 👍
  • 👫
  • 👬
  • 👭
  • 🌚
  • 🌝
  • 🙈
  • 💊
  • 😶
  • 🙏
  • 🍦
  • 🍉
  • 😣
  • 颜文字
  • Emoji
  • Bilibili
0 comments
No comment