Understanding GPT tokenizers

Large language models like GPT-3/4, LLaMA, and PaLM work in terms of tokens, and OpenAI provides a Tokenizer tool while the author has created his own Observable notebook. Playing around with the tokenizer has revealed that most common English words are assigned a single token and many words have a token that includes a leading space. However, languages other than English suffer from less efficient tokenization due to the English bias. The notebook has also brought up the concept of “glitch tokens,” which can cause glitches in the model. The author also introduces tiktoken and ttok, useful tools that can count the number of tokens in strings before passing them to the API.

https://simonwillison.net/2023/Jun/8/gpt-tokenizers/

To top