Educational code for the (byte-level) Byte Pair Encoding (BPE) algorithm

minbpe offers a minimal, clean code solution for the byte-level Byte Pair Encoding algorithm used in LLM tokenization. The algorithm operates on UTF-8 encoded strings and is widely used in modern LLMs like GPT and Llama. The repository includes two tokenizers, BasicTokenizer, and RegexTokenizer, which preprocess text before tokenizing to avoid merges across category boundaries. The GPT4Tokenizer ensures accurate tokenization as seen in GPT-4. The script train.py trains the tokenizers on input text and saves the vocabulary for visualization. Detailed comments and usage examples are included in the files. Upcoming improvements may include optimized versions and handling special tokens.

https://github.com/karpathy/minbpe

To top