--- library_name: transformers tags: [] --- # finewebedu_32000 ## About 🇬🇧 An English tokenizer, trained on the [FineWeb-Edu dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu). ## Description This is a **character-level** (mainly) English (en) tokenizer, trained on the CC-MAIN-2024-10 subset of FineWeb-Edu. It has a vocabulary size of 32,000 ([multiple of 128](https://x.com/karpathy/status/1621578354024677377)), which makes it fast for integration in models. ## Usage import tokenizers dataset = tokenizers.Tokenizer.from_pretrained("gvlassis/finewebedu_32000")