Current LLMs process text by first splitting it into tokens. They use a module named "tokenizer", that -spl-it-s- th-e- te-xt- in-to- arbitrary tokens depending on a fixed dictionnary. On the Hub you can find this dictionary in a model's files under tokenizer.json.
โก๏ธ This process is called BPE tokenization. It is suboptimal, everyone says it. It breaks text into predefined chunks that often fail to capture the nuance of language. But it has been a necessary evil in language models since their inception.
๐ฅ In Byte Latent Transformer (BLT), Meta researchers propose an elegant solution by eliminating tokenization entirely, working directly with raw bytes while maintaining efficiency through dynamic "patches."
This had been tried before with different byte-level tokenizations, but it's the first time that an architecture of this type scales as well as BPE tokenization. And it could mean a real paradigm shift! ๐๐
๐๏ธ ๐๐ฟ๐ฐ๐ต๐ถ๐๐ฒ๐ฐ๐๐๐ฟ๐ฒ: Instead of a lightweight tokenizer, BLT has a lightweight encoder that process raw bytes into patches. Then the patches are processed by the main heavy-duty transformers as we do normally (but for patches of bytes instead of tokens), before converting back to bytes.
๐งฉ ๐๐๐ป๐ฎ๐บ๐ถ๐ฐ ๐ฃ๐ฎ๐๐ฐ๐ต๐ถ๐ป๐ด: Instead of fixed tokens, BLT groups bytes based on their predictability (measured by entropy) - using more compute for complex sequences and efficiently handling simple ones. This allows efficient processing while maintaining byte-level understanding.
I hope this breakthrough is confirmed and we can get rid of all the tokenizer stuff, it will make model handling easier!
After some heated discussion ๐ฅ, we clarify our intent re. storage limits on the Hub
TL;DR: - public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible - private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)
We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community ๐ฅ
Six predictions for AI in 2025 (and a review of how my 2024 predictions turned out):
- There will be the first major public protest related to AI - A big company will see its market cap divided by two or more because of AI - At least 100,000 personal AI robots will be pre-ordered - China will start to lead the AI race (as a consequence of leading the open-source AI race). - There will be big breakthroughs in AI for biology and chemistry. - We will begin to see the economic and employment growth potential of AI, with 15M AI builders on Hugging Face.
How my predictions for 2024 turned out:
- A hyped AI company will go bankrupt or get acquired for a ridiculously low price โ (Inflexion, AdeptAI,...)
- Open-source LLMs will reach the level of the best closed-source LLMs โ with QwQ and dozens of others
- Big breakthroughs in AI for video, time-series, biology and chemistry โ for video ๐ดfor time-series, biology and chemistry
- We will talk much more about the cost (monetary and environmental) of AI โ Monetary ๐ดEnvironmental (๐ข)
- A popular media will be mostly AI-generated โ with NotebookLM by Google
- 10 millions AI builders on Hugging Face leading to no increase of unemployment ๐currently 7M of AI builders on Hugging Face
Are you familiar with the difference between discrete learning and predictive learning? This distinction is exactly why LLM models are not designed to perform and execute function calls, they are not the right shape for it. LLM models are prediction machines. Function calling requires discrete learning machines. Fortunately, you can easily couple an LLM model with a discrete learning algorithm. It is beyond easy to do, you simply need to know the math to do it. Want to dive deeper into this subject? Check out this video.