Aymeric Roucher

m-ric

AI & ML interests

MLE at Hugging Face 🤗 LLMs, Agents, RAG, Multimodal.

Articles

Organizations

Posts 52

view post
Post
606
By far the coolest release of the day!
> The Open LLM Leaderboard, most comprehensive suite for comparing Open LLMs on many benchmarks, just released a comparator tool that lets you dig into the detail of differences between any models.

Here's me checking how the new Llama-3.1-Nemotron-70B that we've heard so much compares to the original Llama-3.1-70B. 🤔🔎

Try it out here 👉 open-llm-leaderboard/comparator
view post
Post
637
⚡️ 𝐓𝐡𝐢𝐬 𝐦𝐨𝐧𝐭𝐡'𝐬 𝐦𝐨𝐬𝐭 𝐢𝐦𝐩𝐨𝐫𝐭𝐚𝐧𝐭 𝐛𝐫𝐞𝐚𝐤𝐭𝐡𝐫𝐨𝐮𝐠𝐡: 𝐃𝐢𝐟𝐟𝐞𝐫𝐞𝐧𝐭𝐢𝐚𝐥 𝐓𝐫𝐚𝐧𝐬𝐟𝐨𝐫𝐦𝐞𝐫 𝐯𝐚𝐬𝐭𝐥𝐲 𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐬 𝐚𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 ⇒ 𝐛𝐞𝐭𝐭𝐞𝐫 𝐫𝐞𝐭𝐫𝐢𝐞𝐯𝐚𝐥 𝐚𝐧𝐝 𝐟𝐞𝐰𝐞𝐫 𝐡𝐚𝐥𝐥𝐮𝐜𝐢𝐧𝐚𝐭𝐢𝐨𝐧𝐬!

Thought that self-attention could not be improved anymore?

Microsoft researchers have dropped a novel "differential attention" mechanism that amplifies focus on relevant context while canceling out noise. It sounds like a free lunch, but it does really seem to vastly improve LLM performance!

𝗞𝗲𝘆 𝗶𝗻𝘀𝗶𝗴𝗵𝘁𝘀:

🧠 Differential attention computes the difference between two separate softmax attention maps, canceling out noise and promoting sparse attention patterns

🔥 DIFF Transformer outperforms standard Transformers while using 35-40% fewer parameters or training tokens

📏 Scales well to long contexts up to 64K tokens, leveraging increasing context length more effectively

🔎 Dramatically improves key information retrieval, enhancing in-context learning, and possibly reducing risk of hallucinations 🤯

🔢 Reduces activation outliers, potentially enabling lower-bit quantization without performance drop!

⚙️ Can be directly implemented using existing FlashAttention kernels

This new architecture could lead much more capable LLMs, with vastly improved strengths in long-context understanding and factual accuracy.

But they didn’t release weights on the Hub: let’s wait for the community to train the first open-weights DiffTransformer! 🚀

Read their paper 👉  Differential Transformer (2410.05258)