Kuldeep Singh Sidhu's picture
5 3

Kuldeep Singh Sidhu

singhsidhukuldeep

AI & ML interests

Seeking contributors for a completely open-source 馃殌 Data Science platform! singhsidhukuldeep.github.io

Organizations

Posts 73

view post
Post
263
All the way from Korea, a novel approach called Mentor-KD significantly improves the reasoning abilities of small language models.

Mentor-KD introduces an intermediate-sized "mentor" model to augment training data and provide soft labels during knowledge distillation from large language models (LLMs) to smaller models.

Broadly, it鈥檚 a two-stage process:
1) Fine-tune the mentor on filtered Chain-of-Thought (CoT) annotations from an LLM teacher.
2) Use the mentor to generate additional CoT rationales and soft probability distributions.

The student model is then trained using:
- CoT rationales from both the teacher and mentor (rationale distillation).
- Soft labels from the mentor (soft label distillation).

Results show that Mentor-KD consistently outperforms baselines, with up to 5% accuracy gains on some tasks.

Mentor-KD is especially effective in low-resource scenarios, achieving comparable performance to baselines while using only 40% of the original training data.

This work opens up exciting possibilities for making smaller, more efficient language models better at complex reasoning tasks.

What are your thoughts on this approach?
view post
Post
1919
While Google's Transformer might have introduced "Attention is all you need," Microsoft and Tsinghua University are here with the DIFF Transformer, stating, "Sparse-Attention is all you need."

The DIFF Transformer outperforms traditional Transformers in scaling properties, requiring only about 65% of the model size or training tokens to achieve comparable performance.

The secret sauce? A differential attention mechanism that amplifies focus on relevant context while canceling out noise, leading to sparser and more effective attention patterns.

How?
- It uses two separate softmax attention maps and subtracts them.
- It employs a learnable scalar 位 for balancing the attention maps.
- It implements GroupNorm for each attention head independently.
- It is compatible with FlashAttention for efficient computation.

What do you get?
- Superior long-context modeling (up to 64K tokens).
- Enhanced key information retrieval.
- Reduced hallucination in question-answering and summarization tasks.
- More robust in-context learning, less affected by prompt order.
- Mitigation of activation outliers, opening doors for efficient quantization.

Extensive experiments show DIFF Transformer's advantages across various tasks and model sizes, from 830M to 13.1B parameters.

This innovative architecture could be a game-changer for the next generation of LLMs. What are your thoughts on DIFF Transformer's potential impact?

models

None public yet

datasets

None public yet