I have been tinkering with quantization and pruning to reduce model sizes. So far, I've had modest success in producing, on average, 8% smaller versions with negligible loss of quality, and I think further reductions in the 10-15% range are realistic, but I've come across a behaviour I wasn't expecting!
Part of the process I'm following consists of quantizing the embedding and output layers aggressively. Since the embedding layer is more about lookup than complex computation, the vectors representing the relative distances between embeddings are usually preserved well enough making this layer fairly robust to quantization. So far, so good.
The output layer, on the other hand, maps the final hidden state to the vocabulary logits and therefore, small changes in these logits could lead to a different probability distribution over the vocabulary, resulting in incorrect word predictions, or so I thought.
Surprisingly, I'm finding that even at Q2_K the loss of overall capability is minimal. Was this to be expected? or am I missing something?
Researchers developed Sonic AI enabling precise facial animation from speech cues 🎧 Decouples head/expression control via audio tone analysis + time-aware fusion for natural long-form synthesis
Can we teach a model to think completely on its own without reinforcement learning? Actually, yes.
We can do straightforward supervised fine-tuning using a relatively simple trick: blurring a part of CoT thoughts. But why is this effective?
We observed that various models differ in their thinking processes, and fine-tuning one model on another model’s thoughts (CoT) can sometimes be inefficient—often resulting in the model simply memorizing reasoning rather than learning how to actually think.
I discovered that this process can still be efficient if we clearly indicate when the model should start and stop thinking and uncover only a part of CoT and the expected answer, blurring the other part of CoT. This approach allows the model to learn only a portion of the thought process while still arriving at an expected answer.
To demonstrate this, you can watch my experimental BT-SFT on meditsolutions/Llama-3.2-SUN-2.5B-chat model, which was fine-tuned on 151 million tokens from the Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B dataset.
Enjoy! 🚀
PS. If you were curious enough to read this, leave me a comment. It's always nice to chat with open-minded and intelligent ppl.
3 replies
·
reacted to Kseniase's
post with 🚀about 2 months ago
7 Open-source Methods to Improve Video Generation and Understanding
AI community is making great strides toward achieving the full potential of multimodality in video generation and understanding. Last week studies showed that working with videos is now one of the main focuses for improving AI models. Another highlight of the week is that open source, once again, proves its value. For those who were impressed by DeepSeek-R1, we’re with you!
Today, we’re combining these two key focuses and bringing you a list of open-source methods for better video generation and understanding:
Evolution and The Knightian Blindspot of Machine Learning
The paper discusses machine learning's limitations in addressing Knightian Uncertainty (KU), highlighting the fragility of models like reinforcement learning (RL) in unpredictable, open-world environments. KU refers to uncertainty that can't be quantified or predicted, a challenge that RL fails to handle due to its reliance on fixed data distributions and limited formalisms.
### Key Approaches:
1. **Artificial Life (ALife):** Simulating diverse, evolving systems to generate adaptability, mimicking biological evolution's robustness to unpredictable environments. 2. **Open-Endedness:** Creating AI systems capable of continuous innovation and adaptation, drawing inspiration from human creativity and scientific discovery.
3. **Revising RL Formalisms:** Modifying reinforcement learning (RL) models to handle dynamic, open-world environments by integrating more flexible assumptions and evolutionary strategies.
These approaches aim to address ML’s limitations in real-world uncertainty and move toward more adaptive, general intelligence.
🚀 HuggingFace Spaces Ranking Tracker - Your Complete AI Trend Analytics!
Introducing the Spaces Ranking Tracker, a comprehensive analytics dashboard that tracks and analyzes every AI application in the HuggingFace ecosystem.
✨ Key Features: • Real-time tracking of daily ranking changes over 30 days • Detailed analysis of top 100 trending spaces • User-based integrated score visualization • One-click access to space details • Interactive rank change graphs
📊 Dashboard Components: 1. Main Dashboard - Daily rank trend graphs - Top 20 creators' combined score chart - Detailed space information cards - Real-time trending score updates
2. Space Detailed Analysis - Creation date, current rank, and trending score - 30-day ranking history - Direct space access - Custom color coding for intuitive rank display
🎯 How to Use: • Monitor latest AI community trends • Track your project's performance • Discover popular AI demos • Analyze competing projects • Follow AI ecosystem dynamics
3. Interactive Features - Custom filtering options - Sorting by various metrics - Detailed performance statistics - Comprehensive trending scores - Historical data tracking
Stay on top of every movement in the HuggingFace ecosystem with daily ranking updates! 👉 Try it now!
Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)
So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)
As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !
Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541
Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights
16 replies
·
reacted to burtenshaw's
post with 🚀about 2 months ago
There's so much you could do with these developments. Especially combining them together into agentic applications or fine-tuning them on your use case.