alkinun
AtAndDev
AI & ML interests
LLMs, Alignment, Merging, Unsloth, DPO, SFT, ORPO, SPIN..
Recent Activity
liked
a dataset
about 13 hours ago
TALRAS/magpie-ultra-5k-11-tasks
liked
a model
about 24 hours ago
Qwen/Qwen3-30B-A3B-Instruct-2507
updated
a dataset
1 day ago
TALRAS/magpie-ultra-5k-11-tasks
Organizations

reacted to
FlameF0X's
post with π
3 days ago

posted
an
update
7 days ago
Post
266
Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...
Qwen ftw!
DECENTRALIZE DECENTRALIZE DECENTRALIZE
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...
Qwen ftw!
DECENTRALIZE DECENTRALIZE DECENTRALIZE

reacted to
AdinaY's
post with π₯
7 days ago
Post
3363
Qwen3-Coder π» agentic code model by Alibaba Qwen teamπ
Qwen/Qwen3-Coder-480B-A35B-Instruct
β¨ 480B total, 35B activated MoE
β¨ Agentic Coding + Browser Use β Top code model performance
β¨ 256K context (up to 1M via Yarn) for repo-scale understanding
Qwen/Qwen3-Coder-480B-A35B-Instruct
β¨ 480B total, 35B activated MoE
β¨ Agentic Coding + Browser Use β Top code model performance
β¨ 256K context (up to 1M via Yarn) for repo-scale understanding

reacted to
eliebak's
post with ππ₯
8 days ago
Post
4428
Kimi K2 tech report is full of gems as always. Here are my notes on it:
> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.
With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.
> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.
The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.
With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.
> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.
The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU
Zig ftw

reacted to
erikkaum's
post with π₯
12 days ago
Post
2532
ZML just released a technical preview of their new Inference Engine: LLMD.
- Just 2.4GB container, which means fast startup times and efficient autoscaling
- Cross-Platform GPU Support: works on both NVIDIA and AMD GPUs.
- written in Zig
I just tried it out and deployed it on Hugging Face Inference Endpoints and wrote a quick guide π You can try it in like 5 minutes!
https://huggingface.co/blog/erikkaum/test-driving-llmd-inference-engine
- Just 2.4GB container, which means fast startup times and efficient autoscaling
- Cross-Platform GPU Support: works on both NVIDIA and AMD GPUs.
- written in Zig
I just tried it out and deployed it on Hugging Face Inference Endpoints and wrote a quick guide π You can try it in like 5 minutes!
https://huggingface.co/blog/erikkaum/test-driving-llmd-inference-engine

reacted to
fdaudens's
post with ππ₯
14 days ago
Post
2481
You might not have heard of Moonshot AI β but within 24 hours, their new model Kimi K2 shot to the top of Hugging Faceβs trending leaderboard.
So⦠who are they, and why does it matter?
Had a lot of fun co-writing this blog post with @xianbao , with key insights translated from Chinese, to unpack how this startup built a model that outperforms GPT-4.1, Claude Opus, and DeepSeek V3 on several major benchmarks.
π§΅ A few standout facts:
1. From zero to $3.3B in 18 months:
Founded in March 2023, Moonshot is now backed by Alibaba, Tencent, Meituan, and HongShan.
2. A CEO who thinks from the end:
Yang Zhilin (31) previously worked at Meta AI, Google Brain, and Carnegie Mellon. His vision? Nothing less than AGI β still a rare ambition among Chinese AI labs.
3. A trillion-parameter model thatβs surprisingly efficient:
Kimi K2 uses a mixture-of-experts architecture (32B active params per inference) and dominates on coding/math benchmarks.
4. The secret weapon: Muon optimizer:
A new training method that doubles efficiency, cuts memory in half, and ran 15.5T tokens with zero failures. Big implications.
Most importantly, their move from closed to open source signals a broader shift in Chinaβs AI scene β following Baiduβs pivot. But as Yang puts it: βUsers are the only real leaderboard.β
π Check out the full post to explore what Kimi K2 can do, how to try it, and why it matters for the future of open-source LLMs:
https://huggingface.co/blog/fdaudens/moonshot-ai-kimi-k2-explained
So⦠who are they, and why does it matter?
Had a lot of fun co-writing this blog post with @xianbao , with key insights translated from Chinese, to unpack how this startup built a model that outperforms GPT-4.1, Claude Opus, and DeepSeek V3 on several major benchmarks.
π§΅ A few standout facts:
1. From zero to $3.3B in 18 months:
Founded in March 2023, Moonshot is now backed by Alibaba, Tencent, Meituan, and HongShan.
2. A CEO who thinks from the end:
Yang Zhilin (31) previously worked at Meta AI, Google Brain, and Carnegie Mellon. His vision? Nothing less than AGI β still a rare ambition among Chinese AI labs.
3. A trillion-parameter model thatβs surprisingly efficient:
Kimi K2 uses a mixture-of-experts architecture (32B active params per inference) and dominates on coding/math benchmarks.
4. The secret weapon: Muon optimizer:
A new training method that doubles efficiency, cuts memory in half, and ran 15.5T tokens with zero failures. Big implications.
Most importantly, their move from closed to open source signals a broader shift in Chinaβs AI scene β following Baiduβs pivot. But as Yang puts it: βUsers are the only real leaderboard.β
π Check out the full post to explore what Kimi K2 can do, how to try it, and why it matters for the future of open-source LLMs:
https://huggingface.co/blog/fdaudens/moonshot-ai-kimi-k2-explained

reacted to
prithivMLmods's
post with πβ€οΈπ€
14 days ago
Post
2163
Open Omega Ξ© (Forge, Atom, Explora):
A Fusion of Math, Science, and Coding π§ͺπ€
Datasets :
β―β² Open-Omega-Forge-1M [Mathematics, Coding, and Science]: prithivMLmods/Open-Omega-Forge-1M
β―β² Open-Omega-Atom-1.5M [Mathematics and Science]: prithivMLmods/Open-Omega-Atom-1.5M
β―β² Open-Omega-Explora-2.5M [Forge + Atom]: prithivMLmods/Open-Omega-Explora-2.5M
β―β² Others [Subordinate portion] - Curated and blended modular dataset.
Models :
> Omega-Qwen3-Atom-8B : prithivMLmods/Omega-Qwen3-Atom-8B
> Omega-Qwen2.5-Coder-3B : prithivMLmods/Omega-Qwen2.5-Coder-3B
Dataset Collection: prithivMLmods/open-omega-a-fusion-of-math-science-and-coding-68756c37769fa39c4055cc0e
.
.
.
For more information, refer to the dataset card(s).
A Fusion of Math, Science, and Coding π§ͺπ€
Datasets :
β―β² Open-Omega-Forge-1M [Mathematics, Coding, and Science]: prithivMLmods/Open-Omega-Forge-1M
β―β² Open-Omega-Atom-1.5M [Mathematics and Science]: prithivMLmods/Open-Omega-Atom-1.5M
β―β² Open-Omega-Explora-2.5M [Forge + Atom]: prithivMLmods/Open-Omega-Explora-2.5M
β―β² Others [Subordinate portion] - Curated and blended modular dataset.
Models :
> Omega-Qwen3-Atom-8B : prithivMLmods/Omega-Qwen3-Atom-8B
> Omega-Qwen2.5-Coder-3B : prithivMLmods/Omega-Qwen2.5-Coder-3B
Dataset Collection: prithivMLmods/open-omega-a-fusion-of-math-science-and-coding-68756c37769fa39c4055cc0e
.
.
.
For more information, refer to the dataset card(s).

reacted to
hba123's
post with π₯
14 days ago
Post
1941
Ark is now pip-installable and supports the following robots!! If you want to do robotics in python, check it out here: https://robotics-ark.github.io/ark_robotics.github.io/
Now, you can pip-install robotics and work completely in Python. Why Ark you ask, well we love Python :D
Now, you can pip-install robotics and work completely in Python. Why Ark you ask, well we love Python :D

reacted to
danielhanchen's
post with π₯
14 days ago
Post
2805

replied to
jasoncorkill's
post
15 days ago
Love these kinds of research!

reacted to
jasoncorkill's
post with π₯ππ
15 days ago
Post
3202
"Why did the bee get married?"
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini
"Because he found his honey!"
This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".
Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.
LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:
Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%
There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English
We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini

reacted to
hesamation's
post with β€οΈ
17 days ago
Post
4642
in case you didnβt know, Claude now has a developer training course with certificates,
this is better than anything you can find on Coursera.
covers Claude Code, MCP and its advanced topics and even more:
https://www.anthropic.com/learn/build-with-claude
this is better than anything you can find on Coursera.
covers Claude Code, MCP and its advanced topics and even more:
https://www.anthropic.com/learn/build-with-claude