t.d.a.g. PRO

sequelbox

AI & ML interests

open source, infinite games. (they/them)

Recent Activity

liked a dataset about 17 hours ago
KingNish/reasoning-base-20k
liked a dataset about 18 hours ago
qingy2024/QwQ-LongCoT-Verified-130K
liked a dataset about 18 hours ago
amphora/QwQ-LongCoT-130K
View all activity

Organizations

Valiant Labs's profile picture

sequelbox's activity

reacted to m-ric's post with ๐Ÿ‘€ 4 days ago
view post
Post
1999
๐‡๐ฎ๐ ๐ ๐ข๐ง๐  ๐…๐š๐œ๐ž ๐ซ๐ž๐ฅ๐ž๐š๐ฌ๐ž๐ฌ ๐๐ข๐œ๐จ๐ญ๐ซ๐จ๐ง, ๐š ๐ฆ๐ข๐œ๐ซ๐จ๐ฌ๐œ๐จ๐ฉ๐ข๐œ ๐ฅ๐ข๐› ๐ญ๐ก๐š๐ญ ๐ฌ๐จ๐ฅ๐ฏ๐ž๐ฌ ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐Ÿ’๐ƒ ๐ฉ๐š๐ซ๐š๐ฅ๐ฅ๐ž๐ฅ๐ข๐ณ๐š๐ญ๐ข๐จ๐ง ๐Ÿฅณ

๐Ÿ•ฐ๏ธ Llama-3.1-405B took 39 million GPU-hours to train, i.e. about 4.5 thousand years.

๐Ÿ‘ด๐Ÿป If they had needed all this time, we would have GPU stories from the time of Pharaoh ๐“‚€: "Alas, Lord of Two Lands, the shipment of counting-stones arriving from Cathay was lost to pirates, this shall delay the building of your computing temple by many moons "

๐Ÿ› ๏ธ But instead, they just parallelized the training on 24k H100s, which made it take just a few months.
This required parallelizing across 4 dimensions: data, tensor, context, pipeline.
And it is infamously hard to do, making for bloated code repos that hold together only by magic.

๐Ÿค ๐—•๐˜‚๐˜ ๐—ป๐—ผ๐˜„ ๐˜„๐—ฒ ๐—ฑ๐—ผ๐—ป'๐˜ ๐—ป๐—ฒ๐—ฒ๐—ฑ ๐—ต๐˜‚๐—ด๐—ฒ ๐—ฟ๐—ฒ๐—ฝ๐—ผ๐˜€ ๐—ฎ๐—ป๐˜†๐—บ๐—ผ๐—ฟ๐—ฒ! Instead of building mega-training codes, Hugging Face colleagues cooked in the other direction, towards tiny 4D parallelism libs. A team has built Nanotron, already widely used in industry.
And now a team releases Picotron, a radical approach to code 4D Parallelism in just a few hundred lines of code, a real engineering prowess, making it much easier to understand what's actually happening!

โšก ๐—œ๐˜'๐˜€ ๐˜๐—ถ๐—ป๐˜†, ๐˜†๐—ฒ๐˜ ๐—ฝ๐—ผ๐˜„๐—ฒ๐—ฟ๐—ณ๐˜‚๐—น:
Counting in MFU (Model FLOPs Utilization, how much the model actually uses all the compute potential), this lib reaches ~50% on SmolLM-1.7B model with 8 H100 GPUs, which is really close to what huge libs would reach. (Caution: the team is leading further benchmarks to verify this)

Go take a look ๐Ÿ‘‰ https://github.com/huggingface/picotron/tree/main/picotron
  • 1 reply
ยท
reacted to takarajordan's post with โค๏ธ 12 days ago
view post
Post
2190
I'm super excited to release my first open-source text dataset:

WorldScenario 20K is a novel dataset of 20,000 synthetically generated multi-stakeholder scenarios designed to simulate real-world decision-making processes. Each scenario explores a unique environmental, societal, or economic issue.

I used the brand new meta-llama/Llama-3.3-70B-Instruct model to generate this dataset and I put the dataset through some post processing to clean and evaluate the dataset for diversity.

I'd appreciate some feedback and thoughts on my new release! Thanks!

takarajordan/WorldScenario_20K
ยท