Stas Bekman

stas

https://stasosphere.com/machine-learning/

AI & ML interests

Toolmaker. Software creator, optimizer and harmonizer. Makes things work and fly at Contextual.AI Training LLM/RAG/Generative AI/Machine Learning/Scalability

Recent Activity

updated a model 5 days ago

stas/ml-engineering-book

posted an update 21 days ago

If you remember my work on MAMF - to find the realistic TFLOPS achievable ceiling - the Intel AI team has shared their measurements and they scored ... an incredible 99.4% TFLOPS efficiency for Gaudi 2! That's quite amazing! Your ROI on these accelerators will be very high. The full table is here: https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table As we have seen the competitors get their achievable efficiency worse with each new generation, I'm looking forward to see if Gaudi 3 will keep the high bar! Thanks to Avi Rubin, Lakshman Chari, Imtiaz Sajwani, Ramy J and Zhiqi Tao for helping to get these numbers to the community.

View all activity

Articles

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Jun 13

• 45

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 28

Organizations

stas's activity

updated a model 5 days ago

stas/ml-engineering-book

Updated 5 days ago • 14

posted an update 21 days ago

Post

1152

If you remember my work on MAMF - to find the realistic TFLOPS achievable ceiling - the Intel AI team has shared their measurements and they scored ...

an incredible 99.4% TFLOPS efficiency for Gaudi 2!

That's quite amazing! Your ROI on these accelerators will be very high.

The full table is here: https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table

As we have seen the competitors get their achievable efficiency worse with each new generation, I'm looking forward to see if Gaudi 3 will keep the high bar!

Thanks to Avi Rubin, Lakshman Chari, Imtiaz Sajwani, Ramy J and Zhiqi Tao for helping to get these numbers to the community.

New activity in joaogante/assisted_generation_demo 3 months ago

Space isn't working because there is a runtime error

#9 opened 3 months ago by

stas

upvoted an article 4 months ago

Article

Mixture of Experts Explained

Dec 11, 2023

• 211

New activity in tatsu-lab/alpaca_eval 4 months ago

Fix FileNotFoundError

#2 opened 4 months ago by

lhoestq

updated a model 5 months ago

ContextualAI/tiny-random-MistralForCausalLM

Text Generation • Updated Aug 9 • 1.68k

New activity in HuggingFaceFW/fineweb 5 months ago

Casting Issue?

#40 opened 6 months ago by

FelixLabelle

authored a paper 6 months ago

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training

Paper • 2406.18820 • Published Jun 27

posted an update 6 months ago

Post

1113

The Universal Checkpointing paper is out! https://arxiv.org/abs/2406.18820

If you remember the Bigscience BLOOM-176B training, Tunji Ruwase and I co-invented this technology for Megatron-Deepspeed in order to enable to quickly scale up and down node topology while continuing training.

Since then the DeepSpeed team continued improving on that and it has now been fully integrated into Deepspeed.

The blog post is here: https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-ucp/README.md

upvoted an article 6 months ago

Article

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Jun 13

• 45

upvoted an article 8 months ago

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15

• 170

New activity in stas/ml-engineering-book 9 months ago

Upload book cover

#1 opened 9 months ago by

julien-c

metadata: set license

#2 opened 9 months ago by

julien-c

reacted to their post with 🤗 9 months ago

Post

A combined effort from the IBM + Pytorch teams achieved an incredible training performance with ZeRO/FSDP on par with 3D parallelism on H100s, while having just 800Gbps inter-node connection.

This is because they got an almost full overlap between comms and compute and have introduced a novel selective activation recomputation method which recalculates only large but inexpensive activations.

Check out their post here: https://pytorch.org/blog/maximizing-training/

posted an update 10 months ago

Post

posted an update 10 months ago

Post

If you're trying to run MoE Mixtral-8x7b under DeepSpeed w/ HF Transformers it's likely to hang on the first forward.

The solution is here https://github.com/microsoft/DeepSpeed/pull/4966?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US#issuecomment-1989671378

and you need deepspeed>=0.13.0

Thanks to Masahiro Tanaka for the fix.

replied to their post 11 months ago

I pinged Elio to see if he wants to join.

posted an update 11 months ago

Post

Hear, hear, AMD MI300Xs have started to emerge much sooner than expected.

Here is a 2-part benchmarks report on performing BLOOM-176B inference using @MSFTDeepSpeed optimized for AMD MI300X.

1. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing
2. https://www.evp.cloud/post/diving-deeper-insights-from-our-llm-inference-testing-part-2

This was published in response to our BLOOM-176B super-fast inference blog post https://huggingface.co/blog/bloom-inference-pytorch-scripts

Note that these have 192GB of HBM!

The NVIDIA monopoly is strong, but it'll have to start sharing the pie and hopefully drive the costs down at least somewhat.

Thanks to https://www.linkedin.com/in/eliovp for sharing this writeup with me.

p.s. at the PyTorch conference in the fall, the AMD representative said we will see MI300X available to us mortals in Q4-2024/Q1-2025.

5 replies

authored a paper 11 months ago

The Case for Co-Designing Model Architectures with Hardware

Paper • 2401.14489 • Published Jan 25 • 3

replied to their post 11 months ago

Thank you for the kind words, Jeff!

We are still waiting for BLOOM v2.0 from HF!

Stas Bekman

AI & ML interests

Recent Activity

Articles

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate

The Technology Behind BLOOM Training

Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

Porting fairseq wmt19 translation system to transformers

Organizations

stas's activity

Space isn't working because there is a runtime error

Mixture of Experts Explained

Fix FileNotFoundError

Casting Issue?

From DeepSpeed to FSDP and Back Again with Hugging Face Accelerate

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Upload book cover

metadata: set license