M Veselovskiy

Yuuru
ยท

AI & ML interests

None yet

Recent Activity

reacted to m-ric's post with ๐Ÿ‘€ 20 days ago
๐—”๐—ฑ๐˜†๐—ฒ๐—ป'๐˜€ ๐—ป๐—ฒ๐˜„ ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐——๐—ฒ๐—ฒ๐—ฝ๐—ฆ๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ ๐˜€๐˜๐—ฟ๐˜‚๐—ด๐—ด๐—น๐—ฒ๐˜€ ๐—ผ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ๐˜€! โŒ โžก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system. So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand. ๐Ÿ‘Ž But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers. ๐Ÿง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well. But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data. It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐Ÿš€ Read more in the blog post ๐Ÿ‘‰ https://huggingface.co/blog/dabstep
upvoted a collection 5 months ago
Qwen2.5
View all activity

Organizations

Stable Diffusion Dreambooth Concepts Library's profile picture

Yuuru's activity

reacted to m-ric's post with ๐Ÿ‘€ 20 days ago
view post
Post
3690
๐—”๐—ฑ๐˜†๐—ฒ๐—ป'๐˜€ ๐—ป๐—ฒ๐˜„ ๐——๐—ฎ๐˜๐—ฎ ๐—”๐—ด๐—ฒ๐—ป๐˜๐˜€ ๐—•๐—ฒ๐—ป๐—ฐ๐—ต๐—บ๐—ฎ๐—ฟ๐—ธ ๐˜€๐—ต๐—ผ๐˜„๐˜€ ๐˜๐—ต๐—ฎ๐˜ ๐——๐—ฒ๐—ฒ๐—ฝ๐—ฆ๐—ฒ๐—ฒ๐—ธ-๐—ฅ๐Ÿญ ๐˜€๐˜๐—ฟ๐˜‚๐—ด๐—ด๐—น๐—ฒ๐˜€ ๐—ผ๐—ป ๐—ฑ๐—ฎ๐˜๐—ฎ ๐˜€๐—ฐ๐—ถ๐—ฒ๐—ป๐—ฐ๐—ฒ ๐˜๐—ฎ๐˜€๐—ธ๐˜€! โŒ

โžก๏ธ How well do reasoning models perform on agentic tasks? Until now, all indicators seemed to show that they worked really well. On our recent reproduction of Deep Search, OpenAI's o1 was by far the best model to power an agentic system.

So when our partner Adyen built a huge benchmark of 450 data science tasks, and built data agents with smolagents to test different models, I expected reasoning models like o1 or DeepSeek-R1 to destroy the tasks at hand.

๐Ÿ‘Ž But they really missed the mark. DeepSeek-R1 only got 1 or 2 out of 10 questions correct. Similarly, o1 was only at ~13% correct answers.

๐Ÿง These results really surprised us. We thoroughly checked them, we even thought our APIs for DeepSeek were broken and colleagues Leandro Anton helped me start custom instances of R1 on our own H100s to make sure it worked well.
But there seemed to be no mistake. Reasoning LLMs actually did not seem that smart. Often, these models made basic mistakes, like forgetting the content of a folder that they had just explored, misspelling file names, or hallucinating data. Even though they do great at exploring webpages through several steps, the same level of multi-step planning seemed much harder to achieve when reasoning over files and data.

It seems like there's still lots of work to do in the Agents x Data space. Congrats to Adyen for this great benchmark, looking forward to see people proposing better agents! ๐Ÿš€

Read more in the blog post ๐Ÿ‘‰ https://huggingface.co/blog/dabstep
New activity in G-reen/gpt5o-reflexion-q-agi-llama-3.1-8b 5 months ago

How to pay

1
#17 opened 5 months ago by
Yuuru
New activity in mattshumer/Reflection-Llama-3.1-70B 6 months ago

DLETE THIS MODEL

2
#76 opened 6 months ago by
MaziyarPanahi
reacted to m-ric's post with ๐Ÿ‘ 6 months ago
view post
Post
1912
๐Ÿคฏ ๐—” ๐—ป๐—ฒ๐˜„ ๐Ÿณ๐Ÿฌ๐—• ๐—ผ๐—ฝ๐—ฒ๐—ป-๐˜„๐—ฒ๐—ถ๐—ด๐—ต๐˜๐˜€ ๐—Ÿ๐—Ÿ๐—  ๐—ฏ๐—ฒ๐—ฎ๐˜๐˜€ ๐—–๐—น๐—ฎ๐˜‚๐—ฑ๐—ฒ-๐Ÿฏ.๐Ÿฑ-๐—ฆ๐—ผ๐—ป๐—ป๐—ฒ๐˜ ๐—ฎ๐—ป๐—ฑ ๐—š๐—ฃ๐—ง-๐Ÿฐ๐—ผ!

@mattshumer , CEO from Hyperwrite AI, had an idea he wanted to try out: why not fine-tune LLMs to always output their thoughts in specific parts, delineated by <thinking> tags?

Even better: inside of that, you could nest other sections, to reflect critically on previous output. Letโ€™s name this part <reflection>. Planning is also put in a separate step.

He named the method โ€œReflection tuningโ€ and set out to fine-tune a Llama-3.1-70B with it.

Well it turns out, it works mind-boggingly well!

๐Ÿคฏ Reflection-70B beats GPT-4o, Sonnet-3.5, and even the much bigger Llama-3.1-405B!

๐—ง๐—Ÿ;๐——๐—ฅ
๐ŸฅŠ This new 70B open-weights model beats GPT-4o, Claude Sonnet, et al.
โฐ 405B in training, coming soon
๐Ÿ“š Report coming next week
โš™๏ธ Uses GlaiveAI synthetic data
๐Ÿค— Available on HF!

Iโ€™m starting an Inference Endpoint right now for this model to give it a spin!

Check it out ๐Ÿ‘‰ mattshumer/Reflection-Llama-3.1-70B
ยท
New activity in yodayo-ai/kivotos-xl-2.0 9 months ago

Broken results

3
#1 opened 9 months ago by
Yuuru
New activity in saltlux/luxia-21.4b-alignment-v1.0 12 months ago

Quantized GGUF available

3
#3 opened 12 months ago by
MaziyarPanahi
New activity in chargoddard/mixtralnt-4x7b-test about 1 year ago

It works!!!

7
#1 opened about 1 year ago by
HoangHa
New activity in TheBloke/Mixtral-8x7B-v0.1-GGUF about 1 year ago

It works.

6
#3 opened about 1 year ago by
Yuuru
New activity in mistralai/Mistral-7B-Instruct-v0.2 about 1 year ago

How is this different from v1?

7
#2 opened about 1 year ago by
amgadhasan
New activity in TheBlokeAI/Mixtral-tiny-GPTQ about 1 year ago

What is this model?

3
#1 opened about 1 year ago by
Yuuru
New activity in TheBloke/Yi-34B-GPTQ over 1 year ago

How do i run it?

4
#2 opened over 1 year ago by
Yuuru