AI & ML interests

Computer Vision

Recent Activity

timm's activity

merve 
posted an update 4 days ago
view post
Post
3459
This week in open AI was 🔥 Let's recap! 🤗 merve/january-31-releases-679a10669bd4030090c5de4d
LLMs 💬
> Huge: AllenAI released new Tülu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B 🔥
> Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license 😱
> Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license 🔥
> Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens
> Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages
> OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset

VLMs & vision 👀
> Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization 🔥
> NVIDIA released new series of Eagle2 models with 1B and 9B sizes
> DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license
> BEN2 is a new background removal model with MIT license!

Audio 🗣️
> YuE is a new open-source music generation foundation model, lyrics-to-song generation

Codebase 👩🏻‍💻
> We are open-sourcing our SmolVLM training and eval codebase! https://github.com/huggingface/smollm/tree/main/vision
> Open-R1 is open-source reproduction of R1 by @huggingface science team https://huggingface.co/blog/open-r1
  • 1 reply
·
merve 
posted an update 11 days ago
view post
Post
4847
Oof, what a week! 🥵 So many things have happened, let's recap! merve/jan-24-releases-6793d610774073328eac67a9

Multimodal 💬
- We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG 💗
- UI-TARS are new models by ByteDance to unlock agentic GUI control 🤯 in 2B, 7B and 72B
- Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B
- MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context
- Dataset: Yale released a new benchmark called MMVU
- Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark

LLMs 📖
- DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! 🤯
- Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B
- NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)

Audio 🗣️
- Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B
- TangoFlux is a new audio generation model trained from scratch and aligned with CRPO

Image/Video/3D Generation ⏯️
- Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux
- tencent released Hunyuan3D-2, new 3D asset generation from images
·
merve 
posted an update 11 days ago
view post
Post
2209
smolagents can see 🔥
we just shipped vision support to smolagents 🤗 agentic computers FTW

you can now:
💻 let the agent get images dynamically (e.g. agentic web browser)
📑 pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc)
with few LoC change! 🤯
you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) 🤠

read our blog http://hf.co/blog/smolagents-can-see
pcuenq 
updated 16 models 14 days ago