Huggingface Projects

company
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

huggingface-projects's activity

merveย 
posted an update about 11 hours ago
AdinaYย 
posted an update about 12 hours ago
view post
Post
394
QvQ-72B-Preview๐ŸŽ„ an open weight model for visual reasoning just released by Alibaba_Qwen team
Qwen/qvq-676448c820912236342b9888
โœจ Combines visual understanding & language reasoning.
โœจ Scores 70.3 on MMMU
โœจ Outperforms Qwen2-VL-72B-Instruct in complex problem-solving
sayakpaulย 
posted an update 1 day ago
view post
Post
1450
Commits speak louder than words ๐Ÿคช

* 4 new video models
* Multiple image models, including SANA & Flux Control
* New quantizers -> GGUF & TorchAO
* New training scripts

Enjoy this holiday-special Diffusers release ๐Ÿค—
Notes: https://github.com/huggingface/diffusers/releases/tag/v0.32.0
akhaliqย 
posted an update 5 days ago
view post
Post
2295
Google drops Gemini 2.0 Flash Thinking

a new experimental model that unlocks stronger reasoning capabilities and shows its thoughts. The model plans (with thoughts visible), can solve complex problems with Flash speeds, and more

now available in anychat, try it out: akhaliq/anychat
Xenovaย 
posted an update 6 days ago
view post
Post
1861
Introducing Moonshine Web: real-time speech recognition running 100% locally in your browser!
๐Ÿš€ Faster and more accurate than Whisper
๐Ÿ”’ Privacy-focused (no data leaves your device)
โšก๏ธ WebGPU accelerated (w/ WASM fallback)
๐Ÿ”ฅ Powered by ONNX Runtime Web and Transformers.js

Demo: webml-community/moonshine-web
Source code: https://github.com/huggingface/transformers.js-examples/tree/main/moonshine-web
freddyaboultonย 
posted an update 7 days ago
jbilcke-hfย 
posted an update 7 days ago
view post
Post
1907
Doing some testing with HunyuanVideo on the Hugging Face Inference Endpoints ๐Ÿค—

prompt: "a Shiba Inu is acting as a DJ, he wears sunglasses and is mixing and scratching with vinyl discs at a Ibiza sunny sand beach party"

1280x720, 22 steps, 121 frames

There are still some things to iron out regarding speed and memory usage, right now it takes 20min on an A100 (see attached charts)

but you can check it out here:

https://huggingface.co/jbilcke-hf/HunyuanVideo-for-InferenceEndpoints

There are various things I want to try like the 100% diffusers version and other models (LTX-Video..)
merveย 
posted an update 7 days ago
view post
Post
2271
Aya by Cohere For AI can now see! ๐Ÿ‘€

C4AI community has built Maya 8B, a new open-source multilingual VLM built on SigLIP and Aya 8B ๐ŸŒฑ works on 8 languages! ๐Ÿ—ฃ๏ธ

The authors extend Llava dataset using Aya's translation capabilities with 558k examples!
ry it here kkr5155/maya_demo

Dataset maya-multimodal/pretrain

Model maya-multimodal/maya ๐Ÿ‘
kudos @nahidalam and team
  • 1 reply
ยท
sayakpaulย 
posted an update 7 days ago
view post
Post
1535
In the past seven days, the Diffusers team has shipped:

1. Two new video models
2. One new image model
3. Two new quantization backends
4. Three new fine-tuning scripts
5. Multiple fixes and library QoL improvements

Coffee on me if someone can guess 1 - 4 correctly.
  • 1 reply
ยท
freddyaboultonย 
posted an update 7 days ago
merveย 
posted an update 8 days ago
view post
Post
2890
Apollo is a new family of open-source video language models by Meta, where 3B model outperforms most 7B models and 7B outperforms most 30B models ๐Ÿงถ

โœจ the models come in 1.5B https://huggingface.co/Apollo-LMMs/Apollo-1_5B-t32, 3B https://huggingface.co/Apollo-LMMs/Apollo-3B-t32 and 7B https://huggingface.co/Apollo-LMMs/Apollo-7B-t32 with A2.0 license, based on Qwen1.5 & Qwen2
โœจ the authors also release a benchmark dataset https://huggingface.co/spaces/Apollo-LMMs/ApolloBench

The paper has a lot of experiments (they trained 84 models!) about what makes the video LMs work โฏ๏ธ

Try the demo for best setup here https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B
they evaluate sampling strategies, scaling laws for models and datasets, video representation and more!
> The authors find out that whatever design decision was applied to small models also scale properly when the model and dataset are scaled ๐Ÿ“ˆ scaling dataset has diminishing returns for smaller models
> They evaluate frame sampling strategies, and find that FPS sampling is better than uniform sampling, and they find 8-32 tokens per frame optimal
> They also compare image encoders, they try a variation of models from shape optimized SigLIP to DINOv2
they find google/siglip-so400m-patch14-384 to be most powerful ๐Ÿ”ฅ
> they also compare freezing different parts of models, training all stages with some frozen parts give the best yield

They eventually release three models, where Apollo-3B outperforms most 7B models and Apollo 7B outperforms 30B models ๐Ÿ”ฅ
  • 3 replies
ยท
AdinaYย 
posted an update 9 days ago
view post
Post
484
Megrez-3B-Omni ๐Ÿ”ฅ an on-device multimodal LLM by Infinigence AI, another startup emerging from the Tsinghua University ecosystem.
Model: Infinigence/Megrez-3B-Omni
Demo: Infinigence/Megrez-3B-Omni
โœจSupports analysis of image, text, and audio modalities
โœจLeads in bilingual speech ( English & Chinese ) input, multi-turn conversations, and voice-based queries
โœจOutperforms in scene understanding and OCR across major benchmarks
freddyaboultonย 
posted an update 12 days ago
view post
Post
1820
Version 0.0.21 of gradio-pdf now properly loads chinese characters!
freddyaboultonย 
posted an update 12 days ago
view post
Post
1511
Hello Llama 3.2! ๐Ÿ—ฃ๏ธ๐Ÿฆ™

Build a Siri-like coding assistant that responds to "Hello Llama" in 100 lines of python! All with Gradio, webRTC ๐Ÿ˜Ž

freddyaboulton/hey-llama-code-editor
merveย 
posted an update 13 days ago
view post
Post
1663
A complete RAG pipeline includes a reranker, which ranks the documents to find the best document ๐Ÿ““
Same goes for multimodal RAG, multimodal rerankers which we can integrate to multimodal RAG pipelines!
Learn how to build a complete multimodal RAG pipeline with vidore/colqwen2-v1.0 as retriever, lightonai/MonoQwen2-VL-v0.1 as reranker, Qwen/Qwen2-VL-7B-Instruct as VLM in this notebook that runs on a GPU as small as L4 ๐Ÿ”ฅ https://huggingface.co/learn/cookbook/multimodal_rag_using_document_retrieval_and_reranker_and_vlms
freddyaboultonย 
posted an update 14 days ago
sayakpaulย 
posted an update 15 days ago
view post
Post
2039
Introducing a high-quality open-preference dataset to further this line of research for image generation.

Despite being such an inseparable component for modern image generation, open preference datasets are a rarity!

So, we decided to work on one with the community!

Check it out here:
https://huggingface.co/blog/image-preferences
ยท