Introducing Voxtral WebGPU: State-of-the-art audio transcription directly in your browser! π€― π£οΈ Transcribe videos, meeting notes, songs and more π Runs on-device, meaning no data is sent to a server π Multilingual (8 languages) π€ Completely free (forever) & open source
That's right, we're running Mistral's new Voxtral-Mini-3B model 100% locally in-browser on WebGPU, powered by Transformers.js and ONNX Runtime Web! π₯
Fine-tune Gemma3n on videos with audios inside with Colab A100 π₯ Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!
keep in mind, it's made for educational purposes π«‘ we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM stretch modalities and unfreeze layers as you wish! ππ» merve/smol-vision
They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion) The model is actually a full LLM (Qwen2), the tokenizer converts image tokens π€―
Dataset Viewer for PDFs just landed on Hugging Face ππ€ you can now preview all the PDFs easier than before!
on top of this, there's PdfFolder format to load the PDF datasets quicker π¨ > to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf > if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder π€