๐ฅ ๐๐ผ๐ผ๐ด๐น๐ฒ ๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ๐ ๐๐ฒ๐บ๐ถ๐ป๐ถ ๐ฎ.๐ฌ, ๐๐๐ฎ๐ฟ๐๐ถ๐ป๐ด ๐๐ถ๐๐ต ๐ฎ ๐๐น๐ฎ๐๐ต ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ต๐ฎ๐ ๐๐๐ฒ๐ฎ๐บ๐ฟ๐ผ๐น๐น๐ ๐๐ฃ๐ง-๐ฐ๐ผ ๐ฎ๐ป๐ฑ ๐๐น๐ฎ๐๐ฑ๐ฒ-๐ฏ.๐ฒ ๐ฆ๐ผ๐ป๐ป๐ฒ๐! And they start a huge effort on agentic capabilities.
๐ The performance improvements are crazy for such a fast model: โฃ Gemini 2.0 Flash outperforms the previous 1.5 Pro model at twice the speed โฃ Now supports both input AND output of images, video, audio and text โฃ Can natively use tools like Google Search and execute code
โก๏ธ If the price is on par with previous Flash iteration ($0.30 / M tokens, to compare with GPT-4o's $1.25) the competition will have a big problem with this 4x cheaper model that gets better benchmarks ๐คฏ
๐ค What about the agentic capabilities?
โฃ Project Astra: A universal AI assistant that can use Google Search, Lens and Maps โฃ Project Mariner: A Chrome extension that can complete complex web tasks (83.5% success rate on WebVoyager benchmark, this is really impressive!) โฃ Jules: An AI coding agent that integrates with GitHub workflows
I'll be eagerly awaiting further news from Google!
Multimodal ๐ผ๏ธ > Google shipped a PaliGemma 2, new iteration of PaliGemma with more sizes: 3B, 10B and 28B, with pre-trained and captioning variants ๐ > OpenGVLab released InternVL2, seven new vision LMs in different sizes, with sota checkpoint with MIT license โจ > Qwen team at Alibaba released the base models of Qwen2VL models with 2B, 7B and 72B ckpts
LLMs ๐ฌ > Meta released a new iteration of Llama 70B, Llama3.2-70B trained further > EuroLLM-9B-Instruct is a new multilingual LLM for European languages with Apache 2.0 license ๐ฅ > Dataset: CohereForAI released GlobalMMLU, multilingual version of MMLU with 42 languages with Apache 2.0 license > Dataset: QwQ-LongCoT-130K is a new dataset to train reasoning models > Dataset: FineWeb2 just landed with multilinguality update! ๐ฅ nearly 8TB pretraining data in many languages!
Image/Video Generation ๐ผ๏ธ > Tencent released HunyuanVideo, a new photorealistic video generation model > OminiControl is a new editing/control framework for image generation models like Flux
Audio ๐ > Indic-Parler-TTS is a new text2speech model made by community
Introducing TTS WebGPU: The first ever text-to-speech web app built with WebGPU acceleration! ๐ฅ High-quality and natural speech generation that runs 100% locally in your browser, powered by OuteTTS and Transformers.js. ๐ค Try it out yourself!
A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! ๐๐
They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.
โ๏ธ They put several good ideas to work:
๐ก Simplify screenshots to the max: They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further
๐ก Build a truly generalist dataset: To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.
โก๏ธ Nice results ensued: They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! ๐
And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! ๐
We just released Transformers.js v3.1 and you're not going to believe what's now possible in the browser w/ WebGPU! ๐คฏ Let's take a look: ๐ Janus from Deepseek for unified multimodal understanding and generation (Text-to-Image and Image-Text-to-Text) ๐๏ธ Qwen2-VL from Qwen for dynamic-resolution image understanding ๐ข JinaCLIP from Jina AI for general-purpose multilingual multimodal embeddings ๐ LLaVA-OneVision from ByteDance for Image-Text-to-Text generation ๐คธโโ๏ธ ViTPose for pose estimation ๐ MGP-STR for optical character recognition (OCR) ๐ PatchTST & PatchTSMixer for time series forecasting
That's right, everything running 100% locally in your browser (no data sent to a server)! ๐ฅ Huge for privacy!
How does it work ? - You give an URL - The AI assistant crawls the website content and embed it - Add it to your frontend in one line of code - People on your website can ask the assistant questions
Been reading about the "bigger models = better AI" narrative getting pushed back today.
@thomwolf tackled this head on at Web Summit and highlighted how important small models are (and why closed-source companies haven't pushed for this ๐ฌ). They're crushing it: today's 1B parameter models outperform last year's 10B models.
Fascinating to hear him talk about the secret sauce behind this approach.
๐ NYT leveraged AI to investigate election interference by analyzing 400+ hours of recorded meetings - that's 5M words of data!
AI spotted patterns, humans verified facts. Every AI-flagged quote was manually verified against source recordings. Really appreciate that they published their full methodology - transparency matters when using AI in journalism.
A perfect blend of tech & journalism.
The future of journalism isn't robots replacing reporters - it's AI helping humans process massive datasets more efficiently. Sometimes the most powerful tech solutions are the least flashy ones.
Letโs say youโre doing RAG, and in an effort to improve performance, you try to rerank a few possible source snippets by their relevancy to a query.
How can you score similarity between your query and any source document? ๐ค ๐ โ๏ธ ๐
This means that you encode each token from both the query and the doc as separate vectors, then average the tokens of each separately to get in total 2 vectors, then you compute similarity via cosine or something. โก๏ธ Notable examples: Check the top of the MTEB leaderboard!
These encode each token from both query and doc as separate vectors as before, but compare all together without previously averaging them and losing information.
This is more accurate than no-interaction but also slower because you have to compare n*m vectors instead of 2. At least you can store documents in memory. And ColBERT has some optimisations like pooling to be faster.
Anthropic just released a chunk improvement technique that vastly improves RAG performance! ๐ฅ
Crash reminder: Retrieval Augmented Generation (RAG) is a widely-used technique for improving your LLM chatbot's answers to user questions.
It goes like this: instead of generating an LLM answer straight away, it just adds a previous step called Retrieval, that retrieves relevant documents from your knowledge base through semantic search, and just appends the top K documents to the prompt. โก๏ธ As a result, the LLM answer is grounded in context.
โ๏ธ The difficulty with this retrieval step is that when you split your documents into chunks that will be retrieved, you lose context. So importance chunks could be missed.
๐ก Anthropic's just released blog post shows that you can add some context to each chunk, with one LLM call. Then you embed the original chunk + a bit of added context, so that the embedding is much more representative of the document in its context!
๐ค Isn't that crazy expensive? Well it would have been before, but not so much anymore with their new Prompt caching feature that makes duplicating thousands of requests with the same prompt much less expensive. They give an indicative price tag of only $1.02 per million chunks processed!
โ And this vastly improves performance on their benchmark!