Ame Vi's picture
6 13

Ame Vi

Ameeeee

AI & ML interests

None yet

Recent Activity

updated a dataset 5 days ago
Ameeeee/Cultural_diversity
published a dataset 5 days ago
Ameeeee/Cultural_diversity
reacted to tomaarsen's post with šŸ”„ 14 days ago
An assembly of 18 European companies, labs, and universities have banded together to launch šŸ‡ŖšŸ‡ŗ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc. šŸ‡ŖšŸ‡ŗ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi 3ļøāƒ£ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion āž”ļø Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common. āš™ļø Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported. šŸ”„ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models šŸ“Š Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight. šŸ“ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code. Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release * https://huggingface.co/EuroBERT/EuroBERT-210m * https://huggingface.co/EuroBERT/EuroBERT-610m * https://huggingface.co/EuroBERT/EuroBERT-2.1B The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
View all activity

Organizations

Hugging Face's profile picture Argilla's profile picture Women on Hugging Face's profile picture Data Is Better Together's profile picture Social Post Explorers's profile picture HuggingFaceFW-Dev's profile picture Data Is Better Together Contributor's profile picture Bluesky Community's profile picture Hugging Face DG's profile picture

Ameeeee's activity

reacted to tomaarsen's post with šŸ”„ 14 days ago
view post
Post
6497
An assembly of 18 European companies, labs, and universities have banded together to launch šŸ‡ŖšŸ‡ŗ EuroBERT! It's a state-of-the-art multilingual encoder for 15 European languages, designed to be finetuned for retrieval, classification, etc.

šŸ‡ŖšŸ‡ŗ 15 Languages: English, French, German, Spanish, Chinese, Italian, Russian, Polish, Portuguese, Japanese, Vietnamese, Dutch, Arabic, Turkish, Hindi
3ļøāƒ£ 3 model sizes: 210M, 610M, and 2.1B parameters - very very useful sizes in my opinion
āž”ļø Sequence length of 8192 tokens! Nice to see these higher sequence lengths for encoders becoming more common.
āš™ļø Architecture based on Llama, but with bi-directional (non-causal) attention to turn it into an encoder. Flash Attention 2 is supported.
šŸ”„ A new Pareto frontier (stronger *and* smaller) for multilingual encoder models
šŸ“Š Evaluated against mDeBERTa, mGTE, XLM-RoBERTa for Retrieval, Classification, and Regression (after finetuning for each task separately): EuroBERT punches way above its weight.
šŸ“ Detailed paper with all details, incl. data: FineWeb for English and CulturaX for multilingual data, The Stack v2 and Proof-Pile-2 for code.

Check out the release blogpost here: https://huggingface.co/blog/EuroBERT/release
* EuroBERT/EuroBERT-210m
* EuroBERT/EuroBERT-610m
* EuroBERT/EuroBERT-2.1B

The next step is for researchers to build upon the 3 EuroBERT base models and publish strong retrieval, zero-shot classification, etc. models for all to use. I'm very much looking forward to it!
  • 1 reply
Ā·
reacted to fdaudens's post with šŸ”„ 26 days ago
view post
Post
3094
Is this the best tool to extract clean info from PDFs, handwriting and complex documents yet?

Open source olmOCR just dropped and the results are impressive.

Tested the free demo with various documents, including a handwritten Claes Oldenburg letter. The speed is impressive: 3000 tokens/second on your own GPU - that's 1/32 the cost of GPT-4o ($190/million pages). Game-changer for content extraction and digital archives.

To achieve this, Ai2 trained a 7B vision language model on 260K pages from 100K PDFs using "document anchoring" - combining PDF metadata with page images.

Best part: it actually understands document structure (columns, tables, equations) instead of just jumbling everything together like most OCR tools. Their human eval results back this up.

šŸ‘‰ Try the demo: https://olmocr.allenai.org

Going right into the AI toolkit: JournalistsonHF/ai-toolkit
  • 3 replies
Ā·
reacted to burtenshaw's post with šŸ‘ 26 days ago
view post
Post
5500
I made a real time voice agent with FastRTC, smolagents, and hugging face inference providers. Check it out in this space:

šŸ”— burtenshaw/coworking_agent
Ā·
upvoted an article 28 days ago
view article
Article

Synthetic data: save money, time and carbon with open source

ā€¢ 63
published an article 3 months ago
view article
Article

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

By davidberenstein1957 and 5 others ā€¢
ā€¢ 118