Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
giux78ย 
posted an update Mar 13
Post
Wonderful open source Italian dataset from @manalog and @ruggsea :

https://huggingface.co/datasets/manalog/UsenetArchiveIT

The dataset contributes to the https://huggingface.co/mii-community project, aimed at advancing the creation of Italian open-source Language Models (LLMs).๐Ÿ‡ฎ๐Ÿ‡น ๐Ÿค– About 10-20 billion token, probably the best conversational open source dataset in the Italian language. ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฎ๐Ÿ‡น

Afaik, the dataset could be the biggest Italian language dataset on Hugginface and probably one of the biggest Italian text datasets ever (excluding Common Crawl based datasets)

Afaik, the dataset could be the biggest Italian language dataset on Hugginface and probably one of the biggest Italian text datasets ever (excluding Common Crawl based datasets)