Running 503 π» Open Source Ai Year In Review 2024 What happened in open-source AI this year, and whatβs next?
view post Post 3126 You can clean and format datasets entirely in the browser with a few lines of SQL. In this post, I replicate the process @mlabonne used to clean the new microsoft/orca-agentinstruct-1M-v1 dataset. The cleaning process consists of:- Joining the separate splits together / add split column- Converting string messages into list of structs- Removing empty system promptshttps://huggingface.co/blog/cfahlgren1/the-beginners-guide-to-cleaning-a-datasetHere's his new cleaned dataset: mlabonne/orca-agentinstruct-1M-v1-cleaned 1 reply Β· β€οΈ 19 19 + Reply
view article Article ColPali: Efficient Document Retrieval with Vision Language Models π By manu β’ Jul 5, 2024 β’ 186
view article Article Training and Finetuning Embedding Models with Sentence Transformers v3 May 28, 2024 β’ 171
In Defense of RAG in the Era of Long-Context Language Models Paper β’ 2409.01666 β’ Published Sep 3, 2024 β’ 2
AI Paper of the Day Collection A collection of papers that I think are interesting, one added each day β’ 266 items β’ Updated about 14 hours ago β’ 34