Daniel van Strien PRO
AI & ML interests
Articles
Organizations
davanstrien's activity
It turns out it's super easy to use Qdrant to index and search ColPali embeddings efficiently.
Blog post here: https://danielvanstrien.xyz/posts/post-with-code/colpali-qdrant/2024-10-02_using_colpali_with_qdrant.html
Very silly demo: davanstrien/ufo-ColPali-Search
To simplify testing this approach, I created a Space that lets you generate queries from an input document page image: davanstrien/ColPali-Query-Generator
I think there is much room for improvement, but I'm excited about the potential for relatively small VLMs to create synthetic data.
You can read the original blog post that goes into more detail here: https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html
Check out my latest blog post, where I guide you through creating a ColPali fine-tuning dataset using Qwen/Qwen2-VL-7B-Instruct to generate queries for a collection of UFO documents sourced from the Internet Archive.
The post covers:
- Introduction to data for ColPali models
- Using Qwen2-VL for retrieval query generation
- Tips for better query generation
Check out the post here:
https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html
The resulting Hugging Face dataset: davanstrien/ufo-ColPali
Very exciting to see this! I often want to use an LLM for a short period, and setting up a whole endpoint for this can be overkill. This seems like a very neat solution!
Do you think there is a chance that any VLMs will be added soon!?
I will have a full notebook to share on Monday, but you can already take a look at the dataset here: davanstrien/ufo-ColPali
Soon, you can find deep-cut datasets even if they don't have a full dataset card (you should still document your datasets!)
You can help improve this project by rating synthetic user search queries for hub datasets.
If you have a Hub login, you can start annotating in Argilla
in < 5 seconds here: https://davanstrien-my-argilla.hf.space/dataset/1100a091-7f3f-4a6e-ad51-4e859abab58f/annotation-mode
I need to do some tidying, but I'll share all the code and in-progress datasets for this soon!
Give it a try: librarian-bots/huggingface-datasets-semantic-search
Model: https://huggingface.co/davanstrien/nasa_concept_art
Dataset: davanstrien/nasa_concept_art
So far, training was done without captions, but I'm experimenting with using VLLMs to generate captions to see if that improves the model.
At the moment, this is relying on the dataset cards, so the similarity does indeed work better for longer dataset cards. I plan for a version that will directly use the dataset to create the similarity scores, which should hopefully work better!
✨ Adds a "Similar Datasets" section to Hugging Face dataset pages
🔍 Recommendations based on dataset READMEs
🏗️ Powered by https://huggingface.co/chromadb and https://huggingface.co/Snowflake embeddings.
You can try it here: https://chromewebstore.google.com/detail/hugging-face-similar/aijelnjllajooinkcpkpbhckbghghpnl?authuser=0&hl=en.
I am very happy to get feedback on whether this could be useful or not 🤗
A demo using Hugging Face Spaces and Gradio to collect LLM output preferences: davanstrien/would-you-read-it
The collection is loaded from an environment variable, so you can duplicate this Space to create a Space for exploring datasets in another collection!
- Search for question-answering datasets that include context
- Find alpaca-style datasets
- Locate DPO datasets
Try it out here: librarian-bots/dataset-column-search-api, or explore real-world applications in this notebook: librarian-bots/dataset-column-search-api
Very cool! One small suggestion would be to add a tag to the dataset so that it's easy to find all datasets made using this script. This could then be combined into a bigger dataset ;)
Just added this to the collection :)
@nihalnayak would love to add Bonito (I really like how many tasks it supports!). Do you already have a Spaces demo for it?
IMO, these approaches that rely on smaller models for synthetic data generation are quite valuable for scaling up synthetic data and democratizing access to creating domain-specific synthetic datasets.
I've compiled a collection of Gradio demos showcasing some of these methods here: davanstrien/synthetic-data-generation-demos-667573f248b97360ff3668a5
✅ Pre-configured environment
✅ Ready-to-use notebooks
✅ No local GPU needed
You can try the Space here: davanstrien/synthetic-data-workshop
I also wrote a blog post going into more detail about the motivations for the Space: https://huggingface.co/blog/davanstrien/synthetic-data-workshop
This Gradio app ( davanstrien/corpus-creator) takes you from your local files to a Hugging Face Dataset via Llama Index.
The goal of the tool is to make it quicker and easier to quickly get some local files you want to get ready for ML tasks into a Hugging Face Dataset. Perfect for building datasets for:
- synthetic data pipelines
- annotation
- RAG
- Other ML tasks that start from a HF dataset
I'll share something more substantial that uses this tomorrow 🤗
Here's what the pipeline offers:
- **Dataset Generation**: Automatically create synthetic sentence pairs
- **Mine hard negatives**: Use an existing embedding model to mine hard negatives
- **Model Training**: Train a model using the latest release of Sentence Transformers.
Check out this collection ( davanstrien/sentence-transformers-from-synthetic-data-66571a6133480d1b70066b70) to see an example of what you can achieve with this pipeline. It features a sentence transformer model to detect coding prompt similarities in a @bigcode dataset.
Excited to get started? Find a tutorial here: https://github.com/davanstrien/awesome-synthetic-datasets/tree/main/examples/embedding-datasets.
One of the most exciting use cases for LLMs is generating synthetic datasets that can be used to train non-LLM models. In the past, gathering enough data was one of the most significant barriers to training task-specific models. LLMs can potentially help in this area.
I've just written a new blog post on using meta-llama/Meta-Llama-3-70B-Instruct to generate synthetic similarity data based on the approach from Retrieving Texts based on Abstract Descriptions (2305.12517).
https://huggingface.co/blog/davanstrien/synthetic-similarity-datasets
I am starting with a great post by @MoritzLaurer on utilizing an open LLM to generate data for training a specialized Roberta model.
Read the blog post: https://huggingface.co/blog/synthetic-data-save-costs
See the rest of the list: https://github.com/davanstrien/awesome-synthetic-datasets
Great work on this!! Excited to see what's next!!
The aim is to lightly curate a collection of resources, tutorials, and tools for generating synthetic datasets using large language models.
I plan to add some "key techniques" to the repo, but for now, it focuses on important datasets, papers, and tools.
🔗 https://github.com/davanstrien/awesome-synthetic-datasets
Thanks, I'm currently using distilabel, which is working very well for me, but I will take a look at your tool!
🎯 Goals:
💬 Create multi-turn chats seeded from Cosmopedia
🎓 Customize questions for different audience levels
🔍 Evaluate the model's ability to elaborate and clarify
🤓 (I want to learn more about creating valuable synthetic datasets, and I learn best by doing stuff rather than reading stuff).
Cosmochat is created using the excellent distilabel library.
🔗 Explore the current version of the dataset: davanstrien/cosmochat
📝 Read more: https://huggingface.co/blog/davanstrien/cosmochat
The Cohere For AI Aya dataset CohereForAI/aya_dataset has human-annotated prompt-completion pairs in 71 languages. We can use this to create DPO datasets for more languages!
Using Aya's prompt/response pairs as a starting point we can use an LLM to generate an additional response to each prompt. We then use an LLM Judge to rank each response.
✅ In some/many languages, human responses may be better than LLM ones but we may want to check that assumption for some languages.
🚀 We use Argilla's distilabel library to push data to Argilla for validation. This also allows us to determine if an LLM judge is effective for different languages.
As an example of what this pipeline produces:
- DIBT/aya_dutch_dpo a DPO style dataset for Dutch using Llama 3 as a generator/judge LM.
- An annotation Space that anyone with a HF account can contribute to: https://dibt-demo-argilla-space.hf.space/dataset/924ef8a8-a447-4563-8806-0e2a668a5314/annotation-mode?page=1&status=pending
As part of Data is Better Together we want to build more DPO datasets. Join us here: https://github.com/huggingface/data-is-better-together#4-dpoorpo-datasets-for-more-languages 🤗
@burtenshaw recently launched the Domain Specific Dataset Project as part of Data is Better Together. As part of this, Ben created a Space that you can use to define some key perspectives and concepts from a domain. This seed dataset can then be used to generate a synthetic dataset for a particular domain.
In less than 30 minutes this afternoon, I created a domain-specific dataset focused on data-centric machine learning using these tools: davanstrien/data-centric-ml-sft.
You can create your own domain specific datasets using this approach. Find the steps to follow here: https://github.com/huggingface/data-is-better-together/blob/main/domain-specific-datasets/README.md
Our next step is to use these translated prompts to evaluate the performance of LLMs for non English languages.
Does LLM, as a judge, work outside of English?
Ideally, it would be compelling to leverage LLMs to judge models for non-English since this significantly lowers the barrier to evaluating models (although it doesn't remove this barrier altogether).
What we want to know is:
- does auto/LLM eval work in general for a particular language
- which model(s) works best as a judge
- do LLMs' judgments of non-English models match human preferences?
We're starting to think about how to approach this. If you have any ideas of possible approaches feel free to comment or join the discussion here: https://github.com/huggingface/data-is-better-together/issues/61
Other ideas...
Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models (2404.18796) with the SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge LLM?
Leveraging a 7k preference dataset Argilla ( @alvarobartt ), Hugging Face ( @lewtun ) and Kaist AI ( @JW17 & @nlee-208 )
utilized Kaist AI's recently introduced ORPO technique ORPO: Monolithic Preference Optimization without Reference Model (2403.07691) with the latest MistralAI MOE model to create a very high-performing open LLM: HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1
Since ORPO doesn't require a separate SFT stage, all that is needed is a strong base model + high-quality DPO-style datasets.
Currently, there is a significant lack of non-English DPO datasets. Filling this gap could significantly improve open LLMs in various languages.
You can get an overview of the current state of DPO datasets across different languages here: DIBT/preference_data_by_language
A Hugging Face Pro subscription includes access to many models you want to test when developing an app (https://huggingface.co/blog/inference-pro). Using the endpoint and tracing your generations during this development process is an excellent way for GPU-poor people to bootstrap an initial dataset quickly while prototyping.
That's a good point! It might be nice to combine the textual tl;dr description with some critical bits of metadata (where it exists).
For example, for the togethercomputer/RedPajama-Data-1T dataset, would the following summary help give you a quick sense of its content?
> tl;dr: RedPajama is a fully open-source implementation of the LLaMa dataset, consisting of 1.2 trillion tokens from sources like Commoncrawl, C4, GitHub, Books, ArXiv, Wikipedia, and StackExchange, primarily in English, and is structured with metadata for each text sample.
I've created a dataset with example summaries of the 500 most liked datasets on the Hub: davanstrien/dataset-tldr
Would these kinds of summaries be helpful?
Hopefully I'll have something to share for this soon! I still need to do some more annotating!
Awesome work @ZennyKenny ! Thanks for leading this effort 🤗
Using this approach, you can create a dataset that anyone with a Hugging Face account can contribute to 🤯
See an example of the kind of Space you can create following this tutorial here: davanstrien/haiku-preferences
🆕 New tutorial covers:
💬 Generating responses with open models
👥 Collecting human feedback (do you like this model response? Yes/No)
🤖 Preparing a TRL-compatible dataset for training aligned models
Check it out here: https://github.com/huggingface/data-is-better-together/tree/main/kto-preference
Great work! It's nice to see some open reproduction efforts of SPIN, and it's cool to see that some high-quality data can reduce the amount of data required. cc @teknium , who I know was also excited about SPIN.
I am excited to see what other amazing things the community can collectively build together! 💪
Step 1: Evaluate current SOTA.
The Data Is Better Together community has rated more than 10K prompts for quality. We now want to translate a subset of these to help address the language gap in model evals.
The plan is roughly this:
- We started with DIBT/10k_prompts_ranked and took a subset of 500 high-quality prompts
- We're asking the community to translate these prompts into different languages
- We'll evaluate the extent to which we can use AlpacaEval and similar approaches to rate the outputs of models across these different languages
- If it works well, we can more easily evaluate open LLMs across different languages by using a judge LLM to rate the quality of outputs from different models.
You can find more details in our new GitHub repo: https://github.com/huggingface/data-is-better-together (don't forget to give it a ⭐!)
This dataset uses a subset of HuggingFaceTB/cosmopedia, a synthetic textbook-quality dataset, and Genstruct to generate user/assistant response pairs.
My current results are mixed, but I'm excited to see how much work is happening around synthetic data generation in the community. Most crucial next step is working more on data filtering from cosmopedia.
Massive thanks to @euclaise @teknium and the other NouseResearch folks for sharing this model ❤️
Good data is essential for the open-source AI community. Recently, Argilla and Hugging Face launched Data is Better Together. In less than two weeks, over 350 people ranked over 10k prompts.
Today, we're shifting our focus to help support other community efforts to create datasets using Argilla and Hugging Face Spaces. This workflow means anyone with a Hugging Face account can contribute to a dataset in less than a minute. We want to hear from anyone with ideas for creating important datasets as a community. This could include things like:
- Creating preference data for a language that lacks high-quality preference datasets.
- Building evaluation datasets for a new domain.
- Developing a dataset for a new task.
If you would like to get involved, join us in the
#data-is-better-together
Discord channel: https://discord.com/channels/879548962464493619/1205128865735770142. You can read more in this blog post from @dvilasuero and I: https://huggingface.co/blog/community-datasets
Announcing DIBT/10k_prompts_ranked, the first dataset release from Data Is Better Together.
Created in <2 weeks by the community. Includes:
✨ 10,000+ prompt quality ratings
🧑💻 Human and synthetic data prompts
🌐 Generated by 300+ contributors
How and why collaborative datasets?
It's no secret that high-quality open data is essential for creating better open models. The open source community shares 100s of models, datasets and demos openly weekly, but collectively building open datasets has been less explored.
Datasets have a massive role in shaping what models can be created. If we want more high-quality models for all languages, domains and tasks, we need more and better open datasets for all languages, domains and tasks!
To explore how the community could build impactful datasets collectively, Argilla added support for HF authentication for Argilla instances hosted on a Hugging Face Space. Anyone with an HF login could begin contributing to a dataset in <1 minute.
To test this new workflow, we launched a task to rank the quality of prompts (human and synthetically generated).
In less than two weeks, we built a community of over 300 contributors for this dataset 🤗
This dataset became a reality thanks to the dedication of all the individuals who lent their support ❤️ To see the amazing people behind this dataset, visit DIBT/prompt-collective-dashboard
This is just the start for collectively building powerful open datasets!
This is a very popular opinion with me!
Really great work! I'm very pleased to see people explore beyond using GPT-4 for all preference ranking!
On Monday Argilla and Hugging Face launched #data-is-better-together an experiment focused on collectively building datasets on the Hub.
For our V1 experiment we're aiming to collectively rank 50k prompts!
In the few days since launch we've had:
❤️ 158 people contribute
🚀 2,796 prompts ranked
🤔 How Can You Contribute?
1. Sign up if you don’t have a Hugging Face account (why not!?)
2. Go to this Argilla Space and sign in: DIBT/prompt-collective
3. Read the guidelines and start rating prompts!
You can also join the #data-is-better-together channel in the Hugging Face Discord 🔗 https://discord.com/channels/879548962464493619/1205128865735770142
We're aiming to judge the text as a full prompt. Some of them are synthetically generated, so I would rank this as a bad prompt since the additional context doesn't seem to make sense as a prompt!
The progress tracking Space is very motivating!
Really nice and work, and really appreciate the depth of the technical report and that the data is available 🤗
Excellent work! Would be great if someone did a big sweep across a bunch of datasets and parameters to produce guidelines based on dataset properties/model size, etc.
I'm really excited to see collections shared like this. Making collections easily accessible unlocks so many interesting use cases often well beyond what was originally imagined by the collection holder.
Do you know how much this format is currently being used? i.e. what % of datasets adopted this format? Could be a nice community effort to convert some existing datasets with permissive licences into a standard format?
I would recommend following @dvilasuero and other folks at Argilla, who are doing some very cool work in this area, particularly via their distilabel library.
I'm also working on a modest intro to synthetic data generation here 🤗
I think it will be a big focus of 2024. I also believe there is a lot of scope for creative approaches to generating synthetic data that don't always rely on a big GPU budget (though this will also be important!). As an example of this, I created haiku_dpo, a synthetic dataset for making LLMs better at writing haiku. The development happened locally on a laptop with <1 hour of collab GPU time used at the end to generate a larger dataset. I think this topic will be an area where many community members can contribute more through their creativity rather than their GPU budget.