Spaces:
Running
Running
metadata
title: README
emoji: π«
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
Large-scale data processing made easy and reusable
Explore the docs Β»
π« Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making containerized components reusable across pipelines and execution environments and shareable within the community.
It offers:
- π§ Plug βnβ play composable pipelines for creating datasets for
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
- Large language model fine-tuning (LLaMA, Falcon)
- Code generation model fine-tuning (StarCoder)
- 𧱠Library of off-the-shelf reusable components for
- Extracting data from public sources such as Common Crawl, LAION, ...
- Filtering on
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
- Context, e.g. copyright license, origin
- Metadata
- Removal of unwanted data such as toxic, NSFW or generated content
- Removal of unwanted data patterns such as societal bias
- Transforming data (resizing, cropping, reformatting, β¦)
- Tuning the data for model performance (normalization, deduplication, β¦)
- Enriching data (captioning, metadata generation, synthetics, β¦)
- Transparency, auditability, compliance
- π πΌοΈ ποΈ βΎοΈ Out of the box multimodal capabilities: text, images, video, etc.
- π Standardized, Python/Pandas-based way of creating custom components
- π Production-ready, scalable deployment
- βοΈ Multi-cloud integrations
πͺ€ Why Fondant?
In the age of Foundation Models, control over your data is key and building pipelines for large-scale data processing is costly, especially when they require advanced machine learning-based operations. This need not be the case, however, if processing components would be reusable and exchangeable and pipelines were easily composable. Realizing this is the main vision behind Fondant.