--- title: README emoji: 🍫 colorFrom: yellow colorTo: green sdk: static pinned: false ---

Fondant banner Large-scale data processing made easy and reusable
Explore the docs »

--- 🍫 **Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making containerized components reusable across pipelines and execution environments and shareable within the community.** It offers: - 🔧 Plug ‘n’ play composable pipelines for creating datasets for - AI image generation model fine-tuning (Stable Diffusion, ControlNet) - Large language model fine-tuning (LLaMA, Falcon) - Code generation model fine-tuning (StarCoder) - 🧱 Library of off-the-shelf reusable components for - Extracting data from public sources such as Common Crawl, LAION, ... - Filtering on - Content, e.g. language, visual style, topic, format, aesthetics, etc. - Context, e.g. copyright license, origin - Metadata - Removal of unwanted data such as toxic, NSFW or generated content - Removal of unwanted data patterns such as societal bias - Transforming data (resizing, cropping, reformatting, …) - Tuning the data for model performance (normalization, deduplication, …) - Enriching data (captioning, metadata generation, synthetics, …) - Transparency, auditability, compliance - 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc. - 🐍 Standardized, Python/Pandas-based way of creating custom components - 🏭 Production-ready, scalable deployment - ☁️ Multi-cloud integrations ## 🪤 Why Fondant? In the age of Foundation Models, control over your data is key and building pipelines for large-scale data processing is costly, especially when they require advanced machine learning-based operations. This need not be the case, however, if processing components would be reusable and exchangeable and pipelines were easily composable. Realizing this is the main vision behind Fondant.