Spaces:

fondant-ai
/

README

Running

App Files Files Community

janvanlooy commited on Sep 25, 2023

Commit

ff11f2b

1 Parent(s): 3dc70f2

Update README.md

Browse files

Files changed (1) hide show

README.md +33 -28

README.md CHANGED Viewed

@@ -11,44 +11,49 @@ pinned: false
     <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
 </p>
 <p align="center">
-    <i>Sweet data-centric foundation model fine-tuning</i>
     <br>
     <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
     <br>
     <br>
     <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
 </p>
 ---
-**Fondant helps you create high quality datasets to train or fine-tune foundation models such as:**
-- 🎨 Stable Diffusion
-- 📄 GPT-like Large Language Models (LLMs)
-- 🔎 CLIP
-- ✂️ Segment Anything (SAM)
-- ➕ And many more
 ## 🪤 Why Fondant?
-Foundation models simplify inference by solving multiple tasks across modalities with a simple
-prompt-based interface. But what they've gained in the front, they've lost in the back.
-**These models require enormous amounts of data, moving complexity towards data preparation**, and
-leaving few parties able to train their own models.
-We believe that **innovation is a group effort**, requiring collaboration. While the community has
-been building and sharing models, everyone is still building their data preparation from scratch.
-**Fondant is the platform where we meet to build and share data preparation workflows.**
-Fondant offers a framework to build **composable data preparation pipelines, with reusable
-components, optimized to handle massive datasets**. Stop building from scratch, and start
-reusing components to:
-- Extend your data with public datasets
-- Generate new modalities using captioning, segmentation, translation, image generation, ...
-- Distill knowledge from existing foundation models
-- Filter out low quality data
-- Deduplicate data
-And create high quality datasets to fine-tune your own foundation models.
 <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>

     <img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
 </p>
 <p align="center">
+    <i>Large-scale data processing made easy and reusable</i>
     <br>
     <a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs »</strong></a>
     <br>
     <br>
     <a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
+    <a href="https://pypi.org/project/fondant/"><img alt="PyPI version" src="https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square"></a>
+    <a href="https://fondant.readthedocs.io/en/latest/license/"><img alt="License" src="https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen"></a>
+    <a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
+    <a href="https://coveralls.io/github/ml6team/fondant?branch=main"><img alt="Coveralls" src="https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square"></a>
 </p>
 ---
+🍫**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
+containerized components reusable across pipelines and execution environments and shareable within the community.**\
+It offers:
+- 🔧 Plug ‘n’ play composable pipelines for creating datasets for
+    - AI image generation model fine-tuning (Stable Diffusion, ControlNet)
+    - Large language model fine-tuning (LLaMA, Falcon)
+    - Code generation model fine-tuning (StarCoder)
+- 🧱 Library of off-the-shelf reusable components for
+    - Extracting data from public sources such as Common Crawl, LAION, ...
+    - Filtering on
+        - Content, e.g. language, visual style, topic, format, aesthetics, etc.
+        - Context, e.g. copyright license, origin
+        - Metadata
+    - Removal of unwanted data such as toxic, NSFW or generated content
+    - Removal of unwanted data patterns such as societal bias
+    - Transforming data (resizing, cropping, reformatting, …)
+    - Tuning the data for model performance (normalization, deduplication, …)
+    - Enriching data (captioning, metadata generation, synthetics, …)
+    - Transparency, auditability, compliance
+- 📖 🖼️ 🎞️ ♾️ Out of the box multimodal capabilities: text, images, video, etc.
+- 🐍 Standardized, Python/Pandas-based way of creating custom components
+- 🏭 Production-ready, scalable deployment
+- ☁️ Multi-cloud integrations
 ## 🪤 Why Fondant?
+In the age of Foundation Models, control over your data is key and building pipelines
+for large-scale data processing is costly, especially when they require advanced
+machine learning-based operations. This need not be the case, however, if processing
+components would be reusable and exchangeable and pipelines were easily composable.
+Realizing this is the main vision behind Fondant.
 <p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>