Spaces:
Running
Running
janvanlooy
commited on
Commit
β’
ff11f2b
1
Parent(s):
3dc70f2
Update README.md
Browse files
README.md
CHANGED
@@ -11,44 +11,49 @@ pinned: false
|
|
11 |
<img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
|
12 |
</p>
|
13 |
<p align="center">
|
14 |
-
<i>
|
15 |
<br>
|
16 |
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
|
17 |
<br>
|
18 |
<br>
|
19 |
<a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
|
|
|
|
|
|
|
|
|
20 |
</p>
|
21 |
|
22 |
---
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
-
|
27 |
-
-
|
28 |
-
-
|
29 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
|
31 |
## πͺ€ Why Fondant?
|
32 |
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
We believe that **innovation is a group effort**, requiring collaboration. While the community has
|
39 |
-
been building and sharing models, everyone is still building their data preparation from scratch.
|
40 |
-
**Fondant is the platform where we meet to build and share data preparation workflows.**
|
41 |
-
|
42 |
-
Fondant offers a framework to build **composable data preparation pipelines, with reusable
|
43 |
-
components, optimized to handle massive datasets**. Stop building from scratch, and start
|
44 |
-
reusing components to:
|
45 |
-
|
46 |
-
- Extend your data with public datasets
|
47 |
-
- Generate new modalities using captioning, segmentation, translation, image generation, ...
|
48 |
-
- Distill knowledge from existing foundation models
|
49 |
-
- Filter out low quality data
|
50 |
-
- Deduplicate data
|
51 |
-
|
52 |
-
And create high quality datasets to fine-tune your own foundation models.
|
53 |
|
54 |
<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
|
|
|
11 |
<img src="https://raw.githubusercontent.com/ml6team/fondant/main/docs/art/fondant_banner.svg" height="250px"/>
|
12 |
</p>
|
13 |
<p align="center">
|
14 |
+
<i>Large-scale data processing made easy and reusable</i>
|
15 |
<br>
|
16 |
<a href="https://fondant.readthedocs.io/en/stable/"><strong>Explore the docs Β»</strong></a>
|
17 |
<br>
|
18 |
<br>
|
19 |
<a href="https://discord.gg/HnTdWhydGp"><img alt="Discord" src="https://dcbadge.vercel.app/api/server/HnTdWhydGp?style=flat-square"></a>
|
20 |
+
<a href="https://pypi.org/project/fondant/"><img alt="PyPI version" src="https://img.shields.io/pypi/v/fondant?color=brightgreen&style=flat-square"></a>
|
21 |
+
<a href="https://fondant.readthedocs.io/en/latest/license/"><img alt="License" src="https://img.shields.io/github/license/ml6team/fondant?style=flat-square&color=brightgreen"></a>
|
22 |
+
<a href="https://github.com/ml6team/fondant/actions/workflows/pipeline.yaml"><img alt="GitHub Workflow Status" src="https://img.shields.io/github/actions/workflow/status/ml6team/fondant/pipeline.yaml?style=flat-square"></a>
|
23 |
+
<a href="https://coveralls.io/github/ml6team/fondant?branch=main"><img alt="Coveralls" src="https://img.shields.io/coverallsCoverage/github/ml6team/fondant?style=flat-square"></a>
|
24 |
</p>
|
25 |
|
26 |
---
|
27 |
+
π«**Fondant is an open-source framework that aims to simplify and speed up large-scale data processing by making
|
28 |
+
containerized components reusable across pipelines and execution environments and shareable within the community.**\
|
29 |
+
It offers:
|
30 |
+
- π§ Plug βnβ play composable pipelines for creating datasets for
|
31 |
+
- AI image generation model fine-tuning (Stable Diffusion, ControlNet)
|
32 |
+
- Large language model fine-tuning (LLaMA, Falcon)
|
33 |
+
- Code generation model fine-tuning (StarCoder)
|
34 |
+
- 𧱠Library of off-the-shelf reusable components for
|
35 |
+
- Extracting data from public sources such as Common Crawl, LAION, ...
|
36 |
+
- Filtering on
|
37 |
+
- Content, e.g. language, visual style, topic, format, aesthetics, etc.
|
38 |
+
- Context, e.g. copyright license, origin
|
39 |
+
- Metadata
|
40 |
+
- Removal of unwanted data such as toxic, NSFW or generated content
|
41 |
+
- Removal of unwanted data patterns such as societal bias
|
42 |
+
- Transforming data (resizing, cropping, reformatting, β¦)
|
43 |
+
- Tuning the data for model performance (normalization, deduplication, β¦)
|
44 |
+
- Enriching data (captioning, metadata generation, synthetics, β¦)
|
45 |
+
- Transparency, auditability, compliance
|
46 |
+
- π πΌοΈ ποΈ βΎοΈ Out of the box multimodal capabilities: text, images, video, etc.
|
47 |
+
- π Standardized, Python/Pandas-based way of creating custom components
|
48 |
+
- π Production-ready, scalable deployment
|
49 |
+
- βοΈ Multi-cloud integrations
|
50 |
|
51 |
## πͺ€ Why Fondant?
|
52 |
|
53 |
+
In the age of Foundation Models, control over your data is key and building pipelines
|
54 |
+
for large-scale data processing is costly, especially when they require advanced
|
55 |
+
machine learning-based operations. This need not be the case, however, if processing
|
56 |
+
components would be reusable and exchangeable and pipelines were easily composable.
|
57 |
+
Realizing this is the main vision behind Fondant.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
<p align="right">(<a href="#chocolate_bar-fondant">back to top</a>)</p>
|