sileod commited on
Commit
5872b44
·
1 Parent(s): 290ce2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -1
README.md CHANGED
@@ -7,4 +7,41 @@ sdk: static
7
  pinned: false
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card 🔥
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  pinned: false
8
  ---
9
 
10
+ ## tasksource: 500+ dataset harmonization preprocessings with structured annotations for frictionless extreme multi-task learning and evaluation
11
+
12
+ Huggingface Datasets is a great library, but it lacks standardization, and datasets require preprocessing work to be used interchangeably.
13
+ `tasksource` automates this and facilitates reproducible multi-task learning scaling.
14
+
15
+ Each dataset is standardized to either `MultipleChoice`, `Classification`, or `TokenClassification` dataset with identical fields. We do not support generation tasks as they are addressed by [promptsource](https://github.com/bigscience-workshop/promptsource). All implemented preprocessings are in [tasks.py](https://github.com/sileod/tasksource/blob/main/src/tasksource/tasks.py) or [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md). A preprocessing is a function that accepts a dataset and returns the standardized dataset. Preprocessing code is concise and human-readable.
16
+
17
+ ### Installation and usage:
18
+ `pip install tasksource`
19
+ ```python
20
+ from tasksource import list_tasks, load_task
21
+ df = list_tasks()
22
+
23
+ for id in df[df.task_type=="MultipleChoice"].id:
24
+ dataset = load_task(id)
25
+ # all yielded datasets can be used interchangeably
26
+ ```
27
+
28
+ See supported 500+ tasks in [tasks.md](https://github.com/sileod/tasksource/blob/main/tasks.md) (+200 MultipleChoice tasks, +200 Classification tasks) and feel free to request a new task. Datasets are downloaded to `$HF_DATASETS_CACHE` (as any huggingface dataset), so be sure to have >100GB of space there.
29
+
30
+ ### Pretrained model:
31
+
32
+ Text encoder pretrained on tasksource reached state-of-the-art results: [🤗/deberta-v3-base-tasksource-nli](https://hf.co/sileod/deberta-v3-base-tasksource-nli)
33
+
34
+ ### Contact and citation
35
+ I can help you integrate tasksource in your experiments. `[email protected]`
36
+
37
+ More details on this [article:](https://arxiv.org/abs/2301.05948)
38
+ ```bib
39
+ @article{sileo2023tasksource,
40
+ title={tasksource: Structured Dataset Preprocessing Annotations for Frictionless Extreme Multi-Task Learning and Evaluation},
41
+ author={Sileo, Damien},
42
+ url= {https://arxiv.org/abs/2301.05948},
43
+ journal={arXiv preprint arXiv:2301.05948},
44
+ year={2023}
45
+ }
46
+ ```
47
+