nielsr HF Staff commited on
Commit
c546cd3
Β·
verified Β·
1 Parent(s): 467086f

Add license to metadata and fix broken link in summary

Browse files

This PR improves the model card by:
- Adding the `apache-2.0` license to the YAML metadata.
- Fixing a broken link in the "Model Summary" section, ensuring it correctly points to the paper.

Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -1,8 +1,11 @@
1
  ---
2
- pipeline_tag: text-classification
3
  library_name: fasttext
 
 
4
  ---
5
 
 
 
6
  <p align="center">
7
  πŸ“‘ <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp πŸ”¨ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp πŸ€— <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp πŸ“¦ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
8
  <br>
@@ -10,8 +13,7 @@ library_name: fasttext
10
 
11
 
12
  ## Model Summary
13
- This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
14
- ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
15
  The positive label name and negative label name are "__label__1" and "__label__0" respectively.
16
 
17
  ## How to use
@@ -47,7 +49,7 @@ dist_executor.run()
47
 
48
  ## Training
49
  For more training details, you can refer to the paper and the training code is available on GitHub
50
- [PreSelect](https://github.com/hkust-nlp/preselect).
51
 
52
  ## Citation
53
  If you find this work helpful, please kindly cite as:
 
1
  ---
 
2
  library_name: fasttext
3
+ pipeline_tag: text-classification
4
+ license: apache-2.0
5
  ---
6
 
7
+ # Predictive Data Selection: The Data That Predicts Is the Data That Teaches
8
+
9
  <p align="center">
10
  πŸ“‘ <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp πŸ”¨ <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp πŸ€— <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp πŸ“¦ <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
11
  <br>
 
13
 
14
 
15
  ## Model Summary
16
+ This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches](https://arxiv.org/abs/2503.00808). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
 
17
  The positive label name and negative label name are "__label__1" and "__label__0" respectively.
18
 
19
  ## How to use
 
49
 
50
  ## Training
51
  For more training details, you can refer to the paper and the training code is available on GitHub
52
+ [PreSelect](https://github.com/hkust-nlp/PreSelect).
53
 
54
  ## Citation
55
  If you find this work helpful, please kindly cite as: