hkust-nlp
/

preselect-fasttext-classifier

Add license to metadata and fix broken link in summary

by nielsr HF Staff - opened 16 days ago

←

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,8 +1,11 @@
 ---
-pipeline_tag: text-classification
 library_name: fasttext
 ---
 <p align="center">
     📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
 <br>
@@ -10,8 +13,7 @@ library_name: fasttext
 ## Model Summary
-This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper:  [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
-](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
 The positive label name and negative label name are "__label__1" and "__label__0" respectively.
 ## How to use
@@ -47,7 +49,7 @@ dist_executor.run()
 ## Training
 For more training details, you can refer to the paper and the training code is available on GitHub
-[PreSelect](https://github.com/hkust-nlp/preselect).
 ## Citation
 If you find this work helpful, please kindly cite as:

 ---
 library_name: fasttext
+pipeline_tag: text-classification
+license: apache-2.0
 ---
+# Predictive Data Selection: The Data That Predicts Is the Data That Teaches
 <p align="center">
     📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
 <br>
 ## Model Summary
+This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches](https://arxiv.org/abs/2503.00808). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
 The positive label name and negative label name are "__label__1" and "__label__0" respectively.
 ## How to use
 ## Training
 For more training details, you can refer to the paper and the training code is available on GitHub
+[PreSelect](https://github.com/hkust-nlp/PreSelect).
 ## Citation
 If you find this work helpful, please kindly cite as: