Text Classification
fastText
English
kenhktsui commited on
Commit
3a5f29c
·
verified ·
1 Parent(s): 4537db0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -3
README.md CHANGED
@@ -1,3 +1,69 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - kenhktsui/math-classifiers-data
5
+ language:
6
+ - en
7
+ metrics:
8
+ - f1
9
+ pipeline_tag: text-classification
10
+ library_name: fasttext
11
+ ---
12
+ # maths-fasttext-classifier
13
+
14
+ [Dataset](https://huggingface.co/datasets/kenhktsui/math-classifiers-data)
15
+ This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
16
+ This classifier classifies a text into Maths or Others.
17
+ The model is trained over 1.6M records, which is a 50:50 mix of maths and non maths in website and achieved a test F1 score of 0.97. It is an intended upsampling of maths data.
18
+ The classifier can be used for LLM pretraining data curation, to enhance capability in mathematics.
19
+ It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
20
+
21
+ Don't underestimate the "old" fasttext classiifer! It is indeed a good and scalable practice.
22
+ For example, [QWEN2.5-MATH](https://arxiv.org/pdf/2409.12122) leverages fasttext to curate pretraining data, althought its classifier is not open sourced.
23
+
24
+
25
+ ## 🛠️Usage
26
+ ```python
27
+ from typing import List
28
+ import re
29
+ from huggingface_hub import hf_hub_download
30
+ import fasttext
31
+
32
+
33
+ model_hf = fasttext.load_model(hf_hub_download("kenhktsui/maths-fasttext-classifier", "model.bin"))
34
+
35
+
36
+ def replace_newlines(text: str) -> str:
37
+ return re.sub("\n+", " ", text)
38
+
39
+
40
+ def predict(text_list: List[str]) -> List[dict]:
41
+ text_list = [replace_newlines(text) for text in text_list]
42
+ pred = model.predict(text_list)
43
+ return [{"label": l[0].lstrip("__label__"), "score": s[0]}
44
+ for l, s in zip(*pred)]
45
+
46
+
47
+ predict([
48
+ """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
49
+ """Differential geometry is a mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of single variable calculus, vector calculus, linear algebra and multilinear algebra.""",
50
+ ])
51
+ # [{'label': 'Others', 'score': 0.99998367},
52
+ # {'label': 'Maths', 'score': 0.99995637},
53
+ ```
54
+ ## 📊Evaluation
55
+ full version
56
+ ```
57
+ precision recall f1-score support
58
+
59
+ Maths 0.98 0.98 0.98 200000
60
+ Others 0.98 0.98 0.98 200000
61
+
62
+ accuracy 0.98 400000
63
+ macro avg 0.98 0.98 0.98 400000
64
+ weighted avg 0.98 0.98 0.98 400000
65
+ ```
66
+
67
+
68
+ ## ⚠️Known Limitation
69
+ The classifier does not handle short text well, which might not be surprising.