Text Classification
fastText
English
kenhktsui commited on
Commit
658569c
·
verified ·
1 Parent(s): 89a286b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -8
README.md CHANGED
@@ -14,7 +14,7 @@ library_name: fasttext
14
  [Dataset](https://huggingface.co/datasets/kenhktsui/math-classifiers-data)
15
  This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
16
  This classifier classifies a text into Maths or Others.
17
- The model is trained over 1.6M records, which is a 50:50 mix of maths and non maths in website and achieved a test F1 score of 0.97. It is an intended upsampling of maths data.
18
  The classifier can be used for LLM pretraining data curation, to enhance capability in mathematics.
19
  It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
20
 
@@ -47,21 +47,24 @@ def predict(text_list: List[str]) -> List[dict]:
47
  predict([
48
  """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
49
  """Differential geometry is a mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of single variable calculus, vector calculus, linear algebra and multilinear algebra.""",
 
50
  ])
51
- # [{'label': 'Others', 'score': 0.99998367},
52
- # {'label': 'Maths', 'score': 0.99995637},
 
53
  ```
54
  ## 📊Evaluation
55
  full version
56
  ```
57
  precision recall f1-score support
58
 
59
- Maths 0.98 0.98 0.98 200000
60
- Others 0.98 0.98 0.98 200000
 
 
 
 
61
 
62
- accuracy 0.98 400000
63
- macro avg 0.98 0.98 0.98 400000
64
- weighted avg 0.98 0.98 0.98 400000
65
  ```
66
 
67
 
 
14
  [Dataset](https://huggingface.co/datasets/kenhktsui/math-classifiers-data)
15
  This is part of my [fasttext classifier collection](https://huggingface.co/collections/kenhktsui/fasttext-model-for-pretraining-data-curation-67220374c8acb97a1839553c) for curating pretraining dataset.
16
  This classifier classifies a text into Maths or Others.
17
+ The model is trained over 1.6M records, which is a 50:50 mix of maths and non maths in website and achieved a test F1 score of 0.99 (too good to be true?). It is an intended upsampling of maths data.
18
  The classifier can be used for LLM pretraining data curation, to enhance capability in mathematics.
19
  It is ultra fast ⚡ with a throughtput of ~2000 doc/s with CPU.
20
 
 
47
  predict([
48
  """This is a lightning fast model, which can classify at throughtput of 2000 doc/s with CPU""",
49
  """Differential geometry is a mathematical discipline that studies the geometry of smooth shapes and smooth spaces, otherwise known as smooth manifolds. It uses the techniques of single variable calculus, vector calculus, linear algebra and multilinear algebra.""",
50
+ """Given $p$: $|4x-3|\leqslant 1$ and $q$: $x^{2}-(2a+1)x+a^{2}+a\leqslant 0$, find the range of values for $a$ if $p$ is a necessary but not sufficient condition for $q$."""
51
  ])
52
+ # [{'label': 'Others', 'score': 1.00000834},
53
+ # {'label': 'Maths', 'score': 0.99995351},
54
+ # {'label': 'Maths', 'score': 0.99801832}]
55
  ```
56
  ## 📊Evaluation
57
  full version
58
  ```
59
  precision recall f1-score support
60
 
61
+ Maths 0.99 0.98 0.99 200000
62
+ Others 0.98 0.99 0.99 200000
63
+
64
+ accuracy 0.99 400000
65
+ macro avg 0.99 0.99 0.99 400000
66
+ weighted avg 0.99 0.99 0.99 400000
67
 
 
 
 
68
  ```
69
 
70