hyunjongkimmath
commited on
Commit
·
c385858
1
Parent(s):
4d9c18f
Update README.md
Browse files
README.md
CHANGED
@@ -6,26 +6,131 @@ tags:
|
|
6 |
|
7 |
# math_text_tag_categorization model
|
8 |
|
|
|
|
|
|
|
|
|
9 |
## Description
|
10 |
|
11 |
math_text_tag_categorization is a multi-label text classification model. It was trained via the ULMFiT approach (cf. [The fastai book](https://github.com/fastai/fastbook)'s presentation of ULMFiT) - the author of this repository fine-tuned a language model available in [fast.ai](https://github.com/fastai) on a corpus of mathematical text in LaTeX, then fine-tuned the encoder obtained from the fine-tuned language model for a multi-label classification.
|
12 |
|
13 |
The model classifies whether a mathematical text is or contains the following common types of mathematical text: definition, notation, concept (i.e. theorems, propositions, corollaries, lemmas, etc.), proof, narrative (e.g. the text one encounters in the beginning of a chapter or section in a book or in between theorems), exercise, remark, example.
|
14 |
|
|
|
|
|
15 |
## Intended uses & limitations
|
16 |
|
17 |
This model is intended to take as input mathematical text that one might encounter in an undergraduate/graduate/research setting and output some tags concerning what kind of text the input is. The input text is also intended to take text of at most a few tens of thousands of characters long (or several pages of most undergraduate or graduate textbooks), but in practice, the author has experienced better results with shorter text.
|
18 |
|
19 |
-
This model was trained on a corpus mostly of algebra, algebraic geometry, arithmetic geometry, and number theory, which are the author's primary mathematical interests.
|
|
|
|
|
20 |
|
21 |
## How to use
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
|
24 |
|
25 |
## Evaluation metrics
|
26 |
During training, the model has achieved over 95% accuracy on its validation dataset, which was chosen randomly from its entire dataset, according to fastai's [multi_accuracy](https://docs.fast.ai/metrics.html) metric.
|
27 |
|
28 |
-
## TODO
|
29 |
The model has been trained on text tokenized via fastai's default word-tokenizing methods. These tokenizations do not necessarily tokenize common LaTeX tokens such as the dollar sign `$` and backslash `\` and thus the author imagines that the model can be improved with modified tokenizations.
|
30 |
|
31 |
-
The model also outputs whether a text "should be" deleted, split, or merged - this was originally intended for the author's personal use, but the author has neither found the model to be actually useful in these categorizations nor taken the time to remove this feature.
|
|
|
|
|
|
6 |
|
7 |
# math_text_tag_categorization model
|
8 |
|
9 |
+
## Disclaimer
|
10 |
+
|
11 |
+
The author trained this model for his personal use first then decided to upload it to Hugging Face only learning about it. As such, the model is not yet optimized for general use.
|
12 |
+
|
13 |
## Description
|
14 |
|
15 |
math_text_tag_categorization is a multi-label text classification model. It was trained via the ULMFiT approach (cf. [The fastai book](https://github.com/fastai/fastbook)'s presentation of ULMFiT) - the author of this repository fine-tuned a language model available in [fast.ai](https://github.com/fastai) on a corpus of mathematical text in LaTeX, then fine-tuned the encoder obtained from the fine-tuned language model for a multi-label classification.
|
16 |
|
17 |
The model classifies whether a mathematical text is or contains the following common types of mathematical text: definition, notation, concept (i.e. theorems, propositions, corollaries, lemmas, etc.), proof, narrative (e.g. the text one encounters in the beginning of a chapter or section in a book or in between theorems), exercise, remark, example.
|
18 |
|
19 |
+
The model specifies these classifications as Python str's of the form `'#_meta/<type>'` - this format is for the author's personal [Obsidian](https://obsidian.md/) vault
|
20 |
+
|
21 |
## Intended uses & limitations
|
22 |
|
23 |
This model is intended to take as input mathematical text that one might encounter in an undergraduate/graduate/research setting and output some tags concerning what kind of text the input is. The input text is also intended to take text of at most a few tens of thousands of characters long (or several pages of most undergraduate or graduate textbooks), but in practice, the author has experienced better results with shorter text.
|
24 |
|
25 |
+
This model was trained on a corpus mostly of algebra, algebraic geometry, arithmetic geometry, and number theory, which are the author's primary mathematical interests. The corpus also contains some text of topology and analysis.
|
26 |
+
|
27 |
+
This model is also trained on inputs which end with a line generally of the form `[^1]: <Last names of authors of the text>, <Some identification of the text>`. As such, text without such a line (or at least without just the beginning `[^1]:`) The author initially constructed the training pipeline to include such lines in the training data in hopes that the text identification would aid in training when the training data contained only a few hundred to a few thousand samples of text.
|
28 |
|
29 |
## How to use
|
30 |
|
31 |
+
The author pieced together the following code block based on his readings from [this blog post](https://huggingface.co/blog/fastai).
|
32 |
+
|
33 |
+
First load the model:
|
34 |
+
|
35 |
+
```python
|
36 |
+
from huggingface_hub import from_pretrained_fastai
|
37 |
+
repo_id = 'hyunjongkimmath/math_text_tag_categorization'
|
38 |
+
model = from_pretrained_fastai(repo_id)
|
39 |
+
```
|
40 |
+
|
41 |
+
### Example usage
|
42 |
+
|
43 |
+
The following are predictions that the model makes on small pieces of text that the author came up on the spot.
|
44 |
+
|
45 |
+
```python
|
46 |
+
model.predict(r"""
|
47 |
+
EXERCISE 1. Prove that a ring homomorphism $k \to k'$ of fields is injective.
|
48 |
+
"""
|
49 |
+
)
|
50 |
+
```
|
51 |
+
|
52 |
+
```python
|
53 |
+
((#1) ['#_meta/exercise'],
|
54 |
+
tensor([False, False, False, False, False, False, False, False, True, False,
|
55 |
+
False, False, False, False]),
|
56 |
+
tensor([1.6156e-03, 2.4018e-04, 8.5640e-03, 1.0265e-01, 1.3623e-05, 1.1528e-06,
|
57 |
+
2.0139e-02, 4.6183e-04, 9.9342e-01, 5.3418e-03, 2.1265e-04, 1.0663e-03,
|
58 |
+
2.8052e-03, 1.5590e-02]))
|
59 |
+
```
|
60 |
+
|
61 |
+
```python
|
62 |
+
model.predict(r"""
|
63 |
+
A scheme is a locally ringed space $(X, \mathscr{O}_X)$ such that $X$ has a cover $X = \bigcup_{i \in I} U_i$ by open subsets for which $(U_i, \mathscr{O}_X|_{U_i}$ is an affine scheme for every $i \in I$.
|
64 |
+
"""
|
65 |
+
)
|
66 |
+
```
|
67 |
+
|
68 |
+
```python
|
69 |
+
((#1) ['#_meta/definition'],
|
70 |
+
tensor([False, False, False, False, False, False, True, False, False, False,
|
71 |
+
False, False, False, False]),
|
72 |
+
tensor([1.5086e-04, 2.4751e-03, 3.2685e-03, 1.0054e-02, 2.2898e-09, 2.7758e-08,
|
73 |
+
9.8818e-01, 3.4790e-06, 2.0679e-04, 2.6567e-04, 8.2683e-03, 2.9197e-04,
|
74 |
+
3.8374e-05, 2.4789e-04]))
|
75 |
+
```
|
76 |
+
|
77 |
+
```python
|
78 |
+
model.predict(r"""
|
79 |
+
Theorem. $\mathbb{C}$ is algebraically closed.
|
80 |
+
"""
|
81 |
+
)
|
82 |
+
```
|
83 |
+
|
84 |
+
```python
|
85 |
+
((#1) ['#_meta/concept'],
|
86 |
+
tensor([False, False, False, True, False, False, False, False, False, False,
|
87 |
+
False, False, False, False]),
|
88 |
+
tensor([3.7847e-03, 2.3521e-03, 1.8541e-02, 9.3016e-01, 7.3878e-06, 1.2939e-04,
|
89 |
+
7.3363e-02, 4.7909e-04, 2.8213e-05, 7.7005e-03, 1.4716e-02, 1.4401e-02,
|
90 |
+
1.2660e-04, 3.5716e-02]))
|
91 |
+
```
|
92 |
+
|
93 |
+
```python
|
94 |
+
model.predict(r"""
|
95 |
+
Theorem. $\mathbb{C}$ is algebraically closed.
|
96 |
+
Proof. Do so algebraic topology stuff.
|
97 |
+
"""
|
98 |
+
)
|
99 |
+
```
|
100 |
+
|
101 |
+
```python
|
102 |
+
((#2) ['#_meta/concept','#_meta/proof'],
|
103 |
+
tensor([False, False, False, True, False, False, False, False, False, False,
|
104 |
+
False, True, False, False]),
|
105 |
+
tensor([4.8982e-03, 2.5899e-03, 4.1548e-02, 9.9772e-01, 1.4636e-05, 4.6300e-04,
|
106 |
+
6.7154e-02, 7.7806e-04, 4.3370e-05, 3.9953e-03, 4.8603e-03, 8.3507e-01,
|
107 |
+
1.3487e-03, 1.0530e-02]))
|
108 |
+
```
|
109 |
+
|
110 |
+
```python
|
111 |
+
model.predict(r"""
|
112 |
+
We write "\alpha+1" to denote the successor ordinal of "\alpha".
|
113 |
+
"""
|
114 |
+
```
|
115 |
+
|
116 |
+
Notice that the model correct identifies this text as containing a notation, but incorrectly identifies it as containing a concept.
|
117 |
+
|
118 |
+
```python
|
119 |
+
((#2) ['#_meta/concept','#_meta/notation'],
|
120 |
+
tensor([False, False, False, True, False, False, False, False, False, False,
|
121 |
+
True, False, False, False]),
|
122 |
+
tensor([3.1503e-02, 9.0384e-05, 3.4131e-02, 6.2074e-01, 3.1992e-03, 1.8337e-04,
|
123 |
+
6.9371e-02, 1.1152e-03, 1.4070e-02, 6.4971e-02, 6.8179e-01, 2.8263e-02,
|
124 |
+
7.2798e-04, 8.7313e-02]))
|
125 |
+
```
|
126 |
|
127 |
|
128 |
## Evaluation metrics
|
129 |
During training, the model has achieved over 95% accuracy on its validation dataset, which was chosen randomly from its entire dataset, according to fastai's [multi_accuracy](https://docs.fast.ai/metrics.html) metric.
|
130 |
|
131 |
+
## TODO
|
132 |
The model has been trained on text tokenized via fastai's default word-tokenizing methods. These tokenizations do not necessarily tokenize common LaTeX tokens such as the dollar sign `$` and backslash `\` and thus the author imagines that the model can be improved with modified tokenizations.
|
133 |
|
134 |
+
The model also outputs whether a text "should be" deleted, split, or merged - this was originally intended for the author's personal use, but the author has neither found the model to be actually useful in these categorizations nor taken the time to remove this feature.
|
135 |
+
|
136 |
+
The author intends to retrain the model on inputs
|