hyunjongkimmath commited on
Commit
c385858
·
1 Parent(s): 4d9c18f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -3
README.md CHANGED
@@ -6,26 +6,131 @@ tags:
6
 
7
  # math_text_tag_categorization model
8
 
 
 
 
 
9
  ## Description
10
 
11
  math_text_tag_categorization is a multi-label text classification model. It was trained via the ULMFiT approach (cf. [The fastai book](https://github.com/fastai/fastbook)'s presentation of ULMFiT) - the author of this repository fine-tuned a language model available in [fast.ai](https://github.com/fastai) on a corpus of mathematical text in LaTeX, then fine-tuned the encoder obtained from the fine-tuned language model for a multi-label classification.
12
 
13
  The model classifies whether a mathematical text is or contains the following common types of mathematical text: definition, notation, concept (i.e. theorems, propositions, corollaries, lemmas, etc.), proof, narrative (e.g. the text one encounters in the beginning of a chapter or section in a book or in between theorems), exercise, remark, example.
14
 
 
 
15
  ## Intended uses & limitations
16
 
17
  This model is intended to take as input mathematical text that one might encounter in an undergraduate/graduate/research setting and output some tags concerning what kind of text the input is. The input text is also intended to take text of at most a few tens of thousands of characters long (or several pages of most undergraduate or graduate textbooks), but in practice, the author has experienced better results with shorter text.
18
 
19
- This model was trained on a corpus mostly of algebra, algebraic geometry, arithmetic geometry, and number theory, which are the author's primary mathematical interests.
 
 
20
 
21
  ## How to use
22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
 
25
  ## Evaluation metrics
26
  During training, the model has achieved over 95% accuracy on its validation dataset, which was chosen randomly from its entire dataset, according to fastai's [multi_accuracy](https://docs.fast.ai/metrics.html) metric.
27
 
28
- ## TODO's
29
  The model has been trained on text tokenized via fastai's default word-tokenizing methods. These tokenizations do not necessarily tokenize common LaTeX tokens such as the dollar sign `$` and backslash `\` and thus the author imagines that the model can be improved with modified tokenizations.
30
 
31
- The model also outputs whether a text "should be" deleted, split, or merged - this was originally intended for the author's personal use, but the author has neither found the model to be actually useful in these categorizations nor taken the time to remove this feature.
 
 
 
6
 
7
  # math_text_tag_categorization model
8
 
9
+ ## Disclaimer
10
+
11
+ The author trained this model for his personal use first then decided to upload it to Hugging Face only learning about it. As such, the model is not yet optimized for general use.
12
+
13
  ## Description
14
 
15
  math_text_tag_categorization is a multi-label text classification model. It was trained via the ULMFiT approach (cf. [The fastai book](https://github.com/fastai/fastbook)'s presentation of ULMFiT) - the author of this repository fine-tuned a language model available in [fast.ai](https://github.com/fastai) on a corpus of mathematical text in LaTeX, then fine-tuned the encoder obtained from the fine-tuned language model for a multi-label classification.
16
 
17
  The model classifies whether a mathematical text is or contains the following common types of mathematical text: definition, notation, concept (i.e. theorems, propositions, corollaries, lemmas, etc.), proof, narrative (e.g. the text one encounters in the beginning of a chapter or section in a book or in between theorems), exercise, remark, example.
18
 
19
+ The model specifies these classifications as Python str's of the form `'#_meta/<type>'` - this format is for the author's personal [Obsidian](https://obsidian.md/) vault
20
+
21
  ## Intended uses & limitations
22
 
23
  This model is intended to take as input mathematical text that one might encounter in an undergraduate/graduate/research setting and output some tags concerning what kind of text the input is. The input text is also intended to take text of at most a few tens of thousands of characters long (or several pages of most undergraduate or graduate textbooks), but in practice, the author has experienced better results with shorter text.
24
 
25
+ This model was trained on a corpus mostly of algebra, algebraic geometry, arithmetic geometry, and number theory, which are the author's primary mathematical interests. The corpus also contains some text of topology and analysis.
26
+
27
+ This model is also trained on inputs which end with a line generally of the form `[^1]: <Last names of authors of the text>, <Some identification of the text>`. As such, text without such a line (or at least without just the beginning `[^1]:`) The author initially constructed the training pipeline to include such lines in the training data in hopes that the text identification would aid in training when the training data contained only a few hundred to a few thousand samples of text.
28
 
29
  ## How to use
30
 
31
+ The author pieced together the following code block based on his readings from [this blog post](https://huggingface.co/blog/fastai).
32
+
33
+ First load the model:
34
+
35
+ ```python
36
+ from huggingface_hub import from_pretrained_fastai
37
+ repo_id = 'hyunjongkimmath/math_text_tag_categorization'
38
+ model = from_pretrained_fastai(repo_id)
39
+ ```
40
+
41
+ ### Example usage
42
+
43
+ The following are predictions that the model makes on small pieces of text that the author came up on the spot.
44
+
45
+ ```python
46
+ model.predict(r"""
47
+ EXERCISE 1. Prove that a ring homomorphism $k \to k'$ of fields is injective.
48
+ """
49
+ )
50
+ ```
51
+
52
+ ```python
53
+ ((#1) ['#_meta/exercise'],
54
+ tensor([False, False, False, False, False, False, False, False, True, False,
55
+ False, False, False, False]),
56
+ tensor([1.6156e-03, 2.4018e-04, 8.5640e-03, 1.0265e-01, 1.3623e-05, 1.1528e-06,
57
+ 2.0139e-02, 4.6183e-04, 9.9342e-01, 5.3418e-03, 2.1265e-04, 1.0663e-03,
58
+ 2.8052e-03, 1.5590e-02]))
59
+ ```
60
+
61
+ ```python
62
+ model.predict(r"""
63
+ A scheme is a locally ringed space $(X, \mathscr{O}_X)$ such that $X$ has a cover $X = \bigcup_{i \in I} U_i$ by open subsets for which $(U_i, \mathscr{O}_X|_{U_i}$ is an affine scheme for every $i \in I$.
64
+ """
65
+ )
66
+ ```
67
+
68
+ ```python
69
+ ((#1) ['#_meta/definition'],
70
+ tensor([False, False, False, False, False, False, True, False, False, False,
71
+ False, False, False, False]),
72
+ tensor([1.5086e-04, 2.4751e-03, 3.2685e-03, 1.0054e-02, 2.2898e-09, 2.7758e-08,
73
+ 9.8818e-01, 3.4790e-06, 2.0679e-04, 2.6567e-04, 8.2683e-03, 2.9197e-04,
74
+ 3.8374e-05, 2.4789e-04]))
75
+ ```
76
+
77
+ ```python
78
+ model.predict(r"""
79
+ Theorem. $\mathbb{C}$ is algebraically closed.
80
+ """
81
+ )
82
+ ```
83
+
84
+ ```python
85
+ ((#1) ['#_meta/concept'],
86
+ tensor([False, False, False, True, False, False, False, False, False, False,
87
+ False, False, False, False]),
88
+ tensor([3.7847e-03, 2.3521e-03, 1.8541e-02, 9.3016e-01, 7.3878e-06, 1.2939e-04,
89
+ 7.3363e-02, 4.7909e-04, 2.8213e-05, 7.7005e-03, 1.4716e-02, 1.4401e-02,
90
+ 1.2660e-04, 3.5716e-02]))
91
+ ```
92
+
93
+ ```python
94
+ model.predict(r"""
95
+ Theorem. $\mathbb{C}$ is algebraically closed.
96
+ Proof. Do so algebraic topology stuff.
97
+ """
98
+ )
99
+ ```
100
+
101
+ ```python
102
+ ((#2) ['#_meta/concept','#_meta/proof'],
103
+ tensor([False, False, False, True, False, False, False, False, False, False,
104
+ False, True, False, False]),
105
+ tensor([4.8982e-03, 2.5899e-03, 4.1548e-02, 9.9772e-01, 1.4636e-05, 4.6300e-04,
106
+ 6.7154e-02, 7.7806e-04, 4.3370e-05, 3.9953e-03, 4.8603e-03, 8.3507e-01,
107
+ 1.3487e-03, 1.0530e-02]))
108
+ ```
109
+
110
+ ```python
111
+ model.predict(r"""
112
+ We write "\alpha+1" to denote the successor ordinal of "\alpha".
113
+ """
114
+ ```
115
+
116
+ Notice that the model correct identifies this text as containing a notation, but incorrectly identifies it as containing a concept.
117
+
118
+ ```python
119
+ ((#2) ['#_meta/concept','#_meta/notation'],
120
+ tensor([False, False, False, True, False, False, False, False, False, False,
121
+ True, False, False, False]),
122
+ tensor([3.1503e-02, 9.0384e-05, 3.4131e-02, 6.2074e-01, 3.1992e-03, 1.8337e-04,
123
+ 6.9371e-02, 1.1152e-03, 1.4070e-02, 6.4971e-02, 6.8179e-01, 2.8263e-02,
124
+ 7.2798e-04, 8.7313e-02]))
125
+ ```
126
 
127
 
128
  ## Evaluation metrics
129
  During training, the model has achieved over 95% accuracy on its validation dataset, which was chosen randomly from its entire dataset, according to fastai's [multi_accuracy](https://docs.fast.ai/metrics.html) metric.
130
 
131
+ ## TODO
132
  The model has been trained on text tokenized via fastai's default word-tokenizing methods. These tokenizations do not necessarily tokenize common LaTeX tokens such as the dollar sign `$` and backslash `\` and thus the author imagines that the model can be improved with modified tokenizations.
133
 
134
+ The model also outputs whether a text "should be" deleted, split, or merged - this was originally intended for the author's personal use, but the author has neither found the model to be actually useful in these categorizations nor taken the time to remove this feature.
135
+
136
+ The author intends to retrain the model on inputs