hyunjongkimmath commited on
Commit
5f072b7
·
1 Parent(s): 8c08a3b
Files changed (3) hide show
  1. README.md +15 -124
  2. model.pkl +2 -2
  3. pyproject.toml +1 -1
README.md CHANGED
@@ -1,141 +1,32 @@
1
  ---
2
- license: gpl-2.0
3
  tags:
4
  - fastai
5
- - multi-label-classification
6
- - mathematics
7
- - text-classification
8
- - math
9
  ---
10
 
 
11
 
12
- # math_text_tag_categorization model
13
 
14
- ## Disclaimer
 
15
 
16
- The author trained this model for his personal use first then decided to upload it to Hugging Face only after learning about it. As such, the model is not yet optimized for general use.
17
 
18
- ## Description
19
 
20
- math_text_tag_categorization is a multi-label text classification model. It was trained via the ULMFiT approach (cf. [The fastai book](https://github.com/fastai/fastbook)'s presentation of ULMFiT) - the author of this repository fine-tuned a language model available in [fast.ai](https://github.com/fastai) on a corpus of mathematical text in LaTeX, then fine-tuned the encoder obtained from the fine-tuned language model for a multi-label classification.
21
 
22
- The model classifies whether a mathematical text is or contains the following common types of mathematical text: definition, notation, concept (i.e. theorems, propositions, corollaries, lemmas, etc.), proof, narrative (e.g. the text one encounters in the beginning of a chapter or section in a book or in between theorems), exercise, remark, example.
23
 
24
- The model specifies these classifications as Python str's of the form `'#_meta/<type>'` - this format is for the author's personal [Obsidian](https://obsidian.md/) vault
25
-
26
- ## Intended uses & limitations
27
-
28
- This model is intended to take as input mathematical text that one might encounter in an undergraduate/graduate/research setting and output some tags concerning what kind of text the input is. The input text is also intended to take text of at most a few tens of thousands of characters long (or several pages of most undergraduate or graduate textbooks), but in practice, the author has experienced better results with shorter text.
29
-
30
- This model was trained on a corpus mostly of algebra, algebraic geometry, arithmetic geometry, and number theory, which are the author's primary mathematical interests. The corpus also contains some text of topology and analysis.
31
-
32
- This model is also trained on inputs which end with a line generally of the form `[^1]: <Last names of authors of the text>, <Some identification of the text>`. As such, text without such a line (or at least without just the beginning `[^1]:`) The author initially constructed the training pipeline to include such lines in the training data in hopes that the text identification would aid in training when the training data contained only a few hundred to a few thousand samples of text.
33
-
34
- ## How to use
35
-
36
- The author pieced together the following code block based on his readings from [this blog post](https://huggingface.co/blog/fastai).
37
-
38
- First load the model:
39
-
40
- ```python
41
- from huggingface_hub import from_pretrained_fastai
42
- repo_id = 'hyunjongkimmath/math_text_tag_categorization'
43
- model = from_pretrained_fastai(repo_id)
44
- ```
45
-
46
- ### Example usage
47
-
48
- The following are predictions that the model makes on small pieces of text that the author came up on the spot.
49
-
50
- ```python
51
- model.predict(r"""
52
- EXERCISE 1. Prove that a ring homomorphism $k \to k'$ of fields is injective.
53
- """
54
- )
55
- ```
56
-
57
- ```python
58
- ((#1) ['#_meta/exercise'],
59
- tensor([False, False, False, False, False, False, False, False, True, False,
60
- False, False, False, False]),
61
- tensor([1.6156e-03, 2.4018e-04, 8.5640e-03, 1.0265e-01, 1.3623e-05, 1.1528e-06,
62
- 2.0139e-02, 4.6183e-04, 9.9342e-01, 5.3418e-03, 2.1265e-04, 1.0663e-03,
63
- 2.8052e-03, 1.5590e-02]))
64
- ```
65
-
66
- ```python
67
- model.predict(r"""
68
- A scheme is a locally ringed space $(X, \mathscr{O}_X)$ such that $X$ has a cover $X = \bigcup_{i \in I} U_i$ by open subsets for which $(U_i, \mathscr{O}_X|_{U_i}$ is an affine scheme for every $i \in I$.
69
- """
70
- )
71
- ```
72
-
73
- ```python
74
- ((#1) ['#_meta/definition'],
75
- tensor([False, False, False, False, False, False, True, False, False, False,
76
- False, False, False, False]),
77
- tensor([1.5086e-04, 2.4751e-03, 3.2685e-03, 1.0054e-02, 2.2898e-09, 2.7758e-08,
78
- 9.8818e-01, 3.4790e-06, 2.0679e-04, 2.6567e-04, 8.2683e-03, 2.9197e-04,
79
- 3.8374e-05, 2.4789e-04]))
80
- ```
81
-
82
- ```python
83
- model.predict(r"""
84
- Theorem. $\mathbb{C}$ is algebraically closed.
85
- """
86
- )
87
- ```
88
-
89
- ```python
90
- ((#1) ['#_meta/concept'],
91
- tensor([False, False, False, True, False, False, False, False, False, False,
92
- False, False, False, False]),
93
- tensor([3.7847e-03, 2.3521e-03, 1.8541e-02, 9.3016e-01, 7.3878e-06, 1.2939e-04,
94
- 7.3363e-02, 4.7909e-04, 2.8213e-05, 7.7005e-03, 1.4716e-02, 1.4401e-02,
95
- 1.2660e-04, 3.5716e-02]))
96
- ```
97
-
98
- ```python
99
- model.predict(r"""
100
- Theorem. $\mathbb{C}$ is algebraically closed.
101
- Proof. Do so algebraic topology stuff.
102
- """
103
- )
104
- ```
105
-
106
- ```python
107
- ((#2) ['#_meta/concept','#_meta/proof'],
108
- tensor([False, False, False, True, False, False, False, False, False, False,
109
- False, True, False, False]),
110
- tensor([4.8982e-03, 2.5899e-03, 4.1548e-02, 9.9772e-01, 1.4636e-05, 4.6300e-04,
111
- 6.7154e-02, 7.7806e-04, 4.3370e-05, 3.9953e-03, 4.8603e-03, 8.3507e-01,
112
- 1.3487e-03, 1.0530e-02]))
113
- ```
114
-
115
- ```python
116
- model.predict(r"""
117
- We write "\alpha+1" to denote the successor ordinal of "\alpha".
118
- """
119
- ```
120
-
121
- Notice that the model correct identifies this text as containing a notation, but incorrectly identifies it as containing a concept.
122
-
123
- ```python
124
- ((#2) ['#_meta/concept','#_meta/notation'],
125
- tensor([False, False, False, True, False, False, False, False, False, False,
126
- True, False, False, False]),
127
- tensor([3.1503e-02, 9.0384e-05, 3.4131e-02, 6.2074e-01, 3.1992e-03, 1.8337e-04,
128
- 6.9371e-02, 1.1152e-03, 1.4070e-02, 6.4971e-02, 6.8179e-01, 2.8263e-02,
129
- 7.2798e-04, 8.7313e-02]))
130
- ```
131
 
132
 
133
- ## Evaluation metrics
134
- During training, the model has achieved over 95% accuracy on its validation dataset, which was chosen randomly from its entire dataset, according to fastai's [multi_accuracy](https://docs.fast.ai/metrics.html) metric.
135
 
136
- ## TODO
137
- The model has been trained on text tokenized via fastai's default word-tokenizing methods. These tokenizations do not necessarily tokenize common LaTeX tokens such as the dollar sign `$` and backslash `\` and thus the author imagines that the model can be improved with modified tokenizations.
138
 
139
- The model also outputs whether a text "should be" deleted, split, or merged - this was originally intended for the author's personal use, but the author has neither found the model to be actually useful in these categorizations nor taken the time to remove this feature.
 
140
 
141
- The author intends to eventually retrain the model on inputs that do not end with a line starting with `[^1]:`, see [above](#intended-uses--limitations).
 
 
1
  ---
 
2
  tags:
3
  - fastai
 
 
 
 
4
  ---
5
 
6
+ # Amazing!
7
 
8
+ 🥳 Congratulations on hosting your fastai model on the Hugging Face Hub!
9
 
10
+ # Some next steps
11
+ 1. Fill out this model card with more information (see the template below and the [documentation here](https://huggingface.co/docs/hub/model-repos))!
12
 
13
+ 2. Create a demo in Gradio or Streamlit using 🤗 Spaces ([documentation here](https://huggingface.co/docs/hub/spaces)).
14
 
15
+ 3. Join the fastai community on the [Fastai Discord](https://discord.com/invite/YKrxeNn)!
16
 
17
+ Greetings fellow fastlearner 🤝! Don't forget to delete this content from your model card.
18
 
 
19
 
20
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
 
23
+ # Model card
 
24
 
25
+ ## Model description
26
+ More information needed
27
 
28
+ ## Intended uses & limitations
29
+ More information needed
30
 
31
+ ## Training and evaluation data
32
+ More information needed
model.pkl CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b4b3eb7007440cc946ca872887f5f99c957d4e95573efc43d7196a5136b02f68
3
- size 188960371
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3d7917f8f3625f0023d0ecbe095fcb52ab354e7430bc262fece0d6cb877743ca
3
+ size 166126233
pyproject.toml CHANGED
@@ -1,3 +1,3 @@
1
  [build-system]
2
- requires = ["setuptools>=40.8.0", "wheel", "python=3.10.6", "fastai=2.7.9", "fastcore=1.5.22"]
3
  build-backend = "setuptools.build_meta:__legacy__"
 
1
  [build-system]
2
+ requires = ["setuptools>=40.8.0", "wheel", "python=3.10.6", "fastai=2.7.10", "fastcore=1.5.27"]
3
  build-backend = "setuptools.build_meta:__legacy__"