parislo commited on
Commit
97c86b4
·
verified ·
1 Parent(s): 1c6f564

Upload tokenizer

Browse files
Files changed (4) hide show
  1. README.md +199 -0
  2. special_tokens_map.json +1 -0
  3. tokenizer.json +539 -0
  4. tokenizer_config.json +8 -0
README.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Model Card for Model ID
7
+
8
+ <!-- Provide a quick summary of what the model is/does. -->
9
+
10
+
11
+
12
+ ## Model Details
13
+
14
+ ### Model Description
15
+
16
+ <!-- Provide a longer summary of what this model is. -->
17
+
18
+ This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
+
20
+ - **Developed by:** [More Information Needed]
21
+ - **Funded by [optional]:** [More Information Needed]
22
+ - **Shared by [optional]:** [More Information Needed]
23
+ - **Model type:** [More Information Needed]
24
+ - **Language(s) (NLP):** [More Information Needed]
25
+ - **License:** [More Information Needed]
26
+ - **Finetuned from model [optional]:** [More Information Needed]
27
+
28
+ ### Model Sources [optional]
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [More Information Needed]
33
+ - **Paper [optional]:** [More Information Needed]
34
+ - **Demo [optional]:** [More Information Needed]
35
+
36
+ ## Uses
37
+
38
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
+
40
+ ### Direct Use
41
+
42
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
+
44
+ [More Information Needed]
45
+
46
+ ### Downstream Use [optional]
47
+
48
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
+
50
+ [More Information Needed]
51
+
52
+ ### Out-of-Scope Use
53
+
54
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
+
56
+ [More Information Needed]
57
+
58
+ ## Bias, Risks, and Limitations
59
+
60
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
+
62
+ [More Information Needed]
63
+
64
+ ### Recommendations
65
+
66
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
+
68
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
+
70
+ ## How to Get Started with the Model
71
+
72
+ Use the code below to get started with the model.
73
+
74
+ [More Information Needed]
75
+
76
+ ## Training Details
77
+
78
+ ### Training Data
79
+
80
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
+
82
+ [More Information Needed]
83
+
84
+ ### Training Procedure
85
+
86
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
+
88
+ #### Preprocessing [optional]
89
+
90
+ [More Information Needed]
91
+
92
+
93
+ #### Training Hyperparameters
94
+
95
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
+
97
+ #### Speeds, Sizes, Times [optional]
98
+
99
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
+
101
+ [More Information Needed]
102
+
103
+ ## Evaluation
104
+
105
+ <!-- This section describes the evaluation protocols and provides the results. -->
106
+
107
+ ### Testing Data, Factors & Metrics
108
+
109
+ #### Testing Data
110
+
111
+ <!-- This should link to a Dataset Card if possible. -->
112
+
113
+ [More Information Needed]
114
+
115
+ #### Factors
116
+
117
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
+
119
+ [More Information Needed]
120
+
121
+ #### Metrics
122
+
123
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
+
125
+ [More Information Needed]
126
+
127
+ ### Results
128
+
129
+ [More Information Needed]
130
+
131
+ #### Summary
132
+
133
+
134
+
135
+ ## Model Examination [optional]
136
+
137
+ <!-- Relevant interpretability work for the model goes here -->
138
+
139
+ [More Information Needed]
140
+
141
+ ## Environmental Impact
142
+
143
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
+
145
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
+
147
+ - **Hardware Type:** [More Information Needed]
148
+ - **Hours used:** [More Information Needed]
149
+ - **Cloud Provider:** [More Information Needed]
150
+ - **Compute Region:** [More Information Needed]
151
+ - **Carbon Emitted:** [More Information Needed]
152
+
153
+ ## Technical Specifications [optional]
154
+
155
+ ### Model Architecture and Objective
156
+
157
+ [More Information Needed]
158
+
159
+ ### Compute Infrastructure
160
+
161
+ [More Information Needed]
162
+
163
+ #### Hardware
164
+
165
+ [More Information Needed]
166
+
167
+ #### Software
168
+
169
+ [More Information Needed]
170
+
171
+ ## Citation [optional]
172
+
173
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
+
175
+ **BibTeX:**
176
+
177
+ [More Information Needed]
178
+
179
+ **APA:**
180
+
181
+ [More Information Needed]
182
+
183
+ ## Glossary [optional]
184
+
185
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
+
187
+ [More Information Needed]
188
+
189
+ ## More Information [optional]
190
+
191
+ [More Information Needed]
192
+
193
+ ## Model Card Authors [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Contact
198
+
199
+ [More Information Needed]
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
tokenizer.json ADDED
@@ -0,0 +1,539 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "version": "1.0",
3
+ "truncation": null,
4
+ "padding": null,
5
+ "added_tokens": [],
6
+ "normalizer": null,
7
+ "pre_tokenizer": {
8
+ "type": "ByteLevel",
9
+ "add_prefix_space": false,
10
+ "trim_offsets": true,
11
+ "use_regex": true
12
+ },
13
+ "post_processor": {
14
+ "type": "ByteLevel",
15
+ "add_prefix_space": true,
16
+ "trim_offsets": false,
17
+ "use_regex": true
18
+ },
19
+ "decoder": {
20
+ "type": "ByteLevel",
21
+ "add_prefix_space": true,
22
+ "trim_offsets": true,
23
+ "use_regex": true
24
+ },
25
+ "model": {
26
+ "type": "BPE",
27
+ "dropout": null,
28
+ "unk_token": null,
29
+ "continuing_subword_prefix": null,
30
+ "end_of_word_suffix": null,
31
+ "fuse_unk": false,
32
+ "byte_fallback": false,
33
+ "ignore_merges": false,
34
+ "vocab": {
35
+ "!": 0,
36
+ "\"": 1,
37
+ "#": 2,
38
+ "$": 3,
39
+ "%": 4,
40
+ "&": 5,
41
+ "'": 6,
42
+ "(": 7,
43
+ ")": 8,
44
+ "*": 9,
45
+ "+": 10,
46
+ ",": 11,
47
+ "-": 12,
48
+ ".": 13,
49
+ "/": 14,
50
+ "0": 15,
51
+ "1": 16,
52
+ "2": 17,
53
+ "3": 18,
54
+ "4": 19,
55
+ "5": 20,
56
+ "6": 21,
57
+ "7": 22,
58
+ "8": 23,
59
+ "9": 24,
60
+ ":": 25,
61
+ ";": 26,
62
+ "<": 27,
63
+ "=": 28,
64
+ ">": 29,
65
+ "?": 30,
66
+ "@": 31,
67
+ "A": 32,
68
+ "B": 33,
69
+ "C": 34,
70
+ "D": 35,
71
+ "E": 36,
72
+ "F": 37,
73
+ "G": 38,
74
+ "H": 39,
75
+ "I": 40,
76
+ "J": 41,
77
+ "K": 42,
78
+ "L": 43,
79
+ "M": 44,
80
+ "N": 45,
81
+ "O": 46,
82
+ "P": 47,
83
+ "Q": 48,
84
+ "R": 49,
85
+ "S": 50,
86
+ "T": 51,
87
+ "U": 52,
88
+ "V": 53,
89
+ "W": 54,
90
+ "X": 55,
91
+ "Y": 56,
92
+ "Z": 57,
93
+ "[": 58,
94
+ "\\": 59,
95
+ "]": 60,
96
+ "^": 61,
97
+ "_": 62,
98
+ "`": 63,
99
+ "a": 64,
100
+ "b": 65,
101
+ "c": 66,
102
+ "d": 67,
103
+ "e": 68,
104
+ "f": 69,
105
+ "g": 70,
106
+ "h": 71,
107
+ "i": 72,
108
+ "j": 73,
109
+ "k": 74,
110
+ "l": 75,
111
+ "m": 76,
112
+ "n": 77,
113
+ "o": 78,
114
+ "p": 79,
115
+ "q": 80,
116
+ "r": 81,
117
+ "s": 82,
118
+ "t": 83,
119
+ "u": 84,
120
+ "v": 85,
121
+ "w": 86,
122
+ "x": 87,
123
+ "y": 88,
124
+ "z": 89,
125
+ "{": 90,
126
+ "|": 91,
127
+ "}": 92,
128
+ "~": 93,
129
+ "¡": 94,
130
+ "¢": 95,
131
+ "£": 96,
132
+ "¤": 97,
133
+ "¥": 98,
134
+ "¦": 99,
135
+ "§": 100,
136
+ "¨": 101,
137
+ "©": 102,
138
+ "ª": 103,
139
+ "«": 104,
140
+ "¬": 105,
141
+ "®": 106,
142
+ "¯": 107,
143
+ "°": 108,
144
+ "±": 109,
145
+ "²": 110,
146
+ "³": 111,
147
+ "´": 112,
148
+ "µ": 113,
149
+ "¶": 114,
150
+ "·": 115,
151
+ "¸": 116,
152
+ "¹": 117,
153
+ "º": 118,
154
+ "»": 119,
155
+ "¼": 120,
156
+ "½": 121,
157
+ "¾": 122,
158
+ "¿": 123,
159
+ "À": 124,
160
+ "Á": 125,
161
+ "Â": 126,
162
+ "Ã": 127,
163
+ "Ä": 128,
164
+ "Å": 129,
165
+ "Æ": 130,
166
+ "Ç": 131,
167
+ "È": 132,
168
+ "É": 133,
169
+ "Ê": 134,
170
+ "Ë": 135,
171
+ "Ì": 136,
172
+ "Í": 137,
173
+ "Î": 138,
174
+ "Ï": 139,
175
+ "Ð": 140,
176
+ "Ñ": 141,
177
+ "Ò": 142,
178
+ "Ó": 143,
179
+ "Ô": 144,
180
+ "Õ": 145,
181
+ "Ö": 146,
182
+ "×": 147,
183
+ "Ø": 148,
184
+ "Ù": 149,
185
+ "Ú": 150,
186
+ "Û": 151,
187
+ "Ü": 152,
188
+ "Ý": 153,
189
+ "Þ": 154,
190
+ "ß": 155,
191
+ "à": 156,
192
+ "á": 157,
193
+ "â": 158,
194
+ "ã": 159,
195
+ "ä": 160,
196
+ "å": 161,
197
+ "æ": 162,
198
+ "ç": 163,
199
+ "è": 164,
200
+ "é": 165,
201
+ "ê": 166,
202
+ "ë": 167,
203
+ "ì": 168,
204
+ "í": 169,
205
+ "î": 170,
206
+ "ï": 171,
207
+ "ð": 172,
208
+ "ñ": 173,
209
+ "ò": 174,
210
+ "ó": 175,
211
+ "ô": 176,
212
+ "õ": 177,
213
+ "ö": 178,
214
+ "÷": 179,
215
+ "ø": 180,
216
+ "ù": 181,
217
+ "ú": 182,
218
+ "û": 183,
219
+ "ü": 184,
220
+ "ý": 185,
221
+ "þ": 186,
222
+ "ÿ": 187,
223
+ "Ā": 188,
224
+ "ā": 189,
225
+ "Ă": 190,
226
+ "ă": 191,
227
+ "Ą": 192,
228
+ "ą": 193,
229
+ "Ć": 194,
230
+ "ć": 195,
231
+ "Ĉ": 196,
232
+ "ĉ": 197,
233
+ "Ċ": 198,
234
+ "ċ": 199,
235
+ "Č": 200,
236
+ "č": 201,
237
+ "Ď": 202,
238
+ "ď": 203,
239
+ "Đ": 204,
240
+ "đ": 205,
241
+ "Ē": 206,
242
+ "ē": 207,
243
+ "Ĕ": 208,
244
+ "ĕ": 209,
245
+ "Ė": 210,
246
+ "ė": 211,
247
+ "Ę": 212,
248
+ "ę": 213,
249
+ "Ě": 214,
250
+ "ě": 215,
251
+ "Ĝ": 216,
252
+ "ĝ": 217,
253
+ "Ğ": 218,
254
+ "ğ": 219,
255
+ "Ġ": 220,
256
+ "ġ": 221,
257
+ "Ģ": 222,
258
+ "ģ": 223,
259
+ "Ĥ": 224,
260
+ "ĥ": 225,
261
+ "Ħ": 226,
262
+ "ħ": 227,
263
+ "Ĩ": 228,
264
+ "ĩ": 229,
265
+ "Ī": 230,
266
+ "ī": 231,
267
+ "Ĭ": 232,
268
+ "ĭ": 233,
269
+ "Į": 234,
270
+ "į": 235,
271
+ "İ": 236,
272
+ "ı": 237,
273
+ "IJ": 238,
274
+ "ij": 239,
275
+ "Ĵ": 240,
276
+ "ĵ": 241,
277
+ "Ķ": 242,
278
+ "ķ": 243,
279
+ "ĸ": 244,
280
+ "Ĺ": 245,
281
+ "ĺ": 246,
282
+ "Ļ": 247,
283
+ "ļ": 248,
284
+ "Ľ": 249,
285
+ "ľ": 250,
286
+ "Ŀ": 251,
287
+ "ŀ": 252,
288
+ "Ł": 253,
289
+ "ł": 254,
290
+ "Ń": 255,
291
+ "th": 256,
292
+ "the": 257,
293
+ "Ġthe": 258,
294
+ "Ġi": 259,
295
+ "Ġa": 260,
296
+ "en": 261,
297
+ "re": 262,
298
+ "Ġo": 263,
299
+ "si": 264,
300
+ "Ġis": 265,
301
+ "al": 266,
302
+ "ri": 267,
303
+ "at": 268,
304
+ "es": 269,
305
+ "le": 270,
306
+ "on": 271,
307
+ "Ġf": 272,
308
+ "Ġof": 273,
309
+ "nd": 274,
310
+ "an": 275,
311
+ "he": 276,
312
+ "Ġb": 277,
313
+ "Ġc": 278,
314
+ "Ġe": 279,
315
+ "Ġs": 280,
316
+ "Ġt": 281,
317
+ "Eu": 282,
318
+ "ion": 283,
319
+ "la": 284,
320
+ "mu": 285,
321
+ "om": 286,
322
+ "or": 287,
323
+ "ore": 288,
324
+ "se": 289,
325
+ "ten": 290,
326
+ "ĠT": 291,
327
+ "Ġsi": 292,
328
+ "ĠEu": 293,
329
+ "Ġand": 294,
330
+ "Ġfu": 295,
331
+ "mula": 296,
332
+ "ormula": 297,
333
+ "Ġsid": 298,
334
+ "'s": 299,
335
+ "ag": 300,
336
+ "et": 301,
337
+ "hy": 302,
338
+ "po": 303,
339
+ "qu": 304,
340
+ "use": 305,
341
+ "Ġ-": 306,
342
+ "Ġ2": 307,
343
+ "Ġl": 308,
344
+ "Ġw": 309,
345
+ "ĠÎ": 310,
346
+ "Ġth": 311,
347
+ "Ġre": 312,
348
+ "ther": 313,
349
+ "Ġin": 314,
350
+ "eng": 315,
351
+ "ent": 316,
352
+ "Ġother": 317,
353
+ "rig": 318,
354
+ "ler": 319,
355
+ "Ġformula": 320,
356
+ "ĠThe": 321,
357
+ "ĠEuler": 322,
358
+ "Ġsides": 323,
359
+ "It": 324,
360
+ "Py": 325,
361
+ "am": 326,
362
+ "are": 327,
363
+ "ct": 328,
364
+ "gle": 329,
365
+ "hi": 330,
366
+ "ht": 331,
367
+ "in": 332,
368
+ "li": 333,
369
+ "lat": 334,
370
+ "nct": 335,
371
+ "ple": 336,
372
+ "ry": 337,
373
+ "um": 338,
374
+ "wo": 339,
375
+ "Ġ+": 340,
376
+ "Ġ=": 341,
377
+ "Ġg": 342,
378
+ "Ġn": 343,
379
+ "Ġhy": 344,
380
+ "Ġrig": 345,
381
+ "ĠIt": 346,
382
+ "ĠPy": 347,
383
+ "thag": 348,
384
+ "Ġan": 349,
385
+ "ndam": 350,
386
+ "here": 351,
387
+ "Ġcom": 352,
388
+ "Ġex": 353,
389
+ "Ġsqu": 354,
390
+ "Ġtwo": 355,
391
+ "ions": 356,
392
+ "omet": 357,
393
+ "orem": 358,
394
+ "orean": 359,
395
+ "tenuse": 360,
396
+ "Ġfunct": 361,
397
+ "Ġfundam": 362,
398
+ "potenuse": 363,
399
+ "Ġleng": 364,
400
+ "Ġwhere": 365,
401
+ "Ġthat": 366,
402
+ "Ġrelat": 367,
403
+ "ental": 368,
404
+ "plex": 369,
405
+ "Ġhypotenuse": 370,
406
+ "Ġright": 371,
407
+ "ĠPythag": 372,
408
+ "Ġcomplex": 373,
409
+ "Ġsquare": 374,
410
+ "Ġfundamental": 375,
411
+ "Ġlength": 376,
412
+ "ĠPythagorean": 377
413
+ },
414
+ "merges": [
415
+ "t h",
416
+ "th e",
417
+ "Ġ the",
418
+ "Ġ i",
419
+ "Ġ a",
420
+ "e n",
421
+ "r e",
422
+ "Ġ o",
423
+ "s i",
424
+ "Ġi s",
425
+ "a l",
426
+ "r i",
427
+ "a t",
428
+ "e s",
429
+ "l e",
430
+ "o n",
431
+ "Ġ f",
432
+ "Ġo f",
433
+ "n d",
434
+ "a n",
435
+ "h e",
436
+ "Ġ b",
437
+ "Ġ c",
438
+ "Ġ e",
439
+ "Ġ s",
440
+ "Ġ t",
441
+ "E u",
442
+ "i on",
443
+ "l a",
444
+ "m u",
445
+ "o m",
446
+ "o r",
447
+ "o re",
448
+ "s e",
449
+ "t en",
450
+ "Ġ T",
451
+ "Ġ si",
452
+ "Ġ Eu",
453
+ "Ġa nd",
454
+ "Ġf u",
455
+ "mu la",
456
+ "or mula",
457
+ "Ġsi d",
458
+ "' s",
459
+ "a g",
460
+ "e t",
461
+ "h y",
462
+ "p o",
463
+ "q u",
464
+ "u se",
465
+ "Ġ -",
466
+ "Ġ 2",
467
+ "Ġ l",
468
+ "Ġ w",
469
+ "Ġ Î",
470
+ "Ġ th",
471
+ "Ġ re",
472
+ "the r",
473
+ "Ġi n",
474
+ "en g",
475
+ "en t",
476
+ "Ġo ther",
477
+ "ri g",
478
+ "le r",
479
+ "Ġf ormula",
480
+ "ĠT he",
481
+ "ĠEu ler",
482
+ "Ġsid es",
483
+ "I t",
484
+ "P y",
485
+ "a m",
486
+ "a re",
487
+ "c t",
488
+ "g le",
489
+ "h i",
490
+ "h t",
491
+ "i n",
492
+ "l i",
493
+ "l at",
494
+ "n ct",
495
+ "p le",
496
+ "r y",
497
+ "u m",
498
+ "w o",
499
+ "Ġ +",
500
+ "Ġ =",
501
+ "Ġ g",
502
+ "Ġ n",
503
+ "Ġ hy",
504
+ "Ġ rig",
505
+ "Ġ It",
506
+ "Ġ Py",
507
+ "th ag",
508
+ "Ġa n",
509
+ "nd am",
510
+ "he re",
511
+ "Ġc om",
512
+ "Ġe x",
513
+ "Ġs qu",
514
+ "Ġt wo",
515
+ "ion s",
516
+ "om et",
517
+ "ore m",
518
+ "ore an",
519
+ "ten use",
520
+ "Ġfu nct",
521
+ "Ġfu ndam",
522
+ "po tenuse",
523
+ "Ġl eng",
524
+ "Ġw here",
525
+ "Ġth at",
526
+ "Ġre lat",
527
+ "ent al",
528
+ "ple x",
529
+ "Ġhy potenuse",
530
+ "Ġrig ht",
531
+ "ĠPy thag",
532
+ "Ġcom plex",
533
+ "Ġsqu are",
534
+ "Ġfundam ental",
535
+ "Ġleng th",
536
+ "ĠPythag orean"
537
+ ]
538
+ }
539
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {},
3
+ "clean_up_tokenization_spaces": true,
4
+ "merges_file": "./openwebmath_tokenizer/merges.txt",
5
+ "model_max_length": 1000000000000000019884624838656,
6
+ "tokenizer_class": "PreTrainedTokenizerFast",
7
+ "vocab_file": "./openwebmath_tokenizer/vocab.json"
8
+ }