Upload tokenizer

Browse files

Files changed (4) hide show

README.md +199 -0
special_tokens_map.json +1 -0
tokenizer.json +539 -0
tokenizer_config.json +8 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,539 @@

+{
+  "version": "1.0",
+  "truncation": null,
+  "padding": null,
+  "added_tokens": [],
+  "normalizer": null,
+  "pre_tokenizer": {
+    "type": "ByteLevel",
+    "add_prefix_space": false,
+    "trim_offsets": true,
+    "use_regex": true
+  },
+  "post_processor": {
+    "type": "ByteLevel",
+    "add_prefix_space": true,
+    "trim_offsets": false,
+    "use_regex": true
+  },
+  "decoder": {
+    "type": "ByteLevel",
+    "add_prefix_space": true,
+    "trim_offsets": true,
+    "use_regex": true
+  },
+  "model": {
+    "type": "BPE",
+    "dropout": null,
+    "unk_token": null,
+    "continuing_subword_prefix": null,
+    "end_of_word_suffix": null,
+    "fuse_unk": false,
+    "byte_fallback": false,
+    "ignore_merges": false,
+    "vocab": {
+      "!": 0,
+      "\"": 1,
+      "#": 2,
+      "$": 3,
+      "%": 4,
+      "&": 5,
+      "'": 6,
+      "(": 7,
+      ")": 8,
+      "*": 9,
+      "+": 10,
+      ",": 11,
+      "-": 12,
+      ".": 13,
+      "/": 14,
+      "0": 15,
+      "1": 16,
+      "2": 17,
+      "3": 18,
+      "4": 19,
+      "5": 20,
+      "6": 21,
+      "7": 22,
+      "8": 23,
+      "9": 24,
+      ":": 25,
+      ";": 26,
+      "<": 27,
+      "=": 28,
+      ">": 29,
+      "?": 30,
+      "@": 31,
+      "A": 32,
+      "B": 33,
+      "C": 34,
+      "D": 35,
+      "E": 36,
+      "F": 37,
+      "G": 38,
+      "H": 39,
+      "I": 40,
+      "J": 41,
+      "K": 42,
+      "L": 43,
+      "M": 44,
+      "N": 45,
+      "O": 46,
+      "P": 47,
+      "Q": 48,
+      "R": 49,
+      "S": 50,
+      "T": 51,
+      "U": 52,
+      "V": 53,
+      "W": 54,
+      "X": 55,
+      "Y": 56,
+      "Z": 57,
+      "[": 58,
+      "\\": 59,
+      "]": 60,
+      "^": 61,
+      "_": 62,
+      "`": 63,
+      "a": 64,
+      "b": 65,
+      "c": 66,
+      "d": 67,
+      "e": 68,
+      "f": 69,
+      "g": 70,
+      "h": 71,
+      "i": 72,
+      "j": 73,
+      "k": 74,
+      "l": 75,
+      "m": 76,
+      "n": 77,
+      "o": 78,
+      "p": 79,
+      "q": 80,
+      "r": 81,
+      "s": 82,
+      "t": 83,
+      "u": 84,
+      "v": 85,
+      "w": 86,
+      "x": 87,
+      "y": 88,
+      "z": 89,
+      "{": 90,
+      "|": 91,
+      "}": 92,
+      "~": 93,
+      "¡": 94,
+      "¢": 95,
+      "£": 96,
+      "¤": 97,
+      "¥": 98,
+      "¦": 99,
+      "§": 100,
+      "¨": 101,
+      "©": 102,
+      "ª": 103,
+      "«": 104,
+      "¬": 105,
+      "®": 106,
+      "¯": 107,
+      "°": 108,
+      "±": 109,
+      "²": 110,
+      "³": 111,
+      "´": 112,
+      "µ": 113,
+      "¶": 114,
+      "·": 115,
+      "¸": 116,
+      "¹": 117,
+      "º": 118,
+      "»": 119,
+      "¼": 120,
+      "½": 121,
+      "¾": 122,
+      "¿": 123,
+      "À": 124,
+      "Á": 125,
+      "Â": 126,
+      "Ã": 127,
+      "Ä": 128,
+      "Å": 129,
+      "Æ": 130,
+      "Ç": 131,
+      "È": 132,
+      "É": 133,
+      "Ê": 134,
+      "Ë": 135,
+      "Ì": 136,
+      "Í": 137,
+      "Î": 138,
+      "Ï": 139,
+      "Ð": 140,
+      "Ñ": 141,
+      "Ò": 142,
+      "Ó": 143,
+      "Ô": 144,
+      "Õ": 145,
+      "Ö": 146,
+      "×": 147,
+      "Ø": 148,
+      "Ù": 149,
+      "Ú": 150,
+      "Û": 151,
+      "Ü": 152,
+      "Ý": 153,
+      "Þ": 154,
+      "ß": 155,
+      "à": 156,
+      "á": 157,
+      "â": 158,
+      "ã": 159,
+      "ä": 160,
+      "å": 161,
+      "æ": 162,
+      "ç": 163,
+      "è": 164,
+      "é": 165,
+      "ê": 166,
+      "ë": 167,
+      "ì": 168,
+      "í": 169,
+      "î": 170,
+      "ï": 171,
+      "ð": 172,
+      "ñ": 173,
+      "ò": 174,
+      "ó": 175,
+      "ô": 176,
+      "õ": 177,
+      "ö": 178,
+      "÷": 179,
+      "ø": 180,
+      "ù": 181,
+      "ú": 182,
+      "û": 183,
+      "ü": 184,
+      "ý": 185,
+      "þ": 186,
+      "ÿ": 187,
+      "Ā": 188,
+      "ā": 189,
+      "Ă": 190,
+      "ă": 191,
+      "Ą": 192,
+      "ą": 193,
+      "Ć": 194,
+      "ć": 195,
+      "Ĉ": 196,
+      "ĉ": 197,
+      "Ċ": 198,
+      "ċ": 199,
+      "Č": 200,
+      "č": 201,
+      "Ď": 202,
+      "ď": 203,
+      "Đ": 204,
+      "đ": 205,
+      "Ē": 206,
+      "ē": 207,
+      "Ĕ": 208,
+      "ĕ": 209,
+      "Ė": 210,
+      "ė": 211,
+      "Ę": 212,
+      "ę": 213,
+      "Ě": 214,
+      "ě": 215,
+      "Ĝ": 216,
+      "ĝ": 217,
+      "Ğ": 218,
+      "ğ": 219,
+      "Ġ": 220,
+      "ġ": 221,
+      "Ģ": 222,
+      "ģ": 223,
+      "Ĥ": 224,
+      "ĥ": 225,
+      "Ħ": 226,
+      "ħ": 227,
+      "Ĩ": 228,
+      "ĩ": 229,
+      "Ī": 230,
+      "ī": 231,
+      "Ĭ": 232,
+      "ĭ": 233,
+      "Į": 234,
+      "į": 235,
+      "İ": 236,
+      "ı": 237,
+      "Ĳ": 238,
+      "ĳ": 239,
+      "Ĵ": 240,
+      "ĵ": 241,
+      "Ķ": 242,
+      "ķ": 243,
+      "ĸ": 244,
+      "Ĺ": 245,
+      "ĺ": 246,
+      "Ļ": 247,
+      "ļ": 248,
+      "Ľ": 249,
+      "ľ": 250,
+      "Ŀ": 251,
+      "ŀ": 252,
+      "Ł": 253,
+      "ł": 254,
+      "Ń": 255,
+      "th": 256,
+      "the": 257,
+      "Ġthe": 258,
+      "Ġi": 259,
+      "Ġa": 260,
+      "en": 261,
+      "re": 262,
+      "Ġo": 263,
+      "si": 264,
+      "Ġis": 265,
+      "al": 266,
+      "ri": 267,
+      "at": 268,
+      "es": 269,
+      "le": 270,
+      "on": 271,
+      "Ġf": 272,
+      "Ġof": 273,
+      "nd": 274,
+      "an": 275,
+      "he": 276,
+      "Ġb": 277,
+      "Ġc": 278,
+      "Ġe": 279,
+      "Ġs": 280,
+      "Ġt": 281,
+      "Eu": 282,
+      "ion": 283,
+      "la": 284,
+      "mu": 285,
+      "om": 286,
+      "or": 287,
+      "ore": 288,
+      "se": 289,
+      "ten": 290,
+      "ĠT": 291,
+      "Ġsi": 292,
+      "ĠEu": 293,
+      "Ġand": 294,
+      "Ġfu": 295,
+      "mula": 296,
+      "ormula": 297,
+      "Ġsid": 298,
+      "'s": 299,
+      "ag": 300,
+      "et": 301,
+      "hy": 302,
+      "po": 303,
+      "qu": 304,
+      "use": 305,
+      "Ġ-": 306,
+      "Ġ2": 307,
+      "Ġl": 308,
+      "Ġw": 309,
+      "ĠÎ": 310,
+      "Ġth": 311,
+      "Ġre": 312,
+      "ther": 313,
+      "Ġin": 314,
+      "eng": 315,
+      "ent": 316,
+      "Ġother": 317,
+      "rig": 318,
+      "ler": 319,
+      "Ġformula": 320,
+      "ĠThe": 321,
+      "ĠEuler": 322,
+      "Ġsides": 323,
+      "It": 324,
+      "Py": 325,
+      "am": 326,
+      "are": 327,
+      "ct": 328,
+      "gle": 329,
+      "hi": 330,
+      "ht": 331,
+      "in": 332,
+      "li": 333,
+      "lat": 334,
+      "nct": 335,
+      "ple": 336,
+      "ry": 337,
+      "um": 338,
+      "wo": 339,
+      "Ġ+": 340,
+      "Ġ=": 341,
+      "Ġg": 342,
+      "Ġn": 343,
+      "Ġhy": 344,
+      "Ġrig": 345,
+      "ĠIt": 346,
+      "ĠPy": 347,
+      "thag": 348,
+      "Ġan": 349,
+      "ndam": 350,
+      "here": 351,
+      "Ġcom": 352,
+      "Ġex": 353,
+      "Ġsqu": 354,
+      "Ġtwo": 355,
+      "ions": 356,
+      "omet": 357,
+      "orem": 358,
+      "orean": 359,
+      "tenuse": 360,
+      "Ġfunct": 361,
+      "Ġfundam": 362,
+      "potenuse": 363,
+      "Ġleng": 364,
+      "Ġwhere": 365,
+      "Ġthat": 366,
+      "Ġrelat": 367,
+      "ental": 368,
+      "plex": 369,
+      "Ġhypotenuse": 370,
+      "Ġright": 371,
+      "ĠPythag": 372,
+      "Ġcomplex": 373,
+      "Ġsquare": 374,
+      "Ġfundamental": 375,
+      "Ġlength": 376,
+      "ĠPythagorean": 377
+    },
+    "merges": [
+      "t h",
+      "th e",
+      "Ġ the",
+      "Ġ i",
+      "Ġ a",
+      "e n",
+      "r e",
+      "Ġ o",
+      "s i",
+      "Ġi s",
+      "a l",
+      "r i",
+      "a t",
+      "e s",
+      "l e",
+      "o n",
+      "Ġ f",
+      "Ġo f",
+      "n d",
+      "a n",
+      "h e",
+      "Ġ b",
+      "Ġ c",
+      "Ġ e",
+      "Ġ s",
+      "Ġ t",
+      "E u",
+      "i on",
+      "l a",
+      "m u",
+      "o m",
+      "o r",
+      "o re",
+      "s e",
+      "t en",
+      "Ġ T",
+      "Ġ si",
+      "Ġ Eu",
+      "Ġa nd",
+      "Ġf u",
+      "mu la",
+      "or mula",
+      "Ġsi d",
+      "' s",
+      "a g",
+      "e t",
+      "h y",
+      "p o",
+      "q u",
+      "u se",
+      "Ġ -",
+      "Ġ 2",
+      "Ġ l",
+      "Ġ w",
+      "Ġ Î",
+      "Ġ th",
+      "Ġ re",
+      "the r",
+      "Ġi n",
+      "en g",
+      "en t",
+      "Ġo ther",
+      "ri g",
+      "le r",
+      "Ġf ormula",
+      "ĠT he",
+      "ĠEu ler",
+      "Ġsid es",
+      "I t",
+      "P y",
+      "a m",
+      "a re",
+      "c t",
+      "g le",
+      "h i",
+      "h t",
+      "i n",
+      "l i",
+      "l at",
+      "n ct",
+      "p le",
+      "r y",
+      "u m",
+      "w o",
+      "Ġ +",
+      "Ġ =",
+      "Ġ g",
+      "Ġ n",
+      "Ġ hy",
+      "Ġ rig",
+      "Ġ It",
+      "Ġ Py",
+      "th ag",
+      "Ġa n",
+      "nd am",
+      "he re",
+      "Ġc om",
+      "Ġe x",
+      "Ġs qu",
+      "Ġt wo",
+      "ion s",
+      "om et",
+      "ore m",
+      "ore an",
+      "ten use",
+      "Ġfu nct",
+      "Ġfu ndam",
+      "po tenuse",
+      "Ġl eng",
+      "Ġw here",
+      "Ġth at",
+      "Ġre lat",
+      "ent al",
+      "ple x",
+      "Ġhy potenuse",
+      "Ġrig ht",
+      "ĠPy thag",
+      "Ġcom plex",
+      "Ġsqu are",
+      "Ġfundam ental",
+      "Ġleng th",
+      "ĠPythag orean"
+    ]
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "added_tokens_decoder": {},
+  "clean_up_tokenization_spaces": true,
+  "merges_file": "./openwebmath_tokenizer/merges.txt",
+  "model_max_length": 1000000000000000019884624838656,
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "vocab_file": "./openwebmath_tokenizer/vocab.json"
+}