alea-institute commited on
Commit
9660d3b
·
verified ·
1 Parent(s): 3deda4a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +105 -149
README.md CHANGED
@@ -1,199 +1,155 @@
1
  ---
2
- library_name: transformers
3
- tags: []
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
 
10
 
 
11
 
12
  ## Model Details
13
 
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
130
 
131
- #### Summary
 
 
132
 
 
133
 
 
 
 
134
 
135
- ## Model Examination [optional]
 
 
 
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
138
 
139
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
- ## Technical Specifications [optional]
 
 
 
 
 
 
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
 
 
 
 
 
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
 
162
 
163
- #### Hardware
 
164
 
165
- [More Information Needed]
166
 
167
- #### Software
168
 
169
- [More Information Needed]
 
170
 
171
- ## Citation [optional]
172
 
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
 
 
 
 
176
 
177
- [More Information Needed]
 
 
 
 
178
 
179
- **APA:**
 
 
 
 
180
 
181
- [More Information Needed]
182
 
183
- ## Glossary [optional]
184
 
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
 
187
- [More Information Needed]
 
188
 
189
- ## More Information [optional]
 
190
 
191
- [More Information Needed]
192
 
193
- ## Model Card Authors [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Contact
 
198
 
199
- [More Information Needed]
 
1
  ---
2
+ library_name: tokenizers
3
+ tags: ['kl3m', 'kl3m-003', 'alea', 'legal', 'financial']
4
+ date: 2024-03-15
5
  ---
6
 
7
+ # kl3m-003-64k tokenizer
8
 
9
+ The `kl3m-001-32k` tokenizer is a domain-specific tokenizer trained on ~1.5T tokens of financial and legal text from sources that contain primarily English, German, Spanish, and French language.
10
 
11
+ This tokenizer was used for the second generation of KL3M embedding and generative models, including
12
+ `kl3m-3.7B`, `kl3m-7B`, `kl3m-embedding-003`, and `kl3m-embedding-004`.
13
 
14
+ Please see `kl3m-001-32k` for the first iteration of our research on domain-specific tokenization.
15
 
16
  ## Model Details
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
+ ### Summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ - **Vocabulary**: 65,536
22
+ - **Tokenizer type:** BPE
23
+ - **Special token support:** Both causal and masked language modeling
24
+ - **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
25
+ - **Developed by:** Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai).
26
+ - **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
27
 
 
28
 
29
+ ### Model Description
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
+ The `kl3m-003-64k` tokenizer is a domain-specific tokenizer trained on ~1.5T tokens of financial and legal text from primarily-English sources.
32
 
33
+ This tokenizer is notable for a number of reasons:
34
 
35
+ #### Domain Specific
36
 
37
+ As part of our research on more efficient SLM training for the legal and financial domain, we
38
+ trained a domain-specific tokenizer on a large corpus of financial and legal text. This tokenizer
39
+ has not, for example, seen any common general pretrain sources like Wikipedia or Common Crawl.
40
 
41
+ #### Large Added Token Set
42
 
43
+ As part of our research on efficient and reliable extraction and generation, we inserted
44
+ a large numer of deterministic "whole" tokens into the tokenizer, such as HTML tags
45
+ like `<span`, common Markdown elements like `#` and `##`, and legal enumerations like `(a)`.
46
 
47
+ **Note that the kl3m-003-64k tokenizer has added a number of additional citation formats that were not included
48
+ in the kl3m-001-32k tokenizer.** These were primarily sourced from empirical data and
49
+ the [Free Law Project's reporters-db](https://raw.githubusercontent.com/freelawproject/reporters-db/main/reporters_db/data/),
50
+ which were added to the tokenizer to improve model behavior related to legal citations.
51
 
52
+ See the `get_custom_tokens` method in `kl3m_embeddings/training/kl3m_003/train_tokenizer.py` for
53
+ more details:
54
 
55
+ ```python
56
+ def get_custom_tokens(
57
+ include_whitespace: bool = True,
58
+ include_markdown: bool = True,
59
+ include_html: bool = True,
60
+ include_json: bool = True,
61
+ include_xml: bool = True,
62
+ include_years: bool = True,
63
+ include_citations: bool = True,
64
+ lowercase: bool = False,
65
+ ) -> list[str]:
66
+ ```
67
 
68
+ #### Space Preservation
69
 
70
+ Unlike `kl3m-001-32k`, we *do not* retain the space character as a token. This was done after adding additional legal
71
+ citation tokens to the vocabulary, which reduced the number of issues related to space tokenization in legal text. This
72
+ means that the `kl3m-003-64k` tokenizer uses substantially fewer tokens than `kl3m-001-32k` for most text.
73
 
74
+ #### Special Tokens for both Embedding and Generative Models
75
 
76
+ For both training and inference efficiency, we intended this tokenizer vocabulary to be
77
+ usable for both embedding and generative models. As such, we included special tokens
78
+ suitable for both causal and masked language modeling tasks.
 
 
79
 
80
+ * `<|start|>`: `0`
81
+ * `<|end|>`: `1`
82
+ * `<|pad|>`: `2`
83
+ * `<|unk|>`: `3`
84
+ * `<|sep|>`: `4`
85
+ * `<|cls|>`: `5`
86
+ * `<|mask|>`: `6`
87
 
88
+ We also added a number of chat and instruction tokens that were not included in `kl3m-001-32k`, including:
89
 
90
+ * `<|system|>`: `7`
91
+ * `</|system|>`: `8`
92
+ * `<|user|>`: `9`
93
+ * `</|user|>`: `10`
94
+ * `<|instruction|>`: `11`
95
+ * `</|instruction|>`: `12`
96
 
97
+ ### Replication
98
 
99
+ The entire data collection and preprocesing pipeline is being made available, along with
100
+ training data, as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
101
 
102
+ The source code to used to train the tokenizer is available on GitHub at:
103
+ [https://github.com/alea-institute/kl3m-embedding-research](https://github.com/alea-institute/kl3m-embedding-research)
104
 
105
+ The data pipeline will be available on GitHub and S3 in the near future.
106
 
107
+ ## Uses
108
 
109
+ This tokenizer is intended to be used for English, Spanish, German, or French language text in professional contexts
110
+ such as legal and financial documents.
111
 
112
+ ### Recommendations
113
 
114
+ In general, the `kl3m-003-64k` tokenizer is recommended over the original `kl3m-001-32k` tokenizer.
115
 
116
+ ```text
117
+ Original text: The Comptroller of the Currency shall have the same authority with respect to functions transferred to
118
+ the Comptroller of the Currency under the Enhancing Financial Institution Safety and Soundness Act of 2010 as was
119
+ vested in the Director of the Office of Thrift Supervision on the transfer date, as defined in section 311 of that
120
+ Act [12 U.S.C. 5411].
121
 
122
+ kl3m-001-32k
123
+ --------------------
124
+ Size: 147
125
+ Tokens: ['The', ' ', 'Comp', 'troller', ' ', 'of', ' ', 'the', ' ', 'C', 'urrency', ' ', 'shall', ' ', 'have', ' ', 'the', ' ', 'same', ' ', 'authority', ' ', 'with', ' ', 'respect', ' ', 'to', ' ', 'fun', 'ctions', ' ', 'transferred', ' ', 'to', '\n', ' ', 'the', ' ', 'Comp', 'troller', ' ', 'of', ' ', 'the', ' ', 'C', 'urrency', ' ', 'under', ' ', 'the', ' ', 'En', 'ha', 'ncing', ' ', 'Financial', ' ', 'Institution', ' ', 'Sa', 'fe', 'ty', ' ', 'a', 'n', 'd', ' ', 'S', 'ound', 'ness', ' ', 'Act', ' ', 'of', ' ', '2010', ' ', 'as', ' ', 'was', '\n', ' ', 'vested', ' ', 'i', 'n', ' ', 'the', ' ', 'Director', ' ', 'of', ' ', 'the', ' ', 'Office', ' ', 'of', ' ', 'Th', 'rift', ' ', 'Superv', 'ision', ' ', 'o', 'n', ' ', 'the', ' ', 'transfer', ' ', 'date', ',', ' ', 'as', ' ', 'defined', ' ', 'i', 'n', ' ', 'section', ' ', '311', ' ', 'of', ' ', 'that', '\n', ' ', 'Act', ' ', '[', '12', ' ', 'U', '.', 'S', '.', 'C', '.', ' ', '54', '11', '].']
126
+ IDs: [815, 31673, 3546, 14529, 31673, 269, 31673, 441, 31673, 41, 9646, 31673, 5516, 31673, 4130, 31673, 441, 31673, 8685, 31673, 14765, 31673, 1946, 31673, 12500, 31673, 265, 31673, 12122, 1935, 31673, 12677, 31673, 265, 31674, 31673, 441, 31673, 3546, 14529, 31673, 269, 31673, 441, 31673, 41, 9646, 31673, 2823, 31673, 441, 31673, 1871, 288, 2655, 31673, 20796, 31673, 29543, 31673, 4778, 362, 1004, 31673, 71, 84, 74, 31673, 57, 1098, 1647, 31673, 8494, 31673, 269, 31673, 3629, 31673, 310, 31673, 3182, 31674, 31673, 9761, 31673, 79, 84, 31673, 441, 31673, 21209, 31673, 269, 31673, 441, 31673, 8827, 31673, 269, 31673, 788, 11004, 31673, 28799, 873, 31673, 85, 84, 31673, 441, 31673, 12790, 31673, 2726, 18, 31673, 310, 31673, 10212, 31673, 79, 84, 31673, 3517, 31673, 15340, 31673, 269, 31673, 1704, 31674, 31673, 8494, 31673, 65, 534, 31673, 59, 20, 57, 20, 41, 20, 31673, 2195, 572, 5582]
127
 
128
+ kl3m-003-64k
129
+ --------------------
130
+ Size: 70
131
+ Tokens: ['The', 'ĠComptroller', 'Ġof', 'Ġthe', 'ĠCurrency', 'Ġshall', 'Ġhave', 'Ġthe', 'Ġsame', 'Ġauthority', 'Ġwith', 'Ġrespect', 'Ġto', 'Ġfunctions', 'Ġtransferred', 'Ġto', 'Ċ', 'Ġthe', 'ĠComptroller', 'Ġof', 'Ġthe', 'ĠCurrency', 'Ġunder', 'Ġthe', 'ĠEnh', 'ancing', 'ĠFinancial', 'ĠInstitution', 'ĠSafety', 'Ġand', 'Ġ', 'Sound', 'ness', 'ĠAct', 'Ġof', 'Ġ2010', 'Ġas', 'Ġwas', 'Ċ', 'Ġvested', 'Ġin', 'Ġthe', 'ĠDirector', 'Ġof', 'Ġthe', 'ĠOffice', 'Ġof', 'ĠThrift', 'ĠSupervision', 'Ġon', 'Ġthe', 'Ġtransfer', 'Ġdate', ',', 'Ġas', 'Ġdefined', 'Ġin', 'Ġsection', 'Ġ311', 'Ġof', 'Ġthat', 'Ċ', 'ĠAct', 'Ġ[', '12', 'Ġ', 'U.S.C.', 'Ġ54', '11', '].']
132
+ IDs: [671, 13273, 295, 281, 25922, 735, 704, 281, 1913, 2451, 440, 1894, 312, 5860, 7264, 312, 211, 281, 13273, 295, 281, 25922, 621, 281, 18926, 4406, 3195, 24448, 5617, 310, 233, 63589, 2130, 854, 295, 1611, 398, 725, 211, 11978, 300, 281, 2827, 295, 281, 1767, 295, 44029, 37141, 395, 281, 3696, 1548, 24, 398, 3011, 300, 782, 6590, 295, 407, 211, 854, 1327, 524, 233, 63761, 3789, 547, 8578]
133
 
134
+ ```
135
 
136
+ ## How to Get Started with the Model
137
 
138
+ Use the code below to get started with the model.
139
 
140
+ ```
141
+ from tokenizers import Tokenizer
142
 
143
+ tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-003-64k')
144
+ ```
145
 
146
+ ## Citation
147
 
148
+ Tokenizer and dataset publications are pending.
149
 
150
+ ## Contact
151
 
152
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
153
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-embedding-research).
154
 
155
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)