alea-institute commited on
Commit
8829ee9
·
verified ·
1 Parent(s): 6136dc8

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: transformers
5
+ license: cc-by-4.0
6
+ tags:
7
+ - kl3m
8
+ - kl3m-002
9
+ - legal
10
+ - financial
11
+ - enterprise
12
+ - slm
13
+ date: '2024-02-20T00:00:00.000Z'
14
+ pipeline_tag: text-generation
15
+ widget:
16
+ - text: "Medical devices are regulated by"
17
+ - temperature: 0.3
18
+ - do_sample: True
19
+ ---
20
+
21
+ # kl3m-002-520m (Draft) Model
22
+
23
+ **This model was part of our scale-up efforts to build `kl3m-003-3.7b`, another Mixtral-architecture model. We are
24
+ making this model public for historical reference and research, but you should probably consider using other models
25
+ for production purposes.**
26
+
27
+ kl3m-520m is a (very) small language model (SLM) model trained on clean, legally-permissible data. Originally
28
+ developed by [273 Ventures](https://273ventures.com) and donated to the [ALEA Institute](https://aleainstitute.ai),
29
+ kl3m-520m was the first LLM to obtain the [Fairly Trained L-Certification](https://www.fairlytrained.org/certifications)
30
+ for its ethical training data and practices. The model is designed for legal, regulatory, and financial workflows,
31
+ with a focus on low toxicity and high efficiency.
32
+
33
+ Given its small size and lack of instruction-aligned training data, kl3m-520m is best suited for use either in
34
+ SLM fine-tuning or as part of training larger models without using unethical data or models.
35
+
36
+ The model was originally trained between November 2023 and January 2024 on a 12xRTX4090 node in DDP. A similar model is
37
+ being provided with complete source and data replication as part of the `kl3m-004` family to be released in Q4 2024.
38
+
39
+ ## Source
40
+
41
+ [https://github.com/alea-institute/kl3m-model-research](https://github.com/alea-institute/kl3m-model-research)
42
+
43
+
44
+ ## Training Data
45
+ While the original training data collection and training infrastructure relies on software that was not donated by
46
+ 273 Ventures, ALEA Institute is open-sourcing an improved dataset, including both replication and an API.
47
+
48
+ [https://github.com/alea-institute/kl3m-data](https://github.com/alea-institute/kl3m-data)
49
+
50
+ Data is available upon request at this time via S3 under a Requester Pays model. We are actively working on a
51
+ zero-cost distribution model as soon as we can obtain additional support.
52
+
53
+ This model, the original `kl3m-002-520m` model, was trained on a US-only subset of the Kelvin Legal DataPack that
54
+ we believe is 100% public domain material. However, so as to enforce maximum transparency to all
55
+ downstream users in the event of any future determination otherwise, we are licensing this model under CC-BY 4.0.
56
+
57
+ ## Model Details
58
+
59
+ ### Summary
60
+ - **Architecture**: Mixtral (`num_local_experts=4, num_experts_per_tok=2`)
61
+ - **Parameters**: 520 million
62
+ - **Context Window**: 1,024 tokens (`sliding_window=256`)
63
+ - **Language(s)**: Primarily English
64
+ - **Tokenizer**: kl3m-001-32k BPE tokenizer (32,768 vocabulary size with unorthodox whitespace handling)
65
+ - **Developed by**: Originally by [273 Ventures LLC](https://273ventures.com), donated to [ALEA Institute](https://aleainstitute.ai)
66
+ - **License**: [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
67
+ - **Hardware Requirements**: Runs real-time in fp32 on CPU/M1+
68
+
69
+ ## Performance Metrics
70
+
71
+ N/A
72
+
73
+ ## Key Features
74
+
75
+ - **Clean Training Data**: Built on what was originally referred to as the Kelvin Legal DataPack, ensuring all training data is ethically sourced and legally permissible.
76
+ - **Low Toxicity**: [Empirically lower toxicity and bias](https://github.com/alea-institute/kl3m-toxicity)
77
+ - **Enterprise Focus**: Specifically designed for legal, regulatory, and financial workflows.
78
+ - **Efficient Deployment**: Optimized for real-time inference on consumer hardware.
79
+
80
+ ## Use Cases
81
+
82
+ - Basic regulatory question answering
83
+ - Contract provision drafting
84
+ - Structured JSON information extraction
85
+ - Foundation for downstream optimization
86
+ - Base model for domain-specific fine-tuning
87
+
88
+ ## Getting Started
89
+
90
+ ```python
91
+ import json
92
+ from transformers import pipeline
93
+
94
+ # Load the model and tokenizer
95
+ p = pipeline('text-generation', 'alea-institute/kl3m-002-520m', device='cpu')
96
+
97
+ # Example usage on CPU
98
+ text = "Under this"
99
+ print(
100
+ json.dumps(
101
+ [
102
+ r.get("generated_text")
103
+ for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32)
104
+ ],
105
+ indent=2
106
+ )
107
+ )
108
+ ```
109
+
110
+ ```json
111
+ [
112
+ "Under this rule, the operator of a vessel in the Gulf reef fish fishery ",
113
+ "Under this proposed rule, the Department is proposing to amend the regulations in \u00a7\u00a7\u200951.2 ",
114
+ "Under this proposed rule, CBP would need to collect information from all entities to perform the necessary"
115
+ ]
116
+ ```
117
+
118
+ ## Contract Example
119
+ ```python
120
+ text = "Governing Law."
121
+ print(
122
+ json.dumps(
123
+ [
124
+ r.get("generated_text")
125
+ for r in p(text, do_sample=True, temperature=0.5, num_return_sequences=3, max_new_tokens=32)
126
+ ],
127
+ indent=2
128
+ )
129
+ )
130
+ ```
131
+
132
+ ```json
133
+ [
134
+ "Governing Law.\n (a) No provision of this Agreement shall be interpreted or construed to confer ",
135
+ "Governing Law.\nThe law of the United States shall be interpreted and enforced in accordance",
136
+ "Governing Law.\n (a) The validity of any contract or agreement to which the \nUnited States is "
137
+ ]
138
+ ```
139
+
140
+ ## Technical Implementation
141
+
142
+ The model implements several techniques during training:
143
+
144
+ - Hybrid NTP and SFT cotraining
145
+ - Dynamic, document-aware segmentation
146
+ - Randomized padding
147
+ - Traditional fixed-attention mechanisms
148
+
149
+ ## License
150
+
151
+ This model was originally developed by 273 Ventures and has been donated to the ALEA Institute.
152
+
153
+ The model weights are released under the CC-BY 4.0 License.
154
+
155
+ ## Contact
156
+
157
+ The KL3M model family is now maintained by the [ALEA Institute](https://aleainstitute.ai). For technical support, collaboration opportunities, or general inquiries:
158
+
159
+ - GitHub: https://github.com/alea-institute/kl3m-model-research
160
+ - Email: [email protected]
161
+ - Website: https://aleainstitute.ai
162
+
163
+ ## Acknowledgments
164
+
165
+ Special thanks to 273 Ventures for developing and donating this model to the open-source community through the Alea Institute.
166
+
167
+
168
+ ## Citation
169
+
170
+ Tokenizer, dataset, and model publications are pending.
171
+
172
+ ## Contact
173
+
174
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
175
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-model-research).
176
+
177
+ ![https://aleainstitute.ai](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "kl3m-002-520m",
3
+ "architectures": [
4
+ "MixtralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "eos_token_id": 1,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 2048,
13
+ "max_position_embeddings": 1024,
14
+ "model_type": "mixtral",
15
+ "num_attention_heads": 16,
16
+ "num_experts_per_tok": 2,
17
+ "num_hidden_layers": 16,
18
+ "num_key_value_heads": 8,
19
+ "num_local_experts": 4,
20
+ "output_router_logits": false,
21
+ "pad_token_id": 2,
22
+ "rms_norm_eps": 1e-06,
23
+ "rope_theta": 10000,
24
+ "router_aux_loss_coef": 0.001,
25
+ "sliding_window": 256,
26
+ "tie_word_embeddings": false,
27
+ "torch_dtype": "float32",
28
+ "transformers_version": "4.38.0.dev0",
29
+ "use_cache": false,
30
+ "vocab_size": 32768
31
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 1,
5
+ "pad_token_id": 2,
6
+ "transformers_version": "4.38.0.dev0",
7
+ "use_cache": false
8
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae708023097727ef0a2dbe766175c255d4b9c2de13b81e1ebe1fe7cdb7fb6744
3
+ size 2080878870
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|start|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|end|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<|pad|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|unk|>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff