Upload folder using huggingface_hub

#1
by sharpenb - opened
README.md ADDED
@@ -0,0 +1,85 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ thumbnail: "https://assets-global.website-files.com/646b351987a8d8ce158d1940/64ec9e96b4334c0e1ac41504_Logo%20with%20white%20text.svg"
3
+ base_model: hywu/Camelidae-8x7B
4
+ metrics:
5
+ - memory_disk
6
+ - memory_inference
7
+ - inference_latency
8
+ - inference_throughput
9
+ - inference_CO2_emissions
10
+ - inference_energy_consumption
11
+ tags:
12
+ - pruna-ai
13
+ ---
14
+ <!-- header start -->
15
+ <!-- 200823 -->
16
+ <div style="width: auto; margin-left: auto; margin-right: auto">
17
+ <a href="https://www.pruna.ai/" target="_blank" rel="noopener noreferrer">
18
+ <img src="https://i.imgur.com/eDAlcgk.png" alt="PrunaAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
19
+ </a>
20
+ </div>
21
+ <!-- header end -->
22
+
23
+ [![Twitter](https://img.shields.io/twitter/follow/PrunaAI?style=social)](https://twitter.com/PrunaAI)
24
+ [![GitHub](https://img.shields.io/github/followers/PrunaAI?label=Follow%20%40PrunaAI&style=social)](https://github.com/PrunaAI)
25
+ [![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue)](https://www.linkedin.com/company/93832878/admin/feed/posts/?feedType=following)
26
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Us-blue?style=social&logo=discord)](https://discord.gg/CP4VSgck)
27
+
28
+ # Simply make AI models cheaper, smaller, faster, and greener!
29
+
30
+ - Give a thumbs up if you like this model!
31
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
32
+ - Request access to easily compress your *own* AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
33
+ - Read the documentations to know more [here](https://pruna-ai-pruna.readthedocs-hosted.com/en/latest/)
34
+ - Join Pruna AI community on Discord [here](https://discord.gg/CP4VSgck) to share feedback/suggestions or get help.
35
+
36
+ ## Results
37
+
38
+ ![image info](./plots.png)
39
+
40
+ **Frequently Asked Questions**
41
+ - ***How does the compression work?*** The model is compressed with llm-int8.
42
+ - ***How does the model quality change?*** The quality of the model output might vary compared to the base model.
43
+ - ***How is the model efficiency evaluated?*** These results were obtained on HARDWARE_NAME with configuration described in `model/smash_config.json` and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
44
+ - ***What is the model format?*** We use safetensors.
45
+ - ***What calibration data has been used?*** If needed by the compression method, we used WikiText as the calibration data.
46
+ - ***What is the naming convention for Pruna Huggingface models?*** We take the original model name and append "turbo", "tiny", or "green" if the smashed model has a measured inference speed, inference memory, or inference energy consumption which is less than 90% of the original base model.
47
+ - ***How to compress my own models?*** You can request premium access to more compression methods and tech support for your specific use-cases [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
48
+ - ***What are "first" metrics?*** Results mentioning "first" are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
49
+ - ***What are "Sync" and "Async" metrics?*** "Sync" metrics are obtained by syncing all GPU processes and stop measurement when all of them are executed. "Async" metrics are obtained without syncing all GPU processes and stop when the model output can be used by the CPU. We provide both metrics since both could be relevant depending on the use-case. We recommend to test the efficiency gains directly in your use-cases.
50
+
51
+ ## Setup
52
+
53
+ You can run the smashed model with these steps:
54
+
55
+ 0. Check requirements from the original repo hywu/Camelidae-8x7B installed. In particular, check python, cuda, and transformers versions.
56
+ 1. Make sure that you have installed quantization related packages.
57
+ ```bash
58
+ pip install transformers accelerate bitsandbytes>0.37.0
59
+ ```
60
+ 2. Load & run the model.
61
+ ```python
62
+ from transformers import AutoModelForCausalLM, AutoTokenizer
63
+
64
+
65
+ model = AutoModelForCausalLM.from_pretrained("PrunaAI/hywu-Camelidae-8x7B-bnb-4bit-smashed", trust_remote_code=True, device_map='auto')
66
+ tokenizer = AutoTokenizer.from_pretrained("hywu/Camelidae-8x7B")
67
+
68
+ input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
69
+
70
+ outputs = model.generate(input_ids, max_new_tokens=216)
71
+ tokenizer.decode(outputs[0])
72
+ ```
73
+
74
+ ## Configurations
75
+
76
+ The configuration info are in `smash_config.json`.
77
+
78
+ ## Credits & License
79
+
80
+ The license of the smashed model follows the license of the original model. Please check the license of the original model hywu/Camelidae-8x7B before using this model which provided the base model. The license of the `pruna-engine` is [here](https://pypi.org/project/pruna-engine/) on Pypi.
81
+
82
+ ## Want to compress other models?
83
+
84
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
85
+ - Request access to easily compress your own AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
config.json ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/ceph/hdd/staff/charpent/.cache/models1trx2z9_cod73un1",
3
+ "adapter_dim": 512,
4
+ "architectures": [
5
+ "LlamaForCausalLM"
6
+ ],
7
+ "auto_map": {
8
+ "AutoConfig": "configuration_camelidae.CamelidaeConfig",
9
+ "AutoModel": "hywu/Camelidae-8x7B--modeling_camelidae.LlamaModel",
10
+ "AutoModelForCausalLM": "modeling_camelidae.LlamaForCausalLM"
11
+ },
12
+ "bos_token_id": 1,
13
+ "eos_token_id": 2,
14
+ "hidden_act": "silu",
15
+ "hidden_size": 4096,
16
+ "initializer_range": 0.02,
17
+ "intermediate_size": 11008,
18
+ "max_position_embeddings": 4096,
19
+ "model_type": "llama",
20
+ "moe_dtype": "bfloat16",
21
+ "moe_scaling": 0.25,
22
+ "num_attention_heads": 32,
23
+ "num_experts": 8,
24
+ "num_hidden_layers": 32,
25
+ "num_key_value_heads": 32,
26
+ "output_router_logits": false,
27
+ "pretraining_tp": 1,
28
+ "quantization_config": {
29
+ "_load_in_4bit": true,
30
+ "_load_in_8bit": false,
31
+ "bnb_4bit_compute_dtype": "bfloat16",
32
+ "bnb_4bit_quant_storage": "uint8",
33
+ "bnb_4bit_quant_type": "fp4",
34
+ "bnb_4bit_use_double_quant": false,
35
+ "llm_int8_enable_fp32_cpu_offload": false,
36
+ "llm_int8_has_fp16_weight": false,
37
+ "llm_int8_skip_modules": [
38
+ "lm_head"
39
+ ],
40
+ "llm_int8_threshold": 6.0,
41
+ "load_in_4bit": true,
42
+ "load_in_8bit": false,
43
+ "quant_method": "bitsandbytes"
44
+ },
45
+ "rms_norm_eps": 1e-05,
46
+ "rope_scaling": null,
47
+ "tie_word_embeddings": false,
48
+ "topk": 2,
49
+ "torch_dtype": "float16",
50
+ "transformers_version": "4.40.0",
51
+ "use_cache": true,
52
+ "vocab_size": 32000
53
+ }
configuration_camelidae.py ADDED
@@ -0,0 +1,207 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ LLaMA model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
29
+
30
+
31
+ class CamelidaeConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
34
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
35
+ defaults will yield a similar configuration to that of the LLaMA-7B.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 32000):
43
+ Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`LlamaModel`]
45
+ hidden_size (`int`, *optional*, defaults to 4096):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 11008):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer encoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ num_key_value_heads (`int`, *optional*):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
60
+ `num_attention_heads`.
61
+ pretraining_tp (`int`, *optional*, defaults to `1`):
62
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
63
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
64
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
65
+ issue](https://github.com/pytorch/pytorch/issues/76232).
66
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
67
+ The non-linear activation function (function or string) in the decoder.
68
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
69
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
70
+ just in case (e.g., 512 or 1024 or 2048).
71
+ initializer_range (`float`, *optional*, defaults to 0.02):
72
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
73
+ rms_norm_eps (`float`, *optional*, defaults to 1e-12):
74
+ The epsilon used by the rms normalization layers.
75
+ use_cache (`bool`, *optional*, defaults to `True`):
76
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
77
+ relevant if `config.is_decoder=True`.
78
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
79
+ Whether to tie weight embeddings
80
+ rope_scaling (`Dict`, *optional*):
81
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
82
+ strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
83
+ is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
84
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
85
+ these scaling strategies behave:
86
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
87
+ experimental feature, subject to breaking API changes in future versions.
88
+ moe_dtype (`str`, *optional*, default to `"bfloat16"`):
89
+ The `dtype` used for the moe layers. It is preferable to keep the `dtype` to `"bfloat16"`
90
+ moe_scaling (`float`, *optional*, defaults to 0.25):
91
+ The scaling factor of expert.
92
+ num_experts (`int`, *optional*, defaults to 8):
93
+ The number of MoE expert
94
+ topk (`int`, *optional*, defaults to 2):
95
+ The number of experts to root per-token, can be also interpreted as the `top-p` routing
96
+ parameter
97
+ output_router_logits (`bool`, *optional*, defaults to `False`):
98
+ Whether or not the router logits should be returned by the model. Enabeling this will also
99
+ allow the model to output the auxiliary loss.
100
+ adapter_dim (`int`, *optional*, defaults to 64):
101
+ Dimension of the adapter.
102
+ Example:
103
+
104
+ ```python
105
+ >>> from transformers import CamelidaeModel, CamelidaeConfig
106
+
107
+ >>> # Initializing a Camelidae camelidae-7b style configuration
108
+ >>> configuration = CamelidaeConfig()
109
+
110
+ >>> # Initializing a model from the camelidae-7b style configuration
111
+ >>> model = CamelidaeModel(configuration)
112
+
113
+ >>> # Accessing the model configuration
114
+ >>> configuration = model.config
115
+ ```"""
116
+ model_type = "llama"
117
+ keys_to_ignore_at_inference = ["past_key_values"]
118
+
119
+ def __init__(
120
+ self,
121
+ vocab_size=32000,
122
+ hidden_size=4096,
123
+ intermediate_size=11008,
124
+ num_hidden_layers=32,
125
+ num_attention_heads=32,
126
+ num_key_value_heads=None,
127
+ hidden_act="silu",
128
+ max_position_embeddings=2048,
129
+ initializer_range=0.02,
130
+ rms_norm_eps=1e-6,
131
+ use_cache=True,
132
+ pad_token_id=None,
133
+ bos_token_id=1,
134
+ eos_token_id=2,
135
+ pretraining_tp=1,
136
+ tie_word_embeddings=False,
137
+ rope_scaling=None,
138
+ moe_dtype="bfloat16",
139
+ moe_scaling=0.25,
140
+ num_experts=8,
141
+ topk=2,
142
+ output_router_logits=False,
143
+ adapter_dim=64,
144
+ **kwargs,
145
+ ):
146
+ self.vocab_size = vocab_size
147
+ self.max_position_embeddings = max_position_embeddings
148
+ self.hidden_size = hidden_size
149
+ self.intermediate_size = intermediate_size
150
+ self.num_hidden_layers = num_hidden_layers
151
+ self.num_attention_heads = num_attention_heads
152
+
153
+ # for backward compatibility
154
+ if num_key_value_heads is None:
155
+ num_key_value_heads = num_attention_heads
156
+
157
+ self.num_key_value_heads = num_key_value_heads
158
+ self.hidden_act = hidden_act
159
+ self.initializer_range = initializer_range
160
+ self.rms_norm_eps = rms_norm_eps
161
+ self.pretraining_tp = pretraining_tp
162
+ self.use_cache = use_cache
163
+ self.rope_scaling = rope_scaling
164
+ self._rope_scaling_validation()
165
+
166
+ self.moe_dtype = moe_dtype
167
+ self.moe_scaling = moe_scaling
168
+ self.num_experts = num_experts
169
+ self.topk = topk
170
+ self.output_router_logits = output_router_logits
171
+
172
+ self.adapter_dim = adapter_dim
173
+
174
+ super().__init__(
175
+ pad_token_id=pad_token_id,
176
+ bos_token_id=bos_token_id,
177
+ eos_token_id=eos_token_id,
178
+ tie_word_embeddings=tie_word_embeddings,
179
+ **kwargs,
180
+ )
181
+
182
+ def _rope_scaling_validation(self):
183
+ """
184
+ Validate the `rope_scaling` configuration.
185
+ """
186
+ if self.rope_scaling is None:
187
+ return
188
+
189
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
190
+ raise ValueError(
191
+ "`rope_scaling` must be a dictionary with with two fields, `name` and `factor`, "
192
+ f"got {self.rope_scaling}"
193
+ )
194
+ rope_scaling_type = self.rope_scaling.get("type", None)
195
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
196
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
197
+ raise ValueError(
198
+ f"`rope_scaling`'s name field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
199
+ )
200
+ if (
201
+ rope_scaling_factor is None
202
+ or not isinstance(rope_scaling_factor, float)
203
+ or rope_scaling_factor <= 1.0
204
+ ):
205
+ raise ValueError(
206
+ f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}"
207
+ )
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "do_sample": true,
4
+ "eos_token_id": 2,
5
+ "max_length": 4096,
6
+ "pad_token_id": 0,
7
+ "temperature": 0.6,
8
+ "top_p": 0.9,
9
+ "transformers_version": "4.40.0"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:af3a39a0bc72e9fa0a1486075bbfa0a3de527d8444e5207241f0f2aff17f4906
3
+ size 4772693800
modeling_camelidae.py ADDED
@@ -0,0 +1,1240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+ from dataclasses import dataclass
24
+
25
+ import numpy as np
26
+ import copy
27
+
28
+ import torch
29
+ import torch.nn.functional as F
30
+ import torch.utils.checkpoint
31
+ from torch import nn
32
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
33
+ from torch.distributions.normal import Normal
34
+
35
+ from transformers.activations import ACT2FN
36
+ from transformers.modeling_outputs import (
37
+ BaseModelOutputWithPast,
38
+ CausalLMOutputWithPast,
39
+ MoECausalLMOutputWithPast,
40
+ )
41
+ from transformers.modeling_utils import PreTrainedModel
42
+ from transformers.utils import (
43
+ ModelOutput,
44
+ add_start_docstrings,
45
+ add_start_docstrings_to_model_forward,
46
+ logging,
47
+ replace_return_docstrings,
48
+ )
49
+
50
+ from .configuration_camelidae import CamelidaeConfig
51
+
52
+ logger = logging.get_logger(__name__)
53
+
54
+ _CONFIG_FOR_DOC = "CamelidaeConfig"
55
+
56
+
57
+ @dataclass
58
+ class MoEModelOutputWithPast(ModelOutput):
59
+ last_hidden_state: torch.FloatTensor = None
60
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
61
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
62
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
63
+ router_logits: Optional[Tuple[torch.FloatTensor]] = None
64
+
65
+
66
+ @dataclass
67
+ class MoECausalLMOutputWithPast(ModelOutput):
68
+ loss: Optional[torch.FloatTensor] = None
69
+ aux_loss: Optional[torch.FloatTensor] = None
70
+ logits: torch.FloatTensor = None
71
+ past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None
72
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
73
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
74
+ router_logits: Optional[Tuple[torch.FloatTensor]] = None
75
+
76
+
77
+
78
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
79
+ def _make_causal_mask(
80
+ input_ids_shape: torch.Size,
81
+ dtype: torch.dtype,
82
+ device: torch.device,
83
+ past_key_values_length: int = 0,
84
+ ):
85
+ """
86
+ Make causal mask used for bi-directional self-attention.
87
+ """
88
+ bsz, tgt_len = input_ids_shape
89
+ mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
90
+ mask_cond = torch.arange(mask.size(-1), device=device)
91
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
92
+ mask = mask.to(dtype)
93
+
94
+ if past_key_values_length > 0:
95
+ mask = torch.cat(
96
+ [
97
+ torch.zeros(
98
+ tgt_len, past_key_values_length, dtype=dtype, device=device
99
+ ),
100
+ mask,
101
+ ],
102
+ dim=-1,
103
+ )
104
+ return mask[None, None, :, :].expand(
105
+ bsz, 1, tgt_len, tgt_len + past_key_values_length
106
+ )
107
+
108
+
109
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
110
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
111
+ """
112
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
113
+ """
114
+ bsz, src_len = mask.size()
115
+ tgt_len = tgt_len if tgt_len is not None else src_len
116
+
117
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
118
+
119
+ inverted_mask = 1.0 - expanded_mask
120
+
121
+ return inverted_mask.masked_fill(
122
+ inverted_mask.to(torch.bool), torch.finfo(dtype).min
123
+ )
124
+
125
+
126
+ class LlamaRMSNorm(nn.Module):
127
+ def __init__(self, hidden_size, eps=1e-6):
128
+ """
129
+ LlamaRMSNorm is equivalent to T5LayerNorm
130
+ """
131
+ super().__init__()
132
+ self.weight = nn.Parameter(torch.ones(hidden_size))
133
+ self.variance_epsilon = eps
134
+
135
+ def forward(self, hidden_states):
136
+ input_dtype = hidden_states.dtype
137
+ hidden_states = hidden_states.to(torch.float32)
138
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
139
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
140
+ return self.weight * hidden_states.to(input_dtype)
141
+
142
+
143
+ class LlamaRotaryEmbedding(torch.nn.Module):
144
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
145
+ super().__init__()
146
+
147
+ self.dim = dim
148
+ self.max_position_embeddings = max_position_embeddings
149
+ self.base = base
150
+ inv_freq = 1.0 / (
151
+ self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
152
+ )
153
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
154
+
155
+ # Build here to make `torch.jit.trace` work.
156
+ self._set_cos_sin_cache(
157
+ seq_len=max_position_embeddings,
158
+ device=self.inv_freq.device,
159
+ dtype=torch.get_default_dtype(),
160
+ )
161
+
162
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
163
+ self.max_seq_len_cached = seq_len
164
+ t = torch.arange(
165
+ self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype
166
+ )
167
+
168
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
169
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
170
+ emb = torch.cat((freqs, freqs), dim=-1)
171
+ self.register_buffer(
172
+ "cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False
173
+ )
174
+ self.register_buffer(
175
+ "sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False
176
+ )
177
+
178
+ def forward(self, x, seq_len=None):
179
+ # x: [bs, num_attention_heads, seq_len, head_size]
180
+ if seq_len > self.max_seq_len_cached:
181
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
182
+
183
+ return (
184
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
185
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
186
+ )
187
+
188
+
189
+ class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
190
+ """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
191
+
192
+ def __init__(
193
+ self,
194
+ dim,
195
+ max_position_embeddings=2048,
196
+ base=10000,
197
+ device=None,
198
+ scaling_factor=1.0,
199
+ ):
200
+ self.scaling_factor = scaling_factor
201
+ super().__init__(dim, max_position_embeddings, base, device)
202
+
203
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
204
+ self.max_seq_len_cached = seq_len
205
+ t = torch.arange(
206
+ self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype
207
+ )
208
+ t = t / self.scaling_factor
209
+
210
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
211
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
212
+ emb = torch.cat((freqs, freqs), dim=-1)
213
+ self.register_buffer(
214
+ "cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False
215
+ )
216
+ self.register_buffer(
217
+ "sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False
218
+ )
219
+
220
+
221
+ class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
222
+ """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
223
+
224
+ def __init__(
225
+ self,
226
+ dim,
227
+ max_position_embeddings=2048,
228
+ base=10000,
229
+ device=None,
230
+ scaling_factor=1.0,
231
+ ):
232
+ self.scaling_factor = scaling_factor
233
+ super().__init__(dim, max_position_embeddings, base, device)
234
+
235
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
236
+ self.max_seq_len_cached = seq_len
237
+
238
+ if seq_len > self.max_position_embeddings:
239
+ base = self.base * (
240
+ (self.scaling_factor * seq_len / self.max_position_embeddings)
241
+ - (self.scaling_factor - 1)
242
+ ) ** (self.dim / (self.dim - 2))
243
+ inv_freq = 1.0 / (
244
+ base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
245
+ )
246
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
247
+
248
+ t = torch.arange(
249
+ self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype
250
+ )
251
+
252
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
253
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
254
+ emb = torch.cat((freqs, freqs), dim=-1)
255
+ self.register_buffer(
256
+ "cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False
257
+ )
258
+ self.register_buffer(
259
+ "sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False
260
+ )
261
+
262
+
263
+ def rotate_half(x):
264
+ """Rotates half the hidden dims of the input."""
265
+ x1 = x[..., : x.shape[-1] // 2]
266
+ x2 = x[..., x.shape[-1] // 2 :]
267
+ return torch.cat((-x2, x1), dim=-1)
268
+
269
+
270
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
271
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
272
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
273
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
274
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
275
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
276
+ q_embed = (q * cos) + (rotate_half(q) * sin)
277
+ k_embed = (k * cos) + (rotate_half(k) * sin)
278
+ return q_embed, k_embed
279
+
280
+
281
+ # Llama MoE
282
+ def load_balancing_loss_func(gate_logits: torch.Tensor, num_experts: torch.Tensor = None, top_k=2) -> float:
283
+ r"""
284
+ Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.
285
+
286
+ See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
287
+ function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
288
+ experts is too unbalanced.
289
+
290
+ Args:
291
+ gate_logits (Union[`torch.Tensor`, Tuple[torch.Tensor]):
292
+ Logits from the `gate`, should be a tuple of tensors. Shape: [batch_size, seqeunce_length, num_experts].
293
+ num_experts (`int`, *optional*):
294
+ Number of experts
295
+
296
+ Returns:
297
+ The auxiliary loss.
298
+ """
299
+ if gate_logits is None:
300
+ return 0
301
+
302
+ if isinstance(gate_logits, tuple):
303
+ # cat along the layers?
304
+ compute_device = gate_logits[0].device
305
+ gate_logits = torch.cat([gate.to(compute_device) for gate in gate_logits], dim=0)
306
+
307
+ routing_weights, selected_experts = torch.topk(gate_logits, top_k, dim=-1)
308
+ routing_weights = routing_weights.softmax(dim=-1)
309
+
310
+ # cast the expert indices to int64, otherwise one-hot encoding will fail
311
+ if selected_experts.dtype != torch.int64:
312
+ selected_experts = selected_experts.to(torch.int64)
313
+
314
+ if len(selected_experts.shape) == 2:
315
+ selected_experts = selected_experts.unsqueeze(2)
316
+
317
+ expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
318
+
319
+ # For a given token, determine if it was routed to a given expert.
320
+ expert_mask = torch.max(expert_mask, axis=-2).values
321
+
322
+ # cast to float32 otherwise mean will fail
323
+ expert_mask = expert_mask.to(torch.float32)
324
+ tokens_per_group_and_expert = torch.mean(expert_mask, axis=-2)
325
+
326
+ router_prob_per_group_and_expert = torch.mean(routing_weights, axis=-1)
327
+ return torch.mean(tokens_per_group_and_expert * router_prob_per_group_and_expert.unsqueeze(-1)) * (num_experts**2)
328
+
329
+ class ParallelAdapterMLP(nn.Module):
330
+ def __init__(self, config, adapter_dim, adapter_scaling):
331
+ super().__init__()
332
+ self.config = config
333
+ self.intermediate_size = config.intermediate_size
334
+ self.hidden_size = config.hidden_size
335
+ self.adapter_down = nn.Linear(self.hidden_size, adapter_dim, bias=False)
336
+ self.adapter_up = nn.Linear(adapter_dim, self.hidden_size, bias=False)
337
+ self.adapter_act = nn.GELU()
338
+
339
+ self.adapter_dropout = nn.Dropout(p=0.1)
340
+ self.adapter_scaling = adapter_scaling
341
+
342
+ def forward(self, x):
343
+ x = self.adapter_dropout(x)
344
+ x = self.adapter_scaling * self.adapter_up(self.adapter_act(self.adapter_down(x)))
345
+ return x
346
+
347
+
348
+ class CamelidaeGateAdapter(nn.Module):
349
+ def __init__(self, config: CamelidaeConfig):
350
+ super().__init__()
351
+
352
+ self.intermediate_size = config.intermediate_size
353
+ self.hidden_size = config.hidden_size
354
+
355
+ # Step 1: Router
356
+ self.num_experts = config.num_experts
357
+ self.topk = config.topk
358
+ self.router = nn.Linear(
359
+ config.hidden_size, self.num_experts, bias=False
360
+ )
361
+ self.dtype = getattr(torch, config.moe_dtype)
362
+
363
+ # Step 2: Get the experts
364
+ self.experts = nn.ModuleDict()
365
+ for idx in range(config.num_experts):
366
+ self.experts[f"expert_{idx}"] = ParallelAdapterMLP(config, config.adapter_dim, config.moe_scaling)
367
+
368
+ def forward(self, input_hidden_states, output_hidden_states, router_hidden_states):
369
+ orig_shape = output_hidden_states.shape
370
+ input_hidden_states = input_hidden_states.view(-1, input_hidden_states.shape[-1])
371
+ output_hidden_states = output_hidden_states.view(-1, output_hidden_states.shape[-1])
372
+ router_hidden_states = router_hidden_states.view(-1, router_hidden_states.shape[-1])
373
+
374
+ router_logits = self.router(router_hidden_states)
375
+
376
+ expert_weights, expert_indices = torch.topk(router_logits, self.topk, dim=-1)
377
+ expert_weights = expert_weights.softmax(dim=-1)
378
+ flat_expert_indices = expert_indices.view(-1)
379
+
380
+ input_hidden_states = input_hidden_states.repeat_interleave(self.topk, dim=0)
381
+ expert_hidden_states = output_hidden_states.repeat_interleave(self.topk, dim=0)
382
+ for idx, expert in enumerate(self.experts.values()):
383
+ expert_hidden_states[flat_expert_indices == idx] += expert(input_hidden_states[flat_expert_indices == idx])
384
+ hidden_states = (expert_hidden_states.view(*expert_weights.shape, -1) * expert_weights.unsqueeze(-1)).sum(dim=1)
385
+
386
+ return hidden_states.view(*orig_shape), router_logits
387
+
388
+
389
+ class LlamaMLP(nn.Module):
390
+ def __init__(self, config):
391
+ super().__init__()
392
+ self.config = config
393
+ self.hidden_size = config.hidden_size
394
+ self.intermediate_size = config.intermediate_size
395
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
396
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
397
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
398
+ self.act_fn = ACT2FN[config.hidden_act]
399
+
400
+ self.moe_adapter = CamelidaeGateAdapter(config)
401
+
402
+ def forward(self, x):
403
+ router_hidden_states = x
404
+ up_proj = self.act_fn(self.gate_proj(x)) * self.up_proj(x)
405
+ down_proj = self.down_proj(up_proj)
406
+ down_proj, router_logits = self.moe_adapter(down_proj, down_proj, router_hidden_states)
407
+
408
+ return down_proj, router_logits
409
+
410
+
411
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
412
+ """
413
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
414
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
415
+ """
416
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
417
+ if n_rep == 1:
418
+ return hidden_states
419
+ hidden_states = hidden_states[:, :, None, :, :].expand(
420
+ batch, num_key_value_heads, n_rep, slen, head_dim
421
+ )
422
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
423
+
424
+
425
+ class LlamaAttention(nn.Module):
426
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
427
+
428
+ def __init__(self, config: CamelidaeConfig):
429
+ super().__init__()
430
+ self.config = config
431
+ self.hidden_size = config.hidden_size
432
+ self.num_heads = config.num_attention_heads
433
+ self.head_dim = self.hidden_size // self.num_heads
434
+ self.num_key_value_heads = config.num_key_value_heads
435
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
436
+ self.max_position_embeddings = config.max_position_embeddings
437
+
438
+ if (self.head_dim * self.num_heads) != self.hidden_size:
439
+ raise ValueError(
440
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
441
+ f" and `num_heads`: {self.num_heads})."
442
+ )
443
+ self.q_proj = nn.Linear(
444
+ self.hidden_size, self.num_heads * self.head_dim, bias=False
445
+ )
446
+ self.k_proj = nn.Linear(
447
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False
448
+ )
449
+ self.v_proj = nn.Linear(
450
+ self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False
451
+ )
452
+ self.o_proj = nn.Linear(
453
+ self.num_heads * self.head_dim, self.hidden_size, bias=False
454
+ )
455
+ self._init_rope()
456
+
457
+ def _init_rope(self):
458
+ if self.config.rope_scaling is None:
459
+ self.rotary_emb = LlamaRotaryEmbedding(
460
+ self.head_dim, max_position_embeddings=self.max_position_embeddings
461
+ )
462
+ else:
463
+ scaling_type = self.config.rope_scaling["type"]
464
+ scaling_factor = self.config.rope_scaling["factor"]
465
+ if scaling_type == "linear":
466
+ self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
467
+ self.head_dim,
468
+ max_position_embeddings=self.max_position_embeddings,
469
+ scaling_factor=scaling_factor,
470
+ )
471
+ elif scaling_type == "dynamic":
472
+ self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
473
+ self.head_dim,
474
+ max_position_embeddings=self.max_position_embeddings,
475
+ scaling_factor=scaling_factor,
476
+ )
477
+ else:
478
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
479
+
480
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
481
+ return (
482
+ tensor.view(bsz, seq_len, self.num_heads, self.head_dim)
483
+ .transpose(1, 2)
484
+ .contiguous()
485
+ )
486
+
487
+ def forward(
488
+ self,
489
+ hidden_states: torch.Tensor,
490
+ attention_mask: Optional[torch.Tensor] = None,
491
+ position_ids: Optional[torch.LongTensor] = None,
492
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
493
+ output_attentions: bool = False,
494
+ use_cache: bool = False,
495
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
496
+ bsz, q_len, _ = hidden_states.size()
497
+
498
+ if self.config.pretraining_tp > 1:
499
+ key_value_slicing = (
500
+ self.num_key_value_heads * self.head_dim
501
+ ) // self.config.pretraining_tp
502
+ query_slices = self.q_proj.weight.split(
503
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
504
+ )
505
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
506
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
507
+
508
+ query_states = [
509
+ F.linear(hidden_states, query_slices[i])
510
+ for i in range(self.config.pretraining_tp)
511
+ ]
512
+ query_states = torch.cat(query_states, dim=-1)
513
+
514
+ key_states = [
515
+ F.linear(hidden_states, key_slices[i])
516
+ for i in range(self.config.pretraining_tp)
517
+ ]
518
+ key_states = torch.cat(key_states, dim=-1)
519
+
520
+ value_states = [
521
+ F.linear(hidden_states, value_slices[i])
522
+ for i in range(self.config.pretraining_tp)
523
+ ]
524
+ value_states = torch.cat(value_states, dim=-1)
525
+
526
+ else:
527
+ query_states = self.q_proj(hidden_states)
528
+ key_states = self.k_proj(hidden_states)
529
+ value_states = self.v_proj(hidden_states)
530
+
531
+ query_states = query_states.view(
532
+ bsz, q_len, self.num_heads, self.head_dim
533
+ ).transpose(1, 2)
534
+ key_states = key_states.view(
535
+ bsz, q_len, self.num_key_value_heads, self.head_dim
536
+ ).transpose(1, 2)
537
+ value_states = value_states.view(
538
+ bsz, q_len, self.num_key_value_heads, self.head_dim
539
+ ).transpose(1, 2)
540
+
541
+ kv_seq_len = key_states.shape[-2]
542
+ if past_key_value is not None:
543
+ kv_seq_len += past_key_value[0].shape[-2]
544
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
545
+ query_states, key_states = apply_rotary_pos_emb(
546
+ query_states, key_states, cos, sin, position_ids
547
+ )
548
+
549
+ if past_key_value is not None:
550
+ # reuse k, v, self_attention
551
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
552
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
553
+
554
+ past_key_value = (key_states, value_states) if use_cache else None
555
+
556
+ # repeat k/v heads if n_kv_heads < n_heads
557
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
558
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
559
+
560
+ attn_weights = torch.matmul(
561
+ query_states, key_states.transpose(2, 3)
562
+ ) / math.sqrt(self.head_dim)
563
+
564
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
565
+ raise ValueError(
566
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
567
+ f" {attn_weights.size()}"
568
+ )
569
+
570
+ if attention_mask is not None:
571
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
572
+ raise ValueError(
573
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
574
+ )
575
+ attn_weights = attn_weights + attention_mask
576
+
577
+ # upcast attention to fp32
578
+ attn_weights = nn.functional.softmax(
579
+ attn_weights, dim=-1, dtype=torch.float32
580
+ ).to(query_states.dtype)
581
+ attn_output = torch.matmul(attn_weights, value_states)
582
+
583
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
584
+ raise ValueError(
585
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
586
+ f" {attn_output.size()}"
587
+ )
588
+
589
+ attn_output = attn_output.transpose(1, 2).contiguous()
590
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
591
+
592
+ if self.config.pretraining_tp > 1:
593
+ attn_output = attn_output.split(
594
+ self.hidden_size // self.config.pretraining_tp, dim=2
595
+ )
596
+ o_proj_slices = self.o_proj.weight.split(
597
+ self.hidden_size // self.config.pretraining_tp, dim=1
598
+ )
599
+ attn_output = sum(
600
+ [
601
+ F.linear(attn_output[i], o_proj_slices[i])
602
+ for i in range(self.config.pretraining_tp)
603
+ ]
604
+ )
605
+ else:
606
+ attn_output = self.o_proj(attn_output)
607
+
608
+ if not output_attentions:
609
+ attn_weights = None
610
+
611
+ return attn_output, attn_weights, past_key_value
612
+
613
+
614
+ class LlamaDecoderLayer(nn.Module):
615
+ def __init__(self, config: CamelidaeConfig):
616
+ super().__init__()
617
+ self.config = config
618
+ self.hidden_size = config.hidden_size
619
+ self.self_attn = LlamaAttention(config=config)
620
+ self.mlp = LlamaMLP(config)
621
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
622
+ self.post_attention_layernorm = LlamaRMSNorm(
623
+ config.hidden_size, eps=config.rms_norm_eps
624
+ )
625
+
626
+ def forward(
627
+ self,
628
+ hidden_states: torch.Tensor,
629
+ attention_mask: Optional[torch.Tensor] = None,
630
+ position_ids: Optional[torch.LongTensor] = None,
631
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
632
+ output_attentions: Optional[bool] = False,
633
+ output_router_logits: Optional[bool] = False,
634
+ use_cache: Optional[bool] = False,
635
+ ) -> Tuple[
636
+ torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
637
+ ]:
638
+ """
639
+ Args:
640
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
641
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
642
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
643
+ output_attentions (`bool`, *optional*):
644
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
645
+ returned tensors for more detail.
646
+ use_cache (`bool`, *optional*):
647
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
648
+ (see `past_key_values`).
649
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
650
+ """
651
+
652
+ residual = hidden_states
653
+
654
+ hidden_states = self.input_layernorm(hidden_states)
655
+ # router_hidden_states = hidden_states
656
+
657
+ # Self Attention
658
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
659
+ hidden_states=hidden_states,
660
+ attention_mask=attention_mask,
661
+ position_ids=position_ids,
662
+ past_key_value=past_key_value,
663
+ output_attentions=output_attentions,
664
+ use_cache=use_cache,
665
+ )
666
+ hidden_states = residual + hidden_states
667
+
668
+ # Fully Connected
669
+ residual = hidden_states
670
+ hidden_states = self.post_attention_layernorm(hidden_states)
671
+ hidden_states, router_logits = self.mlp(hidden_states)
672
+ hidden_states = residual + hidden_states
673
+
674
+ outputs = (hidden_states,)
675
+
676
+ if output_attentions:
677
+ outputs += (self_attn_weights,)
678
+
679
+ if use_cache:
680
+ outputs += (present_key_value,)
681
+
682
+ if output_router_logits:
683
+ outputs += (router_logits,)
684
+
685
+ return outputs
686
+
687
+
688
+ LLAMA_START_DOCSTRING = r"""
689
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
690
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
691
+ etc.)
692
+
693
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
694
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
695
+ and behavior.
696
+
697
+ Parameters:
698
+ config ([`CamelidaeConfig`]):
699
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
700
+ load the weights associated with the model, only the configuration. Check out the
701
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
702
+ """
703
+
704
+
705
+ @add_start_docstrings(
706
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
707
+ LLAMA_START_DOCSTRING,
708
+ )
709
+ class LlamaPreTrainedModel(PreTrainedModel):
710
+ config_class = CamelidaeConfig
711
+ base_model_prefix = "model"
712
+ supports_gradient_checkpointing = True
713
+ _no_split_modules = ["LlamaDecoderLayer"]
714
+ _skip_keys_device_placement = "past_key_values"
715
+
716
+ def _init_weights(self, module):
717
+ std = self.config.initializer_range
718
+ if isinstance(module, nn.Linear):
719
+ module.weight.data.normal_(mean=0.0, std=std)
720
+ if module.bias is not None:
721
+ module.bias.data.zero_()
722
+ elif isinstance(module, nn.Embedding):
723
+ module.weight.data.normal_(mean=0.0, std=std)
724
+ if module.padding_idx is not None:
725
+ module.weight.data[module.padding_idx].zero_()
726
+
727
+ def _set_gradient_checkpointing(self, module, value=False):
728
+ if isinstance(module, LlamaModel):
729
+ module.gradient_checkpointing = value
730
+
731
+
732
+ LLAMA_INPUTS_DOCSTRING = r"""
733
+ Args:
734
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
735
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
736
+ it.
737
+
738
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
739
+ [`PreTrainedTokenizer.__call__`] for details.
740
+
741
+ [What are input IDs?](../glossary#input-ids)
742
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
743
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
744
+
745
+ - 1 for tokens that are **not masked**,
746
+ - 0 for tokens that are **masked**.
747
+
748
+ [What are attention masks?](../glossary#attention-mask)
749
+
750
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
751
+ [`PreTrainedTokenizer.__call__`] for details.
752
+
753
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
754
+ `past_key_values`).
755
+
756
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
757
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
758
+ information on the default strategy.
759
+
760
+ - 1 indicates the head is **not masked**,
761
+ - 0 indicates the head is **masked**.
762
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
763
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
764
+ config.n_positions - 1]`.
765
+
766
+ [What are position IDs?](../glossary#position-ids)
767
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
768
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
769
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
770
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
771
+
772
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
773
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
774
+
775
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
776
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
777
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
778
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
779
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
780
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
781
+ model's internal embedding lookup matrix.
782
+ use_cache (`bool`, *optional*):
783
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
784
+ `past_key_values`).
785
+ output_attentions (`bool`, *optional*):
786
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
787
+ tensors for more detail.
788
+ output_hidden_states (`bool`, *optional*):
789
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
790
+ more detail.
791
+ output_router_logits (`bool`, *optional*):
792
+ Whether or not to return the logits of all the routers. They are useful for computing the router loss, and
793
+ should not be returned during inference.
794
+ return_dict (`bool`, *optional*):
795
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
796
+ """
797
+
798
+
799
+ @add_start_docstrings(
800
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
801
+ LLAMA_START_DOCSTRING,
802
+ )
803
+ class LlamaModel(LlamaPreTrainedModel):
804
+ """
805
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
806
+
807
+ Args:
808
+ config: CamelidaeConfig
809
+ """
810
+
811
+ def __init__(self, config: CamelidaeConfig):
812
+ super().__init__(config)
813
+ self.padding_idx = config.pad_token_id
814
+ self.vocab_size = config.vocab_size
815
+
816
+ self.embed_tokens = nn.Embedding(
817
+ config.vocab_size, config.hidden_size, self.padding_idx
818
+ )
819
+ self.layers = nn.ModuleList(
820
+ [LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)]
821
+ )
822
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
823
+
824
+ self.gradient_checkpointing = False
825
+ # Initialize weights and apply final processing
826
+ self.post_init()
827
+
828
+ def get_input_embeddings(self):
829
+ return self.embed_tokens
830
+
831
+ def set_input_embeddings(self, value):
832
+ self.embed_tokens = value
833
+
834
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
835
+ def _prepare_decoder_attention_mask(
836
+ self, attention_mask, input_shape, inputs_embeds, past_key_values_length
837
+ ):
838
+ # create causal mask
839
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
840
+ combined_attention_mask = None
841
+ if input_shape[-1] > 1:
842
+ combined_attention_mask = _make_causal_mask(
843
+ input_shape,
844
+ inputs_embeds.dtype,
845
+ device=inputs_embeds.device,
846
+ past_key_values_length=past_key_values_length,
847
+ )
848
+
849
+ if attention_mask is not None:
850
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
851
+ expanded_attn_mask = _expand_mask(
852
+ attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
853
+ ).to(inputs_embeds.device)
854
+ combined_attention_mask = (
855
+ expanded_attn_mask
856
+ if combined_attention_mask is None
857
+ else expanded_attn_mask + combined_attention_mask
858
+ )
859
+
860
+ return combined_attention_mask
861
+
862
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
863
+ def forward(
864
+ self,
865
+ input_ids: torch.LongTensor = None,
866
+ attention_mask: Optional[torch.Tensor] = None,
867
+ position_ids: Optional[torch.LongTensor] = None,
868
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
869
+ inputs_embeds: Optional[torch.FloatTensor] = None,
870
+ use_cache: Optional[bool] = None,
871
+ output_attentions: Optional[bool] = None,
872
+ output_hidden_states: Optional[bool] = None,
873
+ output_router_logits: Optional[bool] = None,
874
+ return_dict: Optional[bool] = None,
875
+ ) -> Union[Tuple, MoEModelOutputWithPast]:
876
+ output_attentions = (
877
+ output_attentions
878
+ if output_attentions is not None
879
+ else self.config.output_attentions
880
+ )
881
+ output_hidden_states = (
882
+ output_hidden_states
883
+ if output_hidden_states is not None
884
+ else self.config.output_hidden_states
885
+ )
886
+ output_router_logits = (
887
+ output_router_logits
888
+ if output_router_logits is not None
889
+ else self.config.output_router_logits
890
+ )
891
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
892
+
893
+ return_dict = (
894
+ return_dict if return_dict is not None else self.config.use_return_dict
895
+ )
896
+
897
+ # retrieve input_ids and inputs_embeds
898
+ if input_ids is not None and inputs_embeds is not None:
899
+ raise ValueError(
900
+ "You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time"
901
+ )
902
+ elif input_ids is not None:
903
+ batch_size, seq_length = input_ids.shape
904
+ elif inputs_embeds is not None:
905
+ batch_size, seq_length, _ = inputs_embeds.shape
906
+ else:
907
+ raise ValueError(
908
+ "You have to specify either decoder_input_ids or decoder_inputs_embeds"
909
+ )
910
+
911
+ seq_length_with_past = seq_length
912
+ past_key_values_length = 0
913
+
914
+ if past_key_values is not None:
915
+ past_key_values_length = past_key_values[0][0].shape[2]
916
+ seq_length_with_past = seq_length_with_past + past_key_values_length
917
+
918
+ if position_ids is None:
919
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
920
+ position_ids = torch.arange(
921
+ past_key_values_length,
922
+ seq_length + past_key_values_length,
923
+ dtype=torch.long,
924
+ device=device,
925
+ )
926
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
927
+ else:
928
+ position_ids = position_ids.view(-1, seq_length).long()
929
+
930
+ if inputs_embeds is None:
931
+ inputs_embeds = self.embed_tokens(input_ids)
932
+ # embed positions
933
+ if attention_mask is None:
934
+ attention_mask = torch.ones(
935
+ (batch_size, seq_length_with_past),
936
+ dtype=torch.bool,
937
+ device=inputs_embeds.device,
938
+ )
939
+ attention_mask = self._prepare_decoder_attention_mask(
940
+ attention_mask,
941
+ (batch_size, seq_length),
942
+ inputs_embeds,
943
+ past_key_values_length,
944
+ )
945
+
946
+ hidden_states = inputs_embeds
947
+
948
+ if self.gradient_checkpointing and self.training:
949
+ if use_cache:
950
+ logger.warning_once(
951
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
952
+ )
953
+ use_cache = False
954
+
955
+ # decoder layers
956
+ all_hidden_states = () if output_hidden_states else None
957
+ all_self_attns = () if output_attentions else None
958
+ all_router_logits = () if output_router_logits else None
959
+ next_decoder_cache = () if use_cache else None
960
+
961
+ for idx, decoder_layer in enumerate(self.layers):
962
+ if output_hidden_states:
963
+ all_hidden_states += (hidden_states,)
964
+
965
+ past_key_value = (
966
+ past_key_values[idx] if past_key_values is not None else None
967
+ )
968
+
969
+ if self.gradient_checkpointing and self.training:
970
+
971
+ def create_custom_forward(module):
972
+ def custom_forward(*inputs):
973
+ # None for past_key_value
974
+ return module(
975
+ *inputs, output_attentions, output_router_logits, None
976
+ )
977
+
978
+ return custom_forward
979
+
980
+ layer_outputs = torch.utils.checkpoint.checkpoint(
981
+ create_custom_forward(decoder_layer),
982
+ hidden_states,
983
+ attention_mask,
984
+ position_ids,
985
+ None,
986
+ )
987
+ else:
988
+ layer_outputs = decoder_layer(
989
+ hidden_states,
990
+ attention_mask=attention_mask,
991
+ position_ids=position_ids,
992
+ past_key_value=past_key_value,
993
+ output_attentions=output_attentions,
994
+ output_router_logits=output_router_logits,
995
+ use_cache=use_cache,
996
+ )
997
+
998
+ hidden_states = layer_outputs[0]
999
+
1000
+ if use_cache:
1001
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
1002
+
1003
+ if output_attentions:
1004
+ all_self_attns += (layer_outputs[1],)
1005
+
1006
+ if output_router_logits:
1007
+ all_router_logits += (layer_outputs[-1],)
1008
+
1009
+ hidden_states = self.norm(hidden_states)
1010
+
1011
+ # add hidden states from the last decoder layer
1012
+ if output_hidden_states:
1013
+ all_hidden_states += (hidden_states,)
1014
+
1015
+ next_cache = next_decoder_cache if use_cache else None
1016
+ if not return_dict:
1017
+ return tuple(
1018
+ v
1019
+ for v in [
1020
+ hidden_states,
1021
+ next_cache,
1022
+ all_hidden_states,
1023
+ all_self_attns,
1024
+ all_router_logits
1025
+ ]
1026
+ if v is not None
1027
+ )
1028
+ return MoEModelOutputWithPast(
1029
+ last_hidden_state=hidden_states,
1030
+ past_key_values=next_cache,
1031
+ hidden_states=all_hidden_states,
1032
+ attentions=all_self_attns,
1033
+ router_logits=all_router_logits,
1034
+ )
1035
+
1036
+
1037
+ class LlamaForCausalLM(LlamaPreTrainedModel):
1038
+ _tied_weights_keys = ["lm_head.weight"]
1039
+
1040
+ def __init__(self, config):
1041
+ super().__init__(config)
1042
+ self.config = config
1043
+ self.model = LlamaModel(config)
1044
+ self.vocab_size = config.vocab_size
1045
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1046
+
1047
+ # Initialize weights and apply final processing
1048
+ self.post_init()
1049
+
1050
+ def get_input_embeddings(self):
1051
+ return self.model.embed_tokens
1052
+
1053
+ def set_input_embeddings(self, value):
1054
+ self.model.embed_tokens = value
1055
+
1056
+ def get_output_embeddings(self):
1057
+ return self.lm_head
1058
+
1059
+ def set_output_embeddings(self, new_embeddings):
1060
+ self.lm_head = new_embeddings
1061
+
1062
+ def set_decoder(self, decoder):
1063
+ self.model = decoder
1064
+
1065
+ def get_decoder(self):
1066
+ return self.model
1067
+
1068
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
1069
+ @replace_return_docstrings(
1070
+ output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC
1071
+ )
1072
+ def forward(
1073
+ self,
1074
+ input_ids: torch.LongTensor = None,
1075
+ attention_mask: Optional[torch.Tensor] = None,
1076
+ position_ids: Optional[torch.LongTensor] = None,
1077
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1078
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1079
+ labels: Optional[torch.LongTensor] = None,
1080
+ use_cache: Optional[bool] = None,
1081
+ output_attentions: Optional[bool] = None,
1082
+ output_hidden_states: Optional[bool] = None,
1083
+ output_router_logits: Optional[bool] = None,
1084
+ return_dict: Optional[bool] = None,
1085
+ ) -> Union[Tuple, MoECausalLMOutputWithPast]:
1086
+ r"""
1087
+ Args:
1088
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1089
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1090
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1091
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1092
+
1093
+ Returns:
1094
+
1095
+ Example:
1096
+
1097
+ ```python
1098
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
1099
+
1100
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1101
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1102
+
1103
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1104
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1105
+
1106
+ >>> # Generate
1107
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1108
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1109
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1110
+ ```"""
1111
+
1112
+ output_attentions = (
1113
+ output_attentions
1114
+ if output_attentions is not None
1115
+ else self.config.output_attentions
1116
+ )
1117
+ output_hidden_states = (
1118
+ output_hidden_states
1119
+ if output_hidden_states is not None
1120
+ else self.config.output_hidden_states
1121
+ )
1122
+ output_router_logits = (
1123
+ output_router_logits if output_router_logits is not None else self.config.output_router_logits
1124
+ )
1125
+ return_dict = (
1126
+ return_dict if return_dict is not None else self.config.use_return_dict
1127
+ )
1128
+
1129
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1130
+ outputs = self.model(
1131
+ input_ids=input_ids,
1132
+ attention_mask=attention_mask,
1133
+ position_ids=position_ids,
1134
+ past_key_values=past_key_values,
1135
+ inputs_embeds=inputs_embeds,
1136
+ use_cache=use_cache,
1137
+ output_attentions=output_attentions,
1138
+ output_hidden_states=output_hidden_states,
1139
+ output_router_logits=output_router_logits,
1140
+ return_dict=return_dict,
1141
+ )
1142
+
1143
+ hidden_states = outputs[0]
1144
+ if self.config.pretraining_tp > 1:
1145
+ lm_head_slices = self.lm_head.weight.split(
1146
+ self.vocab_size // self.config.pretraining_tp, dim=0
1147
+ )
1148
+ logits = [
1149
+ F.linear(hidden_states, lm_head_slices[i])
1150
+ for i in range(self.config.pretraining_tp)
1151
+ ]
1152
+ logits = torch.cat(logits, dim=-1)
1153
+ else:
1154
+ logits = self.lm_head(hidden_states)
1155
+ logits = logits.float()
1156
+
1157
+ loss = None
1158
+
1159
+ if labels is not None:
1160
+ # Shift so that tokens < n predict n
1161
+ shift_logits = logits[..., :-1, :].contiguous()
1162
+ shift_labels = labels[..., 1:].contiguous()
1163
+ # Flatten the tokens
1164
+ loss_fct = CrossEntropyLoss()
1165
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1166
+ shift_labels = shift_labels.view(-1)
1167
+ # Enable model parallelism
1168
+ shift_labels = shift_labels.to(shift_logits.device)
1169
+ loss = loss_fct(shift_logits, shift_labels)
1170
+
1171
+ aux_loss = None
1172
+ if output_router_logits:
1173
+ aux_loss = load_balancing_loss_func(
1174
+ outputs.router_logits if return_dict else outputs[-1], self.config.num_experts, self.config.topk
1175
+ )
1176
+ if labels is not None:
1177
+ loss += 0.01 * aux_loss
1178
+
1179
+ if not return_dict:
1180
+ output = (logits,) + outputs[1:]
1181
+ if output_router_logits:
1182
+ output = (aux_loss,) + output
1183
+ return (loss,) + output if loss is not None else output
1184
+
1185
+ return MoECausalLMOutputWithPast(
1186
+ loss=loss,
1187
+ aux_loss=aux_loss,
1188
+ logits=logits,
1189
+ past_key_values=outputs.past_key_values,
1190
+ hidden_states=outputs.hidden_states,
1191
+ attentions=outputs.attentions,
1192
+ router_logits=outputs.router_logits,
1193
+ )
1194
+
1195
+ def prepare_inputs_for_generation(
1196
+ self,
1197
+ input_ids,
1198
+ past_key_values=None,
1199
+ attention_mask=None,
1200
+ inputs_embeds=None,
1201
+ **kwargs,
1202
+ ):
1203
+ if past_key_values:
1204
+ input_ids = input_ids[:, -1:]
1205
+
1206
+ position_ids = kwargs.get("position_ids", None)
1207
+ if attention_mask is not None and position_ids is None:
1208
+ # create position_ids on the fly for batch generation
1209
+ position_ids = attention_mask.long().cumsum(-1) - 1
1210
+ position_ids.masked_fill_(attention_mask == 0, 1)
1211
+ if past_key_values:
1212
+ position_ids = position_ids[:, -1].unsqueeze(-1)
1213
+
1214
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1215
+ if inputs_embeds is not None and past_key_values is None:
1216
+ model_inputs = {"inputs_embeds": inputs_embeds}
1217
+ else:
1218
+ model_inputs = {"input_ids": input_ids}
1219
+
1220
+ model_inputs.update(
1221
+ {
1222
+ "position_ids": position_ids,
1223
+ "past_key_values": past_key_values,
1224
+ "use_cache": kwargs.get("use_cache"),
1225
+ "attention_mask": attention_mask,
1226
+ }
1227
+ )
1228
+ return model_inputs
1229
+
1230
+ @staticmethod
1231
+ def _reorder_cache(past_key_values, beam_idx):
1232
+ reordered_past = ()
1233
+ for layer_past in past_key_values:
1234
+ reordered_past += (
1235
+ tuple(
1236
+ past_state.index_select(0, beam_idx.to(past_state.device))
1237
+ for past_state in layer_past
1238
+ ),
1239
+ )
1240
+ return reordered_past
smash_config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "api_key": null,
3
+ "verify_url": "http://johnrachwan.pythonanywhere.com",
4
+ "smash_config": {
5
+ "pruners": "None",
6
+ "pruning_ratio": 0.0,
7
+ "factorizers": "None",
8
+ "quantizers": "['llm-int8']",
9
+ "weight_quantization_bits": 4,
10
+ "output_deviation": 0.005,
11
+ "compilers": "None",
12
+ "static_batch": true,
13
+ "static_shape": true,
14
+ "controlnet": "None",
15
+ "unet_dim": 4,
16
+ "device": "cuda",
17
+ "cache_dir": "/ceph/hdd/staff/charpent/.cache/models1trx2z9_",
18
+ "batch_size": 1,
19
+ "model_name": "hywu/Camelidae-8x7B",
20
+ "task": "text_text_generation",
21
+ "max_batch_size": 1,
22
+ "qtype_weight": "torch.qint8",
23
+ "qtype_activation": "torch.quint8",
24
+ "qobserver": "<class 'torch.ao.quantization.observer.MinMaxObserver'>",
25
+ "qscheme": "torch.per_tensor_symmetric",
26
+ "qconfig": "x86",
27
+ "group_size": 128,
28
+ "damp_percent": 0.1,
29
+ "save_load_fn": "bitsandbytes"
30
+ }
31
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<unk>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "bos_token": "<s>",
32
+ "clean_up_tokenization_spaces": false,
33
+ "eos_token": "</s>",
34
+ "legacy": false,
35
+ "model_max_length": 1000000000000000019884624838656,
36
+ "pad_token": null,
37
+ "padding_side": "right",
38
+ "sp_model_kwargs": {},
39
+ "spaces_between_special_tokens": false,
40
+ "tokenizer_class": "LlamaTokenizer",
41
+ "unk_token": "<unk>",
42
+ "use_default_system_prompt": true
43
+ }