Abhaykoul commited on
Commit
983f690
·
verified ·
1 Parent(s): 53a805d

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - liuhaotian/LLaVA-Pretrain
4
+ - liuhaotian/LLaVA-Instruct-150K
5
+ language:
6
+ - en
7
+ tags:
8
+ - llava
9
+ - phi
10
+ license: mit
11
+ library_name: transformers
12
+ widget:
13
+ - text: "What animal is it?"
14
+ src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/tiger.jpg"
15
+ - text: "Where is it?"
16
+ src: "https://huggingface.co/datasets/mishig/sample_images/resolve/main/palace.jpg"
17
+ ---
18
+
19
+ # Multi-crop LLaVA-3b
20
+
21
+ <a target="_blank" href="https://colab.research.google.com/drive/1W7JQrFXwFunAY1XvS31mwC7mrXBgGD_M">
22
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
23
+ </a>
24
+
25
+ ## Model details
26
+
27
+ The core idea behind multi-crop LLaVA (MC-LLaVA) is that instead of N visual token embeddings per image, I generate one token embedding per N parts of the image.
28
+ Having high-quality embeddings for smaller parts of the image helps to extract more details and understand the scene better.
29
+
30
+ For every crop of the image, I generate an embedding from the full SigLIP encoder (size [1, 1152]) and then push all N embeddings through the LLaVA adapter, which
31
+ gives the token embedding of size [N, 2560]. Right now, the tokens do not contain explicit information about their position in the original image. I plan to add it later.
32
+
33
+ MC-LLaVA-3b was fine-tuned from [Dolphin 2.6 Phi](https://huggingface.co/cognitivecomputations/dolphin-2_6-phi-2) using vision tower from
34
+ [SigLIP 400M](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384).
35
+
36
+ The context length during training was 1200 tokens, as the L4 GPUs I used didn't allow me to get more.
37
+
38
+ As Dolphin 2.6 Phi, LLaVA-3b uses ChatML prompt format:
39
+
40
+ ```
41
+ <|im_start|>system
42
+ You are Dolphin, a helpful AI assistant.<|im_end|>
43
+ <|im_start|>user
44
+ {prompt}<|im_end|>
45
+ <|im_start|>assistant
46
+ ```
47
+
48
+ ## How to use
49
+
50
+ **Install dependencies**
51
+
52
+ ```bash
53
+ !pip install -q open_clip_torch timm einops
54
+ ```
55
+
56
+ **Download modeling files**
57
+
58
+ ```python
59
+ from huggingface_hub import hf_hub_download
60
+
61
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_llava.py", local_dir="./", force_download=True)
62
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="configuration_phi.py", local_dir="./", force_download=True)
63
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_llava.py", local_dir="./", force_download=True)
64
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="modeling_phi.py", local_dir="./", force_download=True)
65
+ hf_hub_download(repo_id="visheratin/LLaVA-3b", filename="processing_llava.py", local_dir="./", force_download=True)
66
+ ```
67
+
68
+ **Create a model**
69
+
70
+ ```python
71
+ from modeling_llava import LlavaForConditionalGeneration
72
+ import torch
73
+
74
+ model = LlavaForConditionalGeneration.from_pretrained("visheratin/LLaVA-3b", torch_dtype=torch.float16)
75
+ model = model.to("cuda")
76
+ ```
77
+
78
+ **Create processors**
79
+
80
+ ```python
81
+ from transformers import AutoTokenizer
82
+ from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
83
+
84
+ tokenizer = AutoTokenizer.from_pretrained("visheratin/LLaVA-3b")
85
+ image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
86
+ processor = LlavaProcessor(image_processor, tokenizer)
87
+ ```
88
+
89
+ **Set image and text**
90
+
91
+ ```python
92
+ from PIL import Image
93
+ import requests
94
+
95
+ image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
96
+ raw_image = Image.open(requests.get(image_file, stream=True).raw)
97
+
98
+ prompt = """<|im_start|>system
99
+ A chat between a curious human and an artificial intelligence assistant.
100
+ The assistant gives helpful, detailed, and polite answers to the human's questions.
101
+ The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
102
+ <|im_start|>user
103
+ <image>
104
+ Describe the image.<|im_end|>
105
+ <|im_start|>assistant
106
+ """
107
+ ```
108
+
109
+ **Process inputs**
110
+
111
+ ```python
112
+ inputs = processor(prompt, raw_image, model, return_tensors='pt')
113
+
114
+ inputs['input_ids'] = inputs['input_ids'].to(model.device)
115
+ inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
116
+ ```
117
+
118
+ **Generate the data**
119
+
120
+ ```python
121
+ import torch
122
+
123
+ with torch.inference_mode():
124
+ output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.4, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
125
+ ```
126
+
127
+ ## Benchmarks
128
+
129
+ - TextVQA - 38.59%
130
+ - GQA - 49.6%
131
+ - VQAv2 - 64.24%
132
+ - VizWiz - 24.88%
133
+ - POPE - 80.59%
134
+ - V*-bench - 52.25% (OCR - 46.66%, GPT4V-hard - 41.17%, direct attributes - 43.48%, relative position - 65.79%)
135
+
136
+ ## Examples
137
+
138
+ <a target="_blank" href="https://colab.research.google.com/drive/1sXDvVl5s9fTcE0N2bQGOlXhnNlKEdeun">
139
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
140
+ </a>
141
+
142
+ ## License
143
+
144
+ The model is licensed under MIT license, but since the data used for model training is largely synthetic, you should also follow OpenAI and Google Gemini terms of service.
145
+ Which means don't create competitor models for them.
146
+
147
+ ## Acknowledgments
148
+
149
+ Thanks to [ML Collective](https://mlcollective.org/) for providing credits for computing resources.
added_tokens.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\t\t": 50294,
3
+ "\t\t\t": 50293,
4
+ "\t\t\t\t": 50292,
5
+ "\t\t\t\t\t": 50291,
6
+ "\t\t\t\t\t\t": 50290,
7
+ "\t\t\t\t\t\t\t": 50289,
8
+ "\t\t\t\t\t\t\t\t": 50288,
9
+ "\t\t\t\t\t\t\t\t\t": 50287,
10
+ " ": 50286,
11
+ " ": 50285,
12
+ " ": 50284,
13
+ " ": 50283,
14
+ " ": 50282,
15
+ " ": 50281,
16
+ " ": 50280,
17
+ " ": 50279,
18
+ " ": 50278,
19
+ " ": 50277,
20
+ " ": 50276,
21
+ " ": 50275,
22
+ " ": 50274,
23
+ " ": 50273,
24
+ " ": 50272,
25
+ " ": 50271,
26
+ " ": 50270,
27
+ " ": 50269,
28
+ " ": 50268,
29
+ " ": 50267,
30
+ " ": 50266,
31
+ " ": 50265,
32
+ " ": 50264,
33
+ " ": 50263,
34
+ " ": 50262,
35
+ " ": 50261,
36
+ " ": 50260,
37
+ " ": 50259,
38
+ " ": 50258,
39
+ " ": 50257,
40
+ "<image>": 50297,
41
+ "<pad>": 50298,
42
+ "<|im_end|>": 50295,
43
+ "<|im_start|>": 50296
44
+ }
config.json ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlavaForConditionalGeneration"
4
+ ],
5
+ "ignore_index": -100,
6
+ "image_token_index": 50297,
7
+ "max_image_tokens": 100,
8
+ "model_type": "llava",
9
+ "projector_hidden_act": "gelu",
10
+ "projector_tokens_num": 5,
11
+ "text_config": {
12
+ "_name_or_path": "cognitivecomputations/dolphin-2_6-phi-2",
13
+ "activation_function": "gelu_new",
14
+ "add_cross_attention": false,
15
+ "architectures": [
16
+ "PhiForCausalLM"
17
+ ],
18
+ "attn_pdrop": 0.0,
19
+ "auto_map": {
20
+ "AutoConfig": "cognitivecomputations/dolphin-2_6-phi-2--configuration_phi.PhiConfig",
21
+ "AutoModelForCausalLM": "cognitivecomputations/dolphin-2_6-phi-2--modeling_phi.PhiForCausalLM"
22
+ },
23
+ "bad_words_ids": null,
24
+ "begin_suppress_tokens": null,
25
+ "bos_token_id": null,
26
+ "chunk_size_feed_forward": 0,
27
+ "cross_attention_hidden_size": null,
28
+ "decoder_start_token_id": null,
29
+ "diversity_penalty": 0.0,
30
+ "do_sample": false,
31
+ "early_stopping": false,
32
+ "embd_pdrop": 0.0,
33
+ "encoder_no_repeat_ngram_size": 0,
34
+ "eos_token_id": null,
35
+ "exponential_decay_length_penalty": null,
36
+ "finetuning_task": null,
37
+ "flash_attn": false,
38
+ "flash_rotary": false,
39
+ "forced_bos_token_id": null,
40
+ "forced_eos_token_id": null,
41
+ "fused_dense": false,
42
+ "id2label": {
43
+ "0": "LABEL_0",
44
+ "1": "LABEL_1"
45
+ },
46
+ "img_processor": null,
47
+ "initializer_range": 0.02,
48
+ "is_decoder": false,
49
+ "is_encoder_decoder": false,
50
+ "label2id": {
51
+ "LABEL_0": 0,
52
+ "LABEL_1": 1
53
+ },
54
+ "layer_norm_epsilon": 1e-05,
55
+ "length_penalty": 1.0,
56
+ "max_length": 20,
57
+ "min_length": 0,
58
+ "model_type": "phi-msft",
59
+ "n_embd": 2560,
60
+ "n_head": 32,
61
+ "n_head_kv": null,
62
+ "n_inner": null,
63
+ "n_layer": 32,
64
+ "n_positions": 2048,
65
+ "no_repeat_ngram_size": 0,
66
+ "num_beam_groups": 1,
67
+ "num_beams": 1,
68
+ "num_return_sequences": 1,
69
+ "output_attentions": false,
70
+ "output_hidden_states": false,
71
+ "output_scores": false,
72
+ "pad_token_id": null,
73
+ "prefix": null,
74
+ "problem_type": null,
75
+ "pruned_heads": {},
76
+ "remove_invalid_values": false,
77
+ "repetition_penalty": 1.0,
78
+ "resid_pdrop": 0.1,
79
+ "return_dict": true,
80
+ "return_dict_in_generate": false,
81
+ "rotary_dim": 32,
82
+ "sep_token_id": null,
83
+ "suppress_tokens": null,
84
+ "task_specific_params": null,
85
+ "temperature": 1.0,
86
+ "tf_legacy_loss": false,
87
+ "tie_encoder_decoder": false,
88
+ "tie_word_embeddings": false,
89
+ "tokenizer_class": null,
90
+ "top_k": 50,
91
+ "top_p": 1.0,
92
+ "torch_dtype": "float16",
93
+ "torchscript": false,
94
+ "typical_p": 1.0,
95
+ "use_bfloat16": false,
96
+ "use_cache": false,
97
+ "vocab_size": 51200
98
+ },
99
+ "preprocess_config": {
100
+ "mean": [
101
+ 0.5,
102
+ 0.5,
103
+ 0.5
104
+ ],
105
+ "std": [
106
+ 0.5,
107
+ 0.5,
108
+ 0.5
109
+ ],
110
+ "interpolation": "bicubic",
111
+ "resize_mode": "squash",
112
+ "size": 384
113
+ },
114
+ "torch_dtype": "float16",
115
+ "transformers_version": "4.36.2",
116
+ "vision_embed_dim": 1152,
117
+ "vision_tower_name": "ViT-SO400M-14-SigLIP-384",
118
+ "vocab_size": 51200
119
+ }
configuration_llava.py ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+
3
+ from transformers.configuration_utils import PretrainedConfig
4
+ from open_clip import get_model_config
5
+ from configuration_phi import PhiConfig
6
+
7
+
8
+ class LlavaConfig(PretrainedConfig):
9
+ model_type = "llava"
10
+ is_composition = False
11
+
12
+ def __init__(
13
+ self,
14
+ text_config=None,
15
+ vision_tower_name="ViT-SO400M-14-SigLIP-384",
16
+ ignore_index=-100,
17
+ image_token_index=50297,
18
+ projector_hidden_act="gelu",
19
+ projector_tokens_num=1,
20
+ vocab_size=51200,
21
+ **kwargs,
22
+ ):
23
+ self.ignore_index = ignore_index
24
+ self.image_token_index = image_token_index
25
+ self.projector_hidden_act = projector_hidden_act
26
+ self.projector_tokens_num = projector_tokens_num
27
+ self.vocab_size = vocab_size
28
+
29
+ self.vision_tower_name = vision_tower_name
30
+ vision_config = get_model_config(vision_tower_name)
31
+ self.vision_embed_dim = vision_config["embed_dim"]
32
+
33
+ self.vocab_size = self.vocab_size
34
+
35
+ self.text_config = text_config
36
+ if isinstance(self.text_config, dict):
37
+ text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
38
+ self.text_config = PhiConfig(**text_config)
39
+ self.vocab_size = self.text_config.vocab_size
40
+
41
+ super().__init__(**kwargs)
configuration_phi.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Microsoft Corporation.
2
+ # Licensed under the MIT license.
3
+
4
+ import math
5
+ from typing import Optional
6
+
7
+ from transformers import PretrainedConfig
8
+
9
+
10
+ class PhiConfig(PretrainedConfig):
11
+ """Phi configuration."""
12
+
13
+ model_type = "phi-msft"
14
+ attribute_map = {
15
+ "max_position_embeddings": "n_positions",
16
+ "hidden_size": "n_embd",
17
+ "num_attention_heads": "n_head",
18
+ "num_hidden_layers": "n_layer",
19
+ }
20
+
21
+ def __init__(
22
+ self,
23
+ vocab_size: int = 51200,
24
+ n_positions: int = 2048,
25
+ n_embd: int = 1024,
26
+ n_layer: int = 20,
27
+ n_inner: Optional[int] = None,
28
+ n_head: int = 16,
29
+ n_head_kv: Optional[int] = None,
30
+ rotary_dim: Optional[int] = 32,
31
+ activation_function: Optional[str] = "gelu_new",
32
+ flash_attn: bool = False,
33
+ flash_rotary: bool = False,
34
+ fused_dense: bool = False,
35
+ attn_pdrop: float = 0.0,
36
+ embd_pdrop: float = 0.0,
37
+ resid_pdrop: float = 0.0,
38
+ layer_norm_epsilon: float = 1e-5,
39
+ initializer_range: float = 0.02,
40
+ tie_word_embeddings: bool = False,
41
+ pad_vocab_size_multiple: int = 64,
42
+ **kwargs
43
+ ) -> None:
44
+ self.vocab_size = int(math.ceil(vocab_size / pad_vocab_size_multiple) * pad_vocab_size_multiple)
45
+ self.n_positions = n_positions
46
+ self.n_embd = n_embd
47
+ self.n_layer = n_layer
48
+ self.n_inner = n_inner
49
+ self.n_head = n_head
50
+ self.n_head_kv = n_head_kv
51
+ self.rotary_dim = min(rotary_dim, n_embd // n_head)
52
+ self.activation_function = activation_function
53
+ self.flash_attn = flash_attn
54
+ self.flash_rotary = flash_rotary
55
+ self.fused_dense = fused_dense
56
+ self.attn_pdrop = attn_pdrop
57
+ self.embd_pdrop = embd_pdrop
58
+ self.resid_pdrop = resid_pdrop
59
+ self.layer_norm_epsilon = layer_norm_epsilon
60
+ self.initializer_range = initializer_range
61
+
62
+ super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
convert_model.py ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 The HuggingFace Inc. team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+ import argparse
15
+
16
+ import torch
17
+
18
+ from transformers import (
19
+ AddedToken,
20
+ AutoConfig,
21
+ AutoTokenizer,
22
+ )
23
+ from configuration_llava import LlavaConfig
24
+ from modeling_llava import LlavaForConditionalGeneration
25
+
26
+
27
+ KEYS_TO_MODIFY_MAPPING = {
28
+ "transformer.vision_tower.vision_tower": "vision_model",
29
+ "transformer.mm_projector": "multi_modal_projector",
30
+ "transformer": "language_model.transformer",
31
+ "lm_head": "language_model.lm_head",
32
+ "model.model": "language_model.transformer",
33
+ "multi_modal_projector.0": "multi_modal_projector.linear_1",
34
+ "multi_modal_projector.2": "multi_modal_projector.linear_2",
35
+ }
36
+
37
+
38
+ def convert_state_dict_to_hf(state_dict):
39
+ new_state_dict = {}
40
+ for key, value in state_dict.items():
41
+ for key_to_modify, new_key in KEYS_TO_MODIFY_MAPPING.items():
42
+ if key_to_modify in key:
43
+ key = key.replace(key_to_modify, new_key)
44
+
45
+ new_state_dict[key] = value
46
+ return new_state_dict
47
+
48
+
49
+ def convert_llava_llama_to_hf(text_model_id, vision_model_id, projector_tokens_num, output_path, old_state_dict_path):
50
+ torch.set_default_dtype(torch.float16)
51
+ text_config = AutoConfig.from_pretrained(text_model_id, trust_remote_code=True)
52
+
53
+ tokenizer = AutoTokenizer.from_pretrained(text_model_id)
54
+ tokenizer.add_tokens(AddedToken("<image>", special=True, normalized=False), special_tokens=True)
55
+ tokenizer.add_special_tokens({"pad_token": "<pad>"})
56
+
57
+ config = LlavaConfig(text_config=text_config, vocab_size=51200, vision_tower_name=vision_model_id, projector_tokens_num=projector_tokens_num)
58
+ config.text_config.vocab_size = config.vocab_size
59
+
60
+ with torch.device("cuda"):
61
+ model = LlavaForConditionalGeneration(config)
62
+
63
+ state_dict = torch.load(old_state_dict_path, map_location="cpu")
64
+ state_dict = convert_state_dict_to_hf(state_dict)
65
+ model.load_state_dict(state_dict, strict=True, assign=True)
66
+
67
+ model.config.vocab_size = model.config.vocab_size
68
+ model.config.text_config.vocab_size = model.config.text_config.vocab_size
69
+
70
+ model.save_pretrained(output_path)
71
+ tokenizer.save_pretrained(output_path)
72
+
73
+
74
+ def main():
75
+ parser = argparse.ArgumentParser()
76
+ parser.add_argument(
77
+ "--text_model_id",
78
+ help="Hub location of the text model",
79
+ )
80
+ parser.add_argument(
81
+ "--vision_model_id",
82
+ help="Hub location of the vision model",
83
+ )
84
+ parser.add_argument(
85
+ "--output_path",
86
+ help="Location of the converted model",
87
+ )
88
+ parser.add_argument(
89
+ "--old_state_dict_path",
90
+ help="Location on the hub of the raw state dict of the original model. The filename needs to be `model_state_dict.bin`",
91
+ )
92
+ parser.add_argument(
93
+ "--tokens_num",
94
+ type=int,
95
+ default=1
96
+ )
97
+ args = parser.parse_args()
98
+ convert_llava_llama_to_hf(args.text_model_id, args.vision_model_id, args.tokens_num, args.output_path, args.old_state_dict_path)
99
+
100
+
101
+ if __name__ == "__main__":
102
+ main()
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.36.2",
4
+ "use_cache": false
5
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4a26ade091346f7adf46838555315d0ae89b3e704485be64f5cd6490f17f1f73
3
+ size 4989958040
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3bf488b78e5b0b9929d4dfa3afb8760759fb610ac8ae7835797f3dc5670272f0
3
+ size 1520997992
model.safetensors.index.json ADDED
@@ -0,0 +1,678 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 6773041280
4
+ },
5
+ "weight_map": {
6
+ "language_model.lm_head.linear.bias": "model-00002-of-00002.safetensors",
7
+ "language_model.lm_head.linear.weight": "model-00002-of-00002.safetensors",
8
+ "language_model.lm_head.ln.bias": "model-00002-of-00002.safetensors",
9
+ "language_model.lm_head.ln.weight": "model-00002-of-00002.safetensors",
10
+ "language_model.transformer.embd.wte.weight": "model-00001-of-00002.safetensors",
11
+ "language_model.transformer.h.0.ln.bias": "model-00001-of-00002.safetensors",
12
+ "language_model.transformer.h.0.ln.weight": "model-00001-of-00002.safetensors",
13
+ "language_model.transformer.h.0.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
14
+ "language_model.transformer.h.0.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
15
+ "language_model.transformer.h.0.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
16
+ "language_model.transformer.h.0.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
17
+ "language_model.transformer.h.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
18
+ "language_model.transformer.h.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
19
+ "language_model.transformer.h.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
20
+ "language_model.transformer.h.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
21
+ "language_model.transformer.h.1.ln.bias": "model-00001-of-00002.safetensors",
22
+ "language_model.transformer.h.1.ln.weight": "model-00001-of-00002.safetensors",
23
+ "language_model.transformer.h.1.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
24
+ "language_model.transformer.h.1.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
25
+ "language_model.transformer.h.1.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
26
+ "language_model.transformer.h.1.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
27
+ "language_model.transformer.h.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
28
+ "language_model.transformer.h.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
29
+ "language_model.transformer.h.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
30
+ "language_model.transformer.h.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
31
+ "language_model.transformer.h.10.ln.bias": "model-00001-of-00002.safetensors",
32
+ "language_model.transformer.h.10.ln.weight": "model-00001-of-00002.safetensors",
33
+ "language_model.transformer.h.10.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
34
+ "language_model.transformer.h.10.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
35
+ "language_model.transformer.h.10.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
36
+ "language_model.transformer.h.10.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
37
+ "language_model.transformer.h.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
38
+ "language_model.transformer.h.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
39
+ "language_model.transformer.h.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
40
+ "language_model.transformer.h.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
41
+ "language_model.transformer.h.11.ln.bias": "model-00001-of-00002.safetensors",
42
+ "language_model.transformer.h.11.ln.weight": "model-00001-of-00002.safetensors",
43
+ "language_model.transformer.h.11.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
44
+ "language_model.transformer.h.11.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
45
+ "language_model.transformer.h.11.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
46
+ "language_model.transformer.h.11.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
47
+ "language_model.transformer.h.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
48
+ "language_model.transformer.h.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
49
+ "language_model.transformer.h.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
50
+ "language_model.transformer.h.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
51
+ "language_model.transformer.h.12.ln.bias": "model-00001-of-00002.safetensors",
52
+ "language_model.transformer.h.12.ln.weight": "model-00001-of-00002.safetensors",
53
+ "language_model.transformer.h.12.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
54
+ "language_model.transformer.h.12.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
55
+ "language_model.transformer.h.12.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
56
+ "language_model.transformer.h.12.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
57
+ "language_model.transformer.h.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
58
+ "language_model.transformer.h.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
59
+ "language_model.transformer.h.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
60
+ "language_model.transformer.h.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
61
+ "language_model.transformer.h.13.ln.bias": "model-00001-of-00002.safetensors",
62
+ "language_model.transformer.h.13.ln.weight": "model-00001-of-00002.safetensors",
63
+ "language_model.transformer.h.13.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
64
+ "language_model.transformer.h.13.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
65
+ "language_model.transformer.h.13.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
66
+ "language_model.transformer.h.13.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
67
+ "language_model.transformer.h.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
68
+ "language_model.transformer.h.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
69
+ "language_model.transformer.h.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
70
+ "language_model.transformer.h.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
71
+ "language_model.transformer.h.14.ln.bias": "model-00001-of-00002.safetensors",
72
+ "language_model.transformer.h.14.ln.weight": "model-00001-of-00002.safetensors",
73
+ "language_model.transformer.h.14.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
74
+ "language_model.transformer.h.14.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
75
+ "language_model.transformer.h.14.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
76
+ "language_model.transformer.h.14.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
77
+ "language_model.transformer.h.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
78
+ "language_model.transformer.h.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
79
+ "language_model.transformer.h.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
80
+ "language_model.transformer.h.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
81
+ "language_model.transformer.h.15.ln.bias": "model-00001-of-00002.safetensors",
82
+ "language_model.transformer.h.15.ln.weight": "model-00001-of-00002.safetensors",
83
+ "language_model.transformer.h.15.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
84
+ "language_model.transformer.h.15.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
85
+ "language_model.transformer.h.15.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
86
+ "language_model.transformer.h.15.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
87
+ "language_model.transformer.h.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
88
+ "language_model.transformer.h.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
89
+ "language_model.transformer.h.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
90
+ "language_model.transformer.h.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
91
+ "language_model.transformer.h.16.ln.bias": "model-00001-of-00002.safetensors",
92
+ "language_model.transformer.h.16.ln.weight": "model-00001-of-00002.safetensors",
93
+ "language_model.transformer.h.16.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
94
+ "language_model.transformer.h.16.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
95
+ "language_model.transformer.h.16.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
96
+ "language_model.transformer.h.16.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
97
+ "language_model.transformer.h.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
98
+ "language_model.transformer.h.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
99
+ "language_model.transformer.h.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
100
+ "language_model.transformer.h.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
101
+ "language_model.transformer.h.17.ln.bias": "model-00001-of-00002.safetensors",
102
+ "language_model.transformer.h.17.ln.weight": "model-00001-of-00002.safetensors",
103
+ "language_model.transformer.h.17.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
104
+ "language_model.transformer.h.17.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
105
+ "language_model.transformer.h.17.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
106
+ "language_model.transformer.h.17.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
107
+ "language_model.transformer.h.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
108
+ "language_model.transformer.h.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
109
+ "language_model.transformer.h.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
110
+ "language_model.transformer.h.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
111
+ "language_model.transformer.h.18.ln.bias": "model-00001-of-00002.safetensors",
112
+ "language_model.transformer.h.18.ln.weight": "model-00001-of-00002.safetensors",
113
+ "language_model.transformer.h.18.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
114
+ "language_model.transformer.h.18.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
115
+ "language_model.transformer.h.18.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
116
+ "language_model.transformer.h.18.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
117
+ "language_model.transformer.h.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
118
+ "language_model.transformer.h.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
119
+ "language_model.transformer.h.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
120
+ "language_model.transformer.h.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
121
+ "language_model.transformer.h.19.ln.bias": "model-00001-of-00002.safetensors",
122
+ "language_model.transformer.h.19.ln.weight": "model-00001-of-00002.safetensors",
123
+ "language_model.transformer.h.19.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
124
+ "language_model.transformer.h.19.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
125
+ "language_model.transformer.h.19.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
126
+ "language_model.transformer.h.19.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
127
+ "language_model.transformer.h.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
128
+ "language_model.transformer.h.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
129
+ "language_model.transformer.h.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
130
+ "language_model.transformer.h.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
131
+ "language_model.transformer.h.2.ln.bias": "model-00001-of-00002.safetensors",
132
+ "language_model.transformer.h.2.ln.weight": "model-00001-of-00002.safetensors",
133
+ "language_model.transformer.h.2.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
134
+ "language_model.transformer.h.2.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
135
+ "language_model.transformer.h.2.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
136
+ "language_model.transformer.h.2.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
137
+ "language_model.transformer.h.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
138
+ "language_model.transformer.h.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
139
+ "language_model.transformer.h.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
140
+ "language_model.transformer.h.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
141
+ "language_model.transformer.h.20.ln.bias": "model-00001-of-00002.safetensors",
142
+ "language_model.transformer.h.20.ln.weight": "model-00001-of-00002.safetensors",
143
+ "language_model.transformer.h.20.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
144
+ "language_model.transformer.h.20.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
145
+ "language_model.transformer.h.20.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
146
+ "language_model.transformer.h.20.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
147
+ "language_model.transformer.h.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
148
+ "language_model.transformer.h.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
149
+ "language_model.transformer.h.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
150
+ "language_model.transformer.h.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
151
+ "language_model.transformer.h.21.ln.bias": "model-00001-of-00002.safetensors",
152
+ "language_model.transformer.h.21.ln.weight": "model-00001-of-00002.safetensors",
153
+ "language_model.transformer.h.21.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
154
+ "language_model.transformer.h.21.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
155
+ "language_model.transformer.h.21.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
156
+ "language_model.transformer.h.21.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
157
+ "language_model.transformer.h.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
158
+ "language_model.transformer.h.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
159
+ "language_model.transformer.h.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
160
+ "language_model.transformer.h.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
161
+ "language_model.transformer.h.22.ln.bias": "model-00001-of-00002.safetensors",
162
+ "language_model.transformer.h.22.ln.weight": "model-00001-of-00002.safetensors",
163
+ "language_model.transformer.h.22.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
164
+ "language_model.transformer.h.22.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
165
+ "language_model.transformer.h.22.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
166
+ "language_model.transformer.h.22.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
167
+ "language_model.transformer.h.22.mlp.fc1.bias": "model-00002-of-00002.safetensors",
168
+ "language_model.transformer.h.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
169
+ "language_model.transformer.h.22.mlp.fc2.bias": "model-00002-of-00002.safetensors",
170
+ "language_model.transformer.h.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
171
+ "language_model.transformer.h.23.ln.bias": "model-00002-of-00002.safetensors",
172
+ "language_model.transformer.h.23.ln.weight": "model-00002-of-00002.safetensors",
173
+ "language_model.transformer.h.23.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
174
+ "language_model.transformer.h.23.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
175
+ "language_model.transformer.h.23.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
176
+ "language_model.transformer.h.23.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
177
+ "language_model.transformer.h.23.mlp.fc1.bias": "model-00002-of-00002.safetensors",
178
+ "language_model.transformer.h.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
179
+ "language_model.transformer.h.23.mlp.fc2.bias": "model-00002-of-00002.safetensors",
180
+ "language_model.transformer.h.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
181
+ "language_model.transformer.h.24.ln.bias": "model-00002-of-00002.safetensors",
182
+ "language_model.transformer.h.24.ln.weight": "model-00002-of-00002.safetensors",
183
+ "language_model.transformer.h.24.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
184
+ "language_model.transformer.h.24.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
185
+ "language_model.transformer.h.24.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
186
+ "language_model.transformer.h.24.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
187
+ "language_model.transformer.h.24.mlp.fc1.bias": "model-00002-of-00002.safetensors",
188
+ "language_model.transformer.h.24.mlp.fc1.weight": "model-00002-of-00002.safetensors",
189
+ "language_model.transformer.h.24.mlp.fc2.bias": "model-00002-of-00002.safetensors",
190
+ "language_model.transformer.h.24.mlp.fc2.weight": "model-00002-of-00002.safetensors",
191
+ "language_model.transformer.h.25.ln.bias": "model-00002-of-00002.safetensors",
192
+ "language_model.transformer.h.25.ln.weight": "model-00002-of-00002.safetensors",
193
+ "language_model.transformer.h.25.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
194
+ "language_model.transformer.h.25.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
195
+ "language_model.transformer.h.25.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
196
+ "language_model.transformer.h.25.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
197
+ "language_model.transformer.h.25.mlp.fc1.bias": "model-00002-of-00002.safetensors",
198
+ "language_model.transformer.h.25.mlp.fc1.weight": "model-00002-of-00002.safetensors",
199
+ "language_model.transformer.h.25.mlp.fc2.bias": "model-00002-of-00002.safetensors",
200
+ "language_model.transformer.h.25.mlp.fc2.weight": "model-00002-of-00002.safetensors",
201
+ "language_model.transformer.h.26.ln.bias": "model-00002-of-00002.safetensors",
202
+ "language_model.transformer.h.26.ln.weight": "model-00002-of-00002.safetensors",
203
+ "language_model.transformer.h.26.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
204
+ "language_model.transformer.h.26.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
205
+ "language_model.transformer.h.26.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
206
+ "language_model.transformer.h.26.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
207
+ "language_model.transformer.h.26.mlp.fc1.bias": "model-00002-of-00002.safetensors",
208
+ "language_model.transformer.h.26.mlp.fc1.weight": "model-00002-of-00002.safetensors",
209
+ "language_model.transformer.h.26.mlp.fc2.bias": "model-00002-of-00002.safetensors",
210
+ "language_model.transformer.h.26.mlp.fc2.weight": "model-00002-of-00002.safetensors",
211
+ "language_model.transformer.h.27.ln.bias": "model-00002-of-00002.safetensors",
212
+ "language_model.transformer.h.27.ln.weight": "model-00002-of-00002.safetensors",
213
+ "language_model.transformer.h.27.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
214
+ "language_model.transformer.h.27.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
215
+ "language_model.transformer.h.27.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
216
+ "language_model.transformer.h.27.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
217
+ "language_model.transformer.h.27.mlp.fc1.bias": "model-00002-of-00002.safetensors",
218
+ "language_model.transformer.h.27.mlp.fc1.weight": "model-00002-of-00002.safetensors",
219
+ "language_model.transformer.h.27.mlp.fc2.bias": "model-00002-of-00002.safetensors",
220
+ "language_model.transformer.h.27.mlp.fc2.weight": "model-00002-of-00002.safetensors",
221
+ "language_model.transformer.h.28.ln.bias": "model-00002-of-00002.safetensors",
222
+ "language_model.transformer.h.28.ln.weight": "model-00002-of-00002.safetensors",
223
+ "language_model.transformer.h.28.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
224
+ "language_model.transformer.h.28.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
225
+ "language_model.transformer.h.28.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
226
+ "language_model.transformer.h.28.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
227
+ "language_model.transformer.h.28.mlp.fc1.bias": "model-00002-of-00002.safetensors",
228
+ "language_model.transformer.h.28.mlp.fc1.weight": "model-00002-of-00002.safetensors",
229
+ "language_model.transformer.h.28.mlp.fc2.bias": "model-00002-of-00002.safetensors",
230
+ "language_model.transformer.h.28.mlp.fc2.weight": "model-00002-of-00002.safetensors",
231
+ "language_model.transformer.h.29.ln.bias": "model-00002-of-00002.safetensors",
232
+ "language_model.transformer.h.29.ln.weight": "model-00002-of-00002.safetensors",
233
+ "language_model.transformer.h.29.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
234
+ "language_model.transformer.h.29.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
235
+ "language_model.transformer.h.29.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
236
+ "language_model.transformer.h.29.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
237
+ "language_model.transformer.h.29.mlp.fc1.bias": "model-00002-of-00002.safetensors",
238
+ "language_model.transformer.h.29.mlp.fc1.weight": "model-00002-of-00002.safetensors",
239
+ "language_model.transformer.h.29.mlp.fc2.bias": "model-00002-of-00002.safetensors",
240
+ "language_model.transformer.h.29.mlp.fc2.weight": "model-00002-of-00002.safetensors",
241
+ "language_model.transformer.h.3.ln.bias": "model-00001-of-00002.safetensors",
242
+ "language_model.transformer.h.3.ln.weight": "model-00001-of-00002.safetensors",
243
+ "language_model.transformer.h.3.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
244
+ "language_model.transformer.h.3.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
245
+ "language_model.transformer.h.3.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
246
+ "language_model.transformer.h.3.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
247
+ "language_model.transformer.h.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
248
+ "language_model.transformer.h.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
249
+ "language_model.transformer.h.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
250
+ "language_model.transformer.h.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
251
+ "language_model.transformer.h.30.ln.bias": "model-00002-of-00002.safetensors",
252
+ "language_model.transformer.h.30.ln.weight": "model-00002-of-00002.safetensors",
253
+ "language_model.transformer.h.30.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
254
+ "language_model.transformer.h.30.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
255
+ "language_model.transformer.h.30.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
256
+ "language_model.transformer.h.30.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
257
+ "language_model.transformer.h.30.mlp.fc1.bias": "model-00002-of-00002.safetensors",
258
+ "language_model.transformer.h.30.mlp.fc1.weight": "model-00002-of-00002.safetensors",
259
+ "language_model.transformer.h.30.mlp.fc2.bias": "model-00002-of-00002.safetensors",
260
+ "language_model.transformer.h.30.mlp.fc2.weight": "model-00002-of-00002.safetensors",
261
+ "language_model.transformer.h.31.ln.bias": "model-00002-of-00002.safetensors",
262
+ "language_model.transformer.h.31.ln.weight": "model-00002-of-00002.safetensors",
263
+ "language_model.transformer.h.31.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
264
+ "language_model.transformer.h.31.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
265
+ "language_model.transformer.h.31.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
266
+ "language_model.transformer.h.31.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
267
+ "language_model.transformer.h.31.mlp.fc1.bias": "model-00002-of-00002.safetensors",
268
+ "language_model.transformer.h.31.mlp.fc1.weight": "model-00002-of-00002.safetensors",
269
+ "language_model.transformer.h.31.mlp.fc2.bias": "model-00002-of-00002.safetensors",
270
+ "language_model.transformer.h.31.mlp.fc2.weight": "model-00002-of-00002.safetensors",
271
+ "language_model.transformer.h.4.ln.bias": "model-00001-of-00002.safetensors",
272
+ "language_model.transformer.h.4.ln.weight": "model-00001-of-00002.safetensors",
273
+ "language_model.transformer.h.4.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
274
+ "language_model.transformer.h.4.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
275
+ "language_model.transformer.h.4.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
276
+ "language_model.transformer.h.4.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
277
+ "language_model.transformer.h.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
278
+ "language_model.transformer.h.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
279
+ "language_model.transformer.h.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
280
+ "language_model.transformer.h.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
281
+ "language_model.transformer.h.5.ln.bias": "model-00001-of-00002.safetensors",
282
+ "language_model.transformer.h.5.ln.weight": "model-00001-of-00002.safetensors",
283
+ "language_model.transformer.h.5.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
284
+ "language_model.transformer.h.5.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
285
+ "language_model.transformer.h.5.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
286
+ "language_model.transformer.h.5.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
287
+ "language_model.transformer.h.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
288
+ "language_model.transformer.h.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
289
+ "language_model.transformer.h.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
290
+ "language_model.transformer.h.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
291
+ "language_model.transformer.h.6.ln.bias": "model-00001-of-00002.safetensors",
292
+ "language_model.transformer.h.6.ln.weight": "model-00001-of-00002.safetensors",
293
+ "language_model.transformer.h.6.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
294
+ "language_model.transformer.h.6.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
295
+ "language_model.transformer.h.6.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
296
+ "language_model.transformer.h.6.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
297
+ "language_model.transformer.h.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
298
+ "language_model.transformer.h.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
299
+ "language_model.transformer.h.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
300
+ "language_model.transformer.h.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
301
+ "language_model.transformer.h.7.ln.bias": "model-00001-of-00002.safetensors",
302
+ "language_model.transformer.h.7.ln.weight": "model-00001-of-00002.safetensors",
303
+ "language_model.transformer.h.7.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
304
+ "language_model.transformer.h.7.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
305
+ "language_model.transformer.h.7.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
306
+ "language_model.transformer.h.7.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
307
+ "language_model.transformer.h.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
308
+ "language_model.transformer.h.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
309
+ "language_model.transformer.h.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
310
+ "language_model.transformer.h.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
311
+ "language_model.transformer.h.8.ln.bias": "model-00001-of-00002.safetensors",
312
+ "language_model.transformer.h.8.ln.weight": "model-00001-of-00002.safetensors",
313
+ "language_model.transformer.h.8.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
314
+ "language_model.transformer.h.8.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
315
+ "language_model.transformer.h.8.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
316
+ "language_model.transformer.h.8.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
317
+ "language_model.transformer.h.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
318
+ "language_model.transformer.h.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
319
+ "language_model.transformer.h.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
320
+ "language_model.transformer.h.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
321
+ "language_model.transformer.h.9.ln.bias": "model-00001-of-00002.safetensors",
322
+ "language_model.transformer.h.9.ln.weight": "model-00001-of-00002.safetensors",
323
+ "language_model.transformer.h.9.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
324
+ "language_model.transformer.h.9.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
325
+ "language_model.transformer.h.9.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
326
+ "language_model.transformer.h.9.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
327
+ "language_model.transformer.h.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
328
+ "language_model.transformer.h.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
329
+ "language_model.transformer.h.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
330
+ "language_model.transformer.h.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
331
+ "multi_modal_projector.linear_1.bias": "model-00001-of-00002.safetensors",
332
+ "multi_modal_projector.linear_1.weight": "model-00001-of-00002.safetensors",
333
+ "multi_modal_projector.linear_2.bias": "model-00001-of-00002.safetensors",
334
+ "multi_modal_projector.linear_2.weight": "model-00001-of-00002.safetensors",
335
+ "vision_model.trunk.attn_pool.kv.bias": "model-00001-of-00002.safetensors",
336
+ "vision_model.trunk.attn_pool.kv.weight": "model-00001-of-00002.safetensors",
337
+ "vision_model.trunk.attn_pool.latent": "model-00001-of-00002.safetensors",
338
+ "vision_model.trunk.attn_pool.mlp.fc1.bias": "model-00001-of-00002.safetensors",
339
+ "vision_model.trunk.attn_pool.mlp.fc1.weight": "model-00001-of-00002.safetensors",
340
+ "vision_model.trunk.attn_pool.mlp.fc2.bias": "model-00001-of-00002.safetensors",
341
+ "vision_model.trunk.attn_pool.mlp.fc2.weight": "model-00001-of-00002.safetensors",
342
+ "vision_model.trunk.attn_pool.norm.bias": "model-00001-of-00002.safetensors",
343
+ "vision_model.trunk.attn_pool.norm.weight": "model-00001-of-00002.safetensors",
344
+ "vision_model.trunk.attn_pool.proj.bias": "model-00001-of-00002.safetensors",
345
+ "vision_model.trunk.attn_pool.proj.weight": "model-00001-of-00002.safetensors",
346
+ "vision_model.trunk.attn_pool.q.bias": "model-00001-of-00002.safetensors",
347
+ "vision_model.trunk.attn_pool.q.weight": "model-00001-of-00002.safetensors",
348
+ "vision_model.trunk.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
349
+ "vision_model.trunk.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
350
+ "vision_model.trunk.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
351
+ "vision_model.trunk.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
352
+ "vision_model.trunk.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
353
+ "vision_model.trunk.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
354
+ "vision_model.trunk.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
355
+ "vision_model.trunk.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
356
+ "vision_model.trunk.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
357
+ "vision_model.trunk.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
358
+ "vision_model.trunk.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
359
+ "vision_model.trunk.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
360
+ "vision_model.trunk.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
361
+ "vision_model.trunk.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
362
+ "vision_model.trunk.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
363
+ "vision_model.trunk.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
364
+ "vision_model.trunk.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
365
+ "vision_model.trunk.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
366
+ "vision_model.trunk.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
367
+ "vision_model.trunk.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
368
+ "vision_model.trunk.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
369
+ "vision_model.trunk.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
370
+ "vision_model.trunk.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
371
+ "vision_model.trunk.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
372
+ "vision_model.trunk.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
373
+ "vision_model.trunk.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
374
+ "vision_model.trunk.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
375
+ "vision_model.trunk.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
376
+ "vision_model.trunk.blocks.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
377
+ "vision_model.trunk.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
378
+ "vision_model.trunk.blocks.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
379
+ "vision_model.trunk.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
380
+ "vision_model.trunk.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
381
+ "vision_model.trunk.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
382
+ "vision_model.trunk.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
383
+ "vision_model.trunk.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
384
+ "vision_model.trunk.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
385
+ "vision_model.trunk.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
386
+ "vision_model.trunk.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
387
+ "vision_model.trunk.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
388
+ "vision_model.trunk.blocks.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
389
+ "vision_model.trunk.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
390
+ "vision_model.trunk.blocks.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
391
+ "vision_model.trunk.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
392
+ "vision_model.trunk.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
393
+ "vision_model.trunk.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
394
+ "vision_model.trunk.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
395
+ "vision_model.trunk.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
396
+ "vision_model.trunk.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
397
+ "vision_model.trunk.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
398
+ "vision_model.trunk.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
399
+ "vision_model.trunk.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
400
+ "vision_model.trunk.blocks.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
401
+ "vision_model.trunk.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
402
+ "vision_model.trunk.blocks.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
403
+ "vision_model.trunk.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
404
+ "vision_model.trunk.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
405
+ "vision_model.trunk.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
406
+ "vision_model.trunk.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
407
+ "vision_model.trunk.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
408
+ "vision_model.trunk.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
409
+ "vision_model.trunk.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
410
+ "vision_model.trunk.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
411
+ "vision_model.trunk.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
412
+ "vision_model.trunk.blocks.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
413
+ "vision_model.trunk.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
414
+ "vision_model.trunk.blocks.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
415
+ "vision_model.trunk.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
416
+ "vision_model.trunk.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
417
+ "vision_model.trunk.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
418
+ "vision_model.trunk.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
419
+ "vision_model.trunk.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
420
+ "vision_model.trunk.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
421
+ "vision_model.trunk.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
422
+ "vision_model.trunk.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
423
+ "vision_model.trunk.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
424
+ "vision_model.trunk.blocks.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
425
+ "vision_model.trunk.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
426
+ "vision_model.trunk.blocks.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
427
+ "vision_model.trunk.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
428
+ "vision_model.trunk.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
429
+ "vision_model.trunk.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
430
+ "vision_model.trunk.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
431
+ "vision_model.trunk.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
432
+ "vision_model.trunk.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
433
+ "vision_model.trunk.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
434
+ "vision_model.trunk.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
435
+ "vision_model.trunk.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
436
+ "vision_model.trunk.blocks.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
437
+ "vision_model.trunk.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
438
+ "vision_model.trunk.blocks.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
439
+ "vision_model.trunk.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
440
+ "vision_model.trunk.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
441
+ "vision_model.trunk.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
442
+ "vision_model.trunk.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
443
+ "vision_model.trunk.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
444
+ "vision_model.trunk.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
445
+ "vision_model.trunk.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
446
+ "vision_model.trunk.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
447
+ "vision_model.trunk.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
448
+ "vision_model.trunk.blocks.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
449
+ "vision_model.trunk.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
450
+ "vision_model.trunk.blocks.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
451
+ "vision_model.trunk.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
452
+ "vision_model.trunk.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
453
+ "vision_model.trunk.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
454
+ "vision_model.trunk.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
455
+ "vision_model.trunk.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
456
+ "vision_model.trunk.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
457
+ "vision_model.trunk.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
458
+ "vision_model.trunk.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
459
+ "vision_model.trunk.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
460
+ "vision_model.trunk.blocks.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
461
+ "vision_model.trunk.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
462
+ "vision_model.trunk.blocks.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
463
+ "vision_model.trunk.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
464
+ "vision_model.trunk.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
465
+ "vision_model.trunk.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
466
+ "vision_model.trunk.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
467
+ "vision_model.trunk.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
468
+ "vision_model.trunk.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
469
+ "vision_model.trunk.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
470
+ "vision_model.trunk.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
471
+ "vision_model.trunk.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
472
+ "vision_model.trunk.blocks.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
473
+ "vision_model.trunk.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
474
+ "vision_model.trunk.blocks.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
475
+ "vision_model.trunk.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
476
+ "vision_model.trunk.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
477
+ "vision_model.trunk.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
478
+ "vision_model.trunk.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
479
+ "vision_model.trunk.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
480
+ "vision_model.trunk.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
481
+ "vision_model.trunk.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
482
+ "vision_model.trunk.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
483
+ "vision_model.trunk.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
484
+ "vision_model.trunk.blocks.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
485
+ "vision_model.trunk.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
486
+ "vision_model.trunk.blocks.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
487
+ "vision_model.trunk.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
488
+ "vision_model.trunk.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
489
+ "vision_model.trunk.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
490
+ "vision_model.trunk.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
491
+ "vision_model.trunk.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
492
+ "vision_model.trunk.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
493
+ "vision_model.trunk.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
494
+ "vision_model.trunk.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
495
+ "vision_model.trunk.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
496
+ "vision_model.trunk.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
497
+ "vision_model.trunk.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
498
+ "vision_model.trunk.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
499
+ "vision_model.trunk.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
500
+ "vision_model.trunk.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
501
+ "vision_model.trunk.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
502
+ "vision_model.trunk.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
503
+ "vision_model.trunk.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
504
+ "vision_model.trunk.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
505
+ "vision_model.trunk.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
506
+ "vision_model.trunk.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
507
+ "vision_model.trunk.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
508
+ "vision_model.trunk.blocks.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
509
+ "vision_model.trunk.blocks.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
510
+ "vision_model.trunk.blocks.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
511
+ "vision_model.trunk.blocks.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
512
+ "vision_model.trunk.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
513
+ "vision_model.trunk.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
514
+ "vision_model.trunk.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
515
+ "vision_model.trunk.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
516
+ "vision_model.trunk.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
517
+ "vision_model.trunk.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
518
+ "vision_model.trunk.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
519
+ "vision_model.trunk.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
520
+ "vision_model.trunk.blocks.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
521
+ "vision_model.trunk.blocks.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
522
+ "vision_model.trunk.blocks.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
523
+ "vision_model.trunk.blocks.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
524
+ "vision_model.trunk.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
525
+ "vision_model.trunk.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
526
+ "vision_model.trunk.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
527
+ "vision_model.trunk.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
528
+ "vision_model.trunk.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
529
+ "vision_model.trunk.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
530
+ "vision_model.trunk.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
531
+ "vision_model.trunk.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
532
+ "vision_model.trunk.blocks.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
533
+ "vision_model.trunk.blocks.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
534
+ "vision_model.trunk.blocks.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
535
+ "vision_model.trunk.blocks.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
536
+ "vision_model.trunk.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
537
+ "vision_model.trunk.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
538
+ "vision_model.trunk.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
539
+ "vision_model.trunk.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
540
+ "vision_model.trunk.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
541
+ "vision_model.trunk.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
542
+ "vision_model.trunk.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
543
+ "vision_model.trunk.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
544
+ "vision_model.trunk.blocks.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
545
+ "vision_model.trunk.blocks.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
546
+ "vision_model.trunk.blocks.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
547
+ "vision_model.trunk.blocks.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
548
+ "vision_model.trunk.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
549
+ "vision_model.trunk.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
550
+ "vision_model.trunk.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
551
+ "vision_model.trunk.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
552
+ "vision_model.trunk.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
553
+ "vision_model.trunk.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
554
+ "vision_model.trunk.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
555
+ "vision_model.trunk.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
556
+ "vision_model.trunk.blocks.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
557
+ "vision_model.trunk.blocks.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
558
+ "vision_model.trunk.blocks.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
559
+ "vision_model.trunk.blocks.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
560
+ "vision_model.trunk.blocks.24.norm1.bias": "model-00001-of-00002.safetensors",
561
+ "vision_model.trunk.blocks.24.norm1.weight": "model-00001-of-00002.safetensors",
562
+ "vision_model.trunk.blocks.24.norm2.bias": "model-00001-of-00002.safetensors",
563
+ "vision_model.trunk.blocks.24.norm2.weight": "model-00001-of-00002.safetensors",
564
+ "vision_model.trunk.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
565
+ "vision_model.trunk.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
566
+ "vision_model.trunk.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
567
+ "vision_model.trunk.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
568
+ "vision_model.trunk.blocks.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
569
+ "vision_model.trunk.blocks.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
570
+ "vision_model.trunk.blocks.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
571
+ "vision_model.trunk.blocks.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
572
+ "vision_model.trunk.blocks.25.norm1.bias": "model-00001-of-00002.safetensors",
573
+ "vision_model.trunk.blocks.25.norm1.weight": "model-00001-of-00002.safetensors",
574
+ "vision_model.trunk.blocks.25.norm2.bias": "model-00001-of-00002.safetensors",
575
+ "vision_model.trunk.blocks.25.norm2.weight": "model-00001-of-00002.safetensors",
576
+ "vision_model.trunk.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
577
+ "vision_model.trunk.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
578
+ "vision_model.trunk.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
579
+ "vision_model.trunk.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
580
+ "vision_model.trunk.blocks.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
581
+ "vision_model.trunk.blocks.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
582
+ "vision_model.trunk.blocks.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
583
+ "vision_model.trunk.blocks.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
584
+ "vision_model.trunk.blocks.26.norm1.bias": "model-00001-of-00002.safetensors",
585
+ "vision_model.trunk.blocks.26.norm1.weight": "model-00001-of-00002.safetensors",
586
+ "vision_model.trunk.blocks.26.norm2.bias": "model-00001-of-00002.safetensors",
587
+ "vision_model.trunk.blocks.26.norm2.weight": "model-00001-of-00002.safetensors",
588
+ "vision_model.trunk.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
589
+ "vision_model.trunk.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
590
+ "vision_model.trunk.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
591
+ "vision_model.trunk.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
592
+ "vision_model.trunk.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
593
+ "vision_model.trunk.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
594
+ "vision_model.trunk.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
595
+ "vision_model.trunk.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
596
+ "vision_model.trunk.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
597
+ "vision_model.trunk.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
598
+ "vision_model.trunk.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
599
+ "vision_model.trunk.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
600
+ "vision_model.trunk.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
601
+ "vision_model.trunk.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
602
+ "vision_model.trunk.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
603
+ "vision_model.trunk.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
604
+ "vision_model.trunk.blocks.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
605
+ "vision_model.trunk.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
606
+ "vision_model.trunk.blocks.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
607
+ "vision_model.trunk.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
608
+ "vision_model.trunk.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
609
+ "vision_model.trunk.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
610
+ "vision_model.trunk.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
611
+ "vision_model.trunk.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
612
+ "vision_model.trunk.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
613
+ "vision_model.trunk.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
614
+ "vision_model.trunk.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
615
+ "vision_model.trunk.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
616
+ "vision_model.trunk.blocks.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
617
+ "vision_model.trunk.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
618
+ "vision_model.trunk.blocks.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
619
+ "vision_model.trunk.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
620
+ "vision_model.trunk.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
621
+ "vision_model.trunk.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
622
+ "vision_model.trunk.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
623
+ "vision_model.trunk.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
624
+ "vision_model.trunk.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
625
+ "vision_model.trunk.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
626
+ "vision_model.trunk.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
627
+ "vision_model.trunk.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
628
+ "vision_model.trunk.blocks.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
629
+ "vision_model.trunk.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
630
+ "vision_model.trunk.blocks.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
631
+ "vision_model.trunk.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
632
+ "vision_model.trunk.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
633
+ "vision_model.trunk.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
634
+ "vision_model.trunk.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
635
+ "vision_model.trunk.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
636
+ "vision_model.trunk.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
637
+ "vision_model.trunk.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
638
+ "vision_model.trunk.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
639
+ "vision_model.trunk.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
640
+ "vision_model.trunk.blocks.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
641
+ "vision_model.trunk.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
642
+ "vision_model.trunk.blocks.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
643
+ "vision_model.trunk.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
644
+ "vision_model.trunk.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
645
+ "vision_model.trunk.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
646
+ "vision_model.trunk.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
647
+ "vision_model.trunk.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
648
+ "vision_model.trunk.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
649
+ "vision_model.trunk.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
650
+ "vision_model.trunk.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
651
+ "vision_model.trunk.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
652
+ "vision_model.trunk.blocks.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
653
+ "vision_model.trunk.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
654
+ "vision_model.trunk.blocks.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
655
+ "vision_model.trunk.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
656
+ "vision_model.trunk.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
657
+ "vision_model.trunk.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
658
+ "vision_model.trunk.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
659
+ "vision_model.trunk.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
660
+ "vision_model.trunk.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
661
+ "vision_model.trunk.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
662
+ "vision_model.trunk.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
663
+ "vision_model.trunk.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
664
+ "vision_model.trunk.blocks.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
665
+ "vision_model.trunk.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
666
+ "vision_model.trunk.blocks.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
667
+ "vision_model.trunk.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
668
+ "vision_model.trunk.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
669
+ "vision_model.trunk.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
670
+ "vision_model.trunk.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
671
+ "vision_model.trunk.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
672
+ "vision_model.trunk.norm.bias": "model-00001-of-00002.safetensors",
673
+ "vision_model.trunk.norm.weight": "model-00001-of-00002.safetensors",
674
+ "vision_model.trunk.patch_embed.proj.bias": "model-00001-of-00002.safetensors",
675
+ "vision_model.trunk.patch_embed.proj.weight": "model-00001-of-00002.safetensors",
676
+ "vision_model.trunk.pos_embed": "model-00001-of-00002.safetensors"
677
+ }
678
+ }
modeling_llava.py ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ from dataclasses import dataclass
3
+ from typing import List, Optional, Tuple, Union
4
+
5
+ import torch
6
+ import torch.utils.checkpoint
7
+ from torch import nn
8
+
9
+ from transformers import PreTrainedModel
10
+ from transformers.modeling_outputs import ModelOutput
11
+
12
+ from modeling_phi import PhiForCausalLM
13
+ from configuration_llava import LlavaConfig
14
+ from open_clip import create_model
15
+
16
+
17
+ @dataclass
18
+ class LlavaCausalLMOutputWithPast(ModelOutput):
19
+ loss: Optional[torch.FloatTensor] = None
20
+ logits: torch.FloatTensor = None
21
+ past_key_values: Optional[List[torch.FloatTensor]] = None
22
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
23
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
24
+ image_features: Optional[torch.FloatTensor] = None
25
+
26
+
27
+ class LlavaMultiModalProjector(nn.Module):
28
+ def __init__(self, config: LlavaConfig):
29
+ super().__init__()
30
+
31
+ self.linear_1 = nn.Linear(
32
+ config.vision_embed_dim,
33
+ config.text_config.n_embd * config.projector_tokens_num,
34
+ bias=True,
35
+ )
36
+ self.act = nn.GELU()
37
+ self.linear_2 = nn.Linear(
38
+ config.text_config.n_embd * 5,
39
+ config.text_config.n_embd,
40
+ bias=True,
41
+ )
42
+ self.projector_tokens_num = config.projector_tokens_num
43
+
44
+ def forward(self, image_features):
45
+ hidden_states = self.linear_1(image_features)
46
+ hidden_states = self.act(hidden_states)
47
+ hidden_states = self.linear_2(hidden_states)
48
+ return hidden_states
49
+
50
+
51
+ class LlavaPreTrainedModel(PreTrainedModel):
52
+ config_class = LlavaConfig
53
+ base_model_prefix = "model"
54
+ supports_gradient_checkpointing = True
55
+ _no_split_modules = ["LlavaVisionAttention"]
56
+ _skip_keys_device_placement = "past_key_values"
57
+ _supports_flash_attn_2 = True
58
+
59
+ def __init__(self, config):
60
+ super().__init__(config)
61
+
62
+ def _init_weights(self, module):
63
+ return
64
+
65
+ @property
66
+ def _supports_sdpa(self):
67
+ """
68
+ Retrieve language_model's attribute to check whether the model supports
69
+ SDPA or not.
70
+ """
71
+ return self.language_model._supports_sdpa
72
+
73
+
74
+ class LlavaForConditionalGeneration(LlavaPreTrainedModel):
75
+ def __init__(self, config: LlavaConfig):
76
+ super().__init__(config)
77
+ clip_model = create_model(config.vision_tower_name)
78
+ self.vision_model = clip_model.visual
79
+
80
+ self.multi_modal_projector = LlavaMultiModalProjector(config)
81
+ self.vocab_size = config.vocab_size
82
+ self.language_model = PhiForCausalLM(config.text_config)
83
+ self.pad_token_id = (
84
+ self.config.pad_token_id if self.config.pad_token_id is not None else -1
85
+ )
86
+ self.post_init()
87
+
88
+ def get_input_embeddings(self):
89
+ return self.language_model.get_input_embeddings()
90
+
91
+ def set_input_embeddings(self, value):
92
+ self.language_model.set_input_embeddings(value)
93
+
94
+ def get_output_embeddings(self):
95
+ return self.language_model.get_output_embeddings()
96
+
97
+ def set_output_embeddings(self, new_embeddings):
98
+ self.language_model.set_output_embeddings(new_embeddings)
99
+
100
+ def set_decoder(self, decoder):
101
+ self.language_model.transformer = decoder
102
+
103
+ def get_decoder(self):
104
+ return self.language_model.transformer
105
+
106
+ def tie_weights(self):
107
+ return self.language_model.tie_weights()
108
+
109
+ def resize_token_embeddings(
110
+ self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None
111
+ ) -> nn.Embedding:
112
+ model_embeds = self.language_model.resize_token_embeddings(
113
+ new_num_tokens, pad_to_multiple_of
114
+ )
115
+ # update vocab size
116
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
117
+ self.config.vocab_size = model_embeds.num_embeddings
118
+ self.vocab_size = model_embeds.num_embeddings
119
+ return model_embeds
120
+
121
+ def _merge_input_ids_with_image_features(
122
+ self, image_features, inputs_embeds, input_ids, attention_mask, position_ids
123
+ ):
124
+ num_images, num_image_patches, embed_dim = image_features.shape
125
+ batch_size, sequence_length = input_ids.shape
126
+ left_padding = not torch.sum(
127
+ input_ids[:, -1] == torch.tensor(self.pad_token_id)
128
+ )
129
+ # 1. Create a mask to know where special image tokens are
130
+ special_image_token_mask = input_ids == self.config.image_token_index
131
+ num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
132
+ # Compute the maximum embed dimension
133
+ max_embed_dim = (
134
+ num_special_image_tokens.max() * (num_image_patches - 1)
135
+ ) + sequence_length
136
+ batch_indices, non_image_indices = torch.where(
137
+ input_ids != self.config.image_token_index
138
+ )
139
+
140
+ # 2. Compute the positions where text should be written
141
+ # Calculate new positions for text tokens in merged image-text sequence.
142
+ # `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
143
+ # `torch.cumsum` computes how each image token shifts subsequent text token positions.
144
+ # - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
145
+ new_token_positions = (
146
+ torch.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1)
147
+ - 1
148
+ )
149
+ nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
150
+ if left_padding:
151
+ new_token_positions += nb_image_pad[:, None] # offset for left padding
152
+ text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
153
+
154
+ # 3. Create the full embedding, already padded to the maximum position
155
+ final_embedding = torch.zeros(
156
+ batch_size,
157
+ max_embed_dim,
158
+ embed_dim,
159
+ dtype=inputs_embeds.dtype,
160
+ device=inputs_embeds.device,
161
+ )
162
+ final_attention_mask = torch.zeros(
163
+ batch_size,
164
+ max_embed_dim,
165
+ dtype=attention_mask.dtype,
166
+ device=inputs_embeds.device,
167
+ )
168
+ # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
169
+ # set the corresponding tensors into their correct target device.
170
+ target_device = inputs_embeds.device
171
+ batch_indices, non_image_indices, text_to_overwrite = (
172
+ batch_indices.to(target_device),
173
+ non_image_indices.to(target_device),
174
+ text_to_overwrite.to(target_device),
175
+ )
176
+ attention_mask = attention_mask.to(target_device)
177
+
178
+ # 4. Fill the embeddings based on the mask. If we have ["hey" "<image>", "how", "are"]
179
+ # we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
180
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[
181
+ batch_indices, non_image_indices
182
+ ]
183
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[
184
+ batch_indices, non_image_indices
185
+ ]
186
+
187
+ # 5. Fill the embeddings corresponding to the images. Anything that is still zeros needs filling
188
+ image_to_overwrite = torch.all(final_embedding == 0, dim=-1)
189
+ image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[
190
+ :, None
191
+ ].to(target_device)
192
+
193
+ if image_to_overwrite.sum() != image_features.shape[:-1].numel():
194
+ raise ValueError(
195
+ f"The input provided to the model are wrong. The number of image tokens is {torch.sum(special_image_token_mask)} while"
196
+ f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
197
+ )
198
+
199
+ final_embedding[image_to_overwrite] = (
200
+ image_features.contiguous().reshape(-1, embed_dim).to(target_device)
201
+ )
202
+ final_attention_mask |= image_to_overwrite
203
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_(
204
+ (final_attention_mask == 0), 1
205
+ )
206
+ return final_embedding, final_attention_mask, position_ids
207
+
208
+ def forward(
209
+ self,
210
+ input_ids: torch.LongTensor = None,
211
+ image_features: torch.FloatTensor = None,
212
+ attention_mask: Optional[torch.Tensor] = None,
213
+ position_ids: Optional[torch.LongTensor] = None,
214
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
215
+ inputs_embeds: Optional[torch.FloatTensor] = None,
216
+ use_cache: Optional[bool] = None,
217
+ output_attentions: Optional[bool] = None,
218
+ output_hidden_states: Optional[bool] = None,
219
+ return_dict: Optional[bool] = None,
220
+ ) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
221
+ output_attentions = (
222
+ output_attentions
223
+ if output_attentions is not None
224
+ else self.config.output_attentions
225
+ )
226
+ output_hidden_states = (
227
+ output_hidden_states
228
+ if output_hidden_states is not None
229
+ else self.config.output_hidden_states
230
+ )
231
+ return_dict = (
232
+ return_dict if return_dict is not None else self.config.use_return_dict
233
+ )
234
+
235
+ if inputs_embeds is None:
236
+ inputs_embeds = self.get_input_embeddings()(input_ids)
237
+ if image_features is not None and input_ids.shape[1] != 1:
238
+ (
239
+ inputs_embeds,
240
+ attention_mask,
241
+ position_ids,
242
+ ) = self._merge_input_ids_with_image_features(
243
+ image_features,
244
+ inputs_embeds,
245
+ input_ids,
246
+ attention_mask,
247
+ position_ids,
248
+ )
249
+
250
+ outputs = self.language_model(
251
+ input_ids=None,
252
+ attention_mask=attention_mask,
253
+ position_ids=position_ids,
254
+ past_key_values=past_key_values,
255
+ inputs_embeds=inputs_embeds,
256
+ use_cache=use_cache,
257
+ output_attentions=output_attentions,
258
+ output_hidden_states=output_hidden_states,
259
+ return_dict=return_dict,
260
+ )
261
+
262
+ logits = outputs[0]
263
+
264
+
265
+ if not return_dict:
266
+ output = (logits,) + outputs[1:]
267
+ return output
268
+
269
+ return LlavaCausalLMOutputWithPast(
270
+ logits=logits,
271
+ past_key_values=outputs.past_key_values,
272
+ hidden_states=outputs.hidden_states,
273
+ attentions=outputs.attentions,
274
+ image_features=image_features,
275
+ )
276
+
277
+ def prepare_inputs_for_generation(
278
+ self,
279
+ input_ids,
280
+ past_key_values=None,
281
+ inputs_embeds=None,
282
+ attention_mask=None,
283
+ image_features=None,
284
+ **kwargs,
285
+ ):
286
+ res = self.language_model.prepare_inputs_for_generation(input_ids, past_key_values, attention_mask, **kwargs)
287
+ input_ids = res["input_ids"]
288
+ past_key_values = res["past_key_values"]
289
+ attention_mask = res["attention_mask"]
290
+
291
+ if inputs_embeds is not None and past_key_values is None:
292
+ model_inputs = {"inputs_embeds": inputs_embeds}
293
+ else:
294
+ model_inputs = {"input_ids": input_ids}
295
+
296
+ model_inputs.update(
297
+ {
298
+ "past_key_values": past_key_values,
299
+ "use_cache": kwargs.get("use_cache"),
300
+ "attention_mask": attention_mask,
301
+ "image_features": image_features,
302
+ }
303
+ )
304
+ return model_inputs
305
+
306
+ def _reorder_cache(self, *args, **kwargs):
307
+ return self.language_model._reorder_cache(*args, **kwargs)
modeling_phi.py ADDED
@@ -0,0 +1,988 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Microsoft Corporation.
2
+ # Licensed under the MIT license.
3
+ #
4
+ # Copyright (c) 2022, Tri Dao, [email protected].
5
+ # Licensed under the BSD 3-Clause License.
6
+
7
+ from __future__ import annotations
8
+
9
+ import math
10
+ from dataclasses import dataclass, field
11
+ from typing import Any, Dict, Optional, Tuple, Union
12
+
13
+ import torch
14
+ import torch.nn as nn
15
+ from einops import rearrange, repeat
16
+ from transformers import PretrainedConfig, PreTrainedModel
17
+ from transformers.activations import ACT2FN
18
+ from transformers.modeling_outputs import CausalLMOutputWithPast
19
+
20
+ from configuration_phi import PhiConfig
21
+
22
+ try:
23
+ from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.layers.rotary import RotaryEmbedding as FlashRotaryEmbedding
25
+ from flash_attn.modules.mha import FlashCrossAttention, FlashSelfAttention
26
+ from flash_attn.ops.fused_dense import FusedDense
27
+ print("Using Flash Attention!")
28
+ except Exception as exc:
29
+ print(exc)
30
+ pad_input, unpad_input = None, None
31
+ FlashRotaryEmbedding = None
32
+ FlashSelfAttention, FlashCrossAttention = None, None
33
+ FusedDense = None
34
+ print("Not using Flash Attention!")
35
+
36
+
37
+ @dataclass
38
+ class InferenceParams:
39
+ """Inference parameters passed to model to efficiently calculate
40
+ and store context during inference.
41
+
42
+ Reference:
43
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/utils/generation.py.
44
+
45
+ Args:
46
+ max_seqlen: Maximum sequence length.
47
+ max_batch_size: Maximum batch size.
48
+ seqlen_offset: Sequence length offset.
49
+ batch_size_offset: Batch size offset.
50
+ key_value_memory_dict: Key value memory dictionary.
51
+ lengths_per_sample: Lengths per sample.
52
+
53
+ """
54
+
55
+ max_seqlen: int = field(metadata={"help": "Maximum sequence length."})
56
+
57
+ max_batch_size: int = field(metadata={"help": "Maximum batch size."})
58
+
59
+ seqlen_offset: int = field(default=0, metadata={"help": "Sequence length offset."})
60
+
61
+ batch_size_offset: int = field(default=0, metadata={"help": "Batch size offset."})
62
+
63
+ key_value_memory_dict: Dict[str, Any] = field(
64
+ default_factory=dict, metadata={"help": "Key value memory dictionary."}
65
+ )
66
+
67
+ lengths_per_sample: torch.Tensor = field(default=None, metadata={"help": "Lengths per sample."})
68
+
69
+
70
+ class Embedding(nn.Module):
71
+ """Token embedding with dropout."""
72
+
73
+ def __init__(self, config: PretrainedConfig) -> None:
74
+ super().__init__()
75
+
76
+ self.wte = nn.Embedding(config.vocab_size, config.n_embd)
77
+ self.drop = nn.Dropout(config.embd_pdrop)
78
+
79
+ def forward(self, input_ids: torch.LongTensor) -> torch.FloatTensor:
80
+ input_shape = input_ids.size()
81
+ input_ids = input_ids.view(-1, input_shape[-1])
82
+
83
+ hidden_states = self.wte(input_ids)
84
+ hidden_states = self.drop(hidden_states)
85
+
86
+ return hidden_states
87
+
88
+
89
+ def _apply_rotary_emb(
90
+ x: torch.FloatTensor,
91
+ cos: torch.FloatTensor,
92
+ sin: torch.FloatTensor,
93
+ ) -> torch.FloatTensor:
94
+ _, seqlen, _, _ = x.shape
95
+ _, rotary_dim = cos.shape
96
+ rotary_dim *= 2
97
+
98
+ x_rot = x[:, :, :, :rotary_dim]
99
+ x_pass = x[:, :, :, rotary_dim:]
100
+
101
+ x1, x2 = x_rot.chunk(2, dim=-1)
102
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
103
+ x1, x2, c, s = [t.to(dtype=torch.float32) for t in [x1, x2, c, s]]
104
+
105
+ x_rot = torch.cat([x1 * c - x2 * s, x1 * s + x2 * c], axis=-1).to(x.dtype)
106
+
107
+ return torch.cat([x_rot, x_pass], axis=-1)
108
+
109
+
110
+ def _apply_rotary_emb_kv(
111
+ kv: torch.FloatTensor,
112
+ cos: torch.FloatTensor,
113
+ sin: torch.FloatTensor,
114
+ cos_k: Optional[torch.FloatTensor] = None,
115
+ sin_k: Optional[torch.FloatTensor] = None,
116
+ ) -> torch.FloatTensor:
117
+ _, seqlen, _, _, _ = kv.shape
118
+ _, rotary_dim = cos.shape
119
+ rotary_dim *= 2
120
+
121
+ k_rot = kv[:, :, 0, :, :rotary_dim]
122
+ k_pass = kv[:, :, 0, :, rotary_dim:]
123
+
124
+ k1, k2 = k_rot.chunk(2, dim=-1)
125
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
126
+ k1, k2, c, s = [t.to(dtype=torch.float32) for t in [k1, k2, c, s]]
127
+
128
+ k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(kv.dtype)
129
+
130
+ return torch.cat(
131
+ [
132
+ torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
133
+ kv[:, :, 1:2, :, :],
134
+ ],
135
+ axis=2,
136
+ )
137
+
138
+
139
+ def _apply_rotary_emb_qkv(
140
+ qkv: torch.FloatTensor,
141
+ cos: torch.FloatTensor,
142
+ sin: torch.FloatTensor,
143
+ cos_k: Optional[torch.FloatTensor] = None,
144
+ sin_k: Optional[torch.FloatTensor] = None,
145
+ ) -> torch.FloatTensor:
146
+ _, seqlen, _, _, _ = qkv.shape
147
+ _, rotary_dim = cos.shape
148
+ rotary_dim *= 2
149
+
150
+ q_rot = qkv[:, :, 0, :, :rotary_dim]
151
+ q_pass = qkv[:, :, 0, :, rotary_dim:]
152
+
153
+ k_rot = qkv[:, :, 1, :, :rotary_dim]
154
+ k_pass = qkv[:, :, 1, :, rotary_dim:]
155
+
156
+ q1, q2 = q_rot.chunk(2, dim=-1)
157
+ k1, k2 = k_rot.chunk(2, dim=-1)
158
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
159
+ q1, q2, k1, k2, c, s = [t.to(dtype=torch.float32) for t in [q1, q2, k1, k2, c, s]]
160
+
161
+ q_rot = torch.cat([q1 * c - q2 * s, q1 * s + q2 * c], axis=-1).to(qkv.dtype)
162
+ k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(qkv.dtype)
163
+
164
+ return torch.cat(
165
+ [
166
+ torch.cat([q_rot, q_pass], axis=-1).unsqueeze(2),
167
+ torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
168
+ qkv[:, :, 2:3, :, :],
169
+ ],
170
+ axis=2,
171
+ )
172
+
173
+
174
+ class RotaryEmbedding(nn.Module):
175
+ """Rotary positional embedding (RoPE).
176
+
177
+ Reference:
178
+ RoFormer: Enhanced Transformer with Rotary Position Embedding.
179
+ https://arxiv.org/pdf/2104.09864.pdf.
180
+
181
+ """
182
+
183
+ def __init__(
184
+ self,
185
+ dim: int,
186
+ base: int = 10000,
187
+ scale_base: Optional[float] = None,
188
+ pos_idx_in_fp32: bool = True,
189
+ max_position_embeddings: int = 2048,
190
+ device: Optional[str] = None,
191
+ **kwargs,
192
+ ) -> None:
193
+ super().__init__()
194
+
195
+ if scale_base is not None:
196
+ raise NotImplementedError
197
+
198
+ self.dim = dim
199
+ self.base = float(base)
200
+ self.scale_base = scale_base
201
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
202
+ self.max_position_embeddings = max_position_embeddings
203
+ self.device = device
204
+
205
+ # Generate and save the inverse frequency buffer (non-trainable)
206
+ inv_freq = self._compute_inv_freq(device)
207
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
208
+
209
+ # Generate and save the scale buffer (non-trainable)
210
+ scale = (
211
+ (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
212
+ if scale_base is not None
213
+ else None
214
+ )
215
+ self.register_buffer("scale", scale, persistent=False)
216
+
217
+ # Initialize cached attributes since ONNX can't rely on dynamic initialization
218
+ self._update_cos_sin_cache(max_position_embeddings, device=device, dtype=torch.float32)
219
+
220
+ def _compute_inv_freq(self, device: Optional[str] = None) -> torch.FloatTensor:
221
+ return 1.0 / (self.base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
222
+
223
+ def _update_cos_sin_cache(
224
+ self,
225
+ seqlen: int,
226
+ device: Optional[str] = None,
227
+ dtype: Optional[torch.dtype] = None,
228
+ ) -> None:
229
+ self._seq_len_cached = seqlen
230
+
231
+ # fp32 is preferred since the output of `torch.arange` can be quite large
232
+ # and bf16 would lose a lot of precision
233
+ if self.pos_idx_in_fp32:
234
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
235
+ if self.inv_freq.dtype != torch.float32:
236
+ inv_freq = self._compute_inv_freq(device=device)
237
+ else:
238
+ inv_freq = self.inv_freq
239
+ else:
240
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
241
+ inv_freq = self.inv_freq
242
+
243
+ # `torch.outer` is preferred since `torch.einsum` converts from fp32 to fp16 if used with AMP
244
+ freqs = torch.outer(t, inv_freq)
245
+ if self.scale is None:
246
+ self._cos_cached = torch.cos(freqs).to(dtype)
247
+ self._sin_cached = torch.sin(freqs).to(dtype)
248
+ else:
249
+ power = (
250
+ torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
251
+ ) / self.scale_base
252
+ scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
253
+
254
+ # Force the scale multiplication to happen in fp32
255
+ self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
256
+ self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
257
+ self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
258
+ self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
259
+
260
+ def forward(
261
+ self,
262
+ qkv: torch.Tensor,
263
+ kv: Optional[torch.Tensor] = None,
264
+ seqlen_offset: int = 0,
265
+ **kwargs,
266
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
267
+ if (
268
+ self._seq_len_cached < qkv.shape[1] + seqlen_offset
269
+ or self._cos_cached.device != qkv.device
270
+ or self._cos_cached.dtype != qkv.dtype
271
+ or (self.training and self._cos_cached.is_inference())
272
+ ):
273
+ self._update_cos_sin_cache(qkv.shape[1] + seqlen_offset, device=qkv.device, dtype=qkv.dtype)
274
+
275
+ if kv is None:
276
+ return _apply_rotary_emb_qkv(
277
+ qkv,
278
+ self._cos_cached[seqlen_offset:],
279
+ self._sin_cached[seqlen_offset:],
280
+ )
281
+ else:
282
+ q = _apply_rotary_emb(
283
+ qkv,
284
+ self._cos_cached[seqlen_offset:],
285
+ self._sin_cached[seqlen_offset:],
286
+ )
287
+ kv = _apply_rotary_emb_kv(
288
+ kv,
289
+ self._cos_cached[seqlen_offset:],
290
+ self._sin_cached[seqlen_offset:],
291
+ )
292
+
293
+ return q, kv
294
+
295
+
296
+ class MLP(nn.Module):
297
+ """Multi-Layer Perceptron.
298
+
299
+ Reference:
300
+ Attention Is All You Need.
301
+ https://arxiv.org/pdf/1706.03762.pdf.
302
+
303
+ """
304
+
305
+ def __init__(
306
+ self,
307
+ config: PretrainedConfig,
308
+ n_inner: Optional[int] = None,
309
+ act_fn: Optional[str] = None,
310
+ ) -> None:
311
+ super().__init__()
312
+
313
+ act_fn = config.activation_function if act_fn is None else act_fn
314
+
315
+ n_inner = getattr(config, "n_inner", None) if n_inner is None else n_inner
316
+ n_inner = n_inner if n_inner is not None else 4 * config.n_embd
317
+
318
+ self.fc1 = nn.Linear(config.n_embd, n_inner)
319
+ self.fc2 = nn.Linear(n_inner, config.n_embd)
320
+ self.act = ACT2FN[act_fn]
321
+
322
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
323
+ hidden_states = self.fc1(hidden_states)
324
+ hidden_states = self.act(hidden_states)
325
+ hidden_states = self.fc2(hidden_states)
326
+
327
+ return hidden_states
328
+
329
+
330
+ class SelfAttention(nn.Module):
331
+ """Self-attention layer (compatible with PyTorch).
332
+
333
+ Reference:
334
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py.
335
+
336
+ """
337
+
338
+ def __init__(
339
+ self,
340
+ causal: bool = True,
341
+ softmax_scale: Optional[float] = None,
342
+ attention_dropout: float = 0.0,
343
+ ) -> None:
344
+ super().__init__()
345
+
346
+ self.causal = causal
347
+ self.softmax_scale = softmax_scale
348
+ self.drop = nn.Dropout(attention_dropout)
349
+
350
+ @torch.autocast("cpu", enabled=False)
351
+ @torch.autocast("cuda", enabled=False)
352
+ def forward(
353
+ self,
354
+ qkv: torch.FloatTensor,
355
+ causal: bool = None,
356
+ key_padding_mask: Optional[torch.BoolTensor] = None,
357
+ **kwargs,
358
+ ) -> torch.FloatTensor:
359
+ batch_size, seqlen = qkv.shape[0], qkv.shape[1]
360
+ q, k, v = qkv.unbind(dim=2)
361
+
362
+ q = q.to(torch.float32)
363
+ k = k.to(torch.float32)
364
+
365
+ causal = self.causal if causal is None else causal
366
+ softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
367
+
368
+ # Autocast is manually disabled to avoid `torch.einsum` performing the operation
369
+ # using float16, which might lead to overflow
370
+ scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
371
+
372
+ if key_padding_mask is not None:
373
+ padding_mask = torch.full((batch_size, seqlen), -10000.0, dtype=scores.dtype, device=scores.device)
374
+ padding_mask.masked_fill_(key_padding_mask, 0.0)
375
+
376
+ scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
377
+
378
+ if causal:
379
+ causal_mask = torch.triu(torch.full((seqlen, seqlen), -10000.0, device=scores.device), 1)
380
+ scores = scores + causal_mask.to(dtype=scores.dtype)
381
+
382
+ attention = torch.softmax(scores, dim=-1).to(v.dtype)
383
+ attention = self.drop(attention)
384
+
385
+ output = torch.einsum("bhts,bshd->bthd", attention, v)
386
+
387
+ return output
388
+
389
+
390
+ class CrossAttention(nn.Module):
391
+ """Cross-attention layer (compatible with PyTorch).
392
+
393
+ Reference:
394
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py.
395
+
396
+ """
397
+
398
+ def __init__(
399
+ self,
400
+ causal: bool = True,
401
+ softmax_scale: Optional[float] = None,
402
+ attention_dropout: float = 0.0,
403
+ ) -> None:
404
+ super().__init__()
405
+
406
+ self.causal = causal
407
+ self.softmax_scale = softmax_scale
408
+ self.drop = nn.Dropout(attention_dropout)
409
+
410
+ @torch.autocast("cpu", enabled=False)
411
+ @torch.autocast("cuda", enabled=False)
412
+ def forward(
413
+ self,
414
+ q: torch.FloatTensor,
415
+ kv: torch.FloatTensor,
416
+ causal: bool = None,
417
+ key_padding_mask: Optional[torch.BoolTensor] = None,
418
+ **kwargs,
419
+ ) -> torch.FloatTensor:
420
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
421
+ seqlen_k = kv.shape[1]
422
+
423
+ if kv.shape[3] != q.shape[2]:
424
+ kv = repeat(kv, "... hkv d -> ... (hkv g) d", g=q.shape[2] // kv.shape[3])
425
+ k, v = kv.unbind(dim=2)
426
+
427
+ q = q.to(torch.float32)
428
+ k = k.to(torch.float32)
429
+
430
+ causal = self.causal if causal is None else causal
431
+ softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
432
+
433
+ # Autocast is manually disabled to avoid `torch.einsum` performing the operation
434
+ # using float16, which might lead to overflow
435
+ scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
436
+
437
+ if key_padding_mask is not None:
438
+ padding_mask = torch.full(
439
+ (batch_size, seqlen_k),
440
+ -10000.0,
441
+ dtype=scores.dtype,
442
+ device=scores.device,
443
+ )
444
+ padding_mask.masked_fill_(key_padding_mask, 0.0)
445
+
446
+ scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
447
+
448
+ if causal:
449
+ rows = rearrange(torch.arange(seqlen_q, device=q.device, dtype=torch.long), "s -> s 1")
450
+ cols = torch.arange(seqlen_k, device=k.device, dtype=torch.long)
451
+ causal_mask = cols > rows + seqlen_k - seqlen_q
452
+
453
+ scores = scores.masked_fill(causal_mask, -10000.0)
454
+
455
+ attention = torch.softmax(scores, dim=-1).to(v.dtype)
456
+ attention = self.drop(attention)
457
+
458
+ output = torch.einsum("bhts,bshd->bthd", attention, v)
459
+
460
+ return output
461
+
462
+
463
+ def _find_mha_dims(
464
+ config: PretrainedConfig,
465
+ n_head: Optional[int] = None,
466
+ n_head_kv: Optional[int] = None,
467
+ head_dim: Optional[int] = None,
468
+ ) -> Tuple[int, int]:
469
+ if n_head is None and head_dim is None:
470
+ head_dim = config.n_embd // config.n_head
471
+ n_head = config.n_head
472
+ elif n_head is None or head_dim is None:
473
+ raise ValueError("`n_head` and `head_dim` must be both specified or `None`.")
474
+
475
+ if n_head_kv is None:
476
+ n_head_kv = getattr(config, "n_head_kv", None) or n_head
477
+
478
+ return n_head, n_head_kv, head_dim
479
+
480
+
481
+ def _update_kv_cache(kv: torch.FloatTensor, inference_params: InferenceParams, layer_idx: int) -> torch.FloatTensor:
482
+ num_heads, head_dim = kv.shape[-2:]
483
+
484
+ if layer_idx not in inference_params.key_value_memory_dict:
485
+ inference_params.key_value_memory_dict[layer_idx] = torch.empty(
486
+ inference_params.max_batch_size,
487
+ inference_params.max_seqlen,
488
+ 2,
489
+ num_heads,
490
+ head_dim,
491
+ dtype=kv.dtype,
492
+ device=kv.device,
493
+ )
494
+
495
+ batch_start = inference_params.batch_size_offset
496
+ batch_end = batch_start + kv.shape[0]
497
+
498
+ sequence_start = inference_params.seqlen_offset
499
+ sequence_end = sequence_start + kv.shape[1]
500
+
501
+ # When the current sequence length is equal to or larger than the maximum sequence length,
502
+ # we need to concatenate the current `kv` with the cached `kv` to expand its length
503
+ if sequence_end >= inference_params.max_seqlen:
504
+ inference_params.key_value_memory_dict[layer_idx] = torch.concatenate((inference_params.key_value_memory_dict[layer_idx], kv), dim=1)
505
+
506
+ inference_params.key_value_memory_dict[layer_idx][batch_start:batch_end, sequence_start:sequence_end, ...] = kv
507
+ kv = inference_params.key_value_memory_dict[layer_idx][batch_start:batch_end, :sequence_end, ...]
508
+
509
+ return kv
510
+
511
+
512
+ class MHA(nn.Module):
513
+ """Multi-head attention layer."""
514
+
515
+ def __init__(
516
+ self,
517
+ config: PretrainedConfig,
518
+ dtype: Optional[torch.dtype] = None,
519
+ device: Optional[str] = None,
520
+ rotary_dim: Optional[int] = None,
521
+ rotary_base: float = 10000.0,
522
+ rotary_scale_base: Optional[float] = None,
523
+ n_head: Optional[int] = None,
524
+ n_head_kv: Optional[int] = None,
525
+ head_dim: Optional[int] = None,
526
+ bias: bool = True,
527
+ causal: bool = True,
528
+ softmax_scale: Optional[float] = None,
529
+ layer_idx: Optional[int] = None,
530
+ return_residual: bool = False,
531
+ checkpointing: bool = True,
532
+ ) -> None:
533
+ super().__init__()
534
+
535
+ # Rotary embedding
536
+ self.rotary_dim = rotary_dim if rotary_dim is not None else getattr(config, "rotary_dim", 0)
537
+ if self.rotary_dim > 0:
538
+ rotary_cls = FlashRotaryEmbedding if config.flash_rotary else RotaryEmbedding
539
+ if rotary_cls is None:
540
+ rotary_cls = RotaryEmbedding
541
+
542
+ rotary_kwargs = {}
543
+ if rotary_cls is RotaryEmbedding:
544
+ rotary_kwargs["max_position_embeddings"] = config.n_positions
545
+
546
+ self.rotary_emb = rotary_cls(
547
+ self.rotary_dim,
548
+ base=rotary_base,
549
+ scale_base=rotary_scale_base,
550
+ device=device,
551
+ **rotary_kwargs,
552
+ )
553
+
554
+ # MLP
555
+ self.n_head, self.n_head_kv, self.head_dim = _find_mha_dims(
556
+ config, n_head=n_head, n_head_kv=n_head_kv, head_dim=head_dim
557
+ )
558
+ op_size = self.head_dim * (self.n_head + 2 * self.n_head_kv)
559
+ hidden_size = config.n_embd
560
+
561
+ linear_cls = FusedDense if config.fused_dense else nn.Linear
562
+ if linear_cls is None:
563
+ linear_cls = nn.Linear
564
+
565
+ self.Wqkv = linear_cls(hidden_size, op_size, bias=bias, device=device, dtype=dtype)
566
+ self.out_proj = linear_cls(hidden_size, hidden_size, bias=bias, device=device, dtype=dtype)
567
+
568
+ # Attention
569
+ attn_cls = FlashSelfAttention if config.flash_attn else SelfAttention
570
+ if attn_cls is None:
571
+ attn_cls = SelfAttention
572
+
573
+ cross_attn_cls = FlashCrossAttention if config.flash_attn else CrossAttention
574
+ if cross_attn_cls is None:
575
+ cross_attn_cls = CrossAttention
576
+
577
+ self.inner_attn = attn_cls(
578
+ causal=causal,
579
+ softmax_scale=softmax_scale,
580
+ attention_dropout=config.attn_pdrop,
581
+ )
582
+ self.inner_cross_attn = cross_attn_cls(
583
+ causal=causal,
584
+ softmax_scale=softmax_scale,
585
+ attention_dropout=config.attn_pdrop,
586
+ )
587
+
588
+ self.flash_attn = config.flash_attn and attn_cls is FlashSelfAttention
589
+ self.layer_idx = layer_idx
590
+ self.return_residual = return_residual
591
+ self.checkpointing = checkpointing
592
+
593
+ def _forward_self_attn(
594
+ self, x: torch.FloatTensor, key_padding_mask: Optional[torch.BoolTensor]
595
+ ) -> torch.FloatTensor:
596
+ qkv = self.Wqkv(x)
597
+ qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
598
+
599
+ if self.rotary_dim > 0:
600
+ qkv = self.rotary_emb(qkv)
601
+
602
+ if self.flash_attn:
603
+ batch_size, seqlen = qkv.shape[0], qkv.shape[1]
604
+
605
+ cu_seqlens, max_seqlen = None, None
606
+ if key_padding_mask is not None:
607
+ # If `key_padding_mask` is supplied, we need to unpad the input and retrieve
608
+ # the `cu_seqlens` and `max_seqlen` to be used by `flash-attn`
609
+ qkv, indices, cu_seqlens, max_seqlen = unpad_input(qkv, key_padding_mask)
610
+
611
+ if self.checkpointing:
612
+ attn_output = torch.utils.checkpoint.checkpoint(
613
+ self.inner_attn, qkv, None, cu_seqlens, max_seqlen, use_reentrant=False
614
+ )
615
+ else:
616
+ attn_output = self.inner_attn(qkv, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen).to(qkv.device)
617
+
618
+ # If `key_padding_mask` is supplied, we need to pad the output back to the original shape
619
+ return pad_input(attn_output, indices, batch_size, seqlen) if key_padding_mask is not None else attn_output
620
+
621
+ if self.checkpointing:
622
+ return torch.utils.checkpoint.checkpoint(self.inner_attn, qkv, None, key_padding_mask, use_reentrant=False)
623
+
624
+ return self.inner_attn(qkv, key_padding_mask=key_padding_mask)
625
+
626
+ def _forward_cross_attn(
627
+ self,
628
+ x: torch.FloatTensor,
629
+ past_key_values: Optional[InferenceParams],
630
+ key_padding_mask: Optional[torch.BoolTensor],
631
+ ) -> torch.FloatTensor:
632
+ batch_size = x.shape[0]
633
+
634
+ qkv = self.Wqkv(x)
635
+
636
+ q = qkv[..., : self.n_head * self.head_dim]
637
+ q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
638
+
639
+ kv = qkv[..., self.n_head * self.head_dim :]
640
+ kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
641
+
642
+ seqlen_offset = past_key_values.seqlen_offset if past_key_values is not None else 0
643
+ causal = None if seqlen_offset == 0 else False
644
+ if self.rotary_dim > 0:
645
+ q, kv = self.rotary_emb(q, kv=kv, seqlen_offset=seqlen_offset)
646
+
647
+ if past_key_values is not None:
648
+ kv = _update_kv_cache(kv, past_key_values, self.layer_idx)
649
+
650
+ if self.flash_attn:
651
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
652
+ seqlen_k = kv.shape[1]
653
+
654
+ cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k = (
655
+ None,
656
+ None,
657
+ None,
658
+ None,
659
+ )
660
+ if key_padding_mask is not None:
661
+ kv, _, cu_seqlens_k, max_seqlen_k = unpad_input(kv, key_padding_mask)
662
+
663
+ if seqlen_q == 1:
664
+ key_padding_mask = torch.ones(batch_size, 1, device=q.device)
665
+ elif seqlen_q != seqlen_k:
666
+ key_padding_mask = key_padding_mask[:, -seqlen_q:]
667
+
668
+ q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, key_padding_mask)
669
+
670
+ if self.checkpointing:
671
+ attn_output = torch.utils.checkpoint.checkpoint(
672
+ self.inner_cross_attn,
673
+ q,
674
+ kv,
675
+ causal,
676
+ cu_seqlens_q,
677
+ max_seqlen_q,
678
+ cu_seqlens_k,
679
+ max_seqlen_k,
680
+ use_reentrant=False,
681
+ )
682
+ else:
683
+ attn_output = self.inner_cross_attn(
684
+ q,
685
+ kv,
686
+ causal=causal,
687
+ cu_seqlens=cu_seqlens_q,
688
+ max_seqlen=max_seqlen_q,
689
+ cu_seqlens_k=cu_seqlens_k,
690
+ max_seqlen_k=max_seqlen_k,
691
+ )
692
+
693
+ return (
694
+ pad_input(attn_output, indices_q, batch_size, max_seqlen_q)
695
+ if key_padding_mask is not None
696
+ else attn_output
697
+ )
698
+
699
+ if self.checkpointing:
700
+ return torch.utils.checkpoint.checkpoint(
701
+ self.inner_cross_attn,
702
+ q,
703
+ kv,
704
+ causal,
705
+ key_padding_mask,
706
+ use_reentrant=False,
707
+ )
708
+
709
+ return self.inner_cross_attn(q, kv, key_padding_mask=key_padding_mask, causal=causal)
710
+
711
+ def forward(
712
+ self,
713
+ x: torch.FloatTensor,
714
+ past_key_values: Optional[InferenceParams] = None,
715
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
716
+ **kwargs,
717
+ ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
718
+ if attention_mask is not None:
719
+ attention_mask = attention_mask.bool()
720
+ else:
721
+ attention_mask = None
722
+
723
+ # MHA
724
+ if self.n_head == self.n_head_kv:
725
+ if past_key_values is None:
726
+ # If `past_key_values` are not supplied, we run self-attention
727
+ attn_output = self._forward_self_attn(x, attention_mask)
728
+ else:
729
+ # If `past_key_values` are supplied, it means that we might have cached values and
730
+ # could take advantage of cross-attention
731
+ attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
732
+ # MQA / GQA
733
+ else:
734
+ # Regardless of `past_key_values` being supplied or not, it always use cross-attention
735
+ # because `q` and `kv` lengths might be different
736
+ attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
737
+
738
+ output = rearrange(attn_output, "... h d -> ... (h d)")
739
+ output = self.out_proj(output)
740
+
741
+ return output if not self.return_residual else (output, x)
742
+
743
+
744
+ class ParallelBlock(nn.Module):
745
+ """Parallel block.
746
+
747
+ This block applies parallel mixer and MLP layers to the input (used in GPT-J and CodeGen).
748
+
749
+ """
750
+
751
+ def __init__(
752
+ self,
753
+ config: PretrainedConfig,
754
+ block_idx: Optional[int] = None,
755
+ ) -> None:
756
+ super().__init__()
757
+
758
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
759
+ self.resid_dropout = nn.Dropout(config.resid_pdrop)
760
+ self.block_idx = block_idx
761
+
762
+ self.mixer = MHA(config, layer_idx=block_idx)
763
+ self.mlp = MLP(config)
764
+
765
+ def forward(
766
+ self,
767
+ hidden_states: torch.FloatTensor,
768
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
769
+ attention_mask: Optional[torch.BoolTensor] = None,
770
+ **kwargs,
771
+ ) -> torch.FloatTensor:
772
+ residual = hidden_states
773
+ hidden_states = self.ln(hidden_states)
774
+
775
+ attn_outputs = self.mixer(
776
+ hidden_states,
777
+ past_key_values=past_key_values,
778
+ attention_mask=attention_mask,
779
+ )
780
+ if isinstance(attn_outputs, tuple):
781
+ attn_outputs = attn_outputs[0]
782
+
783
+ attn_outputs = self.resid_dropout(attn_outputs)
784
+ feed_forward_hidden_states = self.resid_dropout(self.mlp(hidden_states))
785
+
786
+ hidden_states = attn_outputs + feed_forward_hidden_states + residual
787
+
788
+ return hidden_states
789
+
790
+
791
+ class CausalLMHead(nn.Module):
792
+ """Causal Language Modeling head.
793
+
794
+ Reference:
795
+ Improving Language Understanding by Generative Pre-Training.
796
+ https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
797
+
798
+ """
799
+
800
+ def __init__(self, config: PretrainedConfig) -> None:
801
+ super().__init__()
802
+
803
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
804
+ self.linear = nn.Linear(config.n_embd, config.vocab_size)
805
+
806
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
807
+ hidden_states = self.ln(hidden_states)
808
+ logits = self.linear(hidden_states).to(torch.float32)
809
+
810
+ return logits
811
+
812
+
813
+ class CausalLMLoss(nn.Module):
814
+ """Causal Language Modeling loss.
815
+
816
+ Reference:
817
+ Improving Language Understanding by Generative Pre-Training.
818
+ https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
819
+
820
+ """
821
+
822
+ def __init__(self, shift_labels: bool = True) -> None:
823
+ super().__init__()
824
+
825
+ self.shift_labels = shift_labels
826
+ self.loss_fct = nn.CrossEntropyLoss()
827
+
828
+ def forward(self, logits: torch.FloatTensor, labels: torch.LongTensor) -> torch.FloatTensor:
829
+ if self.shift_labels:
830
+ logits = logits[..., :-1, :].contiguous()
831
+ labels = labels[..., 1:].contiguous()
832
+
833
+ loss = self.loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
834
+
835
+ return loss
836
+
837
+
838
+ class PhiPreTrainedModel(PreTrainedModel):
839
+ """Phi pre-trained model."""
840
+
841
+ config_class = PhiConfig
842
+ base_model_prefix = "transformer"
843
+ supports_gradient_checkpointing = True
844
+ _no_split_modules = ["ParallelBlock"]
845
+
846
+ def __init__(self, *inputs, **kwargs) -> None:
847
+ super().__init__(*inputs, **kwargs)
848
+
849
+ def _init_weights(self, module: nn.Module) -> None:
850
+ if isinstance(module, (nn.Linear,)):
851
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
852
+ if module.bias is not None:
853
+ module.bias.data.zero_()
854
+ elif isinstance(module, nn.Embedding):
855
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
856
+ if module.padding_idx is not None:
857
+ module.weight.data[module.padding_idx].zero_()
858
+ elif isinstance(module, nn.LayerNorm):
859
+ if module.bias is not None:
860
+ module.bias.data.zero_()
861
+ module.weight.data.fill_(1.0)
862
+
863
+ def prepare_inputs_for_generation(
864
+ self,
865
+ input_ids: torch.LongTensor,
866
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
867
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
868
+ **kwargs,
869
+ ) -> Dict[str, Any]:
870
+ # if past_key_values is None or not (isinstance(past_key_values, InferenceParams)):
871
+ # past_key_values = InferenceParams(
872
+ # max_seqlen=self.config.n_positions,
873
+ # max_batch_size=input_ids.shape[0],
874
+ # seqlen_offset=0,
875
+ # batch_size_offset=0,
876
+ # key_value_memory_dict={},
877
+ # lengths_per_sample=None,
878
+ # )
879
+ # else:
880
+ # # Assume that `past_key_values` has cached all tokens up to the last token in `input_ids`
881
+ # past_key_values.seqlen_offset = input_ids.shape[1] - 1
882
+ # input_ids = input_ids[:, -1].unsqueeze(-1)
883
+ # attention_mask = attention_mask[:, -1].unsqueeze(-1)
884
+
885
+ return {
886
+ "input_ids": input_ids,
887
+ "past_key_values": past_key_values,
888
+ "attention_mask": attention_mask,
889
+ }
890
+
891
+
892
+ class PhiModel(PhiPreTrainedModel):
893
+ """Phi model."""
894
+
895
+ _keys_to_ignore_on_load_missing = [""]
896
+ _keys_to_ignore_on_load_unexpected = [r"h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"]
897
+
898
+ def __init__(self, config: PhiConfig) -> None:
899
+ config.flash_attn = True
900
+ config.flash_rotary = True
901
+ super().__init__(config)
902
+
903
+ self.embd = Embedding(config)
904
+ self.h = nn.ModuleList([ParallelBlock(config, block_idx=i) for i in range(config.n_layer)])
905
+ self.gradient_checkpointing = True
906
+ self.post_init()
907
+
908
+ def get_input_embeddings(self) -> nn.Embedding:
909
+ return self.embd.wte
910
+
911
+ def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
912
+ self.embd.wte = new_embeddings
913
+
914
+ def forward(
915
+ self,
916
+ input_ids: torch.LongTensor,
917
+ inputs_embeds: Optional[torch.FloatTensor] = None,
918
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
919
+ attention_mask: Optional[torch.BoolTensor] = None,
920
+ ) -> torch.FloatTensor:
921
+ if input_ids is not None:
922
+ hidden_states = self.embd(input_ids)
923
+ elif inputs_embeds is not None:
924
+ hidden_states = inputs_embeds
925
+ else:
926
+ raise ValueError("You have to specify either input_ids or inputs_embeds")
927
+
928
+ for layer in self.h:
929
+ if self.gradient_checkpointing:
930
+ hidden_states = torch.utils.checkpoint.checkpoint(
931
+ layer.__call__,
932
+ hidden_states,
933
+ past_key_values,
934
+ attention_mask,
935
+ use_reentrant=False,
936
+ )
937
+ else:
938
+ hidden_states = layer(
939
+ hidden_states,
940
+ past_key_values=past_key_values,
941
+ attention_mask=attention_mask,
942
+ )
943
+
944
+ return hidden_states
945
+
946
+
947
+ class PhiForCausalLM(PhiPreTrainedModel):
948
+ """Phi for Causal Language Modeling."""
949
+
950
+ _keys_to_ignore_on_load_missing = [""]
951
+ _keys_to_ignore_on_load_unexpected = [r"transformer\.h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"]
952
+
953
+ supports_gradient_checkpointing = True
954
+ _no_split_modules = ["ParallelBlock"]
955
+ _skip_keys_device_placement = "past_key_values"
956
+
957
+ def __init__(self, config: PhiConfig) -> None:
958
+ super().__init__(config)
959
+
960
+ self.transformer = PhiModel(config)
961
+ self.lm_head = CausalLMHead(config)
962
+ self.loss = CausalLMLoss()
963
+
964
+ self.post_init()
965
+
966
+ def get_output_embeddings(self) -> nn.Linear:
967
+ return self.lm_head.linear
968
+
969
+ def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
970
+ self.lm_head.linear = new_embeddings
971
+
972
+ def forward(
973
+ self,
974
+ input_ids: torch.LongTensor,
975
+ inputs_embeds: Optional[torch.FloatTensor] = None,
976
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
977
+ attention_mask: Optional[torch.BoolTensor] = None,
978
+ labels: Optional[torch.LongTensor] = None,
979
+ **kwargs,
980
+ ) -> CausalLMOutputWithPast:
981
+ hidden_states = self.transformer(input_ids, inputs_embeds=inputs_embeds, past_key_values=past_key_values, attention_mask=attention_mask)
982
+ lm_logits = self.lm_head(hidden_states)
983
+
984
+ loss = None
985
+ if labels is not None:
986
+ loss = self.loss(lm_logits, labels)
987
+
988
+ return CausalLMOutputWithPast(loss=loss, logits=lm_logits, past_key_values=past_key_values)
processing_llava.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2023 The HuggingFace Inc. team.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """
16
+ Processor class for Llava.
17
+ """
18
+
19
+
20
+ from typing import List, Optional, Union
21
+
22
+ from transformers.feature_extraction_utils import BatchFeature
23
+ from transformers.image_utils import ImageInput
24
+ from transformers.tokenization_utils_base import (
25
+ PaddingStrategy,
26
+ PreTokenizedInput,
27
+ TextInput,
28
+ TruncationStrategy,
29
+ )
30
+ from transformers.utils import TensorType
31
+ import torch
32
+ from open_clip.transform import PreprocessCfg, image_transform_v2
33
+ from modeling_llava import LlavaForConditionalGeneration
34
+ from PIL import Image
35
+ import math
36
+
37
+
38
+ class OpenCLIPImageProcessor:
39
+ def __init__(self, config, crop_size=384, max_tokens=100):
40
+ cfg = PreprocessCfg(**config)
41
+ transform = image_transform_v2(cfg=cfg, is_train=False)
42
+ self.transform = transform
43
+ self.crop_size = crop_size
44
+ self.max_tokens = max_tokens
45
+
46
+ def __call__(self, image: Image.Image):
47
+ output = self.transform_func(image)
48
+ return {
49
+ "pixel_values": output,
50
+ }
51
+
52
+ def transform_func(self, image: Image.Image):
53
+ outputs = []
54
+ outputs.append(self.transform(image))
55
+ width, height = image.size
56
+ crop_size = self.crop_size
57
+ if width <= crop_size and height <= crop_size:
58
+ outputs = torch.stack(outputs, dim=0)
59
+ return outputs
60
+ total_tokens = math.inf
61
+ while total_tokens > self.max_tokens:
62
+ total_tokens = math.floor(
63
+ (2 * width - crop_size)
64
+ / crop_size
65
+ * (2 * height - crop_size)
66
+ / crop_size
67
+ )
68
+ if total_tokens > self.max_tokens:
69
+ crop_size += 10
70
+ stride = crop_size // 2
71
+ x_steps = int(round((2 * width - crop_size) / crop_size))
72
+ if x_steps < 1:
73
+ x_steps = 1
74
+ y_steps = int(round((2 * height - crop_size) / crop_size))
75
+ if y_steps < 1:
76
+ y_steps = 1
77
+ x_coords = []
78
+ y_coords = []
79
+ for i in range(x_steps):
80
+ x_coords.append([i * stride, i * stride + crop_size])
81
+ if x_coords[-1][1] != width:
82
+ x_coords[-1][1] = width
83
+ for i in range(y_steps):
84
+ y_coords.append([i * stride, i * stride + crop_size])
85
+ if y_coords[-1][1] != height:
86
+ y_coords[-1][1] = height
87
+ image_parts = []
88
+ for i in range(len(x_coords)):
89
+ for j in range(len(y_coords)):
90
+ image_parts.append(
91
+ image.crop(
92
+ (x_coords[i][0], y_coords[j][0], x_coords[i][1], y_coords[j][1])
93
+ )
94
+ )
95
+ for image_part in image_parts:
96
+ outputs.append(self.transform(image_part))
97
+ outputs = torch.stack(outputs, dim=0)
98
+ return outputs
99
+
100
+ @property
101
+ def model_input_names(self):
102
+ return ["pixel_values"]
103
+
104
+
105
+ class LlavaProcessor:
106
+ def __init__(self, image_processor: OpenCLIPImageProcessor, tokenizer):
107
+ self.image_processor = image_processor
108
+ self.tokenizer = tokenizer
109
+
110
+ def __call__(
111
+ self,
112
+ text: Union[
113
+ TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]
114
+ ] = None,
115
+ images: ImageInput = None,
116
+ model: LlavaForConditionalGeneration = None,
117
+ padding: Union[bool, str, PaddingStrategy] = False,
118
+ truncation: Union[bool, str, TruncationStrategy] = None,
119
+ max_length=None,
120
+ return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
121
+ ) -> BatchFeature:
122
+ if images is not None:
123
+ pixel_values = self.image_processor(images)[
124
+ "pixel_values"
125
+ ]
126
+ pixel_values = pixel_values.to(model.device).to(model.dtype)
127
+ image_outputs = model.vision_model(pixel_values)
128
+ image_features = model.multi_modal_projector(image_outputs)
129
+ image_features = image_features.unsqueeze(0)
130
+ else:
131
+ image_features = None
132
+ text_inputs = self.tokenizer(
133
+ text,
134
+ return_tensors=return_tensors,
135
+ padding=padding,
136
+ truncation=truncation,
137
+ max_length=max_length,
138
+ )
139
+
140
+ return BatchFeature(data={**text_inputs, "image_features": image_features})
141
+
142
+ def batch_decode(self, *args, **kwargs):
143
+ return self.tokenizer.batch_decode(*args, **kwargs)
144
+
145
+ def decode(self, *args, **kwargs):
146
+ return self.tokenizer.decode(*args, **kwargs)
147
+
148
+ @property
149
+ def model_input_names(self):
150
+ tokenizer_input_names = self.tokenizer.model_input_names
151
+ image_processor_input_names = self.image_processor.model_input_names
152
+ return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|im_end|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<|endoftext|>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,357 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "50257": {
13
+ "content": " ",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": false
19
+ },
20
+ "50258": {
21
+ "content": " ",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": false
27
+ },
28
+ "50259": {
29
+ "content": " ",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": false
35
+ },
36
+ "50260": {
37
+ "content": " ",
38
+ "lstrip": false,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": false
43
+ },
44
+ "50261": {
45
+ "content": " ",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": false
51
+ },
52
+ "50262": {
53
+ "content": " ",
54
+ "lstrip": false,
55
+ "normalized": true,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": false
59
+ },
60
+ "50263": {
61
+ "content": " ",
62
+ "lstrip": false,
63
+ "normalized": true,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": false
67
+ },
68
+ "50264": {
69
+ "content": " ",
70
+ "lstrip": false,
71
+ "normalized": true,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": false
75
+ },
76
+ "50265": {
77
+ "content": " ",
78
+ "lstrip": false,
79
+ "normalized": true,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": false
83
+ },
84
+ "50266": {
85
+ "content": " ",
86
+ "lstrip": false,
87
+ "normalized": true,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": false
91
+ },
92
+ "50267": {
93
+ "content": " ",
94
+ "lstrip": false,
95
+ "normalized": true,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": false
99
+ },
100
+ "50268": {
101
+ "content": " ",
102
+ "lstrip": false,
103
+ "normalized": true,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": false
107
+ },
108
+ "50269": {
109
+ "content": " ",
110
+ "lstrip": false,
111
+ "normalized": true,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": false
115
+ },
116
+ "50270": {
117
+ "content": " ",
118
+ "lstrip": false,
119
+ "normalized": true,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "50271": {
125
+ "content": " ",
126
+ "lstrip": false,
127
+ "normalized": true,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "50272": {
133
+ "content": " ",
134
+ "lstrip": false,
135
+ "normalized": true,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "50273": {
141
+ "content": " ",
142
+ "lstrip": false,
143
+ "normalized": true,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "50274": {
149
+ "content": " ",
150
+ "lstrip": false,
151
+ "normalized": true,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "50275": {
157
+ "content": " ",
158
+ "lstrip": false,
159
+ "normalized": true,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "50276": {
165
+ "content": " ",
166
+ "lstrip": false,
167
+ "normalized": true,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "50277": {
173
+ "content": " ",
174
+ "lstrip": false,
175
+ "normalized": true,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ },
180
+ "50278": {
181
+ "content": " ",
182
+ "lstrip": false,
183
+ "normalized": true,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": false
187
+ },
188
+ "50279": {
189
+ "content": " ",
190
+ "lstrip": false,
191
+ "normalized": true,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": false
195
+ },
196
+ "50280": {
197
+ "content": " ",
198
+ "lstrip": false,
199
+ "normalized": true,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": false
203
+ },
204
+ "50281": {
205
+ "content": " ",
206
+ "lstrip": false,
207
+ "normalized": true,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": false
211
+ },
212
+ "50282": {
213
+ "content": " ",
214
+ "lstrip": false,
215
+ "normalized": true,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": false
219
+ },
220
+ "50283": {
221
+ "content": " ",
222
+ "lstrip": false,
223
+ "normalized": true,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": false
227
+ },
228
+ "50284": {
229
+ "content": " ",
230
+ "lstrip": false,
231
+ "normalized": true,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": false
235
+ },
236
+ "50285": {
237
+ "content": " ",
238
+ "lstrip": false,
239
+ "normalized": true,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": false
243
+ },
244
+ "50286": {
245
+ "content": " ",
246
+ "lstrip": false,
247
+ "normalized": true,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": false
251
+ },
252
+ "50287": {
253
+ "content": "\t\t\t\t\t\t\t\t\t",
254
+ "lstrip": false,
255
+ "normalized": true,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": false
259
+ },
260
+ "50288": {
261
+ "content": "\t\t\t\t\t\t\t\t",
262
+ "lstrip": false,
263
+ "normalized": true,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": false
267
+ },
268
+ "50289": {
269
+ "content": "\t\t\t\t\t\t\t",
270
+ "lstrip": false,
271
+ "normalized": true,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": false
275
+ },
276
+ "50290": {
277
+ "content": "\t\t\t\t\t\t",
278
+ "lstrip": false,
279
+ "normalized": true,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": false
283
+ },
284
+ "50291": {
285
+ "content": "\t\t\t\t\t",
286
+ "lstrip": false,
287
+ "normalized": true,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": false
291
+ },
292
+ "50292": {
293
+ "content": "\t\t\t\t",
294
+ "lstrip": false,
295
+ "normalized": true,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": false
299
+ },
300
+ "50293": {
301
+ "content": "\t\t\t",
302
+ "lstrip": false,
303
+ "normalized": true,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": false
307
+ },
308
+ "50294": {
309
+ "content": "\t\t",
310
+ "lstrip": false,
311
+ "normalized": true,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": false
315
+ },
316
+ "50295": {
317
+ "content": "<|im_end|>",
318
+ "lstrip": false,
319
+ "normalized": false,
320
+ "rstrip": false,
321
+ "single_word": false,
322
+ "special": true
323
+ },
324
+ "50296": {
325
+ "content": "<|im_start|>",
326
+ "lstrip": false,
327
+ "normalized": false,
328
+ "rstrip": false,
329
+ "single_word": false,
330
+ "special": false
331
+ },
332
+ "50297": {
333
+ "content": "<image>",
334
+ "lstrip": false,
335
+ "normalized": false,
336
+ "rstrip": false,
337
+ "single_word": false,
338
+ "special": true
339
+ },
340
+ "50298": {
341
+ "content": "<pad>",
342
+ "lstrip": false,
343
+ "normalized": false,
344
+ "rstrip": false,
345
+ "single_word": false,
346
+ "special": true
347
+ }
348
+ },
349
+ "bos_token": "<|endoftext|>",
350
+ "chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
351
+ "clean_up_tokenization_spaces": true,
352
+ "eos_token": "<|im_end|>",
353
+ "model_max_length": 1200,
354
+ "pad_token": "<pad>",
355
+ "tokenizer_class": "CodeGenTokenizer",
356
+ "unk_token": "<|endoftext|>"
357
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff