Canstralian commited on
Commit
1b4786d
·
verified ·
1 Parent(s): 88a170b

Delete cog-replit-code-v1-3b-main

Browse files
cog-replit-code-v1-3b-main/.dockerignore DELETED
@@ -1,3 +0,0 @@
1
- model/*.bin
2
- model/*.tensors
3
- notebooks
 
 
 
 
cog-replit-code-v1-3b-main/LICENSE.txt DELETED
@@ -1,201 +0,0 @@
1
- Apache License
2
- Version 2.0, January 2004
3
- http://www.apache.org/licenses/
4
-
5
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
-
7
- 1. Definitions.
8
-
9
- "License" shall mean the terms and conditions for use, reproduction,
10
- and distribution as defined by Sections 1 through 9 of this document.
11
-
12
- "Licensor" shall mean the copyright owner or entity authorized by
13
- the copyright owner that is granting the License.
14
-
15
- "Legal Entity" shall mean the union of the acting entity and all
16
- other entities that control, are controlled by, or are under common
17
- control with that entity. For the purposes of this definition,
18
- "control" means (i) the power, direct or indirect, to cause the
19
- direction or management of such entity, whether by contract or
20
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
- outstanding shares, or (iii) beneficial ownership of such entity.
22
-
23
- "You" (or "Your") shall mean an individual or Legal Entity
24
- exercising permissions granted by this License.
25
-
26
- "Source" form shall mean the preferred form for making modifications,
27
- including but not limited to software source code, documentation
28
- source, and configuration files.
29
-
30
- "Object" form shall mean any form resulting from mechanical
31
- transformation or translation of a Source form, including but
32
- not limited to compiled object code, generated documentation,
33
- and conversions to other media types.
34
-
35
- "Work" shall mean the work of authorship, whether in Source or
36
- Object form, made available under the License, as indicated by a
37
- copyright notice that is included in or attached to the work
38
- (an example is provided in the Appendix below).
39
-
40
- "Derivative Works" shall mean any work, whether in Source or Object
41
- form, that is based on (or derived from) the Work and for which the
42
- editorial revisions, annotations, elaborations, or other modifications
43
- represent, as a whole, an original work of authorship. For the purposes
44
- of this License, Derivative Works shall not include works that remain
45
- separable from, or merely link (or bind by name) to the interfaces of,
46
- the Work and Derivative Works thereof.
47
-
48
- "Contribution" shall mean any work of authorship, including
49
- the original version of the Work and any modifications or additions
50
- to that Work or Derivative Works thereof, that is intentionally
51
- submitted to Licensor for inclusion in the Work by the copyright owner
52
- or by an individual or Legal Entity authorized to submit on behalf of
53
- the copyright owner. For the purposes of this definition, "submitted"
54
- means any form of electronic, verbal, or written communication sent
55
- to the Licensor or its representatives, including but not limited to
56
- communication on electronic mailing lists, source code control systems,
57
- and issue tracking systems that are managed by, or on behalf of, the
58
- Licensor for the purpose of discussing and improving the Work, but
59
- excluding communication that is conspicuously marked or otherwise
60
- designated in writing by the copyright owner as "Not a Contribution."
61
-
62
- "Contributor" shall mean Licensor and any individual or Legal Entity
63
- on behalf of whom a Contribution has been received by Licensor and
64
- subsequently incorporated within the Work.
65
-
66
- 2. Grant of Copyright License. Subject to the terms and conditions of
67
- this License, each Contributor hereby grants to You a perpetual,
68
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
- copyright license to reproduce, prepare Derivative Works of,
70
- publicly display, publicly perform, sublicense, and distribute the
71
- Work and such Derivative Works in Source or Object form.
72
-
73
- 3. Grant of Patent License. Subject to the terms and conditions of
74
- this License, each Contributor hereby grants to You a perpetual,
75
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
- (except as stated in this section) patent license to make, have made,
77
- use, offer to sell, sell, import, and otherwise transfer the Work,
78
- where such license applies only to those patent claims licensable
79
- by such Contributor that are necessarily infringed by their
80
- Contribution(s) alone or by combination of their Contribution(s)
81
- with the Work to which such Contribution(s) was submitted. If You
82
- institute patent litigation against any entity (including a
83
- cross-claim or counterclaim in a lawsuit) alleging that the Work
84
- or a Contribution incorporated within the Work constitutes direct
85
- or contributory patent infringement, then any patent licenses
86
- granted to You under this License for that Work shall terminate
87
- as of the date such litigation is filed.
88
-
89
- 4. Redistribution. You may reproduce and distribute copies of the
90
- Work or Derivative Works thereof in any medium, with or without
91
- modifications, and in Source or Object form, provided that You
92
- meet the following conditions:
93
-
94
- (a) You must give any other recipients of the Work or
95
- Derivative Works a copy of this License; and
96
-
97
- (b) You must cause any modified files to carry prominent notices
98
- stating that You changed the files; and
99
-
100
- (c) You must retain, in the Source form of any Derivative Works
101
- that You distribute, all copyright, patent, trademark, and
102
- attribution notices from the Source form of the Work,
103
- excluding those notices that do not pertain to any part of
104
- the Derivative Works; and
105
-
106
- (d) If the Work includes a "NOTICE" text file as part of its
107
- distribution, then any Derivative Works that You distribute must
108
- include a readable copy of the attribution notices contained
109
- within such NOTICE file, excluding those notices that do not
110
- pertain to any part of the Derivative Works, in at least one
111
- of the following places: within a NOTICE text file distributed
112
- as part of the Derivative Works; within the Source form or
113
- documentation, if provided along with the Derivative Works; or,
114
- within a display generated by the Derivative Works, if and
115
- wherever such third-party notices normally appear. The contents
116
- of the NOTICE file are for informational purposes only and
117
- do not modify the License. You may add Your own attribution
118
- notices within Derivative Works that You distribute, alongside
119
- or as an addendum to the NOTICE text from the Work, provided
120
- that such additional attribution notices cannot be construed
121
- as modifying the License.
122
-
123
- You may add Your own copyright statement to Your modifications and
124
- may provide additional or different license terms and conditions
125
- for use, reproduction, or distribution of Your modifications, or
126
- for any such Derivative Works as a whole, provided Your use,
127
- reproduction, and distribution of the Work otherwise complies with
128
- the conditions stated in this License.
129
-
130
- 5. Submission of Contributions. Unless You explicitly state otherwise,
131
- any Contribution intentionally submitted for inclusion in the Work
132
- by You to the Licensor shall be under the terms and conditions of
133
- this License, without any additional terms or conditions.
134
- Notwithstanding the above, nothing herein shall supersede or modify
135
- the terms of any separate license agreement you may have executed
136
- with Licensor regarding such Contributions.
137
-
138
- 6. Trademarks. This License does not grant permission to use the trade
139
- names, trademarks, service marks, or product names of the Licensor,
140
- except as required for reasonable and customary use in describing the
141
- origin of the Work and reproducing the content of the NOTICE file.
142
-
143
- 7. Disclaimer of Warranty. Unless required by applicable law or
144
- agreed to in writing, Licensor provides the Work (and each
145
- Contributor provides its Contributions) on an "AS IS" BASIS,
146
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
- implied, including, without limitation, any warranties or conditions
148
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
- PARTICULAR PURPOSE. You are solely responsible for determining the
150
- appropriateness of using or redistributing the Work and assume any
151
- risks associated with Your exercise of permissions under this License.
152
-
153
- 8. Limitation of Liability. In no event and under no legal theory,
154
- whether in tort (including negligence), contract, or otherwise,
155
- unless required by applicable law (such as deliberate and grossly
156
- negligent acts) or agreed to in writing, shall any Contributor be
157
- liable to You for damages, including any direct, indirect, special,
158
- incidental, or consequential damages of any character arising as a
159
- result of this License or out of the use or inability to use the
160
- Work (including but not limited to damages for loss of goodwill,
161
- work stoppage, computer failure or malfunction, or any and all
162
- other commercial damages or losses), even if such Contributor
163
- has been advised of the possibility of such damages.
164
-
165
- 9. Accepting Warranty or Additional Liability. While redistributing
166
- the Work or Derivative Works thereof, You may choose to offer,
167
- and charge a fee for, acceptance of support, warranty, indemnity,
168
- or other liability obligations and/or rights consistent with this
169
- License. However, in accepting such obligations, You may act only
170
- on Your own behalf and on Your sole responsibility, not on behalf
171
- of any other Contributor, and only if You agree to indemnify,
172
- defend, and hold each Contributor harmless for any liability
173
- incurred by, or claims asserted against, such Contributor by reason
174
- of your accepting any such warranty or additional liability.
175
-
176
- END OF TERMS AND CONDITIONS
177
-
178
- APPENDIX: How to apply the Apache License to your work.
179
-
180
- To apply the Apache License to your work, attach the following
181
- boilerplate notice, with the fields enclosed by brackets "[]"
182
- replaced with your own identifying information. (Don't include
183
- the brackets!) The text should be enclosed in the appropriate
184
- comment syntax for the file format. We also recommend that a
185
- file or class name and description of purpose be included on the
186
- same "printed page" as the copyright notice for easier
187
- identification within third-party archives.
188
-
189
- Copyright 2022, Replicate, Inc.
190
-
191
- Licensed under the Apache License, Version 2.0 (the "License");
192
- you may not use this file except in compliance with the License.
193
- You may obtain a copy of the License at
194
-
195
- http://www.apache.org/licenses/LICENSE-2.0
196
-
197
- Unless required by applicable law or agreed to in writing, software
198
- distributed under the License is distributed on an "AS IS" BASIS,
199
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
- See the License for the specific language governing permissions and
201
- limitations under the License.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/README.md DELETED
@@ -1,5 +0,0 @@
1
- # replit-code-v1-3b
2
-
3
- [![Replicate](https://replicate.com/replicate/replit-code-v1-3b/badge)](https://replicate.com/replicate/replit-code-v1-3b)
4
-
5
- A [Cog](https://cog.run) implementation of Replit's [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) Large Language Model
 
 
 
 
 
 
cog-replit-code-v1-3b-main/cog.yaml DELETED
@@ -1,15 +0,0 @@
1
- build:
2
- gpu: true
3
- cuda: "11.7"
4
- python_version: "3.10"
5
- python_requirements: requirements.txt
6
-
7
- # commands run after the environment is setup
8
- run:
9
- - pip install flash-attn==0.2.8
10
- - pip install triton==2.0.0.dev20221202
11
- - pip install tensorizer==1.1.0
12
- - echo 'deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main' | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
13
- - curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
14
- - apt-get update && apt-get install google-cloud-cli
15
- predict: "predict.py:Predictor"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/predict.py DELETED
@@ -1,202 +0,0 @@
1
- import time
2
- from typing import Optional
3
- import subprocess
4
-
5
- import torch
6
- import os
7
-
8
- from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
9
- from tensorizer import TensorDeserializer
10
- from tensorizer.utils import no_init_or_tensor
11
- from collections import OrderedDict
12
- from cog import BasePredictor, ConcatenateIterator, Input, Path
13
-
14
- # from config import DEFAULT_MODEL_NAME, DEFAULT_CONFIG_PATH, load_tokenizer, load_tensorizer
15
- from subclass import YieldingReplitCode
16
-
17
- # Weights are either local or in a cloud bucket.
18
-
19
- # For development, point to a local path on disk.
20
- # This is the path from which we pull weights when there's no COG_WEIGHTS environment variable (COG_WEIGHTS is a thing for trainable models)
21
- # TENSORIZER_WEIGHTS_PATH = "model/model.tensors"
22
- TENSORIZER_WEIGHTS_PATH = "gs://replicate-weights/replit-code-v1-3b/model.tensors"
23
-
24
- # Set this to a GCP URL when pushing the model
25
- # TENSORIZER_WEIGHTS_PATH = None
26
-
27
- DEFAULT_CONFIG_PATH = "model/"
28
- TOKENIZER_PATH = "model/"
29
-
30
- def maybe_download(path):
31
- if path.startswith("gs://"):
32
- st = time.time()
33
- output_path = "/tmp/weights.tensors"
34
- subprocess.check_call(["gcloud", "storage", "cp", path, output_path])
35
- print(f"weights downloaded in {time.time() - st}")
36
- return output_path
37
- return path
38
-
39
-
40
- class Predictor(BasePredictor):
41
- def setup(self):
42
- self.device = "cuda" if torch.cuda.is_available() else "cpu"
43
-
44
- # set TOKENIZERS_PARALLELISM to false to avoid a warning
45
- os.environ["TOKENIZERS_PARALLELISM"] = "false"
46
-
47
- self.model = self.load_tensorizer(
48
- weights=maybe_download(TENSORIZER_WEIGHTS_PATH), plaid_mode=True, cls=YieldingReplitCode, config_path=DEFAULT_CONFIG_PATH,
49
- )
50
- self.tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
51
-
52
- def load_tensorizer(self, weights, plaid_mode, cls, config_path):
53
- st = time.time()
54
- print(f"deserializing weights from {weights}")
55
-
56
- config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
57
- config.attn_config['attn_impl'] = 'triton'
58
-
59
- # with no_init_or_tensor():
60
- # model = YieldingReplitCode.from_pretrained('./model/', config=config, trust_remote_code=True)
61
-
62
-
63
- model = no_init_or_tensor(
64
- lambda: cls.from_pretrained(
65
- None, config=config, state_dict=OrderedDict(), trust_remote_code=True,
66
- )
67
- )
68
-
69
-
70
- deserialized = TensorDeserializer(weights, plaid_mode=True)
71
- deserialized.load_into_module(model)
72
- try:
73
- model = model.to(dtype=torch.bfloat16)
74
- except:
75
- pass
76
-
77
- print(f"weights loaded in {time.time() - st}")
78
- return model
79
-
80
- def predict(
81
- self,
82
- prompt: str = Input(description=f"Text prompt"),
83
- max_length: int = Input(
84
- description="Maximum number of tokens to generate. A word is generally 2-3 tokens",
85
- ge=1,
86
- default=500,
87
- ),
88
- temperature: float = Input(
89
- description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
90
- ge=0.01,
91
- le=5,
92
- default=0.75,
93
- ),
94
- top_p: float = Input(
95
- description="When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
96
- ge=0.01,
97
- le=1.0,
98
- default=1.0,
99
- ),
100
- repetition_penalty: float = Input(
101
- description="Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it.",
102
- ge=0.01,
103
- le=5,
104
- default=1,
105
- ),
106
- length_penalty: float = Input(
107
- description="Increasing the length_penalty parameter above 1.0 will cause the model to favor longer sequences, while decreasing it below 1.0 will cause the model to favor shorter sequences.",
108
- ge=0.01,
109
- le=5,
110
- default=1,
111
- ),
112
- no_repeat_ngram_size: int = Input(
113
- description="If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.",
114
- ge=0,
115
- default=0,
116
- ),
117
- stop_sequence: str = Input(
118
- description="Generation will hault if this token is produced. Currently, only single token stop sequences are support and it is recommended to use `###` as the stop sequence if you want to control generation termination.",
119
- default=None,
120
- ),
121
- seed: int = Input(
122
- description="Set seed for reproducible outputs. Set to -1 for random seed.",
123
- ge=-1,
124
- default=-1,
125
- ),
126
- debug: bool = Input(
127
- description="provide debugging output in logs", default=False
128
- ),
129
- ) -> ConcatenateIterator[str]:
130
- input = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.device)
131
-
132
- # set torch seed
133
- if seed == -1:
134
- torch.seed()
135
-
136
- else:
137
- torch.manual_seed(seed)
138
- torch.cuda.manual_seed(seed)
139
-
140
- with torch.inference_mode():
141
- first_token_yielded = False
142
- prev_ids = []
143
- for output in self.model.generate(
144
- input,
145
- max_length=max_length,
146
- do_sample=True,
147
- temperature=temperature,
148
- top_p=top_p,
149
- repetition_penalty=repetition_penalty,
150
- length_penalty=length_penalty,
151
- no_repeat_ngram_size=no_repeat_ngram_size,
152
- ):
153
- cur_id = output.item()
154
-
155
- # in order to properly handle spaces, we need to do our own tokenizing. Fun!
156
- # we're building up a buffer of sub-word / punctuation tokens until we hit a space, and then yielding whole words + punctuation.
157
- cur_token = self.tokenizer.convert_ids_to_tokens(cur_id)
158
-
159
- # skip initial newline, which this almost always yields. hack - newline id = 13.
160
- if not first_token_yielded and not prev_ids and cur_id == 187:
161
- continue
162
-
163
- # Ġ means a space, means we yield previous tokens
164
- if cur_token.startswith("Ġ"): # this is not a standard G.
165
- # first token
166
- if not prev_ids:
167
- prev_ids = [cur_id]
168
- continue
169
-
170
- # there are tokens to yield
171
- else:
172
- token = self.tokenizer.decode(prev_ids, clean_up_tokenization_spaces=False)
173
- prev_ids = [cur_id]
174
-
175
- if not first_token_yielded:
176
- # no leading space for first token
177
- token = token.strip()
178
- first_token_yielded = True
179
- yield token
180
- # End token
181
- elif cur_token == "<|endoftext|>":
182
- break
183
-
184
- elif stop_sequence and cur_token == stop_sequence:
185
- break
186
-
187
- else:
188
- prev_ids.append(cur_id)
189
- continue
190
-
191
- # remove any special tokens such as </s>
192
- token = self.tokenizer.decode(prev_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
193
- if not first_token_yielded:
194
- # no leading space for first token
195
- token = token.strip()
196
- first_token_yielded = True
197
- yield token
198
-
199
- if debug:
200
- print(f"cur memory: {torch.cuda.memory_allocated()}")
201
- print(f"max allocated: {torch.cuda.max_memory_allocated()}")
202
- print(f"peak memory: {torch.cuda.max_memory_reserved()}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/requirements.txt DELETED
@@ -1,6 +0,0 @@
1
- einops==0.6.1
2
- sentencepiece==0.1.99
3
- torch==2.0.1
4
- transformers==4.29.2
5
- # flash-attn==0.2.8
6
- # triton==2.0.0.dev20221202
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/scripts/download_and_prepare_model.py DELETED
@@ -1,107 +0,0 @@
1
- #!/usr/bin/env python
2
-
3
-
4
- import os
5
- import shutil
6
- import argparse
7
- import logging
8
- import sys
9
- import torch
10
-
11
- from distutils.dir_util import copy_tree
12
- from pathlib import Path
13
- from tempfile import TemporaryDirectory
14
- from huggingface_hub import snapshot_download, login
15
- from tensorizer import TensorSerializer
16
- from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
17
-
18
- from tensorize_model import tensorize_model
19
-
20
- logger = logging.getLogger(__name__)
21
- logging.basicConfig(level=logging.INFO, stream=sys.stdout)
22
-
23
-
24
- def download_model_from_hf_hub(
25
- model_name: str,
26
- model_path: str,
27
- rm_existing_model: bool = True,
28
- ) -> dict:
29
- """
30
- This function downloads a model from the Hugging Face Hub and saves it locally.
31
- It also saves the tokenizer in a separate location so that it can be easely included in a docker Image
32
- without including the model weights.
33
-
34
- Args:
35
- model_name (str): Name of model on hugging face hub
36
- path (str): Local path where model is saved
37
- rm_existing_model (bool, optional): Whether to remove the existing model or not. Defaults to False.
38
-
39
- Returns:
40
- dict: Dictionary containing the model name and path
41
- """
42
-
43
- # model_weights_path = os.path.join(os.getcwd(), "model_weights/torch_weights")
44
- # model_path = os.path.join(model_weights_path, model_name)
45
-
46
-
47
- if rm_existing_model:
48
- logger.info(f"Removing existing model at {model_path}")
49
- if os.path.exists(model_path):
50
- shutil.rmtree(model_path)
51
-
52
- # setup temporary directory
53
- with TemporaryDirectory() as tmpdir:
54
- logger.info(f"Downloading {model_name} weights to temp...")
55
-
56
- snapshot_dir = snapshot_download(
57
- repo_id=model_name,
58
- cache_dir=tmpdir,
59
- allow_patterns=["*.bin", "*.json", "*.md", "*.model", "*.py"],
60
- )
61
- # copy snapshot to model dir
62
- logger.info(f"Copying weights to {model_path}...")
63
- copy_tree(snapshot_dir, str(model_path))
64
-
65
- return {"model_name": model_name, "model_path": model_path}
66
-
67
-
68
- def download_hf_model_and_copy_tokenizer(
69
- model_name: str,
70
- model_path: str,
71
- tokenizer_path: str,
72
- rm_existing_model: bool = True,
73
- ):
74
-
75
- model_info = download_model_from_hf_hub(model_name, model_path)
76
-
77
- if tokenizer_path:
78
- # Move tokenizer to separate location
79
- logging.info(f"Copying tokenizer and model config to {tokenizer_path}...")
80
- tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
81
- tokenizer.save_pretrained(tokenizer_path)
82
-
83
- # Set the source and destination file paths
84
- config_path = os.path.join(model_path, "config.json")
85
-
86
- # Use the shutil.copy() function to copy the file to the destination directory
87
- shutil.copy(config_path, tokenizer_path)
88
-
89
- return model_info
90
-
91
- if __name__ == "__main__":
92
- parser = argparse.ArgumentParser()
93
- parser.add_argument("--model_name", type=str)
94
- parser.add_argument("--model_path", type=str)
95
- parser.add_argument("--tokenizer_path", type=str, default=None)
96
- parser.add_argument("--hf_token", type=str, default=None)
97
- parser.add_argument("--tensorize", action="store_true", default=False)
98
- parser.add_argument("--dtype", type=str, default="fp32")
99
-
100
- args = parser.parse_args()
101
- if args.hf_token is not None:
102
- login(token=args.hf_token)
103
-
104
- # download_hf_model_and_copy_tokenizer(args.model_name, model_path=args.model_path, tokenizer_path=args.tokenizer_path)
105
- tensorizer_path = os.path.join(args.model_path, "model.tensors")
106
- if args.tensorize:
107
- model = tensorize_model(args.model_name, model_path=args.model_path, dtype=args.dtype, tensorizer_path=tensorizer_path)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/scripts/tensorize_model.py DELETED
@@ -1,91 +0,0 @@
1
- #!/usr/bin/env python
2
- import torch
3
- import os
4
- import argparse
5
- import logging
6
- import sys
7
-
8
- from tensorizer import TensorSerializer
9
- from transformers import AutoModelForCausalLM, AutoConfig
10
-
11
-
12
- logger = logging.getLogger(__name__)
13
- logging.basicConfig(level=logging.INFO, stream=sys.stdout)
14
-
15
- def tensorize_model(
16
- model_name: str,
17
- model_path: str,
18
- tensorizer_path: str,
19
- dtype: str = "fp32",
20
- ) -> dict:
21
- """
22
- Create a tensorized version of model weights. If fp16 or bf16 is True,
23
- the model will be converted to fp16 or bf16.
24
-
25
- If `model_path` is None weights will be saved in `./model_weights/torch_weights/model_name`.
26
- If `tensorizer_path` is None weights will be saved in `./model_weights/tensorizer_weights/model_name/dtype_str`.
27
-
28
- Args:
29
- model_name (str): Name of model on hugging face hub
30
- model_path (str, optional): Local path where model weights are saved.
31
- tensorizer_path (str, optional): Local path where tensorizer weights are saved.
32
- path (str): Local path where tensorized model weights are saved
33
- dtype (str): One of `"fp32"`, `"fp16"`, and `"bf16"`. Defaults to `"fp32"`.
34
-
35
- Returns:
36
- dict: Dictionary containing the tensorized model path and dtype.
37
- """
38
-
39
-
40
- if dtype == 'fp32' or dtype is None:
41
- torch_dtype = torch.float32
42
-
43
- elif dtype == 'bf16':
44
- torch_dtype = torch.bfloat16
45
-
46
- elif dtype == 'fp16':
47
- torch_dtype = torch.float16
48
-
49
- logger.info(f"Loading {model_name} in {dtype} from {model_path}...")
50
-
51
- model = AutoModelForCausalLM.from_pretrained(
52
- model_path, trust_remote_code=True,
53
- ).to('cuda:0')
54
-
55
- logger.info(f"Tensorizing model {model_name} in {dtype} and writing tensors to {tensorizer_path}...")
56
-
57
- serializer = TensorSerializer(tensorizer_path)
58
- serializer.write_module(model)
59
- serializer.close()
60
-
61
- # Write config to tensorized model weights directory
62
- # dir_path = os.path.dirname(tensorizer_path)
63
- # config_path = os.path.join(dir_path, 'config.json')
64
- model_config = model.config
65
- model_config.save_pretrained(model_name)
66
-
67
- logger.info(f"Tensorized model {model_name} in {dtype} and wrote tensors to {tensorizer_path} and config to {config_path}...")
68
-
69
- return {"tensorized_weights_path": tensorizer_path, "dtype": dtype}
70
-
71
- if __name__ == "__main__":
72
-
73
-
74
- parser = argparse.ArgumentParser(description=(
75
- "A simple script for tensorizing a torch model."
76
- )
77
- )
78
-
79
- parser.add_argument("--model_name", type=str)
80
- parser.add_argument("--model_path", type=str, default=None)
81
- parser.add_argument("--tensorizer_path", type=str, default=None)
82
- parser.add_argument("--dtype", type=str, default="fp32")
83
-
84
- args = parser.parse_args()
85
-
86
- model_info = tensorize_model(
87
- args.model_name,
88
- model_path=args.model_path,
89
- tensorizer_path=args.tensorizer_path,
90
- dtype=args.dtype
91
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/subclass.py DELETED
@@ -1,284 +0,0 @@
1
- """sampling code pulled from Transformers & slightly modified to stream tokens"""
2
- import warnings
3
- from typing import List, Optional, Union
4
-
5
- import torch
6
- import torch.distributed as dist
7
- from torch import nn
8
-
9
- from transformers.generation.logits_process import LogitsProcessorList
10
- from transformers.generation.stopping_criteria import StoppingCriteriaList, validate_stopping_criteria
11
- from transformers.generation.utils import SampleOutput, SampleDecoderOnlyOutput, SampleEncoderDecoderOutput
12
-
13
- # from transformers import AutoModelForCausalLM
14
- from model.modeling_mpt import MPTForCausalLM
15
-
16
- class YieldingReplitCode(MPTForCausalLM):
17
- """Overriding sample to yield tokens"""
18
- def sample(
19
- self,
20
- input_ids: torch.LongTensor,
21
- logits_processor: Optional[LogitsProcessorList] = None,
22
- stopping_criteria: Optional[StoppingCriteriaList] = None,
23
- logits_warper: Optional[LogitsProcessorList] = None,
24
- max_length: Optional[int] = None,
25
- pad_token_id: Optional[int] = None,
26
- eos_token_id: Optional[Union[int, List[int]]] = None,
27
- output_attentions: Optional[bool] = None,
28
- output_hidden_states: Optional[bool] = None,
29
- output_scores: Optional[bool] = None,
30
- return_dict_in_generate: Optional[bool] = None,
31
- synced_gpus: Optional[bool] = False,
32
- **model_kwargs,
33
- ) -> Union[SampleOutput, torch.LongTensor]:
34
- r"""
35
- Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and
36
- can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
37
-
38
- <Tip warning={true}>
39
-
40
- In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead.
41
- For an overview of generation strategies and code examples, check the [following
42
- guide](./generation_strategies).
43
-
44
- </Tip>
45
-
46
- Parameters:
47
- input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
48
- The sequence used as a prompt for the generation.
49
- logits_processor (`LogitsProcessorList`, *optional*):
50
- An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
51
- used to modify the prediction scores of the language modeling head applied at each generation step.
52
- stopping_criteria (`StoppingCriteriaList`, *optional*):
53
- An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
54
- used to tell if the generation loop should stop.
55
- logits_warper (`LogitsProcessorList`, *optional*):
56
- An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsWarper`] used
57
- to warp the prediction score distribution of the language modeling head applied before multinomial
58
- sampling at each generation step.
59
- max_length (`int`, *optional*, defaults to 20):
60
- **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated
61
- tokens. The maximum length of the sequence to be generated.
62
- pad_token_id (`int`, *optional*):
63
- The id of the *padding* token.
64
- eos_token_id (`int`, *optional*):
65
- The id of the *end-of-sequence* token.
66
- output_attentions (`bool`, *optional*, defaults to `False`):
67
- Whether or not to return the attentions tensors of all attention layers. See `attentions` under
68
- returned tensors for more details.
69
- output_hidden_states (`bool`, *optional*, defaults to `False`):
70
- Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
71
- for more details.
72
- output_scores (`bool`, *optional*, defaults to `False`):
73
- Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
74
- return_dict_in_generate (`bool`, *optional*, defaults to `False`):
75
- Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
76
- synced_gpus (`bool`, *optional*, defaults to `False`):
77
- Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
78
- model_kwargs:
79
- Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is
80
- an encoder-decoder model the kwargs should include `encoder_outputs`.
81
-
82
- Return:
83
- [`~generation.SampleDecoderOnlyOutput`], [`~generation.SampleEncoderDecoderOutput`] or `torch.LongTensor`:
84
- A `torch.LongTensor` containing the generated tokens (default behaviour) or a
85
- [`~generation.SampleDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
86
- `return_dict_in_generate=True` or a [`~generation.SampleEncoderDecoderOutput`] if
87
- `model.config.is_encoder_decoder=True`.
88
-
89
- Examples:
90
-
91
- ```python
92
- >>> from transformers import (
93
- ... AutoTokenizer,
94
- ... AutoModelForCausalLM,
95
- ... LogitsProcessorList,
96
- ... MinLengthLogitsProcessor,
97
- ... TopKLogitsWarper,
98
- ... TemperatureLogitsWarper,
99
- ... StoppingCriteriaList,
100
- ... MaxLengthCriteria,
101
- ... )
102
- >>> import torch
103
-
104
- >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
105
- >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
106
-
107
- >>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
108
- >>> model.config.pad_token_id = model.config.eos_token_id
109
- >>> model.generation_config.pad_token_id = model.config.eos_token_id
110
-
111
- >>> input_prompt = "Today is a beautiful day, and"
112
- >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
113
-
114
- >>> # instantiate logits processors
115
- >>> logits_processor = LogitsProcessorList(
116
- ... [
117
- ... MinLengthLogitsProcessor(15, eos_token_id=model.generation_config.eos_token_id),
118
- ... ]
119
- ... )
120
- >>> # instantiate logits processors
121
- >>> logits_warper = LogitsProcessorList(
122
- ... [
123
- ... TopKLogitsWarper(50),
124
- ... TemperatureLogitsWarper(0.7),
125
- ... ]
126
- ... )
127
-
128
- >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
129
-
130
- >>> torch.manual_seed(0) # doctest: +IGNORE_RESULT
131
- >>> outputs = model.sample(
132
- ... input_ids,
133
- ... logits_processor=logits_processor,
134
- ... logits_warper=logits_warper,
135
- ... stopping_criteria=stopping_criteria,
136
- ... )
137
-
138
- >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
139
- ['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']
140
- ```"""
141
- # init values
142
- logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
143
- stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
144
- if max_length is not None:
145
- warnings.warn(
146
- "`max_length` is deprecated in this function, use"
147
- " `stopping_criteria=StoppingCriteriaList(MaxLengthCriteria(max_length=max_length))` instead.",
148
- UserWarning,
149
- )
150
- stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
151
- logits_warper = logits_warper if logits_warper is not None else LogitsProcessorList()
152
- pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
153
- eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
154
- if isinstance(eos_token_id, int):
155
- eos_token_id = [eos_token_id]
156
- output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
157
- output_attentions = (
158
- output_attentions if output_attentions is not None else self.generation_config.output_attentions
159
- )
160
- output_hidden_states = (
161
- output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
162
- )
163
- return_dict_in_generate = (
164
- return_dict_in_generate
165
- if return_dict_in_generate is not None
166
- else self.generation_config.return_dict_in_generate
167
- )
168
-
169
- # init attention / hidden states / scores tuples
170
- scores = () if (return_dict_in_generate and output_scores) else None
171
- decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
172
- cross_attentions = () if (return_dict_in_generate and output_attentions) else None
173
- decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
174
-
175
- # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
176
- if return_dict_in_generate and self.config.is_encoder_decoder:
177
- encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
178
- encoder_hidden_states = (
179
- model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
180
- )
181
-
182
- # keep track of which sequences are already finished
183
- unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
184
-
185
- this_peer_finished = False # used by synced_gpus only
186
- # auto-regressive generation
187
- while True:
188
- if synced_gpus:
189
- # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
190
- # The following logic allows an early break if all peers finished generating their sequence
191
- this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
192
- # send 0.0 if we finished, 1.0 otherwise
193
- dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
194
- # did all peers finish? the reduced sum will be 0.0 then
195
- if this_peer_finished_flag.item() == 0.0:
196
- break
197
-
198
- # prepare model inputs
199
- model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
200
-
201
- # forward pass to get next token
202
- outputs = self(
203
- **model_inputs,
204
- return_dict=True,
205
- output_attentions=output_attentions,
206
- output_hidden_states=output_hidden_states,
207
- )
208
-
209
- if synced_gpus and this_peer_finished:
210
- continue # don't waste resources running the code we don't need
211
-
212
- next_token_logits = outputs.logits[:, -1, :]
213
-
214
- # pre-process distribution
215
- next_token_scores = logits_processor(input_ids, next_token_logits)
216
- next_token_scores = logits_warper(input_ids, next_token_scores)
217
-
218
- # Store scores, attentions and hidden_states when required
219
- if return_dict_in_generate:
220
- if output_scores:
221
- scores += (next_token_scores,)
222
- if output_attentions:
223
- decoder_attentions += (
224
- (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
225
- )
226
- if self.config.is_encoder_decoder:
227
- cross_attentions += (outputs.cross_attentions,)
228
-
229
- if output_hidden_states:
230
- decoder_hidden_states += (
231
- (outputs.decoder_hidden_states,)
232
- if self.config.is_encoder_decoder
233
- else (outputs.hidden_states,)
234
- )
235
-
236
- # sample
237
- probs = nn.functional.softmax(next_token_scores, dim=-1)
238
- next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
239
-
240
- # finished sentences should have their next token be a padding token
241
- if eos_token_id is not None:
242
- if pad_token_id is None:
243
- raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
244
- next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
245
-
246
- # update generated ids, model inputs, and length for next step
247
- input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
248
- model_kwargs = self._update_model_kwargs_for_generation(
249
- outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
250
- )
251
-
252
- # if eos_token was found in one sentence, set sentence to finished
253
- if eos_token_id is not None:
254
- unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long())
255
-
256
- # stop when each sentence is finished, or if we exceed the maximum length
257
- if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
258
- if not synced_gpus:
259
- break
260
- else:
261
- this_peer_finished = True
262
- else:
263
- yield next_tokens
264
-
265
- if return_dict_in_generate:
266
- if self.config.is_encoder_decoder:
267
- yield SampleEncoderDecoderOutput(
268
- sequences=input_ids,
269
- scores=scores,
270
- encoder_attentions=encoder_attentions,
271
- encoder_hidden_states=encoder_hidden_states,
272
- decoder_attentions=decoder_attentions,
273
- cross_attentions=cross_attentions,
274
- decoder_hidden_states=decoder_hidden_states,
275
- )
276
- else:
277
- yield SampleDecoderOnlyOutput(
278
- sequences=input_ids,
279
- scores=scores,
280
- attentions=decoder_attentions,
281
- hidden_states=decoder_hidden_states,
282
- )
283
- else:
284
- yield next_tokens