Canstralian commited on
Commit
88a170b
·
verified ·
1 Parent(s): d1e65a6

Upload 18 files

Browse files
.dockerignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ model/*.bin
2
+ model/*.tensors
3
+ notebooks
LICENSE.txt ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright 2022, Replicate, Inc.
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
README.md CHANGED
@@ -1,36 +1,5 @@
1
- ---
2
- sdk: streamlit
3
- sdk_version: 1.41.1
4
- license: mit
5
- title: replit-code-v1-3b-fine-tuned
6
- emoji: 📚
7
- colorFrom: yellow
8
- colorTo: blue
9
- ---
10
 
11
- # Replit Code V1 3B Fine-Tuned Model
12
 
13
- This model is a fine-tuned version of the Replit Code model, designed to assist with generating Python code from pseudocode and offering AI-driven suggestions for code optimization. It helps streamline machine learning workflows and automates coding tasks with the power of AI.
14
-
15
- ## Features:
16
- - **Text Generation:** Generate human-like code based on descriptions.
17
- - **Pseudocode to Python:** Convert pseudocode into optimized Python code.
18
- - **Code Optimization:** Provide suggestions for optimizing Python code.
19
- - **ML Debugging:** Analyze and provide feedback for machine learning pipeline errors.
20
-
21
- ## License:
22
- This model is licensed under the MIT License. Feel free to use and adapt it according to the terms of the license.
23
-
24
- ## Tags:
25
- `machine learning`, `code generation`, `python`, `AI`, `code optimization`, `streamlit`, `transformers`
26
-
27
- ## Model Details:
28
- - **Base Model:** Replit Code (fine-tuned)
29
- - **Purpose:** AI assistant for improving Python code and machine learning pipelines.
30
-
31
- ## Usage:
32
- Interact with this model through the provided interface in Streamlit. Input pseudocode or Python code, and the model will assist with text generation, optimization, or debugging.
33
-
34
- ---
35
-
36
- Powered by [Replit LLM](https://replit.com) and [Hugging Face](https://huggingface.co).
 
1
+ # replit-code-v1-3b
 
 
 
 
 
 
 
 
2
 
3
+ [![Replicate](https://replicate.com/replicate/replit-code-v1-3b/badge)](https://replicate.com/replicate/replit-code-v1-3b)
4
 
5
+ A [Cog](https://cog.run) implementation of Replit's [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) Large Language Model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cog-replit-code-v1-3b-main/.dockerignore ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ model/*.bin
2
+ model/*.tensors
3
+ notebooks
cog-replit-code-v1-3b-main/LICENSE.txt ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf
171
+ of any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don't include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright 2022, Replicate, Inc.
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
cog-replit-code-v1-3b-main/README.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # replit-code-v1-3b
2
+
3
+ [![Replicate](https://replicate.com/replicate/replit-code-v1-3b/badge)](https://replicate.com/replicate/replit-code-v1-3b)
4
+
5
+ A [Cog](https://cog.run) implementation of Replit's [replit-code-v1-3b](https://huggingface.co/replit/replit-code-v1-3b) Large Language Model
cog-replit-code-v1-3b-main/cog.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ build:
2
+ gpu: true
3
+ cuda: "11.7"
4
+ python_version: "3.10"
5
+ python_requirements: requirements.txt
6
+
7
+ # commands run after the environment is setup
8
+ run:
9
+ - pip install flash-attn==0.2.8
10
+ - pip install triton==2.0.0.dev20221202
11
+ - pip install tensorizer==1.1.0
12
+ - echo 'deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main' | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
13
+ - curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
14
+ - apt-get update && apt-get install google-cloud-cli
15
+ predict: "predict.py:Predictor"
cog-replit-code-v1-3b-main/predict.py ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ from typing import Optional
3
+ import subprocess
4
+
5
+ import torch
6
+ import os
7
+
8
+ from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
9
+ from tensorizer import TensorDeserializer
10
+ from tensorizer.utils import no_init_or_tensor
11
+ from collections import OrderedDict
12
+ from cog import BasePredictor, ConcatenateIterator, Input, Path
13
+
14
+ # from config import DEFAULT_MODEL_NAME, DEFAULT_CONFIG_PATH, load_tokenizer, load_tensorizer
15
+ from subclass import YieldingReplitCode
16
+
17
+ # Weights are either local or in a cloud bucket.
18
+
19
+ # For development, point to a local path on disk.
20
+ # This is the path from which we pull weights when there's no COG_WEIGHTS environment variable (COG_WEIGHTS is a thing for trainable models)
21
+ # TENSORIZER_WEIGHTS_PATH = "model/model.tensors"
22
+ TENSORIZER_WEIGHTS_PATH = "gs://replicate-weights/replit-code-v1-3b/model.tensors"
23
+
24
+ # Set this to a GCP URL when pushing the model
25
+ # TENSORIZER_WEIGHTS_PATH = None
26
+
27
+ DEFAULT_CONFIG_PATH = "model/"
28
+ TOKENIZER_PATH = "model/"
29
+
30
+ def maybe_download(path):
31
+ if path.startswith("gs://"):
32
+ st = time.time()
33
+ output_path = "/tmp/weights.tensors"
34
+ subprocess.check_call(["gcloud", "storage", "cp", path, output_path])
35
+ print(f"weights downloaded in {time.time() - st}")
36
+ return output_path
37
+ return path
38
+
39
+
40
+ class Predictor(BasePredictor):
41
+ def setup(self):
42
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
43
+
44
+ # set TOKENIZERS_PARALLELISM to false to avoid a warning
45
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
46
+
47
+ self.model = self.load_tensorizer(
48
+ weights=maybe_download(TENSORIZER_WEIGHTS_PATH), plaid_mode=True, cls=YieldingReplitCode, config_path=DEFAULT_CONFIG_PATH,
49
+ )
50
+ self.tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
51
+
52
+ def load_tensorizer(self, weights, plaid_mode, cls, config_path):
53
+ st = time.time()
54
+ print(f"deserializing weights from {weights}")
55
+
56
+ config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
57
+ config.attn_config['attn_impl'] = 'triton'
58
+
59
+ # with no_init_or_tensor():
60
+ # model = YieldingReplitCode.from_pretrained('./model/', config=config, trust_remote_code=True)
61
+
62
+
63
+ model = no_init_or_tensor(
64
+ lambda: cls.from_pretrained(
65
+ None, config=config, state_dict=OrderedDict(), trust_remote_code=True,
66
+ )
67
+ )
68
+
69
+
70
+ deserialized = TensorDeserializer(weights, plaid_mode=True)
71
+ deserialized.load_into_module(model)
72
+ try:
73
+ model = model.to(dtype=torch.bfloat16)
74
+ except:
75
+ pass
76
+
77
+ print(f"weights loaded in {time.time() - st}")
78
+ return model
79
+
80
+ def predict(
81
+ self,
82
+ prompt: str = Input(description=f"Text prompt"),
83
+ max_length: int = Input(
84
+ description="Maximum number of tokens to generate. A word is generally 2-3 tokens",
85
+ ge=1,
86
+ default=500,
87
+ ),
88
+ temperature: float = Input(
89
+ description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
90
+ ge=0.01,
91
+ le=5,
92
+ default=0.75,
93
+ ),
94
+ top_p: float = Input(
95
+ description="When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
96
+ ge=0.01,
97
+ le=1.0,
98
+ default=1.0,
99
+ ),
100
+ repetition_penalty: float = Input(
101
+ description="Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it.",
102
+ ge=0.01,
103
+ le=5,
104
+ default=1,
105
+ ),
106
+ length_penalty: float = Input(
107
+ description="Increasing the length_penalty parameter above 1.0 will cause the model to favor longer sequences, while decreasing it below 1.0 will cause the model to favor shorter sequences.",
108
+ ge=0.01,
109
+ le=5,
110
+ default=1,
111
+ ),
112
+ no_repeat_ngram_size: int = Input(
113
+ description="If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.",
114
+ ge=0,
115
+ default=0,
116
+ ),
117
+ stop_sequence: str = Input(
118
+ description="Generation will hault if this token is produced. Currently, only single token stop sequences are support and it is recommended to use `###` as the stop sequence if you want to control generation termination.",
119
+ default=None,
120
+ ),
121
+ seed: int = Input(
122
+ description="Set seed for reproducible outputs. Set to -1 for random seed.",
123
+ ge=-1,
124
+ default=-1,
125
+ ),
126
+ debug: bool = Input(
127
+ description="provide debugging output in logs", default=False
128
+ ),
129
+ ) -> ConcatenateIterator[str]:
130
+ input = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.device)
131
+
132
+ # set torch seed
133
+ if seed == -1:
134
+ torch.seed()
135
+
136
+ else:
137
+ torch.manual_seed(seed)
138
+ torch.cuda.manual_seed(seed)
139
+
140
+ with torch.inference_mode():
141
+ first_token_yielded = False
142
+ prev_ids = []
143
+ for output in self.model.generate(
144
+ input,
145
+ max_length=max_length,
146
+ do_sample=True,
147
+ temperature=temperature,
148
+ top_p=top_p,
149
+ repetition_penalty=repetition_penalty,
150
+ length_penalty=length_penalty,
151
+ no_repeat_ngram_size=no_repeat_ngram_size,
152
+ ):
153
+ cur_id = output.item()
154
+
155
+ # in order to properly handle spaces, we need to do our own tokenizing. Fun!
156
+ # we're building up a buffer of sub-word / punctuation tokens until we hit a space, and then yielding whole words + punctuation.
157
+ cur_token = self.tokenizer.convert_ids_to_tokens(cur_id)
158
+
159
+ # skip initial newline, which this almost always yields. hack - newline id = 13.
160
+ if not first_token_yielded and not prev_ids and cur_id == 187:
161
+ continue
162
+
163
+ # Ġ means a space, means we yield previous tokens
164
+ if cur_token.startswith("Ġ"): # this is not a standard G.
165
+ # first token
166
+ if not prev_ids:
167
+ prev_ids = [cur_id]
168
+ continue
169
+
170
+ # there are tokens to yield
171
+ else:
172
+ token = self.tokenizer.decode(prev_ids, clean_up_tokenization_spaces=False)
173
+ prev_ids = [cur_id]
174
+
175
+ if not first_token_yielded:
176
+ # no leading space for first token
177
+ token = token.strip()
178
+ first_token_yielded = True
179
+ yield token
180
+ # End token
181
+ elif cur_token == "<|endoftext|>":
182
+ break
183
+
184
+ elif stop_sequence and cur_token == stop_sequence:
185
+ break
186
+
187
+ else:
188
+ prev_ids.append(cur_id)
189
+ continue
190
+
191
+ # remove any special tokens such as </s>
192
+ token = self.tokenizer.decode(prev_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
193
+ if not first_token_yielded:
194
+ # no leading space for first token
195
+ token = token.strip()
196
+ first_token_yielded = True
197
+ yield token
198
+
199
+ if debug:
200
+ print(f"cur memory: {torch.cuda.memory_allocated()}")
201
+ print(f"max allocated: {torch.cuda.max_memory_allocated()}")
202
+ print(f"peak memory: {torch.cuda.max_memory_reserved()}")
cog-replit-code-v1-3b-main/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ einops==0.6.1
2
+ sentencepiece==0.1.99
3
+ torch==2.0.1
4
+ transformers==4.29.2
5
+ # flash-attn==0.2.8
6
+ # triton==2.0.0.dev20221202
cog-replit-code-v1-3b-main/scripts/download_and_prepare_model.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+
4
+ import os
5
+ import shutil
6
+ import argparse
7
+ import logging
8
+ import sys
9
+ import torch
10
+
11
+ from distutils.dir_util import copy_tree
12
+ from pathlib import Path
13
+ from tempfile import TemporaryDirectory
14
+ from huggingface_hub import snapshot_download, login
15
+ from tensorizer import TensorSerializer
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
17
+
18
+ from tensorize_model import tensorize_model
19
+
20
+ logger = logging.getLogger(__name__)
21
+ logging.basicConfig(level=logging.INFO, stream=sys.stdout)
22
+
23
+
24
+ def download_model_from_hf_hub(
25
+ model_name: str,
26
+ model_path: str,
27
+ rm_existing_model: bool = True,
28
+ ) -> dict:
29
+ """
30
+ This function downloads a model from the Hugging Face Hub and saves it locally.
31
+ It also saves the tokenizer in a separate location so that it can be easely included in a docker Image
32
+ without including the model weights.
33
+
34
+ Args:
35
+ model_name (str): Name of model on hugging face hub
36
+ path (str): Local path where model is saved
37
+ rm_existing_model (bool, optional): Whether to remove the existing model or not. Defaults to False.
38
+
39
+ Returns:
40
+ dict: Dictionary containing the model name and path
41
+ """
42
+
43
+ # model_weights_path = os.path.join(os.getcwd(), "model_weights/torch_weights")
44
+ # model_path = os.path.join(model_weights_path, model_name)
45
+
46
+
47
+ if rm_existing_model:
48
+ logger.info(f"Removing existing model at {model_path}")
49
+ if os.path.exists(model_path):
50
+ shutil.rmtree(model_path)
51
+
52
+ # setup temporary directory
53
+ with TemporaryDirectory() as tmpdir:
54
+ logger.info(f"Downloading {model_name} weights to temp...")
55
+
56
+ snapshot_dir = snapshot_download(
57
+ repo_id=model_name,
58
+ cache_dir=tmpdir,
59
+ allow_patterns=["*.bin", "*.json", "*.md", "*.model", "*.py"],
60
+ )
61
+ # copy snapshot to model dir
62
+ logger.info(f"Copying weights to {model_path}...")
63
+ copy_tree(snapshot_dir, str(model_path))
64
+
65
+ return {"model_name": model_name, "model_path": model_path}
66
+
67
+
68
+ def download_hf_model_and_copy_tokenizer(
69
+ model_name: str,
70
+ model_path: str,
71
+ tokenizer_path: str,
72
+ rm_existing_model: bool = True,
73
+ ):
74
+
75
+ model_info = download_model_from_hf_hub(model_name, model_path)
76
+
77
+ if tokenizer_path:
78
+ # Move tokenizer to separate location
79
+ logging.info(f"Copying tokenizer and model config to {tokenizer_path}...")
80
+ tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
81
+ tokenizer.save_pretrained(tokenizer_path)
82
+
83
+ # Set the source and destination file paths
84
+ config_path = os.path.join(model_path, "config.json")
85
+
86
+ # Use the shutil.copy() function to copy the file to the destination directory
87
+ shutil.copy(config_path, tokenizer_path)
88
+
89
+ return model_info
90
+
91
+ if __name__ == "__main__":
92
+ parser = argparse.ArgumentParser()
93
+ parser.add_argument("--model_name", type=str)
94
+ parser.add_argument("--model_path", type=str)
95
+ parser.add_argument("--tokenizer_path", type=str, default=None)
96
+ parser.add_argument("--hf_token", type=str, default=None)
97
+ parser.add_argument("--tensorize", action="store_true", default=False)
98
+ parser.add_argument("--dtype", type=str, default="fp32")
99
+
100
+ args = parser.parse_args()
101
+ if args.hf_token is not None:
102
+ login(token=args.hf_token)
103
+
104
+ # download_hf_model_and_copy_tokenizer(args.model_name, model_path=args.model_path, tokenizer_path=args.tokenizer_path)
105
+ tensorizer_path = os.path.join(args.model_path, "model.tensors")
106
+ if args.tensorize:
107
+ model = tensorize_model(args.model_name, model_path=args.model_path, dtype=args.dtype, tensorizer_path=tensorizer_path)
cog-replit-code-v1-3b-main/scripts/tensorize_model.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ import torch
3
+ import os
4
+ import argparse
5
+ import logging
6
+ import sys
7
+
8
+ from tensorizer import TensorSerializer
9
+ from transformers import AutoModelForCausalLM, AutoConfig
10
+
11
+
12
+ logger = logging.getLogger(__name__)
13
+ logging.basicConfig(level=logging.INFO, stream=sys.stdout)
14
+
15
+ def tensorize_model(
16
+ model_name: str,
17
+ model_path: str,
18
+ tensorizer_path: str,
19
+ dtype: str = "fp32",
20
+ ) -> dict:
21
+ """
22
+ Create a tensorized version of model weights. If fp16 or bf16 is True,
23
+ the model will be converted to fp16 or bf16.
24
+
25
+ If `model_path` is None weights will be saved in `./model_weights/torch_weights/model_name`.
26
+ If `tensorizer_path` is None weights will be saved in `./model_weights/tensorizer_weights/model_name/dtype_str`.
27
+
28
+ Args:
29
+ model_name (str): Name of model on hugging face hub
30
+ model_path (str, optional): Local path where model weights are saved.
31
+ tensorizer_path (str, optional): Local path where tensorizer weights are saved.
32
+ path (str): Local path where tensorized model weights are saved
33
+ dtype (str): One of `"fp32"`, `"fp16"`, and `"bf16"`. Defaults to `"fp32"`.
34
+
35
+ Returns:
36
+ dict: Dictionary containing the tensorized model path and dtype.
37
+ """
38
+
39
+
40
+ if dtype == 'fp32' or dtype is None:
41
+ torch_dtype = torch.float32
42
+
43
+ elif dtype == 'bf16':
44
+ torch_dtype = torch.bfloat16
45
+
46
+ elif dtype == 'fp16':
47
+ torch_dtype = torch.float16
48
+
49
+ logger.info(f"Loading {model_name} in {dtype} from {model_path}...")
50
+
51
+ model = AutoModelForCausalLM.from_pretrained(
52
+ model_path, trust_remote_code=True,
53
+ ).to('cuda:0')
54
+
55
+ logger.info(f"Tensorizing model {model_name} in {dtype} and writing tensors to {tensorizer_path}...")
56
+
57
+ serializer = TensorSerializer(tensorizer_path)
58
+ serializer.write_module(model)
59
+ serializer.close()
60
+
61
+ # Write config to tensorized model weights directory
62
+ # dir_path = os.path.dirname(tensorizer_path)
63
+ # config_path = os.path.join(dir_path, 'config.json')
64
+ model_config = model.config
65
+ model_config.save_pretrained(model_name)
66
+
67
+ logger.info(f"Tensorized model {model_name} in {dtype} and wrote tensors to {tensorizer_path} and config to {config_path}...")
68
+
69
+ return {"tensorized_weights_path": tensorizer_path, "dtype": dtype}
70
+
71
+ if __name__ == "__main__":
72
+
73
+
74
+ parser = argparse.ArgumentParser(description=(
75
+ "A simple script for tensorizing a torch model."
76
+ )
77
+ )
78
+
79
+ parser.add_argument("--model_name", type=str)
80
+ parser.add_argument("--model_path", type=str, default=None)
81
+ parser.add_argument("--tensorizer_path", type=str, default=None)
82
+ parser.add_argument("--dtype", type=str, default="fp32")
83
+
84
+ args = parser.parse_args()
85
+
86
+ model_info = tensorize_model(
87
+ args.model_name,
88
+ model_path=args.model_path,
89
+ tensorizer_path=args.tensorizer_path,
90
+ dtype=args.dtype
91
+ )
cog-replit-code-v1-3b-main/subclass.py ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """sampling code pulled from Transformers & slightly modified to stream tokens"""
2
+ import warnings
3
+ from typing import List, Optional, Union
4
+
5
+ import torch
6
+ import torch.distributed as dist
7
+ from torch import nn
8
+
9
+ from transformers.generation.logits_process import LogitsProcessorList
10
+ from transformers.generation.stopping_criteria import StoppingCriteriaList, validate_stopping_criteria
11
+ from transformers.generation.utils import SampleOutput, SampleDecoderOnlyOutput, SampleEncoderDecoderOutput
12
+
13
+ # from transformers import AutoModelForCausalLM
14
+ from model.modeling_mpt import MPTForCausalLM
15
+
16
+ class YieldingReplitCode(MPTForCausalLM):
17
+ """Overriding sample to yield tokens"""
18
+ def sample(
19
+ self,
20
+ input_ids: torch.LongTensor,
21
+ logits_processor: Optional[LogitsProcessorList] = None,
22
+ stopping_criteria: Optional[StoppingCriteriaList] = None,
23
+ logits_warper: Optional[LogitsProcessorList] = None,
24
+ max_length: Optional[int] = None,
25
+ pad_token_id: Optional[int] = None,
26
+ eos_token_id: Optional[Union[int, List[int]]] = None,
27
+ output_attentions: Optional[bool] = None,
28
+ output_hidden_states: Optional[bool] = None,
29
+ output_scores: Optional[bool] = None,
30
+ return_dict_in_generate: Optional[bool] = None,
31
+ synced_gpus: Optional[bool] = False,
32
+ **model_kwargs,
33
+ ) -> Union[SampleOutput, torch.LongTensor]:
34
+ r"""
35
+ Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and
36
+ can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
37
+
38
+ <Tip warning={true}>
39
+
40
+ In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead.
41
+ For an overview of generation strategies and code examples, check the [following
42
+ guide](./generation_strategies).
43
+
44
+ </Tip>
45
+
46
+ Parameters:
47
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
48
+ The sequence used as a prompt for the generation.
49
+ logits_processor (`LogitsProcessorList`, *optional*):
50
+ An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
51
+ used to modify the prediction scores of the language modeling head applied at each generation step.
52
+ stopping_criteria (`StoppingCriteriaList`, *optional*):
53
+ An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
54
+ used to tell if the generation loop should stop.
55
+ logits_warper (`LogitsProcessorList`, *optional*):
56
+ An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsWarper`] used
57
+ to warp the prediction score distribution of the language modeling head applied before multinomial
58
+ sampling at each generation step.
59
+ max_length (`int`, *optional*, defaults to 20):
60
+ **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated
61
+ tokens. The maximum length of the sequence to be generated.
62
+ pad_token_id (`int`, *optional*):
63
+ The id of the *padding* token.
64
+ eos_token_id (`int`, *optional*):
65
+ The id of the *end-of-sequence* token.
66
+ output_attentions (`bool`, *optional*, defaults to `False`):
67
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
68
+ returned tensors for more details.
69
+ output_hidden_states (`bool`, *optional*, defaults to `False`):
70
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
71
+ for more details.
72
+ output_scores (`bool`, *optional*, defaults to `False`):
73
+ Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
74
+ return_dict_in_generate (`bool`, *optional*, defaults to `False`):
75
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
76
+ synced_gpus (`bool`, *optional*, defaults to `False`):
77
+ Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
78
+ model_kwargs:
79
+ Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is
80
+ an encoder-decoder model the kwargs should include `encoder_outputs`.
81
+
82
+ Return:
83
+ [`~generation.SampleDecoderOnlyOutput`], [`~generation.SampleEncoderDecoderOutput`] or `torch.LongTensor`:
84
+ A `torch.LongTensor` containing the generated tokens (default behaviour) or a
85
+ [`~generation.SampleDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
86
+ `return_dict_in_generate=True` or a [`~generation.SampleEncoderDecoderOutput`] if
87
+ `model.config.is_encoder_decoder=True`.
88
+
89
+ Examples:
90
+
91
+ ```python
92
+ >>> from transformers import (
93
+ ... AutoTokenizer,
94
+ ... AutoModelForCausalLM,
95
+ ... LogitsProcessorList,
96
+ ... MinLengthLogitsProcessor,
97
+ ... TopKLogitsWarper,
98
+ ... TemperatureLogitsWarper,
99
+ ... StoppingCriteriaList,
100
+ ... MaxLengthCriteria,
101
+ ... )
102
+ >>> import torch
103
+
104
+ >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
105
+ >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
106
+
107
+ >>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
108
+ >>> model.config.pad_token_id = model.config.eos_token_id
109
+ >>> model.generation_config.pad_token_id = model.config.eos_token_id
110
+
111
+ >>> input_prompt = "Today is a beautiful day, and"
112
+ >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
113
+
114
+ >>> # instantiate logits processors
115
+ >>> logits_processor = LogitsProcessorList(
116
+ ... [
117
+ ... MinLengthLogitsProcessor(15, eos_token_id=model.generation_config.eos_token_id),
118
+ ... ]
119
+ ... )
120
+ >>> # instantiate logits processors
121
+ >>> logits_warper = LogitsProcessorList(
122
+ ... [
123
+ ... TopKLogitsWarper(50),
124
+ ... TemperatureLogitsWarper(0.7),
125
+ ... ]
126
+ ... )
127
+
128
+ >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
129
+
130
+ >>> torch.manual_seed(0) # doctest: +IGNORE_RESULT
131
+ >>> outputs = model.sample(
132
+ ... input_ids,
133
+ ... logits_processor=logits_processor,
134
+ ... logits_warper=logits_warper,
135
+ ... stopping_criteria=stopping_criteria,
136
+ ... )
137
+
138
+ >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
139
+ ['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']
140
+ ```"""
141
+ # init values
142
+ logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
143
+ stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
144
+ if max_length is not None:
145
+ warnings.warn(
146
+ "`max_length` is deprecated in this function, use"
147
+ " `stopping_criteria=StoppingCriteriaList(MaxLengthCriteria(max_length=max_length))` instead.",
148
+ UserWarning,
149
+ )
150
+ stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
151
+ logits_warper = logits_warper if logits_warper is not None else LogitsProcessorList()
152
+ pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
153
+ eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
154
+ if isinstance(eos_token_id, int):
155
+ eos_token_id = [eos_token_id]
156
+ output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
157
+ output_attentions = (
158
+ output_attentions if output_attentions is not None else self.generation_config.output_attentions
159
+ )
160
+ output_hidden_states = (
161
+ output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
162
+ )
163
+ return_dict_in_generate = (
164
+ return_dict_in_generate
165
+ if return_dict_in_generate is not None
166
+ else self.generation_config.return_dict_in_generate
167
+ )
168
+
169
+ # init attention / hidden states / scores tuples
170
+ scores = () if (return_dict_in_generate and output_scores) else None
171
+ decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
172
+ cross_attentions = () if (return_dict_in_generate and output_attentions) else None
173
+ decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
174
+
175
+ # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
176
+ if return_dict_in_generate and self.config.is_encoder_decoder:
177
+ encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
178
+ encoder_hidden_states = (
179
+ model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
180
+ )
181
+
182
+ # keep track of which sequences are already finished
183
+ unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
184
+
185
+ this_peer_finished = False # used by synced_gpus only
186
+ # auto-regressive generation
187
+ while True:
188
+ if synced_gpus:
189
+ # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
190
+ # The following logic allows an early break if all peers finished generating their sequence
191
+ this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
192
+ # send 0.0 if we finished, 1.0 otherwise
193
+ dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
194
+ # did all peers finish? the reduced sum will be 0.0 then
195
+ if this_peer_finished_flag.item() == 0.0:
196
+ break
197
+
198
+ # prepare model inputs
199
+ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
200
+
201
+ # forward pass to get next token
202
+ outputs = self(
203
+ **model_inputs,
204
+ return_dict=True,
205
+ output_attentions=output_attentions,
206
+ output_hidden_states=output_hidden_states,
207
+ )
208
+
209
+ if synced_gpus and this_peer_finished:
210
+ continue # don't waste resources running the code we don't need
211
+
212
+ next_token_logits = outputs.logits[:, -1, :]
213
+
214
+ # pre-process distribution
215
+ next_token_scores = logits_processor(input_ids, next_token_logits)
216
+ next_token_scores = logits_warper(input_ids, next_token_scores)
217
+
218
+ # Store scores, attentions and hidden_states when required
219
+ if return_dict_in_generate:
220
+ if output_scores:
221
+ scores += (next_token_scores,)
222
+ if output_attentions:
223
+ decoder_attentions += (
224
+ (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
225
+ )
226
+ if self.config.is_encoder_decoder:
227
+ cross_attentions += (outputs.cross_attentions,)
228
+
229
+ if output_hidden_states:
230
+ decoder_hidden_states += (
231
+ (outputs.decoder_hidden_states,)
232
+ if self.config.is_encoder_decoder
233
+ else (outputs.hidden_states,)
234
+ )
235
+
236
+ # sample
237
+ probs = nn.functional.softmax(next_token_scores, dim=-1)
238
+ next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
239
+
240
+ # finished sentences should have their next token be a padding token
241
+ if eos_token_id is not None:
242
+ if pad_token_id is None:
243
+ raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
244
+ next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
245
+
246
+ # update generated ids, model inputs, and length for next step
247
+ input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
248
+ model_kwargs = self._update_model_kwargs_for_generation(
249
+ outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
250
+ )
251
+
252
+ # if eos_token was found in one sentence, set sentence to finished
253
+ if eos_token_id is not None:
254
+ unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long())
255
+
256
+ # stop when each sentence is finished, or if we exceed the maximum length
257
+ if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
258
+ if not synced_gpus:
259
+ break
260
+ else:
261
+ this_peer_finished = True
262
+ else:
263
+ yield next_tokens
264
+
265
+ if return_dict_in_generate:
266
+ if self.config.is_encoder_decoder:
267
+ yield SampleEncoderDecoderOutput(
268
+ sequences=input_ids,
269
+ scores=scores,
270
+ encoder_attentions=encoder_attentions,
271
+ encoder_hidden_states=encoder_hidden_states,
272
+ decoder_attentions=decoder_attentions,
273
+ cross_attentions=cross_attentions,
274
+ decoder_hidden_states=decoder_hidden_states,
275
+ )
276
+ else:
277
+ yield SampleDecoderOnlyOutput(
278
+ sequences=input_ids,
279
+ scores=scores,
280
+ attentions=decoder_attentions,
281
+ hidden_states=decoder_hidden_states,
282
+ )
283
+ else:
284
+ yield next_tokens
cog.yaml ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ build:
2
+ gpu: true
3
+ cuda: "11.7"
4
+ python_version: "3.10"
5
+ python_requirements: requirements.txt
6
+
7
+ # commands run after the environment is setup
8
+ run:
9
+ - pip install flash-attn==0.2.8
10
+ - pip install triton==2.0.0.dev20221202
11
+ - pip install tensorizer==1.1.0
12
+ - echo 'deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main' | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
13
+ - curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -
14
+ - apt-get update && apt-get install google-cloud-cli
15
+ predict: "predict.py:Predictor"
predict.py ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import time
2
+ from typing import Optional
3
+ import subprocess
4
+
5
+ import torch
6
+ import os
7
+
8
+ from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
9
+ from tensorizer import TensorDeserializer
10
+ from tensorizer.utils import no_init_or_tensor
11
+ from collections import OrderedDict
12
+ from cog import BasePredictor, ConcatenateIterator, Input, Path
13
+
14
+ # from config import DEFAULT_MODEL_NAME, DEFAULT_CONFIG_PATH, load_tokenizer, load_tensorizer
15
+ from subclass import YieldingReplitCode
16
+
17
+ # Weights are either local or in a cloud bucket.
18
+
19
+ # For development, point to a local path on disk.
20
+ # This is the path from which we pull weights when there's no COG_WEIGHTS environment variable (COG_WEIGHTS is a thing for trainable models)
21
+ # TENSORIZER_WEIGHTS_PATH = "model/model.tensors"
22
+ TENSORIZER_WEIGHTS_PATH = "gs://replicate-weights/replit-code-v1-3b/model.tensors"
23
+
24
+ # Set this to a GCP URL when pushing the model
25
+ # TENSORIZER_WEIGHTS_PATH = None
26
+
27
+ DEFAULT_CONFIG_PATH = "model/"
28
+ TOKENIZER_PATH = "model/"
29
+
30
+ def maybe_download(path):
31
+ if path.startswith("gs://"):
32
+ st = time.time()
33
+ output_path = "/tmp/weights.tensors"
34
+ subprocess.check_call(["gcloud", "storage", "cp", path, output_path])
35
+ print(f"weights downloaded in {time.time() - st}")
36
+ return output_path
37
+ return path
38
+
39
+
40
+ class Predictor(BasePredictor):
41
+ def setup(self):
42
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
43
+
44
+ # set TOKENIZERS_PARALLELISM to false to avoid a warning
45
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
46
+
47
+ self.model = self.load_tensorizer(
48
+ weights=maybe_download(TENSORIZER_WEIGHTS_PATH), plaid_mode=True, cls=YieldingReplitCode, config_path=DEFAULT_CONFIG_PATH,
49
+ )
50
+ self.tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_PATH, trust_remote_code=True)
51
+
52
+ def load_tensorizer(self, weights, plaid_mode, cls, config_path):
53
+ st = time.time()
54
+ print(f"deserializing weights from {weights}")
55
+
56
+ config = AutoConfig.from_pretrained(config_path, trust_remote_code=True)
57
+ config.attn_config['attn_impl'] = 'triton'
58
+
59
+ # with no_init_or_tensor():
60
+ # model = YieldingReplitCode.from_pretrained('./model/', config=config, trust_remote_code=True)
61
+
62
+
63
+ model = no_init_or_tensor(
64
+ lambda: cls.from_pretrained(
65
+ None, config=config, state_dict=OrderedDict(), trust_remote_code=True,
66
+ )
67
+ )
68
+
69
+
70
+ deserialized = TensorDeserializer(weights, plaid_mode=True)
71
+ deserialized.load_into_module(model)
72
+ try:
73
+ model = model.to(dtype=torch.bfloat16)
74
+ except:
75
+ pass
76
+
77
+ print(f"weights loaded in {time.time() - st}")
78
+ return model
79
+
80
+ def predict(
81
+ self,
82
+ prompt: str = Input(description=f"Text prompt"),
83
+ max_length: int = Input(
84
+ description="Maximum number of tokens to generate. A word is generally 2-3 tokens",
85
+ ge=1,
86
+ default=500,
87
+ ),
88
+ temperature: float = Input(
89
+ description="Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic, 0.75 is a good starting value.",
90
+ ge=0.01,
91
+ le=5,
92
+ default=0.75,
93
+ ),
94
+ top_p: float = Input(
95
+ description="When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens",
96
+ ge=0.01,
97
+ le=1.0,
98
+ default=1.0,
99
+ ),
100
+ repetition_penalty: float = Input(
101
+ description="Penalty for repeated words in generated text; 1 is no penalty, values greater than 1 discourage repetition, less than 1 encourage it.",
102
+ ge=0.01,
103
+ le=5,
104
+ default=1,
105
+ ),
106
+ length_penalty: float = Input(
107
+ description="Increasing the length_penalty parameter above 1.0 will cause the model to favor longer sequences, while decreasing it below 1.0 will cause the model to favor shorter sequences.",
108
+ ge=0.01,
109
+ le=5,
110
+ default=1,
111
+ ),
112
+ no_repeat_ngram_size: int = Input(
113
+ description="If set to int > 0, all ngrams of size no_repeat_ngram_size can only occur once.",
114
+ ge=0,
115
+ default=0,
116
+ ),
117
+ stop_sequence: str = Input(
118
+ description="Generation will hault if this token is produced. Currently, only single token stop sequences are support and it is recommended to use `###` as the stop sequence if you want to control generation termination.",
119
+ default=None,
120
+ ),
121
+ seed: int = Input(
122
+ description="Set seed for reproducible outputs. Set to -1 for random seed.",
123
+ ge=-1,
124
+ default=-1,
125
+ ),
126
+ debug: bool = Input(
127
+ description="provide debugging output in logs", default=False
128
+ ),
129
+ ) -> ConcatenateIterator[str]:
130
+ input = self.tokenizer(prompt, return_tensors="pt").input_ids.to(self.device)
131
+
132
+ # set torch seed
133
+ if seed == -1:
134
+ torch.seed()
135
+
136
+ else:
137
+ torch.manual_seed(seed)
138
+ torch.cuda.manual_seed(seed)
139
+
140
+ with torch.inference_mode():
141
+ first_token_yielded = False
142
+ prev_ids = []
143
+ for output in self.model.generate(
144
+ input,
145
+ max_length=max_length,
146
+ do_sample=True,
147
+ temperature=temperature,
148
+ top_p=top_p,
149
+ repetition_penalty=repetition_penalty,
150
+ length_penalty=length_penalty,
151
+ no_repeat_ngram_size=no_repeat_ngram_size,
152
+ ):
153
+ cur_id = output.item()
154
+
155
+ # in order to properly handle spaces, we need to do our own tokenizing. Fun!
156
+ # we're building up a buffer of sub-word / punctuation tokens until we hit a space, and then yielding whole words + punctuation.
157
+ cur_token = self.tokenizer.convert_ids_to_tokens(cur_id)
158
+
159
+ # skip initial newline, which this almost always yields. hack - newline id = 13.
160
+ if not first_token_yielded and not prev_ids and cur_id == 187:
161
+ continue
162
+
163
+ # Ġ means a space, means we yield previous tokens
164
+ if cur_token.startswith("Ġ"): # this is not a standard G.
165
+ # first token
166
+ if not prev_ids:
167
+ prev_ids = [cur_id]
168
+ continue
169
+
170
+ # there are tokens to yield
171
+ else:
172
+ token = self.tokenizer.decode(prev_ids, clean_up_tokenization_spaces=False)
173
+ prev_ids = [cur_id]
174
+
175
+ if not first_token_yielded:
176
+ # no leading space for first token
177
+ token = token.strip()
178
+ first_token_yielded = True
179
+ yield token
180
+ # End token
181
+ elif cur_token == "<|endoftext|>":
182
+ break
183
+
184
+ elif stop_sequence and cur_token == stop_sequence:
185
+ break
186
+
187
+ else:
188
+ prev_ids.append(cur_id)
189
+ continue
190
+
191
+ # remove any special tokens such as </s>
192
+ token = self.tokenizer.decode(prev_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
193
+ if not first_token_yielded:
194
+ # no leading space for first token
195
+ token = token.strip()
196
+ first_token_yielded = True
197
+ yield token
198
+
199
+ if debug:
200
+ print(f"cur memory: {torch.cuda.memory_allocated()}")
201
+ print(f"max allocated: {torch.cuda.max_memory_allocated()}")
202
+ print(f"peak memory: {torch.cuda.max_memory_reserved()}")
requirements.txt CHANGED
@@ -1,4 +1,6 @@
1
- streamlit==1.25.0 # Latest stable version of Streamlit
2
- transformers==4.33.0 # Hugging Face Transformers library
3
- torch>=1.9.0 # PyTorch, required for Hugging Face models
4
- numpy>=1.21.0 # Numerical library for model dependencies
 
 
 
1
+ einops==0.6.1
2
+ sentencepiece==0.1.99
3
+ torch==2.0.1
4
+ transformers==4.29.2
5
+ # flash-attn==0.2.8
6
+ # triton==2.0.0.dev20221202
scripts/download_and_prepare_model.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+
3
+
4
+ import os
5
+ import shutil
6
+ import argparse
7
+ import logging
8
+ import sys
9
+ import torch
10
+
11
+ from distutils.dir_util import copy_tree
12
+ from pathlib import Path
13
+ from tempfile import TemporaryDirectory
14
+ from huggingface_hub import snapshot_download, login
15
+ from tensorizer import TensorSerializer
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
17
+
18
+ from tensorize_model import tensorize_model
19
+
20
+ logger = logging.getLogger(__name__)
21
+ logging.basicConfig(level=logging.INFO, stream=sys.stdout)
22
+
23
+
24
+ def download_model_from_hf_hub(
25
+ model_name: str,
26
+ model_path: str,
27
+ rm_existing_model: bool = True,
28
+ ) -> dict:
29
+ """
30
+ This function downloads a model from the Hugging Face Hub and saves it locally.
31
+ It also saves the tokenizer in a separate location so that it can be easely included in a docker Image
32
+ without including the model weights.
33
+
34
+ Args:
35
+ model_name (str): Name of model on hugging face hub
36
+ path (str): Local path where model is saved
37
+ rm_existing_model (bool, optional): Whether to remove the existing model or not. Defaults to False.
38
+
39
+ Returns:
40
+ dict: Dictionary containing the model name and path
41
+ """
42
+
43
+ # model_weights_path = os.path.join(os.getcwd(), "model_weights/torch_weights")
44
+ # model_path = os.path.join(model_weights_path, model_name)
45
+
46
+
47
+ if rm_existing_model:
48
+ logger.info(f"Removing existing model at {model_path}")
49
+ if os.path.exists(model_path):
50
+ shutil.rmtree(model_path)
51
+
52
+ # setup temporary directory
53
+ with TemporaryDirectory() as tmpdir:
54
+ logger.info(f"Downloading {model_name} weights to temp...")
55
+
56
+ snapshot_dir = snapshot_download(
57
+ repo_id=model_name,
58
+ cache_dir=tmpdir,
59
+ allow_patterns=["*.bin", "*.json", "*.md", "*.model", "*.py"],
60
+ )
61
+ # copy snapshot to model dir
62
+ logger.info(f"Copying weights to {model_path}...")
63
+ copy_tree(snapshot_dir, str(model_path))
64
+
65
+ return {"model_name": model_name, "model_path": model_path}
66
+
67
+
68
+ def download_hf_model_and_copy_tokenizer(
69
+ model_name: str,
70
+ model_path: str,
71
+ tokenizer_path: str,
72
+ rm_existing_model: bool = True,
73
+ ):
74
+
75
+ model_info = download_model_from_hf_hub(model_name, model_path)
76
+
77
+ if tokenizer_path:
78
+ # Move tokenizer to separate location
79
+ logging.info(f"Copying tokenizer and model config to {tokenizer_path}...")
80
+ tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
81
+ tokenizer.save_pretrained(tokenizer_path)
82
+
83
+ # Set the source and destination file paths
84
+ config_path = os.path.join(model_path, "config.json")
85
+
86
+ # Use the shutil.copy() function to copy the file to the destination directory
87
+ shutil.copy(config_path, tokenizer_path)
88
+
89
+ return model_info
90
+
91
+ if __name__ == "__main__":
92
+ parser = argparse.ArgumentParser()
93
+ parser.add_argument("--model_name", type=str)
94
+ parser.add_argument("--model_path", type=str)
95
+ parser.add_argument("--tokenizer_path", type=str, default=None)
96
+ parser.add_argument("--hf_token", type=str, default=None)
97
+ parser.add_argument("--tensorize", action="store_true", default=False)
98
+ parser.add_argument("--dtype", type=str, default="fp32")
99
+
100
+ args = parser.parse_args()
101
+ if args.hf_token is not None:
102
+ login(token=args.hf_token)
103
+
104
+ # download_hf_model_and_copy_tokenizer(args.model_name, model_path=args.model_path, tokenizer_path=args.tokenizer_path)
105
+ tensorizer_path = os.path.join(args.model_path, "model.tensors")
106
+ if args.tensorize:
107
+ model = tensorize_model(args.model_name, model_path=args.model_path, dtype=args.dtype, tensorizer_path=tensorizer_path)
scripts/tensorize_model.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python
2
+ import torch
3
+ import os
4
+ import argparse
5
+ import logging
6
+ import sys
7
+
8
+ from tensorizer import TensorSerializer
9
+ from transformers import AutoModelForCausalLM, AutoConfig
10
+
11
+
12
+ logger = logging.getLogger(__name__)
13
+ logging.basicConfig(level=logging.INFO, stream=sys.stdout)
14
+
15
+ def tensorize_model(
16
+ model_name: str,
17
+ model_path: str,
18
+ tensorizer_path: str,
19
+ dtype: str = "fp32",
20
+ ) -> dict:
21
+ """
22
+ Create a tensorized version of model weights. If fp16 or bf16 is True,
23
+ the model will be converted to fp16 or bf16.
24
+
25
+ If `model_path` is None weights will be saved in `./model_weights/torch_weights/model_name`.
26
+ If `tensorizer_path` is None weights will be saved in `./model_weights/tensorizer_weights/model_name/dtype_str`.
27
+
28
+ Args:
29
+ model_name (str): Name of model on hugging face hub
30
+ model_path (str, optional): Local path where model weights are saved.
31
+ tensorizer_path (str, optional): Local path where tensorizer weights are saved.
32
+ path (str): Local path where tensorized model weights are saved
33
+ dtype (str): One of `"fp32"`, `"fp16"`, and `"bf16"`. Defaults to `"fp32"`.
34
+
35
+ Returns:
36
+ dict: Dictionary containing the tensorized model path and dtype.
37
+ """
38
+
39
+
40
+ if dtype == 'fp32' or dtype is None:
41
+ torch_dtype = torch.float32
42
+
43
+ elif dtype == 'bf16':
44
+ torch_dtype = torch.bfloat16
45
+
46
+ elif dtype == 'fp16':
47
+ torch_dtype = torch.float16
48
+
49
+ logger.info(f"Loading {model_name} in {dtype} from {model_path}...")
50
+
51
+ model = AutoModelForCausalLM.from_pretrained(
52
+ model_path, trust_remote_code=True,
53
+ ).to('cuda:0')
54
+
55
+ logger.info(f"Tensorizing model {model_name} in {dtype} and writing tensors to {tensorizer_path}...")
56
+
57
+ serializer = TensorSerializer(tensorizer_path)
58
+ serializer.write_module(model)
59
+ serializer.close()
60
+
61
+ # Write config to tensorized model weights directory
62
+ # dir_path = os.path.dirname(tensorizer_path)
63
+ # config_path = os.path.join(dir_path, 'config.json')
64
+ model_config = model.config
65
+ model_config.save_pretrained(model_name)
66
+
67
+ logger.info(f"Tensorized model {model_name} in {dtype} and wrote tensors to {tensorizer_path} and config to {config_path}...")
68
+
69
+ return {"tensorized_weights_path": tensorizer_path, "dtype": dtype}
70
+
71
+ if __name__ == "__main__":
72
+
73
+
74
+ parser = argparse.ArgumentParser(description=(
75
+ "A simple script for tensorizing a torch model."
76
+ )
77
+ )
78
+
79
+ parser.add_argument("--model_name", type=str)
80
+ parser.add_argument("--model_path", type=str, default=None)
81
+ parser.add_argument("--tensorizer_path", type=str, default=None)
82
+ parser.add_argument("--dtype", type=str, default="fp32")
83
+
84
+ args = parser.parse_args()
85
+
86
+ model_info = tensorize_model(
87
+ args.model_name,
88
+ model_path=args.model_path,
89
+ tensorizer_path=args.tensorizer_path,
90
+ dtype=args.dtype
91
+ )
subclass.py ADDED
@@ -0,0 +1,284 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """sampling code pulled from Transformers & slightly modified to stream tokens"""
2
+ import warnings
3
+ from typing import List, Optional, Union
4
+
5
+ import torch
6
+ import torch.distributed as dist
7
+ from torch import nn
8
+
9
+ from transformers.generation.logits_process import LogitsProcessorList
10
+ from transformers.generation.stopping_criteria import StoppingCriteriaList, validate_stopping_criteria
11
+ from transformers.generation.utils import SampleOutput, SampleDecoderOnlyOutput, SampleEncoderDecoderOutput
12
+
13
+ # from transformers import AutoModelForCausalLM
14
+ from model.modeling_mpt import MPTForCausalLM
15
+
16
+ class YieldingReplitCode(MPTForCausalLM):
17
+ """Overriding sample to yield tokens"""
18
+ def sample(
19
+ self,
20
+ input_ids: torch.LongTensor,
21
+ logits_processor: Optional[LogitsProcessorList] = None,
22
+ stopping_criteria: Optional[StoppingCriteriaList] = None,
23
+ logits_warper: Optional[LogitsProcessorList] = None,
24
+ max_length: Optional[int] = None,
25
+ pad_token_id: Optional[int] = None,
26
+ eos_token_id: Optional[Union[int, List[int]]] = None,
27
+ output_attentions: Optional[bool] = None,
28
+ output_hidden_states: Optional[bool] = None,
29
+ output_scores: Optional[bool] = None,
30
+ return_dict_in_generate: Optional[bool] = None,
31
+ synced_gpus: Optional[bool] = False,
32
+ **model_kwargs,
33
+ ) -> Union[SampleOutput, torch.LongTensor]:
34
+ r"""
35
+ Generates sequences of token ids for models with a language modeling head using **multinomial sampling** and
36
+ can be used for text-decoder, text-to-text, speech-to-text, and vision-to-text models.
37
+
38
+ <Tip warning={true}>
39
+
40
+ In most cases, you do not need to call [`~generation.GenerationMixin.sample`] directly. Use generate() instead.
41
+ For an overview of generation strategies and code examples, check the [following
42
+ guide](./generation_strategies).
43
+
44
+ </Tip>
45
+
46
+ Parameters:
47
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
48
+ The sequence used as a prompt for the generation.
49
+ logits_processor (`LogitsProcessorList`, *optional*):
50
+ An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsProcessor`]
51
+ used to modify the prediction scores of the language modeling head applied at each generation step.
52
+ stopping_criteria (`StoppingCriteriaList`, *optional*):
53
+ An instance of [`StoppingCriteriaList`]. List of instances of class derived from [`StoppingCriteria`]
54
+ used to tell if the generation loop should stop.
55
+ logits_warper (`LogitsProcessorList`, *optional*):
56
+ An instance of [`LogitsProcessorList`]. List of instances of class derived from [`LogitsWarper`] used
57
+ to warp the prediction score distribution of the language modeling head applied before multinomial
58
+ sampling at each generation step.
59
+ max_length (`int`, *optional*, defaults to 20):
60
+ **DEPRECATED**. Use `logits_processor` or `stopping_criteria` directly to cap the number of generated
61
+ tokens. The maximum length of the sequence to be generated.
62
+ pad_token_id (`int`, *optional*):
63
+ The id of the *padding* token.
64
+ eos_token_id (`int`, *optional*):
65
+ The id of the *end-of-sequence* token.
66
+ output_attentions (`bool`, *optional*, defaults to `False`):
67
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
68
+ returned tensors for more details.
69
+ output_hidden_states (`bool`, *optional*, defaults to `False`):
70
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
71
+ for more details.
72
+ output_scores (`bool`, *optional*, defaults to `False`):
73
+ Whether or not to return the prediction scores. See `scores` under returned tensors for more details.
74
+ return_dict_in_generate (`bool`, *optional*, defaults to `False`):
75
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
76
+ synced_gpus (`bool`, *optional*, defaults to `False`):
77
+ Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
78
+ model_kwargs:
79
+ Additional model specific kwargs will be forwarded to the `forward` function of the model. If model is
80
+ an encoder-decoder model the kwargs should include `encoder_outputs`.
81
+
82
+ Return:
83
+ [`~generation.SampleDecoderOnlyOutput`], [`~generation.SampleEncoderDecoderOutput`] or `torch.LongTensor`:
84
+ A `torch.LongTensor` containing the generated tokens (default behaviour) or a
85
+ [`~generation.SampleDecoderOnlyOutput`] if `model.config.is_encoder_decoder=False` and
86
+ `return_dict_in_generate=True` or a [`~generation.SampleEncoderDecoderOutput`] if
87
+ `model.config.is_encoder_decoder=True`.
88
+
89
+ Examples:
90
+
91
+ ```python
92
+ >>> from transformers import (
93
+ ... AutoTokenizer,
94
+ ... AutoModelForCausalLM,
95
+ ... LogitsProcessorList,
96
+ ... MinLengthLogitsProcessor,
97
+ ... TopKLogitsWarper,
98
+ ... TemperatureLogitsWarper,
99
+ ... StoppingCriteriaList,
100
+ ... MaxLengthCriteria,
101
+ ... )
102
+ >>> import torch
103
+
104
+ >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
105
+ >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
106
+
107
+ >>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
108
+ >>> model.config.pad_token_id = model.config.eos_token_id
109
+ >>> model.generation_config.pad_token_id = model.config.eos_token_id
110
+
111
+ >>> input_prompt = "Today is a beautiful day, and"
112
+ >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
113
+
114
+ >>> # instantiate logits processors
115
+ >>> logits_processor = LogitsProcessorList(
116
+ ... [
117
+ ... MinLengthLogitsProcessor(15, eos_token_id=model.generation_config.eos_token_id),
118
+ ... ]
119
+ ... )
120
+ >>> # instantiate logits processors
121
+ >>> logits_warper = LogitsProcessorList(
122
+ ... [
123
+ ... TopKLogitsWarper(50),
124
+ ... TemperatureLogitsWarper(0.7),
125
+ ... ]
126
+ ... )
127
+
128
+ >>> stopping_criteria = StoppingCriteriaList([MaxLengthCriteria(max_length=20)])
129
+
130
+ >>> torch.manual_seed(0) # doctest: +IGNORE_RESULT
131
+ >>> outputs = model.sample(
132
+ ... input_ids,
133
+ ... logits_processor=logits_processor,
134
+ ... logits_warper=logits_warper,
135
+ ... stopping_criteria=stopping_criteria,
136
+ ... )
137
+
138
+ >>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
139
+ ['Today is a beautiful day, and a wonderful day.\n\nI was lucky enough to meet the']
140
+ ```"""
141
+ # init values
142
+ logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
143
+ stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
144
+ if max_length is not None:
145
+ warnings.warn(
146
+ "`max_length` is deprecated in this function, use"
147
+ " `stopping_criteria=StoppingCriteriaList(MaxLengthCriteria(max_length=max_length))` instead.",
148
+ UserWarning,
149
+ )
150
+ stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
151
+ logits_warper = logits_warper if logits_warper is not None else LogitsProcessorList()
152
+ pad_token_id = pad_token_id if pad_token_id is not None else self.generation_config.pad_token_id
153
+ eos_token_id = eos_token_id if eos_token_id is not None else self.generation_config.eos_token_id
154
+ if isinstance(eos_token_id, int):
155
+ eos_token_id = [eos_token_id]
156
+ output_scores = output_scores if output_scores is not None else self.generation_config.output_scores
157
+ output_attentions = (
158
+ output_attentions if output_attentions is not None else self.generation_config.output_attentions
159
+ )
160
+ output_hidden_states = (
161
+ output_hidden_states if output_hidden_states is not None else self.generation_config.output_hidden_states
162
+ )
163
+ return_dict_in_generate = (
164
+ return_dict_in_generate
165
+ if return_dict_in_generate is not None
166
+ else self.generation_config.return_dict_in_generate
167
+ )
168
+
169
+ # init attention / hidden states / scores tuples
170
+ scores = () if (return_dict_in_generate and output_scores) else None
171
+ decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
172
+ cross_attentions = () if (return_dict_in_generate and output_attentions) else None
173
+ decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
174
+
175
+ # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
176
+ if return_dict_in_generate and self.config.is_encoder_decoder:
177
+ encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
178
+ encoder_hidden_states = (
179
+ model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
180
+ )
181
+
182
+ # keep track of which sequences are already finished
183
+ unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
184
+
185
+ this_peer_finished = False # used by synced_gpus only
186
+ # auto-regressive generation
187
+ while True:
188
+ if synced_gpus:
189
+ # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
190
+ # The following logic allows an early break if all peers finished generating their sequence
191
+ this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
192
+ # send 0.0 if we finished, 1.0 otherwise
193
+ dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
194
+ # did all peers finish? the reduced sum will be 0.0 then
195
+ if this_peer_finished_flag.item() == 0.0:
196
+ break
197
+
198
+ # prepare model inputs
199
+ model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
200
+
201
+ # forward pass to get next token
202
+ outputs = self(
203
+ **model_inputs,
204
+ return_dict=True,
205
+ output_attentions=output_attentions,
206
+ output_hidden_states=output_hidden_states,
207
+ )
208
+
209
+ if synced_gpus and this_peer_finished:
210
+ continue # don't waste resources running the code we don't need
211
+
212
+ next_token_logits = outputs.logits[:, -1, :]
213
+
214
+ # pre-process distribution
215
+ next_token_scores = logits_processor(input_ids, next_token_logits)
216
+ next_token_scores = logits_warper(input_ids, next_token_scores)
217
+
218
+ # Store scores, attentions and hidden_states when required
219
+ if return_dict_in_generate:
220
+ if output_scores:
221
+ scores += (next_token_scores,)
222
+ if output_attentions:
223
+ decoder_attentions += (
224
+ (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
225
+ )
226
+ if self.config.is_encoder_decoder:
227
+ cross_attentions += (outputs.cross_attentions,)
228
+
229
+ if output_hidden_states:
230
+ decoder_hidden_states += (
231
+ (outputs.decoder_hidden_states,)
232
+ if self.config.is_encoder_decoder
233
+ else (outputs.hidden_states,)
234
+ )
235
+
236
+ # sample
237
+ probs = nn.functional.softmax(next_token_scores, dim=-1)
238
+ next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
239
+
240
+ # finished sentences should have their next token be a padding token
241
+ if eos_token_id is not None:
242
+ if pad_token_id is None:
243
+ raise ValueError("If `eos_token_id` is defined, make sure that `pad_token_id` is defined.")
244
+ next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
245
+
246
+ # update generated ids, model inputs, and length for next step
247
+ input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
248
+ model_kwargs = self._update_model_kwargs_for_generation(
249
+ outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
250
+ )
251
+
252
+ # if eos_token was found in one sentence, set sentence to finished
253
+ if eos_token_id is not None:
254
+ unfinished_sequences = unfinished_sequences.mul((sum(next_tokens != i for i in eos_token_id)).long())
255
+
256
+ # stop when each sentence is finished, or if we exceed the maximum length
257
+ if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
258
+ if not synced_gpus:
259
+ break
260
+ else:
261
+ this_peer_finished = True
262
+ else:
263
+ yield next_tokens
264
+
265
+ if return_dict_in_generate:
266
+ if self.config.is_encoder_decoder:
267
+ yield SampleEncoderDecoderOutput(
268
+ sequences=input_ids,
269
+ scores=scores,
270
+ encoder_attentions=encoder_attentions,
271
+ encoder_hidden_states=encoder_hidden_states,
272
+ decoder_attentions=decoder_attentions,
273
+ cross_attentions=cross_attentions,
274
+ decoder_hidden_states=decoder_hidden_states,
275
+ )
276
+ else:
277
+ yield SampleDecoderOnlyOutput(
278
+ sequences=input_ids,
279
+ scores=scores,
280
+ attentions=decoder_attentions,
281
+ hidden_states=decoder_hidden_states,
282
+ )
283
+ else:
284
+ yield next_tokens