Initial model card version

#1
Files changed (1) hide show
  1. README.md +444 -0
README.md CHANGED
@@ -1,3 +1,447 @@
1
  ---
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-generation
3
+ inference: true
4
+ # widget:
5
+ # - text: 'Question: Please write a function in Python that performs bubble sort.\n\nAnswer:'
6
+ # example_title: Bubble sort
7
+ # group: Python
8
  license: apache-2.0
9
+ datasets:
10
+ # Mentionded in paper
11
+ - codeparrot/github-code-clean
12
+ - bigcode/starcoderdata
13
+ # - Stackexchange
14
+ # - CommonCrawl
15
+ # - open-web-math/open-web-math # Phase 1
16
+ # - math-ai/StackMathQA # Phase 2
17
+ # - Arxiv
18
+ # - Wikipedia
19
+ # - conceptofmind/FLAN_2022 # Original link is broken, we used IBM's filtered version | Phase 2
20
+ # - bigcode/commitpackft # Phase 2
21
+ # - bigcode/oasst-octopack # Phase 2
22
+
23
+ # Phase 1 datasets
24
+ - togethercomputer/RedPajama-Data-V2 # Common Crawl - CC (Redpajama v2)
25
+ - togethercomputer/RedPajama-Data-1T # Books (Redpajama v1)
26
+ - allenai/peS2o
27
+ - open-web-math/open-web-math
28
+ - EleutherAI/proof-pile-2 # Algebraic-stack (HF)
29
+ # - Code pile v2 w/o GPL (dp08)
30
+ # - Webhose (dp08)
31
+ # - Patents (dp08)
32
+ # - Arxiv (dp08)
33
+ # - IEEE (dp08)
34
+ # - DMMath (dp08)
35
+ # - Financial research paper (dp08)
36
+ # - Paper with code (dp08)
37
+ # - Wikipedia (dp08)
38
+ # - Stackexchange (dp08)
39
+ # - doabooks (dp08)
40
+ # - Freelaw (dp08)
41
+ # - Pubmed (dp08)
42
+ # - EDGAR (dp08)
43
+ # - Secfiling (dp08)
44
+ # - FIDC (dp08)
45
+ # - Earning call transcript (dp08)
46
+ #
47
+ # Phase 2 datasets: add high quality + instruction tuning datasets into the mixture
48
+ # Hiqh quality:
49
+ # - sap_revised
50
+ # - cybersecurity
51
+ # - ibm-redbooks
52
+ # - ibm.com
53
+ # - superknowa
54
+ # - multilingual – wikipedia + doabooks (de/es/fr/ja/pt/ar/cs/it/ko/nl/zh)
55
+ # Instruction-tuning
56
+ - nvidia/HelpSteer
57
+ - garage-bAInd/Open-Platypus
58
+ - mosaicml/dolly_hhrlhf
59
+ - mosaicml/instruct-v3
60
+ - conceptofmind/FLAN_2022
61
+ - KnutJaegersberg/longinstruct
62
+ - bigcode/oasst-octopack
63
+ - CohereForAI/xP3x
64
+ - math-ai/StackMathQA
65
+ - math-ai/TemplateGSM
66
+ - bugdaryan/sql-create-context-instruction
67
+ - glaiveai/glaive-function-calling-v2
68
+ - glaiveai/glaive-code-assistant-v3
69
+ - cognitivecomputations/dolphin-coder
70
+ - glaiveai/glaive-code-assistant
71
+ - TokenBender/code_instructions_122k_alpaca_style
72
+ - TIGER-Lab/MathInstruct
73
+ - meta-math/MetaMathQA
74
+ - tiedong/goat
75
+ - CohereForAI/xP3x
76
+ - bigcode/commitpack
77
+ - bigcode/commitpackft
78
+ - HuggingFaceTB/cosmopedia
79
+ - deepmind/code_contests
80
+ - ise-uiuc/Magicoder-Evol-Instruct-110K
81
+ - ise-uiuc/Magicoder-OSS-Instruct-75K
82
+ - theblackcat102/evol-codealpaca-v1
83
+ - ajibawa-2023/Code-290k-ShareGPT
84
+ - Locutusque/UltraTextbooks-2.0
85
+ - teknium/OpenHermes-2.5
86
+ - stingning/ultrachat
87
+ # - API Blend
88
+ #
89
+ # DATASET LINKS
90
+ # NL
91
+ # - nvidia/HelpSteer
92
+ # - garage-bAInd/Open-Platypus
93
+ # - mosaicml/dolly_hhrlhf
94
+ # - mosaicml/instruct-v3
95
+ # - conceptofmind/FLAN_2022
96
+ # - KnutJaegersberg/longinstruct
97
+ # - CohereForAI/xP3x
98
+ # - HuggingFaceTB/cosmopedia
99
+ # - open-web-math/open-web-math
100
+ # - EleutherAI/proof-pile-2
101
+ # - math-ai/StackMathQA
102
+ # - math-ai/TemplateGSM
103
+ # - IBM ConvAI 0111
104
+ # - IBM Forca 30K
105
+ # - IBM Hardcoded
106
+ # Code
107
+ # - bugdaryan/sql-create-context-instruction
108
+ # - glaiveai/glaive-function-calling-v2
109
+ # - cognitivecomputations/dolphin-coder
110
+ # - glaiveai/glaive-code-
111
+ # - bigcode/commitpackft
112
+ # - TIGER-Lab/MathInstruct
113
+ # - meta-math/MetaMathQA
114
+ # - tiedong/goat
115
+ # - CohereForAI/xP3x
116
+ metrics:
117
+ - code_eval
118
+ library_name: transformers
119
+ tags:
120
+ - code
121
+ model-index:
122
+ - name: granite-3b-code-base
123
+ results:
124
+ - task:
125
+ type: text-generation
126
+ dataset:
127
+ type: openai_humaneval # https://arxiv.org/pdf/2107.03374
128
+ name: HumanEval
129
+ metrics:
130
+ - name: pass@1
131
+ type: pass@1
132
+ value: 34.1
133
+ veriefied: false # Check
134
+ - task:
135
+ type: text-generation
136
+ dataset:
137
+ type: evalplus/humanevalplus # https://arxiv.org/pdf/2305.01210 https://github.com/evalplus/evalplus
138
+ name: HumanEval+
139
+ metrics:
140
+ - name: pass@1
141
+ type: pass@1
142
+ value: 29.9
143
+ veriefied: false # Check
144
+ - task:
145
+ type: text-generation
146
+ dataset:
147
+ type: mbpp # https://arxiv.org/abs/2108.07732
148
+ name: MBPP
149
+ metrics:
150
+ - name: pass@1
151
+ type: pass@1
152
+ value: 36.0
153
+ veriefied: false # Check
154
+ - task:
155
+ type: text-generation
156
+ dataset:
157
+ type: evalplus/mbppplus #
158
+ name: MBPP+
159
+ metrics:
160
+ - name: pass@1
161
+ type: pass@1
162
+ value: 45.1
163
+ veriefied: false # Check
164
+ - task:
165
+ type: text-generation
166
+ dataset:
167
+ type: bigcode/humanevalpack
168
+ name: HumanEvalSynthesis(Python)
169
+ metrics:
170
+ - name: pass@1
171
+ type: pass@1
172
+ value: 36.0
173
+ veriefied: false # Check
174
+ - task:
175
+ type: text-generation
176
+ dataset:
177
+ type: bigcode/humanevalpack
178
+ name: HumanEvalSynthesis(JavaScript)
179
+ metrics:
180
+ - name: pass@1
181
+ type: pass@1
182
+ value: 37.2
183
+ veriefied: false # Check
184
+ - task:
185
+ type: text-generation
186
+ dataset:
187
+ type: bigcode/humanevalpack
188
+ name: HumanEvalSynthesis(Java)
189
+ metrics:
190
+ - name: pass@1
191
+ type: pass@1
192
+ value: 40.9
193
+ veriefied: false # Check
194
+ - task:
195
+ type: text-generation
196
+ dataset:
197
+ type: bigcode/humanevalpack
198
+ name: HumanEvalSynthesis(Go)
199
+ metrics:
200
+ - name: pass@1
201
+ type: pass@1
202
+ value: 26.2
203
+ veriefied: false # Check
204
+ - task:
205
+ type: text-generation
206
+ dataset:
207
+ type: bigcode/humanevalpack
208
+ name: HumanEvalSynthesis(C++)
209
+ metrics:
210
+ - name: pass@1
211
+ type: pass@1
212
+ value: 35.4
213
+ veriefied: false # Check
214
+ - task:
215
+ type: text-generation
216
+ dataset:
217
+ type: bigcode/humanevalpack
218
+ name: HumanEvalSynthesis(Rust)
219
+ metrics:
220
+ - name: pass@1
221
+ type: pass@1
222
+ value: 22.0
223
+ veriefied: false # Check
224
+ - task:
225
+ type: text-generation
226
+ dataset:
227
+ type: bigcode/humanevalpack
228
+ name: HumanEvalExplain(Python)
229
+ metrics:
230
+ - name: pass@1
231
+ type: pass@1
232
+ value: 25.0
233
+ veriefied: false # Check
234
+ - task:
235
+ type: text-generation
236
+ dataset:
237
+ type: bigcode/humanevalpack
238
+ name: HumanEvalExplain(JavaScript)
239
+ metrics:
240
+ - name: pass@1
241
+ type: pass@1
242
+ value: 18.9
243
+ veriefied: false # Check
244
+ - task:
245
+ type: text-generation
246
+ dataset:
247
+ type: bigcode/humanevalpack
248
+ name: HumanEvalExplain(Java)
249
+ metrics:
250
+ - name: pass@1
251
+ type: pass@1
252
+ value: 29.9
253
+ veriefied: false # Check
254
+ - task:
255
+ type: text-generation
256
+ dataset:
257
+ type: bigcode/humanevalpack
258
+ name: HumanEvalExplain(Go)
259
+ metrics:
260
+ - name: pass@1
261
+ type: pass@1
262
+ value: 17.1
263
+ veriefied: false # Check
264
+ - task:
265
+ type: text-generation
266
+ dataset:
267
+ type: bigcode/humanevalpack
268
+ name: HumanEvalExplain(C++)
269
+ metrics:
270
+ - name: pass@1
271
+ type: pass@1
272
+ value: 26.8
273
+ veriefied: false # Check
274
+ - task:
275
+ type: text-generation
276
+ dataset:
277
+ type: bigcode/humanevalpack
278
+ name: HumanEvalExplain(Rust)
279
+ metrics:
280
+ - name: pass@1
281
+ type: pass@1
282
+ value: 14.0
283
+ veriefied: false # Check
284
+ - task:
285
+ type: text-generation
286
+ dataset:
287
+ type: bigcode/humanevalpack
288
+ name: HumanEvalFix(Python)
289
+ metrics:
290
+ - name: pass@1
291
+ type: pass@1
292
+ value: 18.3
293
+ veriefied: false # Check
294
+ - task:
295
+ type: text-generation
296
+ dataset:
297
+ type: bigcode/humanevalpack
298
+ name: HumanEvalFix(JavaScript)
299
+ metrics:
300
+ - name: pass@1
301
+ type: pass@1
302
+ value: 23.2
303
+ veriefied: false # Check
304
+ - task:
305
+ type: text-generation
306
+ dataset:
307
+ type: bigcode/humanevalpack
308
+ name: HumanEvalFix(Java)
309
+ metrics:
310
+ - name: pass@1
311
+ type: pass@1
312
+ value: 29.9
313
+ veriefied: false # Check
314
+ - task:
315
+ type: text-generation
316
+ dataset:
317
+ type: bigcode/humanevalpack
318
+ name: HumanEvalFix(Go)
319
+ metrics:
320
+ - name: pass@1
321
+ type: pass@1
322
+ value: 24.4
323
+ veriefied: false # Check
324
+ - task:
325
+ type: text-generation
326
+ dataset:
327
+ type: bigcode/humanevalpack
328
+ name: HumanEvalFix(C++)
329
+ metrics:
330
+ - name: pass@1
331
+ type: pass@1
332
+ value: 16.5
333
+ veriefied: false # Check
334
+ - task:
335
+ type: text-generation
336
+ dataset:
337
+ type: bigcode/humanevalpack
338
+ name: HumanEvalFix(Rust)
339
+ metrics:
340
+ - name: pass@1
341
+ type: pass@1
342
+ value: 3.7
343
+ veriefied: false # Check
344
  ---
345
+ <!--
346
+ Granite 3B Code Base
347
+
348
+ Model Summary: few sentences like starcoder
349
+ - Developers:
350
+ - GH repository:
351
+ - Release date:
352
+ - Lincense:
353
+
354
+ Usage
355
+ Intended use
356
+ Generation
357
+ Fill-in-the-middle
358
+
359
+ Training Data
360
+
361
+ Infrastructure
362
+
363
+ Limitations
364
+
365
+ Citation
366
+ -->
367
+
368
+ # Granite 3B Code Base
369
+ <!-- ![granite](https://github.com/ibm-granite/granite-code-models/blob/main/figures/granite.png) -->
370
+
371
+ ## Model Summary
372
+ **Granite 3B Code Base** model is a decoder-only code model designed for code generative tasks (e.g., code generation, code explanation, code fixing). It was trained from scratch on 4 trillion tokens sourced from 116 programming languages, ensuring a comprehensive understanding of programming languages and syntax.
373
+
374
+ - **Developers:** IBM Research
375
+ - **GitHub Repository:** [ibm-granite/granite-code-models](https://github.com/ibm-granite/granite-code-models)
376
+ - **Paper:** [Granite Code Models: A Family of Open Foundation Models
377
+ for Code Intelligence](https://)
378
+ - **Release Date**: May 6th, 2024
379
+ - **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) license.
380
+
381
+ ## Usage
382
+ ### Intended use
383
+ Prominent enterprise use cases of LLMs in software engineering productivity include code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation, and more. All Granite Code Base models, including the **3B parameters model**, are able to handle these tasks as they were trained on a large amount of code data from 116 programming languages.
384
+
385
+ ### Generation
386
+ Before proceeding, you need to install the necessary dependencies. You can do this by running the following command:
387
+ ```
388
+ pip install -r requirements.txt
389
+ ```
390
+
391
+ This is a simple example of how to use Granite Code Base 3B model.
392
+
393
+ ```python
394
+ import torch
395
+ from transformers import AutoModelForCausalLM, AutoTokenizer
396
+
397
+ model_path = "ibm-granite/granite-3b-code-base"
398
+
399
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
400
+ model = AutoModelForCausalLM.from_pretrained(model_path, device_map="cuda")
401
+ model.eval()
402
+
403
+ input_text = "def generate():"
404
+
405
+ input_tokens = tokenizer(input_text, return_tensors="pt")
406
+ for i in input_tokens:
407
+ input_tokens[i] = input_tokens[i].cuda()
408
+
409
+ output = model.generate(**input_tokens)
410
+ output = tokenizer.batch_decode(output)
411
+ for i in output:
412
+ print(output)
413
+ ```
414
+ ### Fill-in-the-middle
415
+
416
+ Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:
417
+
418
+ ```python
419
+ input_text = "<fim_prefix>def print_hello_world():\n <fim_suffix>\n print('Hello world!')<fim_middle>"
420
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
421
+ outputs = model.generate(inputs)
422
+ print(tokenizer.decode(outputs[0]))
423
+ ```
424
+
425
+ ## Training Data
426
+ - **Data Collection and Filtering:** Pretraining code data is sourced from a combination of publicly available datasets (e.g., [GitHub Code Clean](https://huggingface.co/datasets/codeparrot/github-code-clean), [Starcoder data](https://huggingface.co/datasets/bigcode/starcoderdata)), and additional public code repositories and issues from GitHub. We filter raw data to retain a list of 116 programming languages. After language filtering, we also filter out low-quality code.
427
+ - **Exact and Fuzzy Deduplication:** We adopt an aggressive deduplication strategy that includes both exact and fuzzy deduplication to remove documents having (near) identical code content.
428
+ - **HAP, PII, Malware Filtering:** We apply a HAP content filter that reduces models' likelihood of generating hateful, abusive, or profane language. We also make sure to redact Personally Identifiable Information (PII) by replacing PII content (e.g., names, email addresses, keys, passwords) with corresponding tokens (e.g., ⟨NAME⟩, ⟨EMAIL⟩, ⟨KEY⟩, ⟨PASSWORD⟩). Moreover, we scan all datasets using [ClamAV](https://www.clamav.net/) to identify and remove instances of malware in the source code.
429
+ - **Natural Language Datasets:** In addition to collecting code data for model training, we curate several publicly available high-quality natural language datasets to improve models' proficiency in language understanding and mathematical reasoning. Unlike the code data, we do not deduplicate these datasets.
430
+
431
+ ## Infrastructure
432
+ We train the Granite Code models using two of IBM’s super computing clusters, namely Vela and Blue Vela, both outfitted with NVIDIA A100 and H100 GPUs respectively. These clusters provide a scalable and efficient infrastructure for training our models over thousands of GPUs.
433
+
434
+ ## Limitations
435
+ Large Language Models are often prone to generating incorrect information, typically referred to as hallucinations. **Granite 3B Code Base** model is not the exception in this regard. Even though this model is suited for code-related tasks as it is trained on source code from 116 programming languages, the generated code is not guaranteed to work as intended. It can be inefficient and contain bugs or exploits. Moreover, Granite Code Base models are _not_ instruction-following models. Thus, commands like *"Write a function that computes the square root"* may not work well.
436
+
437
+ ## Citation
438
+ ```
439
+ @misc{granite-models,
440
+ author = {author 1, author2, ...},
441
+ title = {Granite Code Large Language Models: IBM Foundation Models for Code},
442
+ journal = {},
443
+ volume = {},
444
+ year = {2024},
445
+ url = {https://arxiv.org/abs/0000.00000},
446
+ }
447
+ ```