yuewang-sf commited on
Commit
e2fc071
·
1 Parent(s): a1dcd88

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md CHANGED
@@ -1,3 +1,69 @@
1
  ---
2
  license: bsd-3-clause
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: bsd-3-clause
3
  ---
4
+
5
+ # CodeT5+ 770M
6
+
7
+ ## Model description
8
+
9
+ [CodeT5+](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) is a new family of open code large language models with an encoder-decoder architecture that can flexibly operate in different modes (i.e. _encoder-only_, _decoder-only_, and _encoder-decoder_) to support a wide range of code understanding and generation tasks.
10
+ It is introduced in the paper:
11
+
12
+ [CodeT5+: Open Code Large Language Models for Code Understanding and Generation](https://github.com/salesforce/CodeT5/CodeT5+)
13
+ by [Yue Wang](https://yuewang-cuhk.github.io/)\*, [Hung Le](https://sites.google.com/view/henryle2018/home?pli=1)\*, [Akhilesh Deepak Gotmare](https://akhileshgotmare.github.io/), [Nghi D.Q. Bui](https://bdqnghi.github.io/), [Junnan Li](https://sites.google.com/site/junnanlics), [Steven C.H. Hoi](https://sites.google.com/view/stevenhoi/home) (* indicates equal contribution).
14
+
15
+ Compared to the original CodeT5 family (CodeT5-base: `220M`, CodeT5-large: `770M`), CodeT5+ is pretrained with a diverse set of pretraining tasks including _span denoising_, _causal language modeling_, _contrastive learning_, and _text-code matching_ to learn rich representations from both unimodal code data and bimodal code-text data.
16
+ Additionally, it employs a simple yet effective _compute-efficient pretraining_ method to initialize the model components with frozen off-the-shelf LLMs such as [CodeGen](https://github.com/salesforce/CodeGen) to efficiently scale up the model (i.e. `2B`, `6B`, `16B`), and adopts a "shallow encoder and deep decoder" architecture.
17
+ Furthermore, it is instruction-tuned to align with natural language instructions (see [InstructCodeT5+ 16B](https://github.com/salesforce/CodeT5/tree/main/CodeT5+)) following [Code Alpaca](https://github.com/sahil280114/codealpaca).
18
+
19
+ ## How to use
20
+
21
+ This model can be easily loaded using the `T5ForConditionalGeneration` functionality and employs the same tokenizer as original [CodeT5](https://github.com/salesforce/CodeT5).
22
+
23
+ ```python
24
+ from transformers import T5ForConditionalGeneration, AutoTokenizer
25
+
26
+ checkpoint = "Salesforce/codet5p-770m"
27
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
28
+
29
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
30
+ model = T5ForConditionalGeneration.from_pretrained(checkpoint).to(device)
31
+
32
+ inputs = tokenizer.encode("def print_hello_world():<extra_id_0>", return_tensors="pt").to(device)
33
+ outputs = model.generate(inputs, max_length=10)
34
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
35
+ # ==> print "Hello World"
36
+ ```
37
+
38
+ ## Pretraining data
39
+
40
+ This checkpoint is trained on the stricter permissive subset of the deduplicated version of the [github-code dataset](https://huggingface.co/datasets/codeparrot/github-code).
41
+ The data is preprocessed by reserving only permissively licensed code ("mit" “apache-2”, “bsd-3-clause”, “bsd-2-clause”, “cc0-1.0”, “unlicense”, “isc”).
42
+ Supported languages (9 in total) are as follows:
43
+ `c`, `c++`, `c-sharp`, `go`, `java`, `javascript`, `php`, `python`, `ruby.`
44
+
45
+ ## Training procedure
46
+
47
+ This checkpoint is trained on the unimodal code data at the first-stage pretraining, which includes a diverse set of pretraining tasks including _span denoising_ and two variants of _causal language modeling_.
48
+ Please refer to the paper for more details.
49
+
50
+ ## Evaluation results
51
+
52
+ CodeT5+ models have been comprehensively evaluated on a wide range of code understanding and generation tasks in various settings: _zero-shot_, _finetuning_, and _instruction-tuning_.
53
+ Specifically, CodeT5+ yields substantial performance gains on many downstream tasks compared to their SoTA baselines, e.g.,
54
+ 8 text-to-code retrieval tasks (+3.2 avg. MRR), 2 line-level code completion tasks (+2.1 avg. Exact Match), and 2 retrieval-augmented code generation tasks (+5.8 avg. BLEU-4).
55
+ In 2 math programming tasks on MathQA-Python and GSM8K-Python, CodeT5+ models of below billion-parameter sizes significantly outperform many LLMs of up to 137B parameters.
56
+ Particularly, in the zero-shot text-to-code generation task on HumanEval benchmark, InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001 mode
57
+ Please refer to the [paper](https://github.com/salesforce/CodeT5/tree/main/CodeT5+) for more details.
58
+
59
+
60
+ ## BibTeX entry and citation info
61
+
62
+ ```bibtex
63
+ @article{wang2023codet5plus,
64
+ title={CodeT5+: Open Code Large Language Models for Code Understanding and Generation},
65
+ author={Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi D.Q. and Li, Junnan and Hoi, Steven C. H.},
66
+ journal={arXiv preprint},
67
+ year={2023}
68
+ }
69
+ ```