Spaces:
Configuration error
Configuration error
license: bsd-3-clause | |
tags: | |
- codet5 | |
datasets: | |
- code_search_net | |
inference: true | |
# CodeT5-base for Code Summarization | |
[CodeT5-base](https://huggingface.co/Salesforce/codet5-base) model fine-tuned on CodeSearchNet data in a multi-lingual training setting ( | |
Ruby/JavaScript/Go/Python/Java/PHP) for code summarization. It was introduced in this EMNLP 2021 | |
paper [CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation](https://arxiv.org/abs/2109.00859) | |
by Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi. Please check out more | |
at [this repository](https://github.com/salesforce/CodeT5). | |
## How to use | |
Here is how to use this model: | |
```python | |
from transformers import RobertaTokenizer, T5ForConditionalGeneration | |
if __name__ == '__main__': | |
tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base-multi-sum') | |
model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum') | |
text = """def svg_to_image(string, size=None): | |
if isinstance(string, unicode): | |
string = string.encode('utf-8') | |
renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string)) | |
if not renderer.isValid(): | |
raise ValueError('Invalid SVG data.') | |
if size is None: | |
size = renderer.defaultSize() | |
image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32) | |
painter = QtGui.QPainter(image) | |
renderer.render(painter) | |
return image""" | |
input_ids = tokenizer(text, return_tensors="pt").input_ids | |
generated_ids = model.generate(input_ids, max_length=20) | |
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True)) | |
# this prints: "Convert a SVG string to a QImage." | |
``` | |
## Fine-tuning data | |
We employ the filtered version of CodeSearchNet data [[Husain et al., 2019](https://arxiv.org/abs/1909.09436)] | |
from [CodeXGLUE](https://github.com/microsoft/CodeXGLUE/tree/main/Code-Text/code-to-text) benchmark for fine-tuning on | |
code summarization. The data is tokenized with our pre-trained code-specific BPE (Byte-Pair Encoding) tokenizer. One can | |
prepare text (or code) for the model using RobertaTokenizer with the vocab files from [codet5-base](https://huggingface.co/Salesforce/codet5-base). | |
### Data statistic | |
| Programming Language | Training | Dev | Test | | |
| :------------------- | :------: | :----: | :----: | | |
| Python | 251,820 | 13,914 | 14,918 | | |
| PHP | 241,241 | 12,982 | 14,014 | | |
| Go | 167,288 | 7,325 | 8,122 | | |
| Java | 164,923 | 5,183 | 10,955 | | |
| JavaScript | 58,025 | 3,885 | 3,291 | | |
| Ruby | 24,927 | 1,400 | 1,261 | | |
## Training procedure | |
We fine-tune codet5-base on these six programming languages (Ruby/JavaScript/Go/Python/Java/PHP) in the multi-task learning setting. We employ the | |
balanced sampling to avoid biasing towards high-resource tasks. Please refer to the [paper](https://arxiv.org/abs/2109.00859) for more details. | |
## Evaluation results | |
Unlike the paper allowing to select different best checkpoints for different programming languages (PLs), here we employ one checkpoint for | |
all PLs. Besides, we remove the task control prefix to specify the PL in training and inference. The results on the test set are shown as below: | |
| Model | Ruby | Javascript | Go | Python | Java | PHP | Overall | | |
| ----------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | | |
| Seq2Seq | 9.64 | 10.21 | 13.98 | 15.93 | 15.09 | 21.08 | 14.32 | | |
| Transformer | 11.18 | 11.59 | 16.38 | 15.81 | 16.26 | 22.12 | 15.56 | | |
| [RoBERTa](https://arxiv.org/pdf/1907.11692.pdf) | 11.17 | 11.90 | 17.72 | 18.14 | 16.47 | 24.02 | 16.57 | | |
| [CodeBERT](https://arxiv.org/pdf/2002.08155.pdf) | 12.16 | 14.90 | 18.07 | 19.06 | 17.65 | 25.16 | 17.83 | | |
| [PLBART](https://aclanthology.org/2021.naacl-main.211.pdf) | 14.11 |15.56 | 18.91 | 19.30 | 18.45 | 23.58 | 18.32 | | |
| [CodeT5-small](https://arxiv.org/abs/2109.00859) |14.87 | 15.32 | 19.25 | 20.04 | 19.92 | 25.46 | 19.14 | | |
| [CodeT5-base](https://arxiv.org/abs/2109.00859) | **15.24** | 16.16 | 19.56 | 20.01 | **20.31** | 26.03 | 19.55 | | |
| [CodeT5-base-multi-sum](https://arxiv.org/abs/2109.00859) | **15.24** | **16.18** | **19.95** | **20.42** | 20.26 | **26.10** | **19.69** | | |
## Citation | |
```bibtex | |
@inproceedings{ | |
wang2021codet5, | |
title={CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation}, | |
author={Yue Wang, Weishi Wang, Shafiq Joty, Steven C.H. Hoi}, | |
booktitle={Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021}, | |
year={2021}, | |
} | |
``` |