metadata

library_name: transformers
datasets:
  - NamCyan/tesoro-code
base_model:
  - Salesforce/codet5p-220m

Improving the detection of technical debt in Java source code with an enriched dataset

Model Details

Model Description

This model is the part of Tesoro project, used for detecting technical debt in source code. More information can be found at Tesoro HomePage.

Developed by: Nam Hai Le
Model type: Encoder-Decoder-based PLMs
Language(s): Java
Finetuned from model: CodeT5+

Model Sources

Repository: Tesoro
Paper: [To be update]

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("NamCyan/codet5p-220m-technical-debt-code-tesoro")
model = AutoModelForSequenceClassification.from_pretrained("NamCyan/codet5p-220m-technical-debt-code-tesoro")

Training Details

Training Data: The model is finetuned using tesoro-code
Infrastructure: Training process is conducted on two NVIDIA A100 GPUs with 80GB of VRAM.

Leaderboard

Model	Model size	EM	F1
Encoder-based PLMs
CodeBERT	125M	38.28	43.47
UniXCoder	125M	38.12	42.58
GraphCodeBERT	125M	39.38	44.21
RoBERTa	125M	35.37	38.22
ALBERT	11.8M	39.32	41.99
Encoder-Decoder-based PLMs
PLBART	140M	36.85	39.90
Codet5	220M	32.66	35.41
CodeT5+	220M	37.91	41.96
Decoder-based PLMs (LLMs)
TinyLlama	1.03B	37.05	40.05
DeepSeek-Coder	1.28B	42.52	46.19
OpenCodeInterpreter	1.35B	38.16	41.76
phi-2	2.78B	37.92	41.57
starcoder2	3.03B	35.37	41.77
CodeLlama	6.74B	34.14	38.16
Magicoder	6.74B	39.14	42.49

Citing us

@article{nam2024tesoro,
  title={Improving the detection of technical debt in Java source code with an enriched dataset},
  author={Hai, Nam Le and Bui, Anh M. T. Bui and Nguyen, Phuong T. and Ruscio, Davide Di and Kazman, Rick},
  journal={},
  year={2024}
}