Table of Contents

Model Description

This model is developed based on Codebert and a 5M subset of The Vault to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset.

More information:

Model Details

  • Developed by: Fsoft AI Center
  • License: MIT
  • Model type: Transformer-Encoder based Language Model
  • Architecture: BERT-base
  • Data set: The Vault
  • Tokenizer: Byte Pair Encoding
  • Vocabulary Size: 50265
  • Sequence Length: 512
  • Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
  • Training details:
    • Self-supervised learning, binary classification
    • Positive class: Original code-docstring pair
    • Negative class: Random pairing code and docstring

Usage

The input to the model follows the below template:

"""
Template:
<s>{docstring}</s></s>{code}</s>

Example:
from transformers import AutoTokenizer

#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

input = "<s>Sum two integers</s></s>def sum(a, b):\n    return a + b</s>"
tokenized_input = tokenizer(input, add_special_tokens= False)
"""

Using model with Jax and Pytorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification

#Load model with jax
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

#Load model with torch
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")

Limitations

This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.

It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result.

Additional information

Licensing Information

MIT License

Citation Information

@article{manh2023vault,
  title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
  author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
  journal={arXiv preprint arXiv:2305.06156},
  year={2023}
}
Downloads last month
26
Safetensors
Model size
125M params
Tensor type
I64
·
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Fsoft-AIC/Codebert-docstring-inconsistency