File size: 4,750 Bytes
ccf8da6 34445e4 91817c7 ccf8da6 7c1ff62 ccf8da6 34445e4 ccf8da6 91817c7 ccf8da6 facaff0 ccf8da6 e9a7954 ccf8da6 5d56c2d ccf8da6 5d56c2d ccf8da6 5d56c2d ccf8da6 43d2dd2 ccf8da6 32aca2e 6029fcd ccf8da6 2b67305 ccf8da6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 |
---
language:
- code
- en
task_categories:
- text-classification
tags:
- arxiv:2305.06156
license: mit
metrics:
- accuracy
widget:
- text: |-
Sum two integers</s></s>def sum(a, b):
return a + b
example_title: Simple toy
- text: |-
Look for methods that might be dynamically defined and define them for lookup.</s></s>def respond_to_missing?(name, include_private = false)
if name == :to_ary || name == :empty?
false
else
return true if mapping(name).present?
mounting = all_mountings.find{ |mount| mount.respond_to?(name) }
return false if mounting.nil?
end
end
example_title: Ruby example
- text: |-
Method that adds a candidate to the party @param c the candidate that will be added to the party</s></s>public void addCandidate(Candidate c)
{
this.votes += c.getVotes();
candidates.add(c);
}
example_title: Java example
- text: |-
we do not need Buffer pollyfill for now</s></s>function(str){
var ret = new Array(str.length), len = str.length;
while(len--) ret[len] = str.charCodeAt(len);
return Uint8Array.from(ret);
}
example_title: JavaScript example
pipeline_tag: text-classification
---
## Table of Contents
- [Model Description](#model-description)
- [Model Details](#model-details)
- [Usage](#usage)
- [Limitations](#limitations)
- [Additional Information](#additional-information)
- [Licensing Information](#licensing-information)
- [Citation Information](#citation-information)
## Model Description
This model is developed based on [Codebert](https://github.com/microsoft/CodeBERT) and a 5M subset of [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level) to detect the inconsistency between docstring/comment and function. It is used to remove noisy examples in The Vault dataset.
More information:
- **Repository:** [FSoft-AI4Code/TheVault](https://github.com/FSoft-AI4Code/TheVault)
- **Paper:** [The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation](https://arxiv.org/abs/2305.06156)
- **Contact:** [email protected]
## Model Details
* Developed by: [Fsoft AI Center](https://www.fpt-aicenter.com/ai-residency/)
* License: MIT
* Model type: Transformer-Encoder based Language Model
* Architecture: BERT-base
* Data set: [The Vault](https://huggingface.co/datasets/Fsoft-AIC/thevault-function-level)
* Tokenizer: Byte Pair Encoding
* Vocabulary Size: 50265
* Sequence Length: 512
* Language: English and 10 Programming languages (Python, Java, JavaScript, PHP, C#, C, C++, Go, Rust, Ruby)
* Training details:
* Self-supervised learning, binary classification
* Positive class: Original code-docstring pair
* Negative class: Random pairing code and docstring
## Usage
The input to the model follows the below template:
```python
"""
Template:
<s>{docstring}</s></s>{code}</s>
Example:
from transformers import AutoTokenizer
#Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
input = "<s>Sum two integers</s></s>def sum(a, b):\n return a + b</s>"
tokenized_input = tokenizer(input, add_special_tokens= False)
"""
```
Using model with Jax and Pytorch
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, FlaxAutoModelForSequenceClassification
#Load model with jax
model = FlaxAutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
#Load model with torch
model = AutoModelForSequenceClassification.from_pretrained("Fsoft-AIC/Codebert-docstring-inconsistency")
```
## Limitations
This model is trained on 5M subset of The Vault in a self-supervised manner. Since the negative samples are generated artificially, the model's ability to identify instances that require a strong semantic understanding between the code and the docstring might be restricted.
It is hard to evaluate the model due to the unavailable labeled datasets. GPT-3.5-turbo is adopted as a reference to measure the correlation between the model and GPT-3.5-turbo's scores. However, the result could be influenced by GPT-3.5-turbo's potential biases and ambiguous conditions. Therefore, we recommend having human labeling dataset and fine-tune this model to achieve the best result.
## Additional information
### Licensing Information
MIT License
### Citation Information
```
@article{manh2023vault,
title={The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation},
author={Manh, Dung Nguyen and Hai, Nam Le and Dau, Anh TV and Nguyen, Anh Minh and Nghiem, Khanh and Guo, Jin and Bui, Nghi DQ},
journal={arXiv preprint arXiv:2305.06156},
year={2023}
}
``` |