Papers
arxiv:2001.00059

Learning and Evaluating Contextual Embedding of Source Code

Published on Dec 21, 2019
Authors:
,
,
,

Abstract

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of <PRE_TAG>pre-train</POST_TAG>ed contextual embeddings, such as BERT, which can be <PRE_TAG>fine-tune</POST_TAG>d for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, <PRE_TAG>deduplicated corpus</POST_TAG> of 7.4M Python files from GitHub, which we use to <PRE_TAG>pre-train</POST_TAG> <PRE_TAG>CuBERT</POST_TAG>, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five <PRE_TAG>classification tasks</POST_TAG> and one <PRE_TAG>program-repair task</POST_TAG>, akin to code-understanding tasks proposed in the literature before. We <PRE_TAG>fine-tune</POST_TAG> <PRE_TAG>CuBERT</POST_TAG> on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, <PRE_TAG>BiLSTM</POST_TAG> and Transformer models, as well as published state-of-the-art models, showing that <PRE_TAG>CuBERT</POST_TAG> outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against <PRE_TAG>CuBERT</POST_TAG> models as a strong baseline.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2001.00059 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2001.00059 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.