arxiv:2001.00059

Learning and Evaluating Contextual Embedding of Source Code

Published on Dec 21, 2019

Authors:

Abstract

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of <PRE_TAG>pre-train</POST_TAG>ed contextual embeddings, such as BERT, which can be <PRE_TAG>fine-tune</POST_TAG>d for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, <PRE_TAG>deduplicated corpus</POST_TAG> of 7.4M Python files from GitHub, which we use to <PRE_TAG>pre-train</POST_TAG> <PRE_TAG>CuBERT</POST_TAG>, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five <PRE_TAG>classification tasks</POST_TAG> and one <PRE_TAG>program-repair task</POST_TAG>, akin to code-understanding tasks proposed in the literature before. We <PRE_TAG>fine-tune</POST_TAG> <PRE_TAG>CuBERT</POST_TAG> on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, <PRE_TAG>BiLSTM</POST_TAG> and Transformer models, as well as published state-of-the-art models, showing that <PRE_TAG>CuBERT</POST_TAG> outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against <PRE_TAG>CuBERT</POST_TAG> models as a strong baseline.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2001.00059 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2001.00059 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.