CodeBERTaJS

CodeBERTaJS is a RoBERTa-like model trained on the CodeSearchNet dataset from GitHub for javaScript by Manuel Romero

The tokenizer is a Byte-level BPE tokenizer trained on the corpus using Hugging Face tokenizers.

Because it is trained on a corpus of code (vs. natural language), it encodes the corpus efficiently (the sequences are between 33% to 50% shorter, compared to the same corpus tokenized by gpt2/roberta).

The (small) model is a 6-layer, 84M parameters, RoBERTa-like Transformer model – that’s the same number of layers & heads as DistilBERT – initialized from the default initialization settings and trained from scratch on the full javascript corpus (120M after preproccessing) for 2 epochs.

Quick start: masked language modeling prediction

JS_CODE = """
async function createUser(req, <mask>) {
  if (!validUser(req.body.user)) {
\t  return res.status(400);
  }
  user = userService.createUser(req.body.user);
  return res.json(user);
}
""".lstrip()

Does the model know how to complete simple JS/express like code?

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="mrm8488/codeBERTaJS",
    tokenizer="mrm8488/codeBERTaJS"
)

fill_mask(JS_CODE)

## Top 5 predictions:
#
'res' # prob  0.069489665329
'next'
'req'
'user'
',req'

Yes! That was easy πŸŽ‰ Let's try with another example

JS_CODE_= """
function getKeys(obj) {
  keys = [];
  for (var [key, value] of Object.entries(obj)) {
     keys.push(<mask>);
  }
  return keys
}
""".lstrip()

Results:

'obj', 'key', ' value', 'keys', 'i'

Not so bad! Right token was predicted as second option! πŸŽ‰

This work is heavely inspired on codeBERTa by huggingface team


CodeSearchNet citation

@article{husain_codesearchnet_2019,
\ttitle = {{CodeSearchNet} {Challenge}: {Evaluating} the {State} of {Semantic} {Code} {Search}},
\tshorttitle = {{CodeSearchNet} {Challenge}},
\turl = {http://arxiv.org/abs/1909.09436},
\turldate = {2020-03-12},
\tjournal = {arXiv:1909.09436 [cs, stat]},
\tauthor = {Husain, Hamel and Wu, Ho-Hsiang and Gazit, Tiferet and Allamanis, Miltiadis and Brockschmidt, Marc},
\tmonth = sep,
\tyear = {2019},
\tnote = {arXiv: 1909.09436},
}

Created by Manuel Romero/@mrm8488

Made with β™₯ in Spain

Downloads last month
22
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.