metadata

pipeline_tag: text-generation
inference: true
widget:
  - text: >-
      <commit_before>def has_close_elements(numbers: List[float], threshold:
      float) -> bool:\n    for idx, elem in enumerate(numbers):\n        for
      idx2, elem2 in enumerate(numbers):\n            if idx !=
      idx2:\n                distance = elem - elem2\n                if
      distance < threshold:\n                    return True\n\n    return
      False<commit_message>Fix bugs in has_close_elements.<commit_after>
    example_title: Fix has_close_elements
    group: Python
license: bigcode-openrail-m
datasets:
  - bigcode/commitpack-subset-cf
metrics:
  - code_eval
library_name: transformers
tags:
  - code
model-index:
  - name: SantaCoderPack
    results:
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix Python
        metrics:
          - name: pass@1
            type: pass@1
            value: 3.2
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix JavaScript
        metrics:
          - name: pass@1
            type: pass@1
            value: 4.9
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix Java
        metrics:
          - name: pass@1
            type: pass@1
            value: 1.8
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix Go
        metrics:
          - name: pass@1
            type: pass@1
            value: 3.6
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix C++
        metrics:
          - name: pass@1
            type: pass@1
            value: 4.2
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix Rust
        metrics:
          - name: pass@1
            type: pass@1
            value: 1.7
            verified: false
      - task:
          type: text-generation
        dataset:
          type: bigcode/humanevalpack
          name: HumanEvalFix Average
        metrics:
          - name: pass@1
            type: pass@1
            value: 3.3
            verified: false

Model Summary
Use
Training
Citation

Model Summary

SantaCoderPack is an pre-trained model with the same architecture of SantaCoder on

CommitPack using this format:

<commit_before>code_before<commit_msg>message<commit_after>code_after

Repository: bigcode/octopack
Paper: TODO
Languages: Python, JavaScript, Java, C++, Go, Rust

SantaCoderPack:

Data	CommitPack	4TB of GitHub commits across 350 programming languages
Model	SantaCoderPack	SantaCoderPack (1.1B parameters) pre-trained on CommitPack
Evaluation	HumanEvalPack/HumanEvalFix	Extension of OpenAI's HumanEval to HumanEvalFix

Use

Intended use

The model follows instructions provided in the input. We recommend prefacing your input with "def has_close_elements(numbers: List[float], threshold: float) -> bool:\n for idx, elem in enumerate(numbers):\n for idx2, elem2 in enumerate(numbers):\n if idx != idx2:\n distance = elem - elem2\n if distance < threshold:\n return True\n\n return FalseFix bugs in has_close_elements."

Feel free to share your generations in the Community tab!

Generation

# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/santacoderpack"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Q<commit_before>def has_close_elements(numbers: List[float], threshold: float) -> bool:\n    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = elem - elem2\n                if distance < threshold:\n                    return True\n\n    return False<commit_message>Fix bugs in has_close_elements.<commit_after>", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Training

Model

Architecture: GPT-2 model with multi-query attention
Steps: 250k pretraining
Pretraining tokens: 131B
Precision: bfloat16

Hardware

Pretraining:
- GPUs: 32 Tesla A100
- Training time: 15 days

Software

Orchestration: Megatron-LM/Transformers
Neural networks: PyTorch

Citation

TODO

bigcode
/

santacoderpack

Table of Contents