Spaces:
Running
Running
File size: 5,639 Bytes
49cf832 c02139b 49cf832 5867bc5 3ab57b2 5867bc5 aa053f8 a7327bc 5867bc5 e56af96 5867bc5 b9a94ab 5867bc5 b9a94ab 05ab9b5 b9a94ab 2d7e77d 8f259ff 5867bc5 75b0c8e 5867bc5 f3f95d5 5867bc5 4449d38 5867bc5 eda1407 f3f95d5 5867bc5 3068326 5867bc5 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
---
title: README
emoji: ✨
colorFrom: gray
colorTo: red
sdk: static
pinned: false
---
<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
<style type="text/css">
#bclogo {
display: block;
margin-left: auto;
margin-right: auto }
</style>
# BigCode
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
---
## 💫StarCoder
StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.
### Models
- [Paper](https://arxiv.org/abs/2305.06161): A technical report about StarCoder.
- [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder.
- [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
- [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
- [StarCoder+](https://huggingface.co/bigcode/starcoderplus): StarCoderBase further trained on English web data.
- [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
- [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.
### Tools & Demos
- [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder Models!
- [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
- [StarChat](https://huggingface.co/spaces/HuggingFaceH4/starchat-playground): Chat with StarCoder!
- [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
- [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-editor): Edit with StarCoder!
### Data & Governance
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
- [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
- [StarCoder Data](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
- [StarCoder Membership Test](https://stack.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
---
## 🐙OctoPack
OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
- [Paper](https://arxiv.org/abs/2308.07124): Research paper with details about all components of OctoPack.
- [GitHub](https://github.com/bigcode-project/octopack): All code used for the creation of OctoPack.
- [CommitPack](https://huggingface.co/datasets/bigcode/commitpack): 4TB of Git commits.
- [Am I in the CommitPack](https://huggingface.co/spaces/bigcode/in-the-commitpack): Check if your code is in the CommitPack.
- [CommitPackFT](https://huggingface.co/datasets/bigcode/commitpackft): 2GB of high-quality Git commits that resemble instructions.
- [HumanEvalPack](https://huggingface.co/datasets/bigcode/humanevalpack): Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust.
- [OctoCoder](https://huggingface.co/bigcode/octocoder): Instruction tuned model of StarCoder by training on CommitPackFT.
- [OctoCoder Demo](https://huggingface.co/spaces/bigcode/OctoCoder-Demo): Play with OctoCoder.
- [OctoGeeX](https://huggingface.co/bigcode/octogeex): Instruction tuned model of CodeGeeX2 by training on CommitPackFT.
---
## 📑The Stack
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
---
## 🎅SantaCoder
SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
- [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
- [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
- [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
- [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder. |