Spaces:
Running
Running
Update Org Card. (#3)
Browse files- Update Org Card. (94f2d5bdcab5d81447add22135c8bd4489a34436)
- Update README.md (81a7150d3953663aa4e9e159fb7c3d140dbdf8e9)
- Update README.md (ab1f9b0db256990cca9344cf6a652b39b0bcb80c)
- Update README.md (4e05844b41aa543d5b305230c11cd5d1466332cb)
- add paper link (7ce4f8e1b0dd3c899beeaa9b68f435390e5d433d)
Co-authored-by: Leandro von Werra <[email protected]>
README.md
CHANGED
@@ -6,15 +6,63 @@ colorTo: red
|
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
---
|
9 |
-
<p>
|
10 |
-
<img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
|
11 |
-
</p>
|
12 |
|
13 |
-
<
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
-
|
17 |
|
18 |
-
|
19 |
-
|
20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
sdk: static
|
7 |
pinned: false
|
8 |
---
|
|
|
|
|
|
|
9 |
|
10 |
+
<img id="bclogo" src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/bigcode_light.png" alt="drawing" width="440"/>
|
11 |
+
<style type="text/css">
|
12 |
+
#bclogo {
|
13 |
+
display: block;
|
14 |
+
margin-left: auto;
|
15 |
+
margin-right: auto }
|
16 |
+
</style>
|
17 |
|
18 |
+
# BigCode
|
19 |
|
20 |
+
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.
|
21 |
+
|
22 |
+
---
|
23 |
+
|
24 |
+
## 💫StarCoder
|
25 |
+
StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.
|
26 |
+
|
27 |
+
### Models
|
28 |
+
- [Paper](https://drive.google.com/file/d/1cN-b9GnWtHzQRoE7M7gAEyivY0kl4BYs/view): A technical report about StarCoder.
|
29 |
+
- [GitHub](https://github.com/bigcode-project/starcoder/tree/main): All you need to know about using or fine-tuning StarCoder.
|
30 |
+
- [StarCoder](https://huggingface.co/bigcode/starcoder): StarCoderBase further trained on Python.
|
31 |
+
- [StarCoderBase](https://huggingface.co/bigcode/starcoderbase): Trained on 80+ languages from The Stack.
|
32 |
+
- [StarEncoder](https://huggingface.co/bigcode/starencoder): Encoder model trained on TheStack.
|
33 |
+
- [StarPii](https://huggingface.co/bigcode/starpii): StarEncoder based PII detector.
|
34 |
+
|
35 |
+
### Tools & Demos
|
36 |
+
- [StarCoder Chat](hf.co/chat/starcoder): Chat with StarCoder!
|
37 |
+
- [VSCode Extension](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode): Code with StarCoder!
|
38 |
+
- [StarCoder Playground](https://huggingface.co/spaces/bigcode/bigcode-playground): Write with StarCoder!
|
39 |
+
- [StarCoder Editor](https://huggingface.co/spaces/bigcode/bigcode-playground): Edit with StarCoder!
|
40 |
+
|
41 |
+
### Data & Governance
|
42 |
+
- [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata): Pretraining dataset of StarCoder.
|
43 |
+
- [Tech Assistant Prompt](https://huggingface.co/datasets/bigcode/ta-prompt): With this prompt you can turn StarCoder into tech assistant.
|
44 |
+
- [Governance Card](https://huggingface.co/spaces/bigcode/governance-card): A card outlining the governance of the model.
|
45 |
+
- [StarCoder License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
|
46 |
+
- [StarCoder Search](https://huggingface.co/spaces/bigcode/search): Full-text search code in the pretraining dataset.
|
47 |
+
- [StarCoder Membership Test](stack.dataportraits.org): Blazing fast test if code was present in pretraining dataset.
|
48 |
+
|
49 |
+
---
|
50 |
+
|
51 |
+
## 📑The Stack
|
52 |
+
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
|
53 |
+
|
54 |
+
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
55 |
+
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
|
56 |
+
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-issues): Collection of GitHub issues.
|
57 |
+
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
|
58 |
+
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
|
59 |
+
|
60 |
+
---
|
61 |
+
|
62 |
+
## 🎅SantaCoder
|
63 |
+
SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.
|
64 |
+
|
65 |
+
- [SantaCoder](https://huggingface.co/bigcode/santacoder): SantaCoder Model.
|
66 |
+
- [SantaCoder Demo](https://huggingface.co/spaces/bigcode/santacoder-demo): Write with SantaCoder.
|
67 |
+
- [SantaCoder Search](https://huggingface.co/spaces/bigcode/santacoder-search): Search code in the pretraining dataset.
|
68 |
+
- [SantaCoder License](https://huggingface.co/spaces/bigcode/license): The OpenRAIL license for SantaCoder.
|