File size: 4,506 Bytes
c1f1040 7a311bb c1f1040 d3e56a7 7a311bb d3e56a7 c1f1040 d3e56a7 c1f1040 d3e56a7 c1f1040 d3e56a7 c1f1040 d3e56a7 c1f1040 d3e56a7 c1f1040 1d4a0ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
---
license: openrail
tags:
- generated_from_trainer
- bash
- shell
- code
- codegen
model-index:
- name: santacoder-finetuned-the-stack-bash-shell
results: []
datasets:
- bigcode/the-stack-dedup
language:
- code
pipeline_tag: text-generation
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# SantaCoder ๐
fine-tuned on bash/shell ๐ scripts
This model is a fine-tuned version of [BigCode/SantaCoder](https://huggingface.co/bigcode/santacoder) on The Stack [bash/shell scripts](https://huggingface.co/datasets/bigcode/the-stack-dedup).
It achieves the following results on the evaluation set:
- Loss: 1.2272
## Model description
The [SantaCoder](https://huggingface.co/bigcode/santacoder) models are a series of 1.1B parameter models trained on the Python, Java, and JavaScript subset of [The Stack (v1.1)](https://huggingface.co/datasets/bigcode/the-stack) (which excluded opt-out requests).
The main model uses [Multi Query Attention](https://arxiv.org/abs/1911.02150), was trained using near-deduplication and comment-to-code ratio as filtering criteria and using the [Fill-in-the-Middle objective](https://arxiv.org/abs/2207.14255).
In addition, there are several models that were trained on datasets with different filter parameters and with architecture and objective variations.
## Intended uses & limitations
The model has been trained on source code in Python, Java, and JavaScript and fine-tuned on bash/shell scripts. The predominant language in source is English although other languages are also present. As such the model is capable to generate code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient, contain bugs or exploits.
## Training and evaluation data
The Stack contains over 6TB of permissively-licensed source code files covering 358 programming languages. The dataset was created as part of the [BigCode Project](https://www.bigcode-project.org/), an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). The Stack serves as a pre-training dataset for Code LLMs, i.e., code-generating AI systems which enable the synthesis of programs from natural language descriptions as well as other from code snippets. **This is the near-deduplicated version with 3TB data.**
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 8
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- training_steps: 10000
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|
| 1.6101 | 0.05 | 500 | 1.5078 |
| 1.6156 | 0.1 | 1000 | 1.4687 |
| 1.4916 | 0.15 | 1500 | 1.4728 |
| 1.4027 | 0.2 | 2000 | 1.4237 |
| 1.499 | 0.25 | 2500 | 1.4067 |
| 1.4378 | 0.3 | 3000 | 1.3838 |
| 1.3698 | 0.35 | 3500 | 1.3767 |
| 1.3021 | 0.4 | 4000 | 1.3562 |
| 4.0521 | 0.45 | 4500 | 1.3433 |
| 0.9722 | 0.5 | 5000 | 1.3461 |
| 1.3836 | 0.55 | 5500 | 1.2955 |
| 1.3727 | 0.6 | 6000 | 1.2809 |
| 1.3332 | 0.65 | 6500 | 1.2665 |
| 1.2232 | 0.7 | 7000 | 1.2573 |
| 1.2373 | 0.75 | 7500 | 1.2463 |
| 1.3759 | 0.8 | 8000 | 1.2391 |
| 1.3021 | 0.85 | 8500 | 1.2325 |
| 1.369 | 0.9 | 9000 | 1.2292 |
| 1.4911 | 0.95 | 9500 | 1.2275 |
| 1.1677 | 1.0 | 10000 | 1.2272 |
### Framework versions
- Transformers 4.26.0.dev0
- Pytorch 1.13.1+cu116
- Datasets 2.7.1
- Tokenizers 0.13.2
### Citation
```
@misc {manuel_romero_2023,
author = { {Manuel Romero} },
title = { santacoder-finetuned-the-stack-bash-shell (Revision d3e56a7) },
year = 2023,
url = { https://huggingface.co/mrm8488/santacoder-finetuned-the-stack-bash-shell },
doi = { 10.57967/hf/0320 },
publisher = { Hugging Face }
}
``` |