|
--- |
|
datasets: |
|
- togethercomputer/RedPajama-Data-V2 |
|
- LSX-UniWue/LLaMmlein-Dataset |
|
language: |
|
- de |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
license: other |
|
--- |
|
|
|
# LLäMmlein 120M |
|
|
|
|
|
LLäMmlein 120M is a German LLaMa model trained from scratch using our adapted [Tinyllama](https://github.com/jzhang38/TinyLlama) codebase on the German portion of [RedPajama V2](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2). |
|
To enhance data quality, we additionally deduplicated the dataset on paragraph level and filtered it using a token-to-word ratio filter. The resulting dataset can be found [here](https://huggingface.co/datasets/LSX-UniWue/LLaMmlein-Dataset). |
|
|
|
We provide three model sizes: |
|
|
|
* [LLäMmlein 7B](https://huggingface.co/LSX-UniWue/LLaMmlein_7B) |
|
|
|
* [LLäMmlein 1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) |
|
|
|
* [LLäMmlein 120M](https://huggingface.co/LSX-UniWue/LLaMmlein_120M) ← You are here |
|
|
|
|
|
Find more details on our page our [page](https://www.informatik.uni-wuerzburg.de/datascience/projects/nlp/llammlein/) and our [preprint](https://arxiv.org/abs/2411.11171)! |
|
|
|
|
|
### Usage |
|
|
|
You can use LLäMmlein with the `transformers` library. |
|
(Optional: install `flash-attn` to achieve highest efficiency.) |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_id = "LSX-UniWue/LLaMmlein_120M" |
|
tokenizer = AutoTokenizer.from_pretrained("LSX-UniWue/LLaMmlein_120M") |
|
model = AutoModelForCausalLM.from_pretrained("LSX-UniWue/LLaMmlein_120M") |
|
``` |
|
|
|
|
|
### Intermediate Checkpoints |
|
In addition to the final model checkpoint, we publish intermediate checkpoints throughout the full training process as unique branches in this repository. |
|
A specific checkpoint can be loaded like this: |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_id = "LSX-UniWue/LLaMmlein_120M" |
|
revision = "iter-00420000-ckpt" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) |
|
model = AutoModelForCausalLM.from_pretrained(model_id, revision=revision) |
|
``` |
|
|
|
Next to the model itself each branch contains all datapoints that were used to train the model up to that point. |
|
In the correspinding folder, named after the checkpoint, you can find several `.log` files (depending on the number of GPUs) of the following format: |
|
|
|
```json |
|
{"time": 1739809392.679516, |
|
"iter_num": 0, |
|
"data_id": ["sha1:EDQMBYDCYBLDAZH3MGYM276BM2DEHPPJ", "sha1:SAJCI75DRHZZFGQORV66NB5FVWUAVLFH", "sha1:7RBZV2MCEM4TUGBBWGTFQAKTWUOGETZU", "sha1:234M32IMLZF7455AKOFWDP6HT6YXAYB4", "sha1:2BIZ7LLSHRK5GUGPZM2GM55APTDKBUG2", "sha1:OF7OI77ZT7ROXGMB6LL4RSRANX7REAYK", "sha1:LGPUOCOV3MKETI5F3IHVGZPD4M26NNJL", "sha1:SHIHUW7FJTP5YHFFV2JZ2CAHUVMKK7XG"], |
|
"file_id": [0, 0, 0, 0, 0, 0, 0, 0], |
|
"process_rank": 0} |
|
``` |
|
|
|
|
|
Note: Our earlier models from the paper, which do not include data logging, are available at: |
|
* [LLäMmlein 1B prerelease](https://huggingface.co/LSX-UniWue/LLaMmlein_1B_prerelease) |
|
|
|
* [LLäMmlein 120M prerelease](https://huggingface.co/LSX-UniWue/LLaMmlein_120M_prerelease) |
|
|
|
|
|
|
|
### License |
|
We release the LLäMmlein models under a research-only RAIL-M license. See [license.md](./license.md) for details. |