File size: 8,178 Bytes
32cfcb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234

---

datasets:
- louisbrulenaudet/Romulus-cpt-fr
license: llama3
language:
- fr
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- law
- droit
- unsloth
- trl
- transformers
- sft
- llama

---

[![QuantFactory Banner](https://lh7-rt.googleusercontent.com/docsz/AD_4nXeiuCm7c8lEwEJuRey9kiVZsRn2W-b4pWlu3-X534V3YmVuVc2ZL-NXg2RkzSOOS2JXGHutDuyyNAUtdJI65jGTo8jT9Y99tMi4H4MqL44Uc5QKG77B0d6-JfIkZHFaUA71-RtjyYZWVIhqsNZcx8-OMaA?key=xt3VSDoCbmTY7o-cwwOFwQ)](https://hf.co/QuantFactory)


# QuantFactory/Romulus-cpt-Llama-3.1-8B-v0.1-GGUF
This is quantized version of [louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1](https://huggingface.co/louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1) created using llama.cpp

# Original Model Card

<img src="assets/thumbnail.webp">

# Romulus, continually pre-trained models for French law.

Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.

The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B tokenizer).

## Hyperparameters

The following table outlines the key hyperparameters used for training Romulus.

| **Parameter**                   | **Description**                                                 | **Value**                   |
|----------------------------------|-----------------------------------------------------------------|-----------------------------|
| `max_seq_length`                 | Maximum sequence length for the model                           | 4096                        |
| `load_in_4bit`                   | Whether to load the model in 4-bit precision                    | False                       |
| `model_name`                     | Pre-trained model name from Hugging Face                        | meta-llama/Meta-Llama-3.1-8B|
| `r`                              | Rank of the LoRA adapter                                        | 128                         |
| `lora_alpha`                     | Alpha value for the LoRA module                                 | 32                          |
| `lora_dropout`                   | Dropout rate for LoRA layers                                    | 0                           |
| `bias`                           | Bias type for LoRA adapters                                     | none                        |
| `use_gradient_checkpointing`     | Whether to use gradient checkpointing                           | unsloth                     |
| `train_batch_size`               | Per device training batch size                                  | 8                           |
| `gradient_accumulation_steps`    | Number of gradient accumulation steps                           | 8                           |
| `warmup_ratio`                   | Warmup steps as a fraction of total steps                       | 0.1                         |
| `num_train_epochs`               | Number of training epochs                                       | 1                           |
| `learning_rate`                  | Learning rate for the model                                     | 5e-5                        |
| `embedding_learning_rate`        | Learning rate for embeddings                                    | 1e-5                        |
| `optim`                          | Optimizer used for training                                     | adamw_8bit                  |
| `weight_decay`                   | Weight decay to prevent overfitting                             | 0.01                        |
| `lr_scheduler_type`              | Type of learning rate scheduler                                 | linear                      |

# Training script

Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:

```python
# -*- coding: utf-8 -*-
import os

from typing import (
    Dict,
)

from datasets import load_dataset
from unsloth import (
    FastLanguageModel,
    is_bfloat16_supported,
    UnslothTrainer,
    UnslothTrainingArguments,
)

max_seq_length = 4096
dtype = None
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="hf_token",
)

model = FastLanguageModel.get_peft_model(
    model,
    r=128,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "embed_tokens",
        "lm_head",
    ],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=True,
    loftq_config=None,
)

prompt = """### Référence :
{}
### Contenu :
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    """
    Format input examples into prompts for a language model.

    This function takes a dictionary of examples containing titles and texts,
    combines them into formatted prompts, and appends an end-of-sequence token.

    Parameters
    ----------
    examples : dict
        A dictionary containing two keys:
        - 'title': A list of titles.
        - 'text': A list of corresponding text content.

    Returns
    -------
    dict
        A dictionary with a single key 'text', containing a list of formatted prompts.

    Notes
    -----
    - The function assumes the existence of a global `prompt` variable, which is a
      formatting string used to combine the title and text.
    - The function also assumes the existence of a global `EOS_TOKEN` variable,
      which is appended to the end of each formatted prompt.
    - The input lists 'title' and 'text' are expected to have the same length.

    Examples
    --------
    >>> examples = {
    ...     'title': ['Title 1', 'Title 2'],
    ...     'text': ['Content 1', 'Content 2']
    ... }
    >>> formatting_cpt_prompts_func(examples)
    {'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
    """
    refs = examples["ref"]
    texts = examples["texte"]
    outputs = []

    for ref, text in zip(refs, texts):
        text = prompt.format(ref, text) + EOS_TOKEN
        outputs.append(text)

    return {
        "text": outputs,
    }


cpt_dataset = load_dataset(
    "louisbrulenaudet/Romulus-cpt-fr",
    split="train",
    token="hf_token",
)

cpt_dataset = cpt_dataset.map(
    formatting_prompts_func,
    batched=True,
)

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=cpt_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=UnslothTrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=5e-5,
        embedding_learning_rate=1e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        report_to="wandb",
        save_steps=350,
        run_name="romulus-cpt",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()
```

<img src="assets/loss.png">

## Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Romulus, continually pre-trained models for French law},
  year =         {2024}
  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
}
```

## Feedback

If you have any feedback, please reach out at [[email protected]](mailto:[email protected]).