nicholasKluge commited on
Commit
b2a8af5
·
1 Parent(s): c60dcfe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +166 -19
README.md CHANGED
@@ -1,29 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
 
2
- ## nicholasKluge/TeenyTinyLlama-162m-Assin2
3
 
4
- | Epoch | Training Loss | Validation Loss | Accuracy |
5
- |-------|---------------|------------------|----------|
6
- | 1 | No log | 0.378027 | 0.846405 |
7
- | 2 | 0.352600 | 0.474960 | 0.849265 |
8
- | 3 | 0.148100 | 0.575100 | 0.857843 |
9
 
10
- ## neuralmind/bert-base-portuguese-cased
11
 
12
- | Epoch | Training Loss | Validation Loss | Accuracy |
13
- |-------|---------------|------------------|----------|
14
- | 1 | No log | 0.341371 | 0.872958 |
15
- | 2 | 0.349900 | 0.429437 | 0.870098 |
16
- | 3 | 0.168700 | 0.578217 | 0.874592 |
17
 
 
18
 
19
- ## neuralmind/bert-large-portuguese-cased
20
 
21
- | Epoch | Training Loss | Validation Loss | Accuracy |
22
- |-------|---------------|------------------|----------|
23
- | 1 | No log | 0.329105 | 0.873775 |
24
- | 2 | 0.337000 | 0.403772 | 0.876634 |
25
- | 3 | 0.151400 | 0.563161 | 0.889706 |
26
 
27
- ## pierreguillou/gpt2-small-portuguese
28
 
 
 
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - assin2
5
+ language:
6
+ - pt
7
+ metrics:
8
+ - accuracy
9
+ library_name: transformers
10
+ pipeline_tag: text-classification
11
+ tags:
12
+ - textual-entailment
13
+ widget:
14
+ - text: "<s>Qual a capital do Brasil?<s>A capital do Brasil é Brasília!</s>"
15
+ example_title: Exemplo
16
+ - text: "<s>Qual a capital do Brasil?<s>Anões são muito mais legais do que elfos!</s>"
17
+ example_title: Exemplo
18
+ ---
19
+ # TeenyTinyLlama-162m-Assin2
20
 
21
+ TeenyTinyLlama is a series of small foundational models trained in Brazilian Portuguese.
22
 
23
+ This repository contains a version of [TeenyTinyLlama-162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) (`TeenyTinyLlama-162m-Assin2`) fine-tuned on the [Assin2](https://huggingface.co/datasets/assin2).
 
 
 
 
24
 
25
+ ## Details
26
 
27
+ - **Number of Epochs:** 3
28
+ - **Batch size:** 16
29
+ - **Optimizer:** `torch.optim.AdamW` (learning_rate = 4e-5, epsilon = 1e-8)
30
+ - **GPU:** 1 NVIDIA A100-SXM4-40GB
 
31
 
32
+ ## Usage
33
 
34
+ Using `transformers.pipeline`:
35
 
36
+ ```python
37
+ from transformers import pipeline
 
 
 
38
 
39
+ text = "<s>Qual a capital do Brasil?<s>A capital do Brasil é Brasília!</s>"
40
 
41
+ classifier = pipeline("text-classification", model="nicholasKluge/TeenyTinyLlama-162m-Assin2")
42
+ classifier(text)
43
 
44
+ # >>> [{'label': 'ENTAILED', 'score': 0.9774010181427002}]
45
+ ```
46
+
47
+ ## Reproducing
48
+
49
+ To reproduce the fine-tuning process, use the following code snippet:
50
+
51
+ ```python
52
+ # Assin2
53
+ ! pip install transformers datasets evaluate accelerate -q
54
+
55
+ import evaluate
56
+ import numpy as np
57
+ from datasets import load_dataset, Dataset, DatasetDict
58
+ from transformers import AutoTokenizer, DataCollatorWithPadding
59
+ from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
60
+
61
+ # Load the task
62
+ dataset = load_dataset("assin2")
63
+
64
+ # Create a `ModelForSequenceClassification`
65
+ model = AutoModelForSequenceClassification.from_pretrained(
66
+ "nicholasKluge/TeenyTinyLlama-162m",
67
+ num_labels=2,
68
+ id2label={0: "UNENTAILED", 1: "ENTAILED"},
69
+ label2id={"UNENTAILED": 0, "ENTAILED": 1}
70
+ )
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("nicholasKluge/TeenyTinyLlama-162m")
73
+
74
+ # Format the dataset
75
+ train = dataset['train'].to_pandas()
76
+ train['text'] = tokenizer.bos_token + train['premise'] + tokenizer.bos_token + train['hypothesis'] + tokenizer.eos_token
77
+ train = train[["text", "entailment_judgment"]]
78
+ train.columns = ['text', 'label']
79
+ train.labels = train.label.astype(int)
80
+ train = Dataset.from_pandas(train)
81
+
82
+ test = dataset['test'].to_pandas()
83
+ test['text'] = tokenizer.bos_token + test['premise'] + tokenizer.bos_token + test['hypothesis'] + tokenizer.eos_token
84
+ test = test[["text", "entailment_judgment"]]
85
+ test.columns = ['text', 'label']
86
+ test.labels = test.label.astype(int)
87
+ test = Dataset.from_pandas(test)
88
+
89
+ dataset = DatasetDict({
90
+ "train": train,
91
+ "test": test
92
+ })
93
+
94
+ # Preprocess the dataset
95
+ def preprocess_function(examples):
96
+ return tokenizer(examples["text"], truncation=True)
97
+
98
+ dataset_tokenized = dataset.map(preprocess_function, batched=True)
99
+
100
+ # Create a simple data collactor
101
+ data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
102
+
103
+ # Use accuracy as evaluation metric
104
+ accuracy = evaluate.load("accuracy")
105
+
106
+ # Function to compute accuracy
107
+ def compute_metrics(eval_pred):
108
+ predictions, labels = eval_pred
109
+ predictions = np.argmax(predictions, axis=1)
110
+ return accuracy.compute(predictions=predictions, references=labels)
111
+
112
+ # Define training arguments
113
+ training_args = TrainingArguments(
114
+ output_dir="checkpoints",
115
+ learning_rate=4e-5,
116
+ per_device_train_batch_size=16,
117
+ per_device_eval_batch_size=16,
118
+ num_train_epochs=3,
119
+ weight_decay=0.01,
120
+ evaluation_strategy="epoch",
121
+ save_strategy="epoch",
122
+ load_best_model_at_end=True,
123
+ push_to_hub=True,
124
+ hub_token="your_token_here",
125
+ hub_model_id="username/model-ID",
126
+ )
127
+
128
+ # Define the Trainer
129
+ trainer = Trainer(
130
+ model=model,
131
+ args=training_args,
132
+ train_dataset=dataset_tokenized["train"],
133
+ eval_dataset=dataset_tokenized["test"],
134
+ tokenizer=tokenizer,
135
+ data_collator=data_collator,
136
+ compute_metrics=compute_metrics,
137
+ )
138
+
139
+ # Train!
140
+ trainer.train()
141
+
142
+
143
+ ```
144
+
145
+ ## Fine-Tuning Comparisons
146
+
147
+ | Models | [Assin2](https://huggingface.co/datasets/assin2)|
148
+ |--------------------------------------------------------------------------------------------|-------------------------------------------------|
149
+ | [Teeny Tiny Llama 162m](https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m) | 85.78 |
150
+ | [Bert-base-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased) | 87.45 |
151
+ | [Bert-large-portuguese-cased](https://huggingface.co/neuralmind/bert-base-portuguese-cased)| 88.97 |
152
+ | [Gpt2-small-portuguese](https://huggingface.co/pierreguillou/gpt2-small-portuguese) | 86.11 |
153
+
154
+ ## Cite as 🤗
155
+
156
+ ```latex
157
+
158
+ @misc{nicholas22llama,
159
+ doi = {10.5281/zenodo.6989727},
160
+ url = {https://huggingface.co/nicholasKluge/TeenyTinyLlama-162m},
161
+ author = {Nicholas Kluge Corrêa},
162
+ title = {TeenyTinyLlama},
163
+ year = {2023},
164
+ publisher = {HuggingFace},
165
+ journal = {HuggingFace repository},
166
+ }
167
+
168
+ ```
169
+
170
+ ## Funding
171
+
172
+ This repository was built as part of the RAIES ([Rede de Inteligência Artificial Ética e Segura](https://www.raies.org/)) initiative, a project supported by FAPERGS - ([Fundação de Amparo à Pesquisa do Estado do Rio Grande do Sul](https://fapergs.rs.gov.br/inicial)), Brazil.
173
+
174
+ ## License
175
+
176
+ TeenyTinyLlama-162m-Assin2 is licensed under the Apache License, Version 2.0. See the [LICENSE](LICENSE) file for more details.