File size: 2,919 Bytes
de8f857
 
0c608d2
 
de8f857
 
 
57ce7bd
 
 
 
98d253b
 
de8f857
 
 
57ce7bd
de8f857
57ce7bd
 
 
 
 
de8f857
0a13ed8
de8f857
 
88e4e66
6e1f2f7
0a13ed8
 
 
de8f857
0a13ed8
 
 
 
 
 
 
88e4e66
 
 
0a13ed8
 
 
 
 
88e4e66
 
 
 
 
0a13ed8
de8f857
fcd5599
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de8f857
 
98d253b
de8f857
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
library_name: transformers
language:
- en
license: apache-2.0
base_model: BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024
tags:
- flan
- t5
- gqa
- instruct
datasets:
- pszemraj/flan-subsets-deduped
---


# tFINE-680m-e32-d16-gqa-flan

FLAN-tuned variant of a tFINE (t5) model with GQA.

- 32 encoder layers
- 16 decoder layers
- 1024 hidden size

## testing 


install [transformers fork with GQA updates for t5](https://github.com/pszemraj/transformers/tree/t5-gqa) (⚠️WIP🚧):

```sh
pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
```

then 

```py
# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
model = AutoModelForSeq2SeqLM.from_pretrained(
    "BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
print(
    tokenizer.batch_decode(
        generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
    )[0]
)
```

## Quick eval

Quick eval for:	`BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan`


hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8

|    Tasks    |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|boolq        |      2|none  |     0|acc     |↑  |0.7040|±  |0.0080|
|openbookqa   |      1|none  |     0|acc     |↑  |0.1580|±  |0.0163|
|             |       |none  |     0|acc_norm|↑  |0.2420|±  |0.0192|
|piqa         |      1|none  |     0|acc     |↑  |0.6132|±  |0.0114|
|             |       |none  |     0|acc_norm|↑  |0.6159|±  |0.0113|
|social_iqa   |      0|none  |     0|acc     |↑  |0.4319|±  |0.0112|
|tinyArc      |      0|none  |    25|acc_norm|↑  |0.2898|±  |   N/A|
|tinyHellaswag|      0|none  |    10|acc_norm|↑  |0.3295|±  |   N/A|
|tinyMMLU     |      0|none  |     0|acc_norm|↑  |0.2980|±  |   N/A|
|winogrande   |      1|none  |     0|acc     |↑  |0.5020|±  |0.0141|

## Training and evaluation data

used config 'all'

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 8e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 4
- optimizer: Use paged_ademamix_32bit and the args are:
No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0