File size: 2,919 Bytes
de8f857 0c608d2 de8f857 57ce7bd 98d253b de8f857 57ce7bd de8f857 57ce7bd de8f857 0a13ed8 de8f857 88e4e66 6e1f2f7 0a13ed8 de8f857 0a13ed8 88e4e66 0a13ed8 88e4e66 0a13ed8 de8f857 fcd5599 de8f857 98d253b de8f857 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
library_name: transformers
language:
- en
license: apache-2.0
base_model: BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024
tags:
- flan
- t5
- gqa
- instruct
datasets:
- pszemraj/flan-subsets-deduped
---
# tFINE-680m-e32-d16-gqa-flan
FLAN-tuned variant of a tFINE (t5) model with GQA.
- 32 encoder layers
- 16 decoder layers
- 1024 hidden size
## testing
install [transformers fork with GQA updates for t5](https://github.com/pszemraj/transformers/tree/t5-gqa) (⚠️WIP🚧):
```sh
pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
```
then
```py
# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
model = AutoModelForSeq2SeqLM.from_pretrained(
"BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
print(
tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
)
```
## Quick eval
Quick eval for: `BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan`
hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|-------------|------:|------|-----:|--------|---|-----:|---|------|
|boolq | 2|none | 0|acc |↑ |0.7040|± |0.0080|
|openbookqa | 1|none | 0|acc |↑ |0.1580|± |0.0163|
| | |none | 0|acc_norm|↑ |0.2420|± |0.0192|
|piqa | 1|none | 0|acc |↑ |0.6132|± |0.0114|
| | |none | 0|acc_norm|↑ |0.6159|± |0.0113|
|social_iqa | 0|none | 0|acc |↑ |0.4319|± |0.0112|
|tinyArc | 0|none | 25|acc_norm|↑ |0.2898|± | N/A|
|tinyHellaswag| 0|none | 10|acc_norm|↑ |0.3295|± | N/A|
|tinyMMLU | 0|none | 0|acc_norm|↑ |0.2980|± | N/A|
|winogrande | 1|none | 0|acc |↑ |0.5020|± |0.0141|
## Training and evaluation data
used config 'all'
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 8e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 4
- optimizer: Use paged_ademamix_32bit and the args are:
No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0
|