tFINE-680m-e32-d16-gqa-flan
FLAN-tuned variant of a tFINE (t5) model with GQA.
- 32 encoder layers
- 16 decoder layers
- 1024 hidden size
testing
install transformers fork with GQA updates for t5 (⚠️WIP🚧):
pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
then
# pip install -U git+https://github.com/pszemraj/transformers.git@t5-gqa
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan")
model = AutoModelForSeq2SeqLM.from_pretrained(
"BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan"
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=64, no_repeat_ngram_size=3)
print(
tokenizer.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)[0]
)
Quick eval
Quick eval for: BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
hf (pretrained=BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan,trust_remote_code=True,dtype=bfloat16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 8
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
boolq | 2 | none | 0 | acc | ↑ | 0.7040 | ± | 0.0080 |
openbookqa | 1 | none | 0 | acc | ↑ | 0.1580 | ± | 0.0163 |
none | 0 | acc_norm | ↑ | 0.2420 | ± | 0.0192 | ||
piqa | 1 | none | 0 | acc | ↑ | 0.6132 | ± | 0.0114 |
none | 0 | acc_norm | ↑ | 0.6159 | ± | 0.0113 | ||
social_iqa | 0 | none | 0 | acc | ↑ | 0.4319 | ± | 0.0112 |
tinyArc | 0 | none | 25 | acc_norm | ↑ | 0.2898 | ± | N/A |
tinyHellaswag | 0 | none | 10 | acc_norm | ↑ | 0.3295 | ± | N/A |
tinyMMLU | 0 | none | 0 | acc_norm | ↑ | 0.2980 | ± | N/A |
winogrande | 1 | none | 0 | acc | ↑ | 0.5020 | ± | 0.0141 |
Training and evaluation data
used config 'all'
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 8e-05
- train_batch_size: 4
- eval_batch_size: 2
- seed: 17868
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 32
- total_train_batch_size: 256
- total_eval_batch_size: 4
- optimizer: Use paged_ademamix_32bit and the args are: No additional optimizer arguments
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 1.0
- Downloads last month
- 43
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for BEE-spoke-data/tFINE-680m-e32-d16-gqa-flan
Base model
BEE-spoke-data/tFINE-680m-e32-d16-gqa-1024