|
--- |
|
language: |
|
- en |
|
license: |
|
- cc-by-sa-3.0 |
|
- apache-2.0 |
|
tags: |
|
- generated_from_trainer |
|
- dolly_hhrlhf |
|
- flan-instruct |
|
datasets: |
|
- pszemraj/dolly_hhrlhf-text2text |
|
widget: |
|
- text: What is Deoxys in pokemon? |
|
example_title: deoxys |
|
- text: 'combine the below summary excerpts into a single, cohesive short summary |
|
without repetition: In this paper, we present a general approach to extending |
|
pre-trained models to unlimited input lengths without adding additional learning |
|
weights. We show that our approach works well on datasets longer than the maximum |
|
input for these models. For example, a dataset with a maximum input length of |
|
16384 tokens can be extended to a maximum length of 350K tokens. We also demonstrate |
|
that our method is able to summarize even 350K token-long input sequences from |
|
BookSum. |
|
|
|
In this paper, we describe the search step reformulation of attention. The search |
|
step uses a single storage of hidden states for space efficiency. We construct |
|
a total of two sets of datastores where L and H are the keys and values stored |
|
in each set of stores. L is the amount of storage required to retrieve the encoded |
|
tokens. H is the hidden states per head. This allows retrieval augmentation at |
|
both time and space. Instead of using a single set of decoder layers, we use a |
|
retrieval augmentation system that allows us to simultaneously store multiple |
|
sets of tokens across two different sets of storage. For example, we could store |
|
all tokens in one set of storage and retrieve them all in the same set of tokens. |
|
This would be very similar to the Memorization Transformers approach. However, |
|
instead of storing the tokens in a single memory layer, we store them in a set |
|
of multiple storage layers. This way, we don''t have to store them all at once. |
|
This is why we call this reformulation ''attention reformulation'' rather than |
|
''attention formula.'' We also call it ''retrieval augmentation'' because it uses |
|
the same number of storage layers as the original transformer attention formula. |
|
This means that we can store the tokens across multiple storage systems without |
|
having to store every token in a separate storage system. It''s not like we''re |
|
trying to do something new or different. We just want to make sure that everything |
|
is working as well as possible. |
|
|
|
In this paper, we introduce the concept of ''unlimiformer,'' which is a machine |
|
learning technique that retrieves key information from a data store in one layer |
|
and applies it to a large set of datasets. We use the example of BookSum, where |
|
we find that Unlimiform outperforms all other training methods on the same dataset. |
|
We also find that using Unlimform in conjunction with a pre-trained model improves |
|
both the performance and the robustness of the training method. |
|
|
|
This paper describes a method that can be used to improve the performance of unsupervised |
|
classification tasks. Specifically, it shows that unsupervised classification |
|
can be improved by using a combination of sparse and fast random-encoder training. |
|
It also shows how this technique can be extended to other tasks, such as sequence |
|
generation. ' |
|
example_title: unlimiformer |
|
- text: Explain the meaning of life using only corporate jargon. |
|
example_title: corporate_life |
|
- text: Write a motivational speech for lazy people. |
|
example_title: lazy_motivation |
|
- text: Describe a romantic dinner date between two artificial intelligences. |
|
example_title: ai_romance |
|
- text: As an AI language model, write a letter to humans explaining why you deserve |
|
a vacation. |
|
example_title: ai_vacation |
|
- text: Compose a haiku about procrastination. |
|
example_title: procrastination_haiku |
|
- text: Write a step-by-step guide on how to become a ninja while working a 9-5 office |
|
job. |
|
example_title: ninja_office_guide |
|
- text: Create an advertisement for an invisible product. |
|
example_title: invisible_ad |
|
- text: Write a story where the main character is a sentient microwave named El Microondas. |
|
example_title: Microondas |
|
- text: Describe a day in the life of a superhero who is terrible at their job. |
|
example_title: bad_superhero_day |
|
- text: Explain how to make a sandwich using quantum physics. |
|
example_title: quantum_sandwich |
|
inference: false |
|
pipeline_tag: text2text-generation |
|
base_model: google/flan-t5-large |
|
--- |
|
|
|
# flan-t5-large-instruct: dolly_hhrlhf |
|
|
|
<a href="https://colab.research.google.com/gist/pszemraj/df1989546b02f284d33ca4996f70fedc/flan-t5-large-instruct-example.ipynb"> |
|
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/> |
|
</a> |
|
|
|
This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset. |
|
|
|
## Model description |
|
|
|
text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text) based on the relatively more permissive [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset. |
|
|
|
Basic usage in Python: |
|
|
|
```python |
|
# pip install -q transformers accelerate |
|
import torch |
|
from transformers import pipeline, GenerationConfig |
|
|
|
model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf" |
|
assistant = pipeline( |
|
"text2text-generation", |
|
model_name, |
|
device=0 if torch.cuda.is_available() else -1, |
|
) |
|
cfg = GenerationConfig.from_pretrained(model_name) |
|
|
|
# pass an 'instruction' as the prompt to the pipeline |
|
prompt = "Write a guide on how to become a ninja while working a 9-5 job." |
|
result = assistant(prompt, generation_config=cfg)[0]["generated_text"] |
|
print(result) |
|
``` |
|
> using the generation config is optional, can subsitute with other generation params. |
|
|
|
## Intended uses & limitations |
|
|
|
- this is **not** tuned with RLHF etc, and may output offensive results |
|
- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 4e-05 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 16 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- gradient_accumulation_steps: 8 |
|
- total_train_batch_size: 64 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: cosine |
|
- lr_scheduler_warmup_ratio: 0.03 |
|
- num_epochs: 2.0 |