File size: 6,274 Bytes

---
license:
- cc-by-sa-3.0
- apache-2.0
tags:
- generated_from_trainer
- dolly_hhrlhf
- flan-instruct
datasets:
- pszemraj/dolly_hhrlhf-text2text
widget:
- text: What is Deoxys in pokemon?
  example_title: deoxys
- text: >-
    combine the below summary excerpts into a single, cohesive  short summary
    without repetition: In this paper, we present a general approach to
    extending pre-trained models to unlimited input lengths without adding
    additional learning weights. We show that our approach works well on
    datasets longer than the maximum input for these models. For example, a
    dataset with a maximum input length of 16384 tokens can be extended to a
    maximum length of 350K tokens. We also demonstrate that our method is able
    to summarize even 350K token-long input sequences from BookSum.

    In this paper, we describe the search step reformulation of attention. The
    search step uses a single storage of hidden states for space efficiency. We
    construct a total of two sets of datastores where L and H are the keys and
    values stored in each set of stores. L is the amount of storage required to
    retrieve the encoded tokens. H is the hidden states per head. This allows
    retrieval augmentation at both time and space. Instead of using a single set
    of decoder layers, we use a retrieval augmentation system that allows us to
    simultaneously store multiple sets of tokens across two different sets of
    storage. For example, we could store all tokens in one set of storage and
    retrieve them all in the same set of tokens. This would be very similar to
    the Memorization Transformers approach. However, instead of storing the
    tokens in a single memory layer, we store them in a set of multiple storage
    layers. This way, we don't have to store them all at once. This is why we
    call this reformulation 'attention reformulation' rather than 'attention
    formula.' We also call it 'retrieval augmentation' because it uses the same
    number of storage layers as the original transformer attention formula. This
    means that we can store the tokens across multiple storage systems without
    having to store every token in a separate storage system. It's not like
    we're trying to do something new or different. We just want to make sure
    that everything is working as well as possible.

    In this paper, we introduce the concept of 'unlimiformer,' which is a
    machine learning technique that retrieves key information from a data store
    in one layer and applies it to a large set of datasets. We use the example
    of BookSum, where we find that Unlimiform outperforms all other training
    methods on the same dataset. We also find that using Unlimform in
    conjunction with a pre-trained model improves both the performance and the
    robustness of the training method.

    This paper describes a method that can be used to improve the performance of
    unsupervised classification tasks. Specifically, it shows that unsupervised
    classification can be improved by using a combination of sparse and fast
    random-encoder training. It also shows how this technique can be extended to
    other tasks, such as sequence generation. 
  example_title: unlimiformer
- text: Explain the meaning of life using only corporate jargon.
  example_title: corporate_life
- text: Write a motivational speech for lazy people.
  example_title: lazy_motivation
- text: Describe a romantic dinner date between two artificial intelligences.
  example_title: ai_romance
- text: >-
    As an AI language model, write a letter to humans explaining why you deserve
    a vacation.
  example_title: ai_vacation
- text: Compose a haiku about procrastination.
  example_title: procrastination_haiku
- text: >-
    Write a step-by-step guide on how to become a ninja while working a 9-5
    office job.
  example_title: ninja_office_guide
- text: Create an advertisement for an invisible product.
  example_title: invisible_ad
- text: >-
    Write a story where the main character is a sentient microwave named El
    Microondas.
  example_title: Microondas
- text: Describe a day in the life of a superhero who is terrible at their job.
  example_title: bad_superhero_day
- text: Explain how to make a sandwich using quantum physics.
  example_title: quantum_sandwich
inference: false
language:
- en
pipeline_tag: text2text-generation
---

# flan-t5-large-instruct: dolly_hhrlhf

This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset.

## Model description

text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text)  based on the relatively more permissive  [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset.

Basic usage in Python:

```python
# pip install -q transformers accelerate
import torch
from transformers import pipeline, GenerationConfig

model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf"
assistant = pipeline(
    "text2text-generation",
    model_name,
    device=0 if torch.cuda.is_available() else -1,
)
cfg = GenerationConfig.from_pretrained(model_name)

# pass an 'instruction' as the prompt to the pipeline
prompt = "Write a guide on how to become a ninja while working a 9-5 job."
result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
print(result)
```
> using the generation config is optional, can subsitute with other generation params.

## Intended uses & limitations

- this is **not** tuned with RLHF etc, and may output offensive results
- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 4e-05
- train_batch_size: 8
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- gradient_accumulation_steps: 8
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 2.0