Librarian Bot: Add base_model information to model (#2)

23e317b over 1 year ago

6.51 kB

	---
	language:
	- en
	license:
	- cc-by-sa-3.0
	- apache-2.0
	tags:
	- generated_from_trainer
	- dolly_hhrlhf
	- flan-instruct
	datasets:
	- pszemraj/dolly_hhrlhf-text2text
	widget:
	- text: What is Deoxys in pokemon?
	example_title: deoxys
	- text: 'combine the below summary excerpts into a single, cohesive short summary
	without repetition: In this paper, we present a general approach to extending
	pre-trained models to unlimited input lengths without adding additional learning
	weights. We show that our approach works well on datasets longer than the maximum
	input for these models. For example, a dataset with a maximum input length of
	16384 tokens can be extended to a maximum length of 350K tokens. We also demonstrate
	that our method is able to summarize even 350K token-long input sequences from
	BookSum.

	In this paper, we describe the search step reformulation of attention. The search
	step uses a single storage of hidden states for space efficiency. We construct
	a total of two sets of datastores where L and H are the keys and values stored
	in each set of stores. L is the amount of storage required to retrieve the encoded
	tokens. H is the hidden states per head. This allows retrieval augmentation at
	both time and space. Instead of using a single set of decoder layers, we use a
	retrieval augmentation system that allows us to simultaneously store multiple
	sets of tokens across two different sets of storage. For example, we could store
	all tokens in one set of storage and retrieve them all in the same set of tokens.
	This would be very similar to the Memorization Transformers approach. However,
	instead of storing the tokens in a single memory layer, we store them in a set
	of multiple storage layers. This way, we don''t have to store them all at once.
	This is why we call this reformulation ''attention reformulation'' rather than
	''attention formula.'' We also call it ''retrieval augmentation'' because it uses
	the same number of storage layers as the original transformer attention formula.
	This means that we can store the tokens across multiple storage systems without
	having to store every token in a separate storage system. It''s not like we''re
	trying to do something new or different. We just want to make sure that everything
	is working as well as possible.

	In this paper, we introduce the concept of ''unlimiformer,'' which is a machine
	learning technique that retrieves key information from a data store in one layer
	and applies it to a large set of datasets. We use the example of BookSum, where
	we find that Unlimiform outperforms all other training methods on the same dataset.
	We also find that using Unlimform in conjunction with a pre-trained model improves
	both the performance and the robustness of the training method.

	This paper describes a method that can be used to improve the performance of unsupervised
	classification tasks. Specifically, it shows that unsupervised classification
	can be improved by using a combination of sparse and fast random-encoder training.
	It also shows how this technique can be extended to other tasks, such as sequence
	generation. '
	example_title: unlimiformer
	- text: Explain the meaning of life using only corporate jargon.
	example_title: corporate_life
	- text: Write a motivational speech for lazy people.
	example_title: lazy_motivation
	- text: Describe a romantic dinner date between two artificial intelligences.
	example_title: ai_romance
	- text: As an AI language model, write a letter to humans explaining why you deserve
	a vacation.
	example_title: ai_vacation
	- text: Compose a haiku about procrastination.
	example_title: procrastination_haiku
	- text: Write a step-by-step guide on how to become a ninja while working a 9-5 office
	job.
	example_title: ninja_office_guide
	- text: Create an advertisement for an invisible product.
	example_title: invisible_ad
	- text: Write a story where the main character is a sentient microwave named El Microondas.
	example_title: Microondas
	- text: Describe a day in the life of a superhero who is terrible at their job.
	example_title: bad_superhero_day
	- text: Explain how to make a sandwich using quantum physics.
	example_title: quantum_sandwich
	inference: false
	pipeline_tag: text2text-generation
	base_model: google/flan-t5-large
	---

	# flan-t5-large-instruct: dolly_hhrlhf

	<a href="https://colab.research.google.com/gist/pszemraj/df1989546b02f284d33ca4996f70fedc/flan-t5-large-instruct-example.ipynb">
	<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
	</a>

	This model is a fine-tuned version of [google/flan-t5-large](https://huggingface.co/google/flan-t5-large) on the pszemraj/dolly_hhrlhf-text2text dataset.

	## Model description

	text2text models fine-tuned on a [modified dataset for text2text generation](https://huggingface.co/datasets/pszemraj/dolly_hhrlhf-text2text) based on the relatively more permissive [mosaicml/dolly_hhrlhf](https://huggingface.co/datasets/mosaicml/dolly_hhrlhf) dataset.

	Basic usage in Python:

	```python
	# pip install -q transformers accelerate
	import torch
	from transformers import pipeline, GenerationConfig

	model_name = "pszemraj/flan-t5-large-instruct-dolly_hhrlhf"
	assistant = pipeline(
	"text2text-generation",
	model_name,
	device=0 if torch.cuda.is_available() else -1,
	)
	cfg = GenerationConfig.from_pretrained(model_name)

	# pass an 'instruction' as the prompt to the pipeline
	prompt = "Write a guide on how to become a ninja while working a 9-5 job."
	result = assistant(prompt, generation_config=cfg)[0]["generated_text"]
	print(result)
	```
	> using the generation config is optional, can subsitute with other generation params.

	## Intended uses & limitations

	- this is not tuned with RLHF etc, and may output offensive results
	- despite being the `large` tagged variant, this model has only 774M parameters (3 gb) and therefore may exhibit less 'cogitive ability' on some uses cases/tasks

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 4e-05
	- train_batch_size: 8
	- eval_batch_size: 16
	- seed: 42
	- distributed_type: multi-GPU
	- gradient_accumulation_steps: 8
	- total_train_batch_size: 64
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: cosine
	- lr_scheduler_warmup_ratio: 0.03
	- num_epochs: 2.0