humarin
/

chatgpt_paraphraser_on_T5_base

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

chatgpt_paraphraser_on_T5_base / README.md

VladimirVorobev's picture

VladimirVorobev

Update README.md

13f60a9 over 1 year ago

|

3.23 kB

	---
	license: openrail
	datasets:
	- humarin/chatgpt-paraphrases
	language:
	- en
	library_name: transformers
	---
	This model was trained on our [ChatGPT paraphrase dataset](https://huggingface.co/datasets/humarin/chatgpt-paraphrases).



	This dataset is based on the [Quora paraphrase question](https://www.kaggle.com/competitions/quora-question-pairs), texts from the [SQUAD 2.0](https://huggingface.co/datasets/squad_v2) and the [CNN news dataset](https://huggingface.co/datasets/cnn_dailymail).

	This model is based on the T5-base model. We used "transfer learning" to get our model to generate paraphrases as well as ChatGPT. Now we can say that this is one of the best paraphrases of the Hugging Face.

	[Kaggle](https://www.kaggle.com/datasets/vladimirvorobevv/chatgpt-paraphrases) link

	Deploying example:
	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base")

	model = AutoModelForSeq2SeqLM.from_pretrained("humarin/chatgpt_paraphraser_on_T5_base").to(device)

	def paraphrase(text, max_length=128, num_return_sequences=5, num_beams=25, temperature=0.7):
	input_ids = tokenizer(
	f'paraphrase: {text}',
	return_tensors="pt", padding="longest",
	max_length=max_length,
	truncation=True,
	).input_ids.to(device)

	outputs = model.generate(
	input_ids, temperature=temperature, repetition_penalty=1.5,
	num_return_sequences=num_return_sequences, no_repeat_ngram_size=5, num_beams=num_beams, max_length=max_length
	)

	res = tokenizer.batch_decode(outputs, skip_special_tokens=True)

	return res
	```

	Usage examples

	Input:
	```python
	text = 'What are the best places to see in New York?'
	paraphrase(text)
	```
	Output:
	```python
	['What are some of the must-visit places in New York?',
	'Which are the top destinations to explore in New York?',
	'What are some of the must-visit spots in New York?',
	'What are some of the must-see places in New York?',
	'Which places should I not miss when visiting New York?']
	```

	Input:
	```python
	text = "Rammstein's album Mutter was recorded in the south of France in May and June 2000, and mixed in Stockholm in October of that year."
	paraphrase(text)
	```
	Output:
	```python
	['In May and June 2000, Rammstein recorded Mutter in the south of France, with the album being mixed in Stockholm in October of the same year.',
	'In May and June 2000, Rammstein recorded Mutter in the south of France, with the album being mixed in Stockholm in October of that year.',
	'In May and June 2000, Rammstein recorded Mutter, his album, in the south of France, with mixing taking place in Stockholm in October of the same year.',
	'In May and June 2000, Rammstein filmed the recording of his album Mutter in the south of France, with the mixing process taking place in Stockholm in October of the same year.',
	'In May and June 2000, Rammstein recorded Mutter in the south of France, with mixing taking place in Stockholm in October of the same year.']
	```


	Train parameters:
	```python
	epochs = 3
	batch_size = 64
	max_length = 128
	lr = 5e-5
	batches_qty = 196465
	betas = (0.9, 0.999)
	eps = 1e-08
	```