tsaditya
/

GPT-Kalki

Text Generation

text-generation-inference

Model card Files Files and versions Community

GPT-Kalki / README.md

tsaditya's picture

Create README.md

1a860cb over 2 years ago

|

history blame contribute delete

2.38 kB

	---
	language: ta
	datasets:
	- oscar
	- IndicNLP
	- Wiki-Tamil novels scrapped data

	widget:
	- text: 'ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார்.'

	- text: 'நந்தினி பெரிய பழுவேட்டரையரை உண்மையாக நேசித்தால் '

	- text: 'மதுராந்தகருக்கு இராஜ்யமாளும் விருப்பம் இருப்பதாக இல்லை '

	---

	# GPT2-Kalki
	## Model description
	GPT2-Kalki is a GPT-2 transformer model fine-tuned on corpus of Tamil language data from Wikipedia. Has been specifically finetuned on the works of [Kalki Krishnamurthy](https://en.wikipedia.org/wiki/Kalki_Krishnamurthy) - a Tamil writer from the 1900s.
	This model is an experimentation of "What if" scenarios using the characters of his novels. The famous movie that has been released now [Ponniyin Selvan - I](https://en.wikipedia.org/wiki/Ponniyin_Selvan:_I) is based on the novel written by the same author.
	This model is trained on an already trained model on Tamil dataset from [GPT2-Tamil](https://huggingface.co/abinayam/gpt-2-tamil).

	## Dataset Used:
	The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/) and manually scrapped Wikipedia dataset specifically having stories and novels.
	The scrapped dataset will be released soon.

	## Usage
	You can use this model for Tamil text generation:
	```python
	>>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
	>>> tokenizer = AutoTokenizer.from_pretrained('tsaditya/GPT-Kalki')
	>>> model = AutoModelWithLMHead.from_pretrained('tsaditya/GPT-Kalki')
	>>> text = "ஆதித்த கரிகாலர் தஞ்சைக்குச் செல்ல உடனடியாக ஒப்புக்கொண்டார். "
	>>> encoded_text = tokenizer.encode(text, return_tensors='tf')
	>>> beam_output = model.generate(
	encoded_text,
	do_sample=True,
	max_length=512,
	top_k=50,
	top_p=0.95,
	num_return_sequences=1,
	no_repeat_ngram_size = 3,
	temperature = 0.7
	)
	>>> print(tokenizer.decode(beam_output[0], skip_special_tokens=True))
	```
	---