Marxx01
/

gpt2_catalan

Text Generation

Model card Files Files and versions Community

gpt2_catalan / README.md

Marxx01's picture

Update README.md

c2fb6af verified 3 days ago

|

history blame contribute delete

1.65 kB

	---
	license: mit
	datasets:
	- ELiRF/dacsa
	- projecte-aina/CATalog
	language:
	- ca
	- en
	base_model:
	- openai-community/gpt2
	- openai-community/gpt2-medium
	pipeline_tag: text-generation
	---

	# GPT-2 Medium Catalan-English Model

	The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀

	## Model Overview
	This model is a GPT-2 Medium architecture trained from scratch, meaning it does not inherit any weights from existing models. It has been trained using Catalan dataset, specifically ELiRF/dacsa and projecte-aina/CATalog.

	## License and Usage
	This model is free to use under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work.

	## Tokenizer
	The model utilizes a 52,000-token vocabulary, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".

	## How to Use
	To use this model for text generation, you can load it with the `transformers` library as follows:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "Marxx01/gpt2_catalan"
	model = AutoModelForCausalLM.from_pretrained(model_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	text = "El president de la generalitat va dir "
	inputs = tokenizer(text, return_tensors="pt")

	outputs = model.generate(
	**inputs,
	do_sample = True,
	max_length=150,
	temperature=0.7,
	top_p=0.8,
	top_k=1000,
	no_repeat_ngram_size=2,
	num_return_sequences=1
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))