canarim-7b / README.md

Update README.md

98567cb about 1 year ago

4.14 kB

	---
	tags:
	- text-generation
	- pytorch
	inference: false
	license: llama2
	language:
	- pt
	pipeline_tag: text-generation
	library_name: transformers
	datasets:
	- dominguesm/CC-MAIN-2023-23
	---


	<p align="center">
	<img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png">
	</p>

	<hr>

	# Canarim-7B

	Canarim-7B is a Portuguese large language model developed by [Maicon Domingues](https://nlp.rocks).

	## Model description

	The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023.

	## Key Features

	- Language: Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences.
	- Architecture: Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results.
	- Diverse Dataset: The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese.

	## Applications

	Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications:

	- Natural Language Understanding (NLU): Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided.
	- Natural Language Generation (NLG): Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format.
	- Language Translation: Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning.

	### Tips for Efficient Use

	- Few-shot Learning: When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task.
	- Contextualization: Including additional context in the input can significantly improve the quality of the model’s predictions and text generation.

	---

	## Getting Started

	To start using Canarim-7B with the Transformers library, first install the library if you haven't already:

	```bash
	pip install transformers
	```

	You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function:

	```python
	from transformers import AutoTokenizer, pipeline
	import torch

	model_id = "dominguesm/canarim-7b"
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	pipe = pipeline(
	"text-generation",
	model=model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	)

	prompt = make_prompt(question)
	sequences = pipe(
	prompt,
	do_sample=True,
	num_return_sequences=1,
	eos_token_id=tokenizer.eos_token_id,
	max_length=2048,
	temperature=0.9,
	top_p=0.6,
	repetition_penalty=1.15
	)
	```

	This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements.

	## Citation

	If you want to cite Canarim Instruct PTBR dataset, you could use this:

	```
	@misc {maicon_domingues_2023,
	author = { {Maicon Domingues} },
	title = { canarim-7b (Revision 08fdd2b) },
	year = 2023,
	url = { https://huggingface.co/dominguesm/canarim-7b },
	doi = { 10.57967/hf/1356 },
	publisher = { Hugging Face }
	}
	```

	## License

	Canarim-7B is released under the [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://ai.meta.com/llama/license/).