|
--- |
|
tags: |
|
- text-generation |
|
- pytorch |
|
inference: false |
|
license: llama2 |
|
language: |
|
- pt |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
datasets: |
|
- dominguesm/CC-MAIN-2023-23 |
|
--- |
|
|
|
|
|
<p align="center"> |
|
<img width="250" alt="Camarim Logo" src="https://raw.githubusercontent.com/DominguesM/Canarim-Instruct-PTBR/main/assets/canarim.png"> |
|
</p> |
|
|
|
<hr> |
|
|
|
# Canarim-7B |
|
|
|
Canarim-7B is a Portuguese large language model developed by [Maicon Domingues](https://nlp.rocks). |
|
|
|
## Model description |
|
|
|
The model was pretrained on 16 billion tokens from the Portuguese subset of [CommonCrawl 2023-23](https://huggingface.co/datasets/dominguesm/CC-MAIN-2023-23), starting with the weights of LLaMA2-7B. The pretraining data has cutoff of mid-2023. |
|
|
|
## Key Features |
|
|
|
- **Language:** Specialized in understanding and generating Portuguese text, making it ideal for applications targeting Portuguese-speaking audiences. |
|
- **Architecture:** Inherits the robust architecture from LLaMA2-7B, ensuring efficient performance and accurate results. |
|
- **Diverse Dataset:** The pretraining dataset includes a wide range of topics and writing styles, enhancing the model's ability to understand various contexts and nuances in Portuguese. |
|
|
|
## Applications |
|
|
|
Canarim-7B, was trained solely on a language modeling objective and has not been fine-tuned for instruction following. Therefore, it is more suited for few-shot tasks rather than zero-shot tasks. This means the model tends to perform better when provided with a few examples of the desired outcome during use. Here are some practical applications: |
|
|
|
- **Natural Language Understanding (NLU):** Efficient in tasks such as sentiment analysis, topic classification, and entity recognition in Portuguese text, especially when relevant examples are provided. |
|
- **Natural Language Generation (NLG):** Capable of generating coherent and contextually relevant text, useful for content creation, chatbots, and more, with improved results when provided examples of the desired style or format. |
|
- **Language Translation:** Suitable for high-quality translation between Portuguese and other languages, especially when examples of desired translations are included during model training or fine-tuning. |
|
|
|
### Tips for Efficient Use |
|
|
|
- **Few-shot Learning:** When using Canarim-7B for specific tasks, it is beneficial to provide a few relevant examples. This helps the model better understand the context and purpose of the task. |
|
- **Contextualization:** Including additional context in the input can significantly improve the quality of the model’s predictions and text generation. |
|
|
|
--- |
|
|
|
## Getting Started |
|
|
|
To start using Canarim-7B with the Transformers library, first install the library if you haven't already: |
|
|
|
```bash |
|
pip install transformers |
|
``` |
|
|
|
You can then load the model using the Transformers library. Here's a simple example of how to use the model for text generation using the `pipeline` function: |
|
|
|
```python |
|
from transformers import AutoTokenizer, pipeline |
|
import torch |
|
|
|
model_id = "dominguesm/canarim-7b" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
pipe = pipeline( |
|
"text-generation", |
|
model=model_id, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
) |
|
|
|
prompt = make_prompt(question) |
|
sequences = pipe( |
|
prompt, |
|
do_sample=True, |
|
num_return_sequences=1, |
|
eos_token_id=tokenizer.eos_token_id, |
|
max_length=2048, |
|
temperature=0.9, |
|
top_p=0.6, |
|
repetition_penalty=1.15 |
|
) |
|
``` |
|
|
|
This code snippet demonstrates how to generate text with Canarim-7B. You can customize the input text and adjust parameters like `max_length` according to your requirements. |
|
|
|
## Citation |
|
|
|
If you want to cite **Canarim Instruct PTBR dataset**, you could use this: |
|
|
|
``` |
|
@misc {maicon_domingues_2023, |
|
author = { {Maicon Domingues} }, |
|
title = { canarim-7b (Revision 08fdd2b) }, |
|
year = 2023, |
|
url = { https://huggingface.co/dominguesm/canarim-7b }, |
|
doi = { 10.57967/hf/1356 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
## License |
|
|
|
Canarim-7B is released under the [LLAMA 2 COMMUNITY LICENSE AGREEMENT](https://ai.meta.com/llama/license/). |