metadata

license: apache-2.0
datasets:
  - jed351/cantonese-wikipedia
  - raptorkwok/cantonese-traditional-chinese-parallel-corpus
language:
  - zh
  - en
pipeline_tag: text-generation
tags:
  - Cantonese
  - Qwen2
  - chat

Qwen2-Cantonese-7B-Instruct

Model Overview

Qwen2-Cantonese-7B-Instruct is a Cantonese language model based on Qwen2-7B-Instruct, fine-tuned using LoRA. It aims to enhance Cantonese text generation and comprehension capabilities, supporting various tasks such as dialogue generation, text summarization, and question-answering.

Model Features

Base Model: Qwen2-7B-Instruct
Fine-tuning Method: LoRA instruction tuning
Training Steps: 4572 steps
Primary Language: Cantonese
Datasets:
- jed351/cantonese-wikipedia
- raptorkwok/cantonese-traditional-chinese-parallel-corpus
Training Tools: LLaMA-Factory

Usage

You can easily load and use this model with Hugging Face's Transformers library. Here is a simple example:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lordjia/Qwen2-Cantonese-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained("lordjia/Qwen2-Cantonese-7B-Instruct")

input_text = "唔該你用廣東話講下你係邊個。"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Quantized Version

A 4-bit quantized version of this model is also available: qwen2-cantonese-7b-instruct-q4_0.gguf.

Alternative Model Recommendation

For an alternative, consider Llama-3-Cantonese-8B-Instruct, also fine-tuned by LordJia and based on Meta-Llama-3-8B-Instruct.

License

This model is licensed under the Apache 2.0 license. Please review the terms before use.

Contributors

LordJia

Acknowledgements

Thanks to Hugging Face for providing the platform and tools, and to all the developers and researchers contributing to the open-source community.