Taigi-llama-logo

Model Card for Taigi-Llama-2-7B

The Taigi-Llama-2 series are built based on the Traditional Chinese version of the LLaMA-2 model. We conducted continued pre-training on web-scraped data in Taiwanese Hokkien, including Hanzi, POJ, and Hanlo, totaling around 78MB.

For more details, please refer to our GitHub repository and the paper: Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems

Explore other models and datasets in the Taiwanese Hokkien LLM collection.

Model description

  • Usage: This model can be used for causal language modeling tasks in Taiwanese Hokkien. It is also suitable for further fine-tuning on specific datasets for downstream tasks.
  • Language(s) (NLP): The primary language is Taiwanese Hokkien (Hanzi and POJ). The model also retains capabilities in English and Mandarin Chinese due to prior pre-training.
  • Input: Text
  • Output: Text
  • Model Size: 7B parameters

Usage Example

from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate

def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
    model = AutoModelForCausalLM.from_pretrained(
        path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
    
    terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]

    pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)

    return pipeline

model_dir = "Bohanlu/Taigi-Llama-2-7B" # or Bohanlu/Taigi-Llama-2-13B for the 13B model
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)

accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)

# Few-shot示例:問答
qa_prompt = """Example 1:
問題:台北101有偌懸?
答案:台北101的高度是五百空八公尺。

Example 2:
問題:台灣上長的溪仔是佗一條?
答案:台灣上長的溪仔是濁水溪,規个長度有百八公里遐爾長。

Example 3:
問題:臺灣上懸的山是啥物?
答案:"""

print(pipe(qa_prompt, return_full_text=False))
# Output: [{'generated_text': '臺灣上懸的山是玉山,海拔三千九百五十二公尺。'}]

Citation

If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:

@misc{lu2024enhancing,
      title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems}, 
      author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
      year={2024},
      eprint={2403.12024},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
Downloads last month
27
Safetensors
Model size
6.94B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including Bohanlu/Taigi-Llama-2-7B