Bohanlu commited on
Commit
f52a6d3
·
verified ·
1 Parent(s): 3d19ea9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +90 -0
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ ---
4
+ <p align="center">
5
+ <img src="https://github.com/lbh0830/TW-Hokkien-LLM/blob/main/pics/logo.jpg?raw=true" alt="Taigi-llama-logo" width="350">
6
+ </p>
7
+
8
+ # Model Card for Taigi-Llama-2-Translator-13B
9
+ The Taigi-Llama-2-Translator series are built based on the Taigi-Llama-2 series model. We conducted fine-tuning on 263k parallel data to create a translation model for Taiwanese Hokkien and related languages.
10
+
11
+ For more details, please refer to our [GitHub repository](https://github.com/lbh0830/TW-Hokkien-LLM/tree/main) and the paper: [Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems](https://arxiv.org/abs/2403.12024)
12
+
13
+ Explore other models and datasets in the [Taiwanese Hokkien LLM collection](https://huggingface.co/collections/Bohanlu/taiwanese-hokkien-llm-6614ba7456e6789bc2f10ca0).
14
+
15
+ ## Model description
16
+
17
+ - **Base Model:** [Bohanlu/Taigi-Llama-2-13B](https://huggingface.co/Bohanlu/Taigi-Llama-2-13B)
18
+ - **Usage:** This model can be used for translating between Traditional Chinese or English and Taiwanese Hokkien (Hanzi, POJ, or Hanlo). It also supports translation between different scripts of Taiwanese Hokkien (Hanzi, POJ, Hanlo).
19
+ - **Language(s) (NLP):** Taiwanese Hokkien (Hanzi, POJ and Hanlo), Traditional Chinese and English
20
+ - **Input:** Text in source language
21
+ - **Output:** Text in target language
22
+ - **Model Size:** 13B parameters
23
+
24
+ ## Prompt Template
25
+ ```
26
+ {BOS}[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n
27
+ ```
28
+
29
+ - `source_sentence`: The sentence you want to translate.
30
+ - `target_language`: The target language you want to translate to. Use "ZH" for Traditional Chinese, "EN" for English, "POJ" for Taiwanese Hokkien POJ, "HL" for Taiwanese Hokkien Hanlo, and "HAN" for Taiwanese Hokkien Hanzi.
31
+ - Ensure there's a newline at the end.
32
+
33
+ ## Usage Example
34
+ ```python
35
+ from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
36
+ import torch
37
+ import accelerate
38
+
39
+ def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
40
+ model = AutoModelForCausalLM.from_pretrained(
41
+ path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
42
+
43
+ terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]
44
+
45
+ pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
46
+
47
+ return pipeline
48
+
49
+ model_dir = "Bohanlu/Taigi-Llama-2-Translator-7B" # or "Bohanlu/Taigi-Llama-2-Translator-13B" for the 13B model
50
+ tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
51
+
52
+ accelerator = accelerate.Accelerator()
53
+ pipe = get_pipeline(model_dir, tokenizer, accelerator)
54
+
55
+ PROMPT_TEMPLATE = "[TRANS]\n{source_sentence}\n[/TRANS]\n[{target_language}]\n"
56
+
57
+ def translate(source_sentence:str, target_language:str) -> str:
58
+ prompt = PROMPT_TEMPLATE.format(source_sentence=source_sentence, target_language=target_language)
59
+ out = pipe(prompt, return_full_text=False, repetition_penalty=1.1, do_sample=False)[0]['generated_text']
60
+ return out[:out.find("[/")].strip()
61
+
62
+ source_sentence = "How are you today?"
63
+
64
+ print("To Hanzi: " + translate(source_sentence, "HAN"))
65
+ >>> To Hanzi: 你今仔日好無?
66
+
67
+ print("To POJ: " + translate(source_sentence, "POJ"))
68
+ >>> To POJ: Lí kin-á-ji̍t án-chóaⁿ?
69
+
70
+ print("To Traditional Chinese: " + translate(source_sentence, "ZH"))
71
+ >>> To Traditional Chinese: 你今天好嗎?
72
+
73
+ print("To Hanlo: " + translate(source_sentence, "HL"))
74
+ >>> To Hanlo: 你今仔日好無?
75
+ ```
76
+
77
+ ## Citation
78
+
79
+ If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:
80
+
81
+ ```
82
+ @misc{lu2024enhancing,
83
+ title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems},
84
+ author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
85
+ year={2024},
86
+ eprint={2403.12024},
87
+ archivePrefix={arXiv},
88
+ primaryClass={cs.CL}
89
+ }
90
+ ```