File size: 9,421 Bytes
679b242
 
0bca233
9955f2a
 
 
 
679b242
 
9955f2a
 
 
 
 
 
60f5323
9955f2a
 
e85f3f4
f37702b
9955f2a
 
 
 
47d85a6
9955f2a
592c3e9
0bca233
9955f2a
47d85a6
 
 
 
 
 
 
 
 
9955f2a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6de315f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
library_name: transformers
license: llama3.1
language:
- th
- en
pipeline_tag: text-generation
---

# Typhoon2-Audio

<div align="center">
<img src="https://storage.googleapis.com/typhoon-public/assets/typhoon2-audio/typhoon2_audio.png" alt="Typhoon2-Audio" width="20%" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
</div>

**Typhoon2-Audio** is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs simultaneously. It is optimized specifically for Thai, but it also supports English.

- **GitHub**: https://github.com/scb-10x/typhoon2-audio/
- **Demo**: https://audio.opentyphoon.ai/
- **Paper**: https://arxiv.org/abs/2412.13702

## Model Description

- **Model type**: The LLM is based on Typhoon2 LLM.
- **Requirement**: Python==3.10 & transformers==4.52.2 & fairseq==0.12.2 & flash-attn
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **License-Speech-Input & LLM**: [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
- **License-Speech-Output**: [CC-BY-NC](https://creativecommons.org/licenses/by-nc/4.0/)

## Installation

```bash
pip install pip==24.0
pip install transformers==4.45.2
pip install fairseq==0.12.2 # fairseq required pip==24.0 to install & only worked only on python 3.10
pip install flash-attn
```

## Usage

### Load Model
```python
import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "scb10x/llama3.1-typhoon2-audio-8b-instruct",
    torch_dtype=torch.float16, 
    trust_remote_code=True
)
model.to("cuda")
```

### Inference - Single turn example
```python
conversation = [
    {"role": "system", "content": "You are a helpful female assistant named ไต้ฝุ่น."},
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
            },
            {"type": "text", "text": "Transcribe this audio"},
        ],
    },
]
x = model.generate(
    conversation=conversation,
    max_new_tokens=500,
    do_sample=True,
    num_beams=1,
    top_p=0.9,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=0.7,
)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])
```

### Inference - Multi turn example
```python
conversation_multi_turn = [
    {
        "role": "system",
        "content": "You are a helpful female assistant named ไต้ฝุ่น. Respond conversationally to the speech provided in the language it is spoken in.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
                # บอกชื่อเมืองใหญ่ๆในอเมริกามาให้หน่อยสิ -- "List some names of US cities"
            },
            {
                "type": "text",
                "text": "",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "โอเคค่ะ, ฉันจะบอกชื่อเมืองใหญ่ๆ ในอเมริกาให้คุณฟัง:\n\n1. นิวยอร์ก\n2. ลอสแอนเจลิส\n3. ชิคาโก\n4. ฮิวสตัน\n5. ฟิลาเดลเฟีย\n6. บอสตัน\n7. ซานฟรานซิสโก\n8. วอชิงตัน ดี.ซี. (Washington D.C.)\n9. แอตแลนต้า\n10. ซีแอตเทิล\n\nถ้าคุณต้องการข้อมูลเพิ่มเติมหรือมีคำถามอื่นๆ กรุณาถามได้เลยค่ะ'",
            },
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2284cd76e1c875525ff75327a2fc3610.wav",
                # แล้วถ้าเป็นประเทศอังกฤษล่ะ -- "How about the UK"

            },
        ],
    },
]
x = model.generate(conversation=conversation_multi_turn)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])
```

### TTS functionality
```python
y = model.synthesize_speech("Hello, my name is ไต้ฝุ่น I am a language model specialized in Thai")
# y => numpy array
```

## Evaluation Results

- **1) Audio and Speech Understanding**

| Model                       | ASR-en (WER↓)      | ASR-th (WER↓) | En2Th (BLEU↑) | X2Th (BLEU↑) | Th2En (BLEU↑) |
|:----------------------------|:-------------------|:--------------|:--------------|:-------------|:--------------|
| SALMONN-13B                 | 5.79      | 98.07         | 0.07         | 0.10        | 14.97        |
| DiVA-8B                     | 30.28     | 65.21         | 9.82         | 5.31        | 7.97         |
| Gemini-1.5-pro-001          | 5.98      | 13.56         | 20.69        | 13.52       | 22.54        |
| Typhoon-Audio               | 8.72      | 14.17         | 17.52        | 10.67       | 24.14        |
| Typhoon2-Audio              | 5.83      | 14.04         | 27.15        | 15.93       | 33.25        |

| Model                          | Gender-th (Acc) | SpokenQA-th (F1)   | SpeechInstruct-(en,th) |
|:-------------------------------|:---------------|:-------------------|:-------------------|
| SALMONN-13B                   |     93.26       |    2.95     |        2.47, 1.18         |
| DiVA-8B                       |     50.12       |    15.13    |        6.81, 2.68         |
| Gemini-1.5-pro-001            |     81.32       |    62.10    |        3.24, 3.93         |
| Typhoon-Audio                 |     93.74       |    64.60    |        5.62, 6.11         |
| Typhoon2-Audio                |     75.65       |    70.01    |        6.00, 6.79         |

- **2) Speech-to-Speech Evaluation**

- 2.1) *Content Generation*


| Model                         | SpeechIF(En)-Quality | SpeechIF(En)-Style   | SpeechIF(Th)-Quality | SpeechIF(Th)-Style   |
|:------------------------------|:---------------|:-------------------|:-------------------|:-------------------|
| Llama-Omni                    |     5.15       |    5.79    |        1.71      |     2.14         |
| GPT-4o-Audio                  |     6.82       |    7.86    |        6.66      |     8.07         |
| Typhoon2-Audio                |     4.92       |    5.39    |        7.19      |     8.04         |

- 2.2) *Speech Quality*

| Model                         | SpeechIF(En)-CER | SpeechIF(En)-UTMOS   | SpeechIF(Th)-CER | SpeechIF(Th)-UTMOS   |
|:------------------------------|:---------------|:-------------------|:-------------------|:-------------------|
| Llama-Omni*                   |     3.40       |    3.93    |        6.30      |     3.93         |
| GPT-4o-Audio                  |     3.20       |    3.65    |        8.05      |     3.46         |
| Typhoon2-Audio                |     26.50      |    2.29    |        8.67      |     2.35         |

*Note that Llama-Omni does not generate Thai text/speech, so it has low CER and high UTMOS due to the outputs being English.
  
## Intended Uses & Limitations
This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

## Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg

## Acknowledgements
We would like to thank the SALMONN team and the Llama-Omni team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

## Typhoon Team
Potsawee Manakul, Warit Sirichotedumrong, Kunat Pipatanakul, Pittawat Taveekitworachai, Natapong Nitarach, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai

## **Citation**

- If you find Typhoon2 useful for your work, please cite it using:
```
@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}
```