PATTARA TIPAKSORN
Update README.md
4c9231a verified
|
raw
history blame
4.55 kB
metadata
license: apache-2.0
language:
  - th
  - en
pipeline_tag: text-generation
library_name: transformers
tags:
  - chat
  - audio

Pathumma-Audio

Model Description

Pathumma-llm-audio-1.0.0 is a 8 billion parameter Thai large language model designed for audio understanding tasks. The model can process multiple types of audio inputs including speech, general audio, and music, converting them into meaningful textual representations.

Model Architecture

The model combines two key components:

Quickstart

To load the model and generate responses using the Hugging Face Transformers library, follow the steps below.

1. Install the required dependencies:

Make sure you have the necessary libraries installed by running:

pip install librosa torch transformers peft

2. Load the model and generate a response:

You can load the model and use it to generate a response with the following code snippet:

import torch
import librosa
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

model = AutoModel.from_pretrained(
    "nectec/Pathumma-llm-audio-1.0.0",
    torch_dtype=torch.bfloat16,
    lora_infer_mode=True,
    init_from_scratch=True,
    trust_remote_code=True
)
model = model.to(device)

prompt = "ถอดเสียงเป็นข้อความ"
audio_path = "audio_path.wav"
audio, sr = librosa.load(audio_path, sr=16000)

model.eval()
with torch.no_grad():
  response = model.generate(
        raw_wave=audio,
        prompts=prompt,
        device=device,
        max_new_tokens=200,
        repetition_penalty=1.0,
)
print(response[0])

Evaluation Performance

Model ASR-th CV18 Th (WER↓) ASR-en CV18 En (WER↓) ASR-en Librispeech En (WER↓) ThaiSER Emotion (Acc↑, F1↑) ThaiSER Gender (Acc↑, F1↑)
Typhoon-Audio-Preview 13.26 13.34 (partial result) 5.07 (partial result) 41.50, 33.48 96.20, 96.69
DIVA 69.15 (partial result) 37.40 49.06 18.64, 8.16 47.50, 35.90
Gemini-1.5-Pro 16.49 12.94 25.83 26.00, 18.26 79.66, 77.32
Pathumma-llm-audio-1.0.0 12.03 12.20 11.36 42.30, 36.88 90.30, 92.07

Limitations and Future Work

At present, our model remains in the experimental research phase and is not yet fully suitable for practical applications as an assistant. Future work will focus on upgrading the language model to a newer version Pathumma-llm-text-1.0.0, and curating more refined and robust datasets to improve performance. Additionally, we aim to address and prioritize the safety and reliability of the model's outputs.

Acknowledgements

We are grateful to ThaiSC, also known as NSTDA Supercomputer Centre, for providing the LANTA that was utilised for model training and finetuning. Additionally, we would like to express our gratitude to the SALMONN team for making their code publicly available, and to Typhoon Audio at SCB 10X for making available the huggingface project, source code, and technical paper, which served as a valuable guide for us. Many other open-source projects have contributed valuable information, code, data, and model weights; we are grateful to them all.

Pathumma Audio Team

Pattara Tipkasorn, Wayupuk Sommuang, Oatsada Chatthong, Kwanchiva Thangthai

Citation

@misc{tipaksorn2024PathummaAudio,
    title        = { {Pathumma-Audio} },
    author       = { Pattara Tipaksorn and Wayupuk Sommuang and Oatsada Chatthong and Kwanchiva Thangthai },
    url          = { https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0 },
    publisher    = { Hugging Face },
    year         = { 2024 },
}