You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Ratchada-Fang-Thon-Whisper

Model Description

Ratchada-Fang-Thon-Whisper is a fine-tuned version of the Whisper model, specifically adapted for Thai speech recognition in financial contexts. This model is designed to transcribe Thai audio with high accuracy, particularly for financial terminology and discussions.

Image

Whisper is a state-of-the-art transformer model that can transcribe speech signals into text with high accuracy and low latency. We will use the huggingface's whisper implementation to fine-tune the model on our own GPU infrastructure, using a various custom dataset of audio recordings and transcripts.

We will also monitor the training process and evaluate the model performance with tensorboard, a visualization tool for machine learning experiments.

Key Features

  • Specialized in Thai language transcription
  • Fine-tuned for financial domain vocabulary
  • Based on the Whisper medium model architecture
  • Supports long-form transcription

Model Details

  • Model Type: WhisperForConditionalGeneration
  • Language: Thai
  • Task: Automatic Speech Recognition (ASR)
  • License: MIT

Usage

Standard Pipeline (Recommended)

You can use this model with the standard Transformers pipeline:

from transformers import pipeline

device = 0 if torch.cuda.is_available() else "cpu"

pipe = pipeline(
    "automatic-speech-recognition",
    model="ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper",
    device=device,
    generate_kwargs={"language": "th", "task": "transcribe"}
)

result = pipe("path/to/audio/file.wav") # path to audio file or numpy array of wave 
print(result["text"])

Note: It is recommended that audio input should have sample_rate=16_000 before hand !

Transformer Directly

You can use this model from Transfomers module driectly:

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

processor = AutoProcessor.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper")
model = AutoModelForSpeechSeq2Seq.from_pretrained("ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper").to(device)

# waveform is numpy that obtain from Audio processor lib i.e. librosa, torchaudio

input_features = processor(waveform.squeeze(), sampling_rate=16000, return_tensors="pt").input_features.to(device)

with torch.no_grad():
    predicted_ids = model.generate(input_features)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0] # best choice of batches

from ratchada_processor import tokenize_text # strongly recommend post-processor

processed_text = tokenize_text(transcription) # cut the text into splited component and process it (see github)

result = "".join(processed_text)

print(result)

Note: Using this method required own manually post-processor at the output of the model. The post-processor can be found in this lib on pypi project:

python3 -m pip install ratchada-util

Training

Training Data

This model was fine-tuned on a proprietary dataset: ThinkingMachinesDataScience/Ratchada-STT. The dataset contains Thai speech audio from financial contexts.

Training Procedure

The model was fine-tuned from the biodatlab/whisper-th-medium-combined checkpoint, which is a Thai-specific version of the Whisper medium model. After each model prediction, a post-processor code is applied to refine the results.

Limitations and Bias

  1. The model is specifically trained on Thai financial audio data and may not perform as well on general Thai speech or other domains.
  2. There might be biases present in the training data, which could affect the model's performance on certain types of speech or accents.

Evaluation Results

Using our own evaluation algorithm, these are the performance of this model:

  • Lower is better
models wer cer (jiwer) deletions substitutions insertions
RATFT-WHISPER 0.332685 0.272674 1884 1806 5466
WHISPER-LARGE-V3 0.392162 0.318666 2499 1489 6752
THON-WHISPER 0.474360 0.405920 1722 2603 8597
WHISPER-LARGE 0.593637 0.578926 5441 1500 9433
WHISPER-LARGE-V2 0.595292 0.652592 4924 1866 9580
WHISPER-MEDIUM 0.643084 0.66565 7471 1312 9090
WHISPER-SMALL 0.667453 0.603361 4397 1817 12028
WHISPER-BASE 0.791954 0.73896 3362 1906 16252

Note: CER, Using Jiwer, to evaluate an automatic speech recognition system.

Ethical Considerations

Users should be aware that this model is designed for transcribing Thai speech in financial contexts. It should not be used for making financial decisions without human verification. Always cross-check important financial information obtained from this model.

Citations

If you use this model in your research, please cite:

Copy@misc{Ratchada-Fang-Thon-Whisper,
  author = {ThinkingMachinesDataScience},
  title = {Ratchada-Fang-Thon-Whisper: Thai Financial Speech Recognition Model},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://huggingface.co/ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper}}
}

Contacts

For questions and feedback about this model, please make a contact ThinkingMachinesDataScience Github repository for this project.

Downloads last month
0
Safetensors
Model size
764M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ThinkingMachinesDataScience/Ratchada-Fang-Thon-Whisper