Text-to-Audio
Transformers
Safetensors
Divehi
csm
dhivehi-tts

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CSM-1B Dhivehi

Multispeaker Dhivehi speech generation model based on sesame/csm-1b, fine-tuned on synthetic male and female Dhivehi voice data.

Usage

import torch
from transformers import CsmForConditionalGeneration, AutoProcessor

model_id = "alakxender/csm-1b-dhivehi-5-spk-gd"
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# Set speaker and input Dhivehi text
role = "0"  # "0" for female, "1" for male
content = "މެލޭޝިއާގައި އިތުރުކުރާ ޓެކްސް، ދިވެހި ދަރިވަރުންނަށް ބުރައަކަށް ނުވާނެ ގޮތެއް ހޯދައިދޭނަން: ހައިދަރު"

conversation = [
    {"role": role, "content": [{"type": "text", "text": content}]}
]

inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True
).to(device)

# Generate audio
audio = model.generate(**inputs, output_audio=True)

# Save to file
processor.save_audio(audio, f"output_{role}.wav")

More usage info at: sesame/csm-1b

Training Details

  • Epochs: 3
  • Global Steps: 24,408
  • Training Loss: 0.89
  • Final Loss: 3.35
  • Gradient Norm: 3.31
  • Learning Rate: ~8.38e-7
  • FLOPs: 436,376,769,022,130,240
  • Runtime: 4.59 hours
  • Samples/sec: 11.83
  • Steps/sec: 1.48

Dataset Overview

alakxender/dv_syn_speech_md

  • Synthetic TTS dataset with aligned Dhivehi text and audio
  • Two distinct speaker IDs:
    • "0": Female synthetic voice
    • "1": Male synthetic voice
    • "2": Female synthetic voice
    • "3": Male synthetic voice
    • "4": Female unk voice
    • "5": Male unk voice

Notes

  • The model is suitable for Dhivehi TTS tasks with controllable speaker voice.
  • Speaker identity is selected via the role field in the chat input template.
  • This setup allows simple voice switching without changing the architecture.

Disclaimer

This fine-tuned checkpoint was created for Dhivehi speech synthesis and is intended for research and educational use only. All voice outputs generated by this model are entirely synthetic. Any resemblance to real persons, living or deceased, is purely coincidental and unintentional. The creators of this model do not endorse or condone the use of this system for:

  • Impersonation or deepfake purposes
  • Deceptive content generation
  • Harassment, misinformation, or manipulation
Downloads last month
156
Safetensors
Model size
1.63B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train alakxender/csm-1b-dhivehi-5-spk-gd