English-Swahili Subtitle Translator (v1)
A specialized machine translation model optimized for subtitle content conversion between English and Swahili, preserving timing formats and handling common subtitle annotations.
Model Details
- Architecture: MarianMT (Transformer-based)
- Training Data: 500k subtitle pairs + general domain text
- Max Sequence Length: 512 tokens
- Special Features:
- Preserves timestamps (
00:01:23,456 --> 00:01:25,678
) - Handles subtitle annotations (
[MUSIC]
,(OFF-SCREEN)
) - Context-aware translation for dialogue continuity
- Preserves timestamps (
First install requirements:
pip install transformers sentencepiece pysrt
## Basic Translation
from transformers import MarianMTModel, MarianTokenizer
## Load model and tokenizer
model = MarianMTModel.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")
tokenizer = MarianTokenizer.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")
def translate_subtitle(text):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
outputs = model.generate(**inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Example usage
print(translate_subtitle("We need to hurry, the show starts in 5 minutes!"))
# Output: "Tunahitaji kufanya haraka, kipindi kinaanza baada ya dakika 5!"
Full Subtitle File Processing
import pysrt
def translate_srt_file(input_path, output_path):
subs = pysrt.open(input_path)
for sub in subs:
# Preserve original timestamps
original_time = f"{sub.start} --> {sub.end}"
# Clean and translate text
clean_text = ' '.join([line for line in sub.text.split('\n') if '-->' not in line])
translated = translate_subtitle(clean_text)
# Rebuild subtitle with timing
sub.text = f"{original_time}\n{translated}"
subs.save(output_path)
# Usage
translate_srt_file("episode.srt", "translated_episode.srt")
Best Practices
1. Preserve Formatting:
Keep annotations like [MUSIC] unchanged
text = "[INTENSE MUSIC] The final battle begins"
translation = "[INTENSE MUSIC] Vita vya mwisho vyaanza"
2. Handle Line Breaks:
Split long lines into subtitle-friendly chunks
def split_subtitle(text, max_length=42):
return '\n'.join([text[i:i+max_length] for i in range(0, len(text), max_length)])
3. Context Window:
Use previous 2 lines as context
context = []
def contextual_translate(text):
context.append(text)
if len(context) > 2:
context.pop(0)
return translate_subtitle(' '.join(context))
Limitations
- Max 3 lines per subtitle segment
- Best performance on conversational text
- May require post-editing for:
- Cultural references
- Idiomatic expressions
- Proper noun pronunciations
Ethical Considerations
- May reflect biases in training data
- Cultural nuances in Swahili dialects:
- Prefer Tanzanian variants
- Avoid regional slang
- Always human-validate translations for sensitive content
Finetuned by Emmanuel Minga (0755652681)
- Downloads last month
- 24
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.
Model tree for ngosha/English-Swahili-Subs-Translator-v1
Base model
Helsinki-NLP/opus-mt-en-swc