ngosha's picture
Update README.md
6122e44 verified
metadata
license: apache-2.0
language:
  - en
  - sw
base_model:
  - Helsinki-NLP/opus-mt-en-swc
pipeline_tag: translation
tags:
  - translation
  - Swahili
  - subtitle

English-Swahili Subtitle Translator (v1)

A specialized machine translation model optimized for subtitle content conversion between English and Swahili, preserving timing formats and handling common subtitle annotations.

Model Details

  • Architecture: MarianMT (Transformer-based)
  • Training Data: 500k subtitle pairs + general domain text
  • Max Sequence Length: 512 tokens
  • Special Features:
    • Preserves timestamps (00:01:23,456 --> 00:01:25,678)
    • Handles subtitle annotations ([MUSIC], (OFF-SCREEN))
    • Context-aware translation for dialogue continuity

First install requirements:

pip install transformers sentencepiece pysrt

## Basic Translation
from transformers import MarianMTModel, MarianTokenizer

## Load model and tokenizer
model = MarianMTModel.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")
tokenizer = MarianTokenizer.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")

def translate_subtitle(text):
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
  outputs = model.generate(**inputs)
  return tokenizer.decode(outputs[0], skip_special_tokens=True)

Example usage

print(translate_subtitle("We need to hurry, the show starts in 5 minutes!"))
# Output: "Tunahitaji kufanya haraka, kipindi kinaanza baada ya dakika 5!"

Full Subtitle File Processing

import pysrt

def translate_srt_file(input_path, output_path):
    subs = pysrt.open(input_path)
    
    for sub in subs:
        # Preserve original timestamps
        original_time = f"{sub.start} --> {sub.end}"
        
        # Clean and translate text
        clean_text = ' '.join([line for line in sub.text.split('\n') if '-->' not in line])
        translated = translate_subtitle(clean_text)
        
        # Rebuild subtitle with timing
        sub.text = f"{original_time}\n{translated}"
    
    subs.save(output_path)

# Usage
translate_srt_file("episode.srt", "translated_episode.srt")

Best Practices

1. Preserve Formatting:

Keep annotations like [MUSIC] unchanged

  text = "[INTENSE MUSIC] The final battle begins"
  translation = "[INTENSE MUSIC] Vita vya mwisho vyaanza"

2. Handle Line Breaks:

Split long lines into subtitle-friendly chunks

def split_subtitle(text, max_length=42):
    return '\n'.join([text[i:i+max_length] for i in range(0, len(text), max_length)])

3. Context Window:

Use previous 2 lines as context

context = []
def contextual_translate(text):
    context.append(text)
    if len(context) > 2:
        context.pop(0)
    return translate_subtitle(' '.join(context))

Limitations

  • Max 3 lines per subtitle segment
  • Best performance on conversational text
  • May require post-editing for:
    • Cultural references
    • Idiomatic expressions
    • Proper noun pronunciations

Ethical Considerations

  • May reflect biases in training data
  • Cultural nuances in Swahili dialects:
    • Prefer Tanzanian variants
    • Avoid regional slang
    • Always human-validate translations for sensitive content

Finetuned by Ngosha