---
license: apache-2.0
language:
- en
- sw
base_model:
- Helsinki-NLP/opus-mt-en-swc
pipeline_tag: translation
tags:
- translation
- Swahili
- subtitle
---
# English-Swahili Subtitle Translator (v1)

### A specialized machine translation model optimized for subtitle content conversion between English and Swahili, preserving timing formats and handling common subtitle annotations.

## Model Details

 
- **Architecture**: MarianMT (Transformer-based)
- **Training Data**: 500k subtitle pairs + general domain text
- **Max Sequence Length**: 512 tokens
- **Special Features**:
  - **Preserves timestamps (`00:01:23,456 --> 00:01:25,678`)**
  - **Handles subtitle annotations (`[MUSIC]`, `(OFF-SCREEN)`)**
  - **Context-aware translation for dialogue continuity**

## First install requirements:
```bash
pip install transformers sentencepiece pysrt

## Basic Translation
from transformers import MarianMTModel, MarianTokenizer

## Load model and tokenizer
model = MarianMTModel.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")
tokenizer = MarianTokenizer.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1")

def translate_subtitle(text):
  inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
  outputs = model.generate(**inputs)
  return tokenizer.decode(outputs[0], skip_special_tokens=True)
```

## Example usage
```bash
print(translate_subtitle("We need to hurry, the show starts in 5 minutes!"))
# Output: "Tunahitaji kufanya haraka, kipindi kinaanza baada ya dakika 5!"
```

## Full Subtitle File Processing
```bash
import pysrt

def translate_srt_file(input_path, output_path):
    subs = pysrt.open(input_path)
    
    for sub in subs:
        # Preserve original timestamps
        original_time = f"{sub.start} --> {sub.end}"
        
        # Clean and translate text
        clean_text = ' '.join([line for line in sub.text.split('\n') if '-->' not in line])
        translated = translate_subtitle(clean_text)
        
        # Rebuild subtitle with timing
        sub.text = f"{original_time}\n{translated}"
    
    subs.save(output_path)

# Usage
translate_srt_file("episode.srt", "translated_episode.srt")
```

# Best Practices
## 1. Preserve Formatting:

### Keep annotations like [MUSIC] unchanged
```bash
  text = "[INTENSE MUSIC] The final battle begins"
  translation = "[INTENSE MUSIC] Vita vya mwisho vyaanza"
```

## 2. Handle Line Breaks:

### Split long lines into subtitle-friendly chunks
```bash
def split_subtitle(text, max_length=42):
    return '\n'.join([text[i:i+max_length] for i in range(0, len(text), max_length)])
```

## 3. Context Window:

# Use previous 2 lines as context
```bash
context = []
def contextual_translate(text):
    context.append(text)
    if len(context) > 2:
        context.pop(0)
    return translate_subtitle(' '.join(context))
```

## Limitations

  - **Max 3 lines per subtitle segment**
  - **Best performance on conversational text**
  - **May require post-editing for**:
    - **Cultural references**
    - **Idiomatic expressions**
    - **Proper noun pronunciations**

## Ethical Considerations

  - **May reflect biases in training data**
  - **Cultural nuances in Swahili dialects**:
    - **Prefer Tanzanian variants**
    - **Avoid regional slang**
    - **Always human-validate translations for sensitive content**

## Finetuned by Ngosha