--- license: apache-2.0 language: - en - sw base_model: - Helsinki-NLP/opus-mt-en-swc pipeline_tag: translation tags: - translation - Swahili - subtitle --- # English-Swahili Subtitle Translator (v1) ### A specialized machine translation model optimized for subtitle content conversion between English and Swahili, preserving timing formats and handling common subtitle annotations. ## Model Details - **Architecture**: MarianMT (Transformer-based) - **Training Data**: 500k subtitle pairs + general domain text - **Max Sequence Length**: 512 tokens - **Special Features**: - **Preserves timestamps (`00:01:23,456 --> 00:01:25,678`)** - **Handles subtitle annotations (`[MUSIC]`, `(OFF-SCREEN)`)** - **Context-aware translation for dialogue continuity** ## First install requirements: ```bash pip install transformers sentencepiece pysrt ## Basic Translation from transformers import MarianMTModel, MarianTokenizer ## Load model and tokenizer model = MarianMTModel.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1") tokenizer = MarianTokenizer.from_pretrained("ngosha/English-Swahili-Subs-Translator-v1") def translate_subtitle(text): inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True) outputs = model.generate(**inputs) return tokenizer.decode(outputs[0], skip_special_tokens=True) ``` ## Example usage ```bash print(translate_subtitle("We need to hurry, the show starts in 5 minutes!")) # Output: "Tunahitaji kufanya haraka, kipindi kinaanza baada ya dakika 5!" ``` ## Full Subtitle File Processing ```bash import pysrt def translate_srt_file(input_path, output_path): subs = pysrt.open(input_path) for sub in subs: # Preserve original timestamps original_time = f"{sub.start} --> {sub.end}" # Clean and translate text clean_text = ' '.join([line for line in sub.text.split('\n') if '-->' not in line]) translated = translate_subtitle(clean_text) # Rebuild subtitle with timing sub.text = f"{original_time}\n{translated}" subs.save(output_path) # Usage translate_srt_file("episode.srt", "translated_episode.srt") ``` # Best Practices ## 1. Preserve Formatting: ### Keep annotations like [MUSIC] unchanged ```bash text = "[INTENSE MUSIC] The final battle begins" translation = "[INTENSE MUSIC] Vita vya mwisho vyaanza" ``` ## 2. Handle Line Breaks: ### Split long lines into subtitle-friendly chunks ```bash def split_subtitle(text, max_length=42): return '\n'.join([text[i:i+max_length] for i in range(0, len(text), max_length)]) ``` ## 3. Context Window: # Use previous 2 lines as context ```bash context = [] def contextual_translate(text): context.append(text) if len(context) > 2: context.pop(0) return translate_subtitle(' '.join(context)) ``` ## Limitations - **Max 3 lines per subtitle segment** - **Best performance on conversational text** - **May require post-editing for**: - **Cultural references** - **Idiomatic expressions** - **Proper noun pronunciations** ## Ethical Considerations - **May reflect biases in training data** - **Cultural nuances in Swahili dialects**: - **Prefer Tanzanian variants** - **Avoid regional slang** - **Always human-validate translations for sensitive content** ## Finetuned by Ngosha