metadata
title: Modal Transcriber MCP
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
tag: mcp-server-track
ποΈ Modal Transcriber MCP
A powerful audio transcription system integrating Gradio UI, FastMCP Tools, and Modal cloud computing with intelligent speaker identification.
β¨ Key Features
- π΅ Multi-platform Audio Download: Support for Apple Podcasts, XiaoYuZhou, and other podcast platforms
- π High-performance Transcription: Based on OpenAI Whisper with multiple model support (turbo, large-v3, etc.)
- π€ Intelligent Speaker Identification: Using pyannote.audio for speaker separation and embedding clustering
- β‘ Distributed Processing: Support for large file concurrent chunk processing, significantly improving processing speed
- π§ FastMCP Tools: Complete MCP (Model Context Protocol) tool integration
- βοΈ Modal Deployment: Support for both local and cloud deployment modes
π― Core Advantages
π§ Intelligent Audio Segmentation
- Silence Detection Segmentation: Automatically identify silent segments in audio for intelligent chunking
- Fallback Mechanism: Long audio automatically degrades to time-based segmentation, ensuring processing efficiency
- Concurrent Processing: Multiple chunks processed simultaneously, dramatically improving transcription speed
π€ Advanced Speaker Identification
- Embedding Clustering: Using deep learning embeddings for speaker consistency identification
- Cross-chunk Unification: Solving speaker label inconsistency issues in distributed processing
- Quality Filtering: Automatically filter low-quality segments to improve output accuracy
π§ Developer Friendly
- MCP Protocol Support: Complete tool invocation interface
- REST API: Standardized API interface
- Gradio UI: Intuitive web interface
- Test Coverage: 29 unit tests and integration tests
π Quick Start
Local Setup
- Clone Repository
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
cd ModalTranscriberMCP
- Install Dependencies
pip install -r requirements.txt
- Configure Hugging Face Token (Optional, for speaker identification)
# Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env
- Start Application
python app.py
Usage Instructions
- Upload audio file or Input podcast URL
- Select transcription options:
- Model size: turbo (recommended) / large-v3
- Output format: SRT / TXT
- Enable speaker identification
- Start transcription, the system will automatically process and generate results
π οΈ Technical Architecture
- Frontend: Gradio 4.44.0
- Backend: FastAPI + FastMCP
- Transcription Engine: OpenAI Whisper
- Speaker Identification: pyannote.audio
- Cloud Computing: Modal.com
- Audio Processing: FFmpeg
π Performance Metrics
- Processing Speed: Support for 30x real-time transcription speed
- Concurrency: Up to 10 chunks processed simultaneously
- Accuracy: Chinese accuracy >95%
- Supported Formats: MP3, WAV, M4A, FLAC, etc.
π€ Contributing
Issues and Pull Requests are welcome!
π License
MIT License
π Related Links
- Project Documentation: See
docs/
directory in the repository - Test Coverage: 29 test cases ensuring functional stability
- Modal Deployment: Support for cloud high-performance processing
Last updated: 2025-06-11