File size: 3,595 Bytes
0acf986
3499c7d
4bbc337
3499c7d
 
2f74fe7
4bbc337
 
0acf986
4bbc337
0acf986
 
3499c7d
 
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
 
 
 
 
aad85c9
3499c7d
 
 
 
aad85c9
3499c7d
aad85c9
3499c7d
 
 
aad85c9
3499c7d
 
 
 
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
 
 
 
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
aad85c9
3499c7d
 
 
aad85c9
3499c7d
aad85c9
 
 
 
 
4bbc337
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: Modal Transcriber MCP
emoji: πŸŽ™οΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
tag: mcp-server-track
---

# πŸŽ™οΈ Modal Transcriber MCP

A powerful audio transcription system integrating Gradio UI, FastMCP Tools, and Modal cloud computing with intelligent speaker identification.

## ✨ Key Features

- **🎡 Multi-platform Audio Download**: Support for Apple Podcasts, XiaoYuZhou, and other podcast platforms
- **πŸš€ High-performance Transcription**: Based on OpenAI Whisper with multiple model support (turbo, large-v3, etc.)
- **🎀 Intelligent Speaker Identification**: Using pyannote.audio for speaker separation and embedding clustering
- **⚑ Distributed Processing**: Support for large file concurrent chunk processing, significantly improving processing speed
- **πŸ”§ FastMCP Tools**: Complete MCP (Model Context Protocol) tool integration
- **☁️ Modal Deployment**: Support for both local and cloud deployment modes

## 🎯 Core Advantages

### 🧠 Intelligent Audio Segmentation
- **Silence Detection Segmentation**: Automatically identify silent segments in audio for intelligent chunking
- **Fallback Mechanism**: Long audio automatically degrades to time-based segmentation, ensuring processing efficiency
- **Concurrent Processing**: Multiple chunks processed simultaneously, dramatically improving transcription speed

### 🎀 Advanced Speaker Identification
- **Embedding Clustering**: Using deep learning embeddings for speaker consistency identification
- **Cross-chunk Unification**: Solving speaker label inconsistency issues in distributed processing
- **Quality Filtering**: Automatically filter low-quality segments to improve output accuracy

### πŸ”§ Developer Friendly
- **MCP Protocol Support**: Complete tool invocation interface
- **REST API**: Standardized API interface
- **Gradio UI**: Intuitive web interface
- **Test Coverage**: 29 unit tests and integration tests

## πŸš€ Quick Start

### Local Setup

1. **Clone Repository**
```bash
git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
cd ModalTranscriberMCP
```

2. **Install Dependencies**
```bash
pip install -r requirements.txt
```

3. **Configure Hugging Face Token** (Optional, for speaker identification)
```bash
# Create .env file
echo "HF_TOKEN=your_huggingface_token_here" > .env
```

4. **Start Application**
```bash
python app.py
```

### Usage Instructions

1. **Upload audio file** or **Input podcast URL**
2. **Select transcription options**:
   - Model size: turbo (recommended) / large-v3
   - Output format: SRT / TXT
   - Enable speaker identification
3. **Start transcription**, the system will automatically process and generate results

## πŸ› οΈ Technical Architecture

- **Frontend**: Gradio 4.44.0
- **Backend**: FastAPI + FastMCP
- **Transcription Engine**: OpenAI Whisper
- **Speaker Identification**: pyannote.audio
- **Cloud Computing**: Modal.com
- **Audio Processing**: FFmpeg

## πŸ“Š Performance Metrics

- **Processing Speed**: Support for 30x real-time transcription speed
- **Concurrency**: Up to 10 chunks processed simultaneously
- **Accuracy**: Chinese accuracy >95%
- **Supported Formats**: MP3, WAV, M4A, FLAC, etc.

## 🀝 Contributing

Issues and Pull Requests are welcome!

## πŸ“œ License

MIT License

## πŸ”— Related Links

- **Project Documentation**: See `docs/` directory in the repository
- **Test Coverage**: 29 test cases ensuring functional stability
- **Modal Deployment**: Support for cloud high-performance processing

---
*Last updated: 2025-06-11*