richard-su commited on
Commit
aad85c9
·
verified ·
1 Parent(s): faa63c6

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +58 -55
README.md CHANGED
@@ -13,96 +13,99 @@ python_version: 3.10.12
13
 
14
  # 🎙️ Modal Transcriber MCP
15
 
16
- 一个功能强大的音频转录系统,集成了 Gradio UIFastMCP Tools Modal 云计算,支持智能说话人识别。
17
 
18
- ## ✨ 主要功能
19
 
20
- - **🎵 多平台音频下载**:支持 Apple Podcasts、小宇宙等播客平台
21
- - **🚀 高性能转录**:基于 OpenAI Whisper,支持多种模型(turbo, large-v3等)
22
- - **🎤 智能说话人识别**:使用 pyannote.audio 进行说话人分离和embedding聚类
23
- - **⚡ 分布式处理**:支持大文件并发切片处理,显著提升处理速度
24
- - **🔧 FastMCP 工具**:提供完整的 MCP (Model Context Protocol) 工具集成
25
- - **☁️ Modal 部署**:支持本地和云端双模式部署
26
 
27
- ## 🎯 核心优势
28
 
29
- ### 🧠 智能音频分割
30
- - **静音检测分割**:自动识别音频中的静音段落进行智能切分
31
- - **Fallback机制**:长音频自动降级为时间分割,确保处理效率
32
- - **并发处理**:多chunk同时处理,大幅提升转录速度
33
 
34
- ### 🎤 高级说话人识别
35
- - **Embedding聚类**:使用深度学习embedding进行说话人一致性识别
36
- - **跨chunk统一**:解决分布式处理中说话人标签不一致问题
37
- - **质量过滤**:自动过滤低质量片段,提升输出准确性
38
 
39
- ### 🔧 开发者友好
40
- - **MCP协议支持**:完整的工具调用接口
41
- - **REST API**:标准化的API接口
42
- - **Gradio UI**:直观的Web界面
43
- - **测试覆盖**:29个单元测试和集成测试
44
 
45
- ## 🚀 快速开始
46
 
47
- ### 本地运行
48
 
49
- 1. **克隆仓库**
50
  ```bash
51
  git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
52
  cd ModalTranscriberMCP
53
  ```
54
 
55
- 2. **安装依赖**
56
  ```bash
57
  pip install -r requirements.txt
58
  ```
59
 
60
- 3. **配置 Hugging Face Token**(可选,用于说话人识别)
61
  ```bash
62
- # 创建 .env 文件
63
  echo "HF_TOKEN=your_huggingface_token_here" > .env
64
  ```
65
 
66
- 4. **启动应用**
67
  ```bash
68
  python app.py
69
  ```
70
 
71
- ### 使用说明
72
 
73
- 1. **上传音频文件** **输入播客URL**
74
- 2. **选择转录选项**:
75
- - 模型大小:turbo (推荐) / large-v3
76
- - 输出格式:SRT / TXT
77
- - 是否启用说话人识别
78
- 3. **开始转录**,系统会自动处理并生成结果
79
 
80
- ## 🛠️ 技术架构
81
 
82
- - **前端**:Gradio 4.44.0
83
- - **后端**:FastAPI + FastMCP
84
- - **转录引擎**:OpenAI Whisper
85
- - **说话人识别**:pyannote.audio
86
- - **云计算**:Modal.com
87
- - **音频处理**:FFmpeg
88
 
89
- ## 📊 性能指标
90
 
91
- - **处理速度**:支持30倍实时速度转录
92
- - **并发能力**:最多10chunks同时处理
93
- - **准确率**:中文准确率>95%
94
- - **支持格式**:MP3, WAV, M4A, FLAC
95
 
96
- ## 🤝 贡献
97
 
98
- 欢迎提交 Issue Pull Request!
99
 
100
- ## 📜 许可证
101
 
102
  MIT License
103
 
104
- ## 🔗 相关链接
105
 
106
- - **项目文档**:详见仓库中的 `docs/` 目录
107
- - **测试覆盖**:29个测试用例确保功能稳定性
108
- - **Modal部署**:支持云端高性能处理
 
 
 
 
13
 
14
  # 🎙️ Modal Transcriber MCP
15
 
16
+ A powerful audio transcription system integrating Gradio UI, FastMCP Tools, and Modal cloud computing with intelligent speaker identification.
17
 
18
+ ## ✨ Key Features
19
 
20
+ - **🎵 Multi-platform Audio Download**: Support for Apple Podcasts, XiaoYuZhou, and other podcast platforms
21
+ - **🚀 High-performance Transcription**: Based on OpenAI Whisper with multiple model support (turbo, large-v3, etc.)
22
+ - **🎤 Intelligent Speaker Identification**: Using pyannote.audio for speaker separation and embedding clustering
23
+ - **⚡ Distributed Processing**: Support for large file concurrent chunk processing, significantly improving processing speed
24
+ - **🔧 FastMCP Tools**: Complete MCP (Model Context Protocol) tool integration
25
+ - **☁️ Modal Deployment**: Support for both local and cloud deployment modes
26
 
27
+ ## 🎯 Core Advantages
28
 
29
+ ### 🧠 Intelligent Audio Segmentation
30
+ - **Silence Detection Segmentation**: Automatically identify silent segments in audio for intelligent chunking
31
+ - **Fallback Mechanism**: Long audio automatically degrades to time-based segmentation, ensuring processing efficiency
32
+ - **Concurrent Processing**: Multiple chunks processed simultaneously, dramatically improving transcription speed
33
 
34
+ ### 🎤 Advanced Speaker Identification
35
+ - **Embedding Clustering**: Using deep learning embeddings for speaker consistency identification
36
+ - **Cross-chunk Unification**: Solving speaker label inconsistency issues in distributed processing
37
+ - **Quality Filtering**: Automatically filter low-quality segments to improve output accuracy
38
 
39
+ ### 🔧 Developer Friendly
40
+ - **MCP Protocol Support**: Complete tool invocation interface
41
+ - **REST API**: Standardized API interface
42
+ - **Gradio UI**: Intuitive web interface
43
+ - **Test Coverage**: 29 unit tests and integration tests
44
 
45
+ ## 🚀 Quick Start
46
 
47
+ ### Local Setup
48
 
49
+ 1. **Clone Repository**
50
  ```bash
51
  git clone https://huggingface.co/spaces/Agents-MCP-Hackathon/ModalTranscriberMCP
52
  cd ModalTranscriberMCP
53
  ```
54
 
55
+ 2. **Install Dependencies**
56
  ```bash
57
  pip install -r requirements.txt
58
  ```
59
 
60
+ 3. **Configure Hugging Face Token** (Optional, for speaker identification)
61
  ```bash
62
+ # Create .env file
63
  echo "HF_TOKEN=your_huggingface_token_here" > .env
64
  ```
65
 
66
+ 4. **Start Application**
67
  ```bash
68
  python app.py
69
  ```
70
 
71
+ ### Usage Instructions
72
 
73
+ 1. **Upload audio file** or **Input podcast URL**
74
+ 2. **Select transcription options**:
75
+ - Model size: turbo (recommended) / large-v3
76
+ - Output format: SRT / TXT
77
+ - Enable speaker identification
78
+ 3. **Start transcription**, the system will automatically process and generate results
79
 
80
+ ## 🛠️ Technical Architecture
81
 
82
+ - **Frontend**: Gradio 4.44.0
83
+ - **Backend**: FastAPI + FastMCP
84
+ - **Transcription Engine**: OpenAI Whisper
85
+ - **Speaker Identification**: pyannote.audio
86
+ - **Cloud Computing**: Modal.com
87
+ - **Audio Processing**: FFmpeg
88
 
89
+ ## 📊 Performance Metrics
90
 
91
+ - **Processing Speed**: Support for 30x real-time transcription speed
92
+ - **Concurrency**: Up to 10 chunks processed simultaneously
93
+ - **Accuracy**: Chinese accuracy >95%
94
+ - **Supported Formats**: MP3, WAV, M4A, FLAC, etc.
95
 
96
+ ## 🤝 Contributing
97
 
98
+ Issues and Pull Requests are welcome!
99
 
100
+ ## 📜 License
101
 
102
  MIT License
103
 
104
+ ## 🔗 Related Links
105
 
106
+ - **Project Documentation**: See `docs/` directory in the repository
107
+ - **Test Coverage**: 29 test cases ensuring functional stability
108
+ - **Modal Deployment**: Support for cloud high-performance processing
109
+
110
+ ---
111
+ *Last updated: 2025-06-11*