Advanced Speech Processing with faster-whisper
Welcome to the advanced speech processing utility leveraging the powerful Whisper large-v2 model for the CTranslate2 framework. This tool is designed for high-performance speech recognition and processing, supporting a wide array of languages and the capability to handle video inputs for slide detection and audio transcription.
Features
- Language Support: Extensive language support covering major global languages for speech recognition tasks.
- Video Processing: Download MP4 files from links and extract audio content for transcription.
- Slide Detection: Detect and sort presentation slides from video lectures or meetings.
- Audio Transcription: Leverage the Whisper large-v2 model to transcribe audio content with high accuracy.
Getting Started
To begin using this utility, set up the WhisperModel
from the faster_whisper
package with the provided language
configurations. The EndpointHandler
class is your main interface for processing the data.
Example Usage
import requests
import os
# Sample data dict with the link to the video file and the desired language for transcription
DATA = {
"inputs": "<base64_encoded_audio_string>",
"link": "<your_mp4_video_link>",
"language": "en", # Choose from supported languages
"task": "transcribe",
"type": "audio" # Use "link" for video files
}
HF_ACCESS_TOKEN = os.environ.get("HF_TRANSCRIPTION_ACCESS_TOKEN")
API_URL = os.environ.get("HF_TRANSCRIPTION_ENDPOINT")
HEADERS = {
"Authorization": HF_ACCESS_TOKEN,
"Content-Type": "application/json"
}
response = requests.post(API_URL, headers=HEADERS, json=DATA)
print(response)
# The response will contain transcribed audio and detected slides if a video link was provided
Processing Video Files
To process video files, the process_video
function downloads the MP4 file, extracts the audio, and passes it to the
Whisper model for transcription. It also utilizes the Detector
and SlideSorter
classes to identify and sort
presentation slides within the video.
Error Handling
Comprehensive logging and error handling are in place to ensure you're informed of each step's success or failure.
Installation
Ensure that you have the following dependencies installed:
opencv-python~=4.8.1.78
numpy~=1.26.1
Pillow~=10.0.1
tqdm~=4.66.1
requests~=2.31.0
moviepy~=1.0.3
scipy~=1.11.3
Install them using pip with the provided requirements.txt
file:
pip install -r requirements.txt
Languages Supported
This tool supports a plethora of languages, making it highly versatile for global applications. The full list of
supported languages can be found in the language
section of the old README.
License
This project is available under the MIT license.
More Information
For more information about the original Whisper large-v2 model, please refer to its model card on Hugging Face.