Advanced Speech Processing with faster-whisper

Welcome to the advanced speech processing utility leveraging the powerful Whisper large-v2 model for the CTranslate2 framework. This tool is designed for high-performance speech recognition and processing, supporting a wide array of languages and the capability to handle video inputs for slide detection and audio transcription.

Features

Language Support: Extensive language support covering major global languages for speech recognition tasks.
Video Processing: Download MP4 files from links and extract audio content for transcription.
Slide Detection: Detect and sort presentation slides from video lectures or meetings.
Audio Transcription: Leverage the Whisper large-v2 model to transcribe audio content with high accuracy.

Getting Started

To begin using this utility, set up the WhisperModel from the faster_whisper package with the provided language configurations. The EndpointHandler class is your main interface for processing the data.

Example Usage

import requests
import os

# Sample data dict with the link to the video file and the desired language for transcription
DATA = {
    "inputs": "<base64_encoded_audio_string>",
    "link": "<your_mp4_video_link>",
    "language": "en",  # Choose from supported languages
    "task": "transcribe",
    "type": "audio"  # Use "link" for video files
}

HF_ACCESS_TOKEN = os.environ.get("HF_TRANSCRIPTION_ACCESS_TOKEN")
API_URL = os.environ.get("HF_TRANSCRIPTION_ENDPOINT")

HEADERS = {
    "Authorization": HF_ACCESS_TOKEN,
    "Content-Type": "application/json"
}

response = requests.post(API_URL, headers=HEADERS, json=DATA)
print(response)

# The response will contain transcribed audio and detected slides if a video link was provided

Processing Video Files

To process video files, the process_video function downloads the MP4 file, extracts the audio, and passes it to the Whisper model for transcription. It also utilizes the Detector and SlideSorter classes to identify and sort presentation slides within the video.

Error Handling

Comprehensive logging and error handling are in place to ensure you're informed of each step's success or failure.

Installation

Ensure that you have the following dependencies installed:

opencv-python~=4.8.1.78
numpy~=1.26.1
Pillow~=10.0.1
tqdm~=4.66.1
requests~=2.31.0
moviepy~=1.0.3
scipy~=1.11.3

Install them using pip with the provided requirements.txt file:

pip install -r requirements.txt

Languages Supported

This tool supports a plethora of languages, making it highly versatile for global applications. The full list of supported languages can be found in the language section of the old README.

License

This project is available under the MIT license.

More Information

For more information about the original Whisper large-v2 model, please refer to its model card on Hugging Face.