{
"cells": [
{
"cell_type": "markdown",
"id": "75b58048-7d14-4fc6-8085-1fc08c81b4a6",
"metadata": {},
"source": [
"# Fine-Tune Whisper With π€ Transformers and Streaming Mode"
]
},
{
"cell_type": "markdown",
"id": "fbfa8ad5-4cdc-4512-9058-836cbbf65e1a",
"metadata": {},
"source": [
"In this Colab, we present a step-by-step guide on fine-tuning Whisper with Hugging Face π€ Transformers on 400 hours of speech data! Using streaming mode, we'll show how you can train a speech recongition model on any dataset, irrespective of size. With streaming mode, storage requirements are no longer a consideration: you can train a model on whatever dataset you want, even if it's download size exceeds your devices disk space. How can this be possible? It simply seems too good to be true! Well, rest assured it's not π Carry on reading to find out more."
]
},
{
"cell_type": "markdown",
"id": "afe0d503-ae4e-4aa7-9af4-dbcba52db41e",
"metadata": {},
"source": [
"## Introduction"
]
},
{
"cell_type": "markdown",
"id": "9ae91ed4-9c3e-4ade-938e-f4c2dcfbfdc0",
"metadata": {},
"source": [
"Speech recognition datasets are large. A typical speech dataset consists of approximately 100 hours of audio-transcription data, requiring upwards of 130GB of storage space for download and preparation. For most ASR researchers, this is already at the upper limit of what is feasible for disk space. So what happens when we want to train on a larger dataset? The full [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) dataset consists of 960 hours of audio data. Kensho's [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) contains 5,000 hours of audio data. ML Commons [People's Speech](https://huggingface.co/datasets/MLCommons/peoples_speech) contains **30,000+** hours of audio data! Do we need to bite the bullet and buy additional storage? Or is there a way we can train on all of these datasets with no disk drive requirements?\n",
"\n",
"When training machine learning systems, we rarely use the entire dataset at once. We typically _batch_ our data into smaller subsets of data, and pass these incrementally through our training pipeline. This is because we train our system on an accelerator device, such as a GPU or TPU, which has a memory limit typically around 16GB. We have to fit our model, optimiser and training data all on the same accelerator device, so we usually have to divide the dataset up into smaller batches and move them from the CPU to the GPU when required.\n",
"\n",
"Consequently, we don't require the entire dataset to be downloaded at once; we simply need the batch of data that we pass to our model at any one go. We can leverage this principle of partial dataset loading when preparing our dataset: rather than downloading the entire dataset at the start, we can load each piece of data as and when we need it. For each batch, we load the relevant data from a remote server and pass it through the training pipeline. For the next batch, we load the next items and again pass them through the training pipeline. At no point do we have to save data to our disk drive, we simply load them in memory and use them in our pipeline. In doing so, we only ever need as much memory as each individual batch requires.\n",
"\n",
"This is analogous to downloading a TV show versus streaming it πΊ When we download a TV show, we download the entire video offline and save it to our disk. Compare this to when we stream a TV show. Here, we don't download any part of the video to memory, but iterate over the video file and load each part in real-time as required. It's this same principle that we can apply to our ML training pipeline! We want to iterate over the dataset and load each sample of data as required.\n",
"\n",
"While the principle of partial dataset loading sounds ideal, it also seems **pretty** difficult to do. Luckily for us, π€ Datasets allows us to do this with minimal code changes! We'll make use of the principle of [_streaming_](https://huggingface.co/docs/datasets/stream), depicted graphically in Figure 1. Streaming does exactly this: the data is loaded progressively as we iterate over the dataset, meaning it is only loaded as and when we need it. If you're familiar with π€ Transformers and Datasets, the content of this notebook will be very familiar, with some small extensions to support streaming mode."
]
},
{
"cell_type": "markdown",
"id": "1c87f76e-47be-4a5d-bc52-7b1c2e9d4f5a",
"metadata": {},
"source": [
"\n",
"
Step | \n", "Training Loss | \n", "Validation Loss | \n", "Wer | \n", "
---|---|---|---|
200 | \n", "0.007300 | \n", "0.196525 | \n", "41.187308 | \n", "
400 | \n", "0.008800 | \n", "0.212407 | \n", "42.436029 | \n", "
600 | \n", "0.003400 | \n", "0.215344 | \n", "41.494371 | \n", "
"
],
"text/plain": [
"