Spaces:

ctatu
/

cnn-voice-classifier

Sleeping

App Files Files Community

Cristian Tatu commited on 25 days ago

Commit

611f013

1 Parent(s): 1e50356

add description article

Browse files

Files changed (15) hide show

.gitattributes +10 -8
female_3.wav → acc.png +2 -2
cnn_arch.png +3 -0
female_blah.wav +3 -0
female_4.wav → female_foreign_lang.wav +0 -0
female_2.wav → female_laugh.wav +0 -0
female_1.wav → female_phone.wav +0 -0
inv_block.png +3 -0
loss.png +3 -0
lr.png +3 -0
main.py +64 -4
male_3.wav → male_foreign_lang.wav +0 -0
male_2.wav → male_foreign_lang2.wav +0 -0
male_4.wav → male_foreign_lang3.wav +0 -0
male_1.wav → male_radio.wav +0 -0

.gitattributes CHANGED Viewed

@@ -33,11 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-male_1.wav filter=lfs diff=lfs merge=lfs -text
-male_2.wav filter=lfs diff=lfs merge=lfs -text
-male_3.wav filter=lfs diff=lfs merge=lfs -text
-male_4.wav filter=lfs diff=lfs merge=lfs -text
-female_1.wav filter=lfs diff=lfs merge=lfs -text
-female_2.wav filter=lfs diff=lfs merge=lfs -text
-female_3.wav filter=lfs diff=lfs merge=lfs -text
-female_4.wav filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+male_foreign_lang.wav filter=lfs diff=lfs merge=lfs -text
+male_foreign_lang2.wav filter=lfs diff=lfs merge=lfs -text
+male_foreign_lang3.wav filter=lfs diff=lfs merge=lfs -text
+male_radio.wav filter=lfs diff=lfs merge=lfs -text
+female_foreign_lang.wav filter=lfs diff=lfs merge=lfs -text
+female_phone.wav filter=lfs diff=lfs merge=lfs -text
+female_laugh.wav filter=lfs diff=lfs merge=lfs -text
+female_blah.wav filter=lfs diff=lfs merge=lfs -text
+*.wav filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

female_3.wav → acc.png RENAMED Viewed

File without changes

cnn_arch.png ADDED Viewed

Git LFS Details

SHA256: 7f0f23ff98929b6ec7c1d5ef7b01a863e99ea9c83ea675e2a43691c9f3ef4e86
Pointer size: 131 Bytes
Size of remote file: 675 kB

female_blah.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39c9f005b211e842519dd47fde05308f2689ee667ccd61822948ad1b12a98542
+size 330794

female_4.wav → female_foreign_lang.wav RENAMED Viewed

File without changes

female_2.wav → female_laugh.wav RENAMED Viewed

File without changes

female_1.wav → female_phone.wav RENAMED Viewed

File without changes

inv_block.png ADDED Viewed

Git LFS Details

SHA256: 859bea7919b3d57bf9586c5230e1ae464962cb117ac5672759f09933471b10fe
Pointer size: 131 Bytes
Size of remote file: 123 kB

loss.png ADDED Viewed

Git LFS Details

SHA256: 5621bb2a06ca0f7ced2a4a609094e1fc07cddd3bbf361ddf4840e883705a6b71
Pointer size: 130 Bytes
Size of remote file: 32.4 kB

lr.png ADDED Viewed

Git LFS Details

SHA256: f4387a8d19b8b2378df9bf331e4de905a33c02b1edcd1db16e76aae65acabefb
Pointer size: 130 Bytes
Size of remote file: 23.3 kB

main.py CHANGED Viewed

@@ -1,13 +1,73 @@
 import gradio as gr
 from model import CnnVoiceClassifier
 model = CnnVoiceClassifier()
 audio_component = gr.Audio(type='filepath', label='Upload your audio file here')
 label_component = gr.Label(label='Gender classification result')
-sample_female = [f'female_{i}.wav' for i in range(1, 5)]
-sample_male = [f'male_{i}.wav' for i in range(1, 5)]
-demo = gr.Interface(fn=model.inference, inputs=audio_component, outputs=label_component, examples=sample_female + sample_male)
-demo.launch()

 import gradio as gr
 from model import CnnVoiceClassifier
+from glob import glob
 model = CnnVoiceClassifier()
 audio_component = gr.Audio(type='filepath', label='Upload your audio file here')
 label_component = gr.Label(label='Gender classification result')
+sample_female = glob('female_*.wav')
+sample_male = glob('male_*.wav')
+title = 'CNN Voice Classifier 👨👩'
+description = '<h2>This application is a Convolutional Neural Network (CNN) voice gender classifier. You can upload an audio file or record it from your microphone and the application will predict the gender of the speaker.</h2>'
+css = '''
+    .article-text {
+        font-size: 1.6em;
+        text-align: justify;
+    }
+'''
+article = '''
+<div style="text-align: center; padding: 20px;">
+    <img src="https://media.licdn.com/dms/image/v2/C4E03AQHo_SiOFDMLtA/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1563539658469?e=1756339200&v=beta&t=--Zge8UfsgSSSE1SyNrOQYxDmgq_Q87szn5pMNE4xpE" alt="Profile Image" style="border-radius: 50%; margin: 0 auto;">
+    <h1>Cristian Tatu</h1>
+    <p class="article-text">
+        I created this AI model as a side project stemming from a larger one. While exploring the Mozilla Common Voice dataset for training a Text-to-Speech (TTS) model, I noticed a significant number of audio samples had incorrect gender labels. This would severely degrade the quality and training of any TTS model.
+        My goal was to find a quick and easy way to determine a person's gender from their voice. To my surprise, "easy" solutions like the one found <a href="https://github.com/primaryobjects/voice-gender">here</a> weren't robust enough to handle background noise or poor microphone quality. Conversely, robust solutions, such as complex Transformer models like the one <a href="https://huggingface.co/JaesungHuh/voice-gender-classifier">here</a>, were too resource-intensive.
+        I believed there had to be a relatively simple model, like a Convolutional Neural Network (CNN), that could reliably distinguish a person's voice despite environmental noise and recording artifacts.
+    </p>
+</div>
+<div style="display: grid; grid-template-columns: 70% 2fr; gap: 5%; margin-top: 30px; align-items: start;">
+    <div>
+        <p class="article-text">
+            To develop this Convolutional Neural Network (CNN), the quality of the dataset was a top priority. I aimed to use datasets with a wide variety of sound conditions and languages to prevent the model from being biased towards any particular pattern other than a person's gender.
+            <br>
+            To achieve this, I combined three distinct datasets:
+        </p>
+        <ul class="article-text" style="padding-left: 5%;">
+            <li><a href="https://www.islrn.org/resources/038-476-412-610-4/">aGender</a> - This dataset includes german speech samples recorded over public telephone lines, featuring both read and semi-spontaneous speech..</li>
+            <li><a href="https://huggingface.co/datasets/litagin/moe-speech">litagin/moe-speech</a> - A high-quality collection of character acting speech audio performed by Japanese professional voice actors.</li>
+            <li><a href="https://research.google.com/audioset/dataset/index.html">AudioSet</a> - A large-scale compilation of human-labeled 10-second sound clips extracted from YouTube videos.</li>
+        </ul>
+        <p class="article-text">
+            By combining these diverse datasets with data augmentation techniques, such as adding background noise to the audio clips at training time, I was able to create a very robust model.
+        </p>
+        <h1>Model Architecture:</h1>
+        <p class="article-text">
+            The architecture for this model was inspired by MobileNetV3, which is known for being lightweight and efficient, though originally designed for images. Fortunately, it's easily adaptable for 1D signals like audio.
+            The diagram below illustrates the core building block of the architecture, called the <i>Universal Inverted Bottleneck</i>. This block features two convolutional layers surrounding a <i>DepthWise</i> layer and a simple attention mechanism known as <i>Squeeze and Excitation</i>. A skip connection can be included if the configuration allows for it. The complete model architecture, shown on the right, is composed of numerous such blocks with various configurations of skip connections, filters, and attention mechanisms.
+        </p>
+        <img src="gradio_api/file=inv_block.png" style="max-width: 80%; height: auto; object-fit: contain; margin: 0 auto;">
+        <p class="article-text">
+            This model is incredibly compact, taking up only 4MB. This makes it perfect for running on mobile devices or even directly within web browsers.
+            I trained the model on Google Colab for approximately 114 epochs, utilizing ReduceLROnPlateau and EarlyStopping for optimization. To ensure broad compatibility, I converted the model to the universal ONNX format, making it easy to deploy across various platforms including Linux, Windows, macOS, Android, iOS, and WebGPU.
+            Below, you can see the graphs illustrating the evolution of the model's Accuracy, Loss, and Learning Rate during training.
+        </p>
+    </div>
+    <img src="gradio_api/file=cnn_arch.png" style="max-width: 80%; height: auto; object-fit: contain;">
+</div>
+<div style="display: flex; justify-content: center; gap: 10px; margin-top: 30px;">
+    <img src="gradio_api/file=acc.png" style="max-width: 33%; height: auto; object-fit: contain;">
+    <img src="gradio_api/file=loss.png" style="max-width: 33%; height: auto; object-fit: contain;">
+    <img src="gradio_api/file=lr.png" style="max-width: 33%; height: auto; object-fit: contain;">
+</div>
+'''
+demo = gr.Interface(fn=model.inference, inputs=audio_component, outputs=label_component, flagging_mode="never", css=css,
+                     examples=sample_female + sample_male, title=title, description=description, article=article)
+demo.launch(allowed_paths=glob('*.png'))

male_3.wav → male_foreign_lang.wav RENAMED Viewed

File without changes

male_2.wav → male_foreign_lang2.wav RENAMED Viewed

File without changes

male_4.wav → male_foreign_lang3.wav RENAMED Viewed

File without changes

male_1.wav → male_radio.wav RENAMED Viewed

File without changes