Cristian Tatu commited on
Commit
611f013
Β·
1 Parent(s): 1e50356

add description article

Browse files
.gitattributes CHANGED
@@ -33,11 +33,13 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- male_1.wav filter=lfs diff=lfs merge=lfs -text
37
- male_2.wav filter=lfs diff=lfs merge=lfs -text
38
- male_3.wav filter=lfs diff=lfs merge=lfs -text
39
- male_4.wav filter=lfs diff=lfs merge=lfs -text
40
- female_1.wav filter=lfs diff=lfs merge=lfs -text
41
- female_2.wav filter=lfs diff=lfs merge=lfs -text
42
- female_3.wav filter=lfs diff=lfs merge=lfs -text
43
- female_4.wav filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ male_foreign_lang.wav filter=lfs diff=lfs merge=lfs -text
37
+ male_foreign_lang2.wav filter=lfs diff=lfs merge=lfs -text
38
+ male_foreign_lang3.wav filter=lfs diff=lfs merge=lfs -text
39
+ male_radio.wav filter=lfs diff=lfs merge=lfs -text
40
+ female_foreign_lang.wav filter=lfs diff=lfs merge=lfs -text
41
+ female_phone.wav filter=lfs diff=lfs merge=lfs -text
42
+ female_laugh.wav filter=lfs diff=lfs merge=lfs -text
43
+ female_blah.wav filter=lfs diff=lfs merge=lfs -text
44
+ *.wav filter=lfs diff=lfs merge=lfs -text
45
+ *.png filter=lfs diff=lfs merge=lfs -text
female_3.wav β†’ acc.png RENAMED
File without changes
cnn_arch.png ADDED

Git LFS Details

  • SHA256: 7f0f23ff98929b6ec7c1d5ef7b01a863e99ea9c83ea675e2a43691c9f3ef4e86
  • Pointer size: 131 Bytes
  • Size of remote file: 675 kB
female_blah.wav ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39c9f005b211e842519dd47fde05308f2689ee667ccd61822948ad1b12a98542
3
+ size 330794
female_4.wav β†’ female_foreign_lang.wav RENAMED
File without changes
female_2.wav β†’ female_laugh.wav RENAMED
File without changes
female_1.wav β†’ female_phone.wav RENAMED
File without changes
inv_block.png ADDED

Git LFS Details

  • SHA256: 859bea7919b3d57bf9586c5230e1ae464962cb117ac5672759f09933471b10fe
  • Pointer size: 131 Bytes
  • Size of remote file: 123 kB
loss.png ADDED

Git LFS Details

  • SHA256: 5621bb2a06ca0f7ced2a4a609094e1fc07cddd3bbf361ddf4840e883705a6b71
  • Pointer size: 130 Bytes
  • Size of remote file: 32.4 kB
lr.png ADDED

Git LFS Details

  • SHA256: f4387a8d19b8b2378df9bf331e4de905a33c02b1edcd1db16e76aae65acabefb
  • Pointer size: 130 Bytes
  • Size of remote file: 23.3 kB
main.py CHANGED
@@ -1,13 +1,73 @@
1
  import gradio as gr
2
  from model import CnnVoiceClassifier
 
3
 
4
  model = CnnVoiceClassifier()
5
 
6
  audio_component = gr.Audio(type='filepath', label='Upload your audio file here')
7
  label_component = gr.Label(label='Gender classification result')
8
 
9
- sample_female = [f'female_{i}.wav' for i in range(1, 5)]
10
- sample_male = [f'male_{i}.wav' for i in range(1, 5)]
11
 
12
- demo = gr.Interface(fn=model.inference, inputs=audio_component, outputs=label_component, examples=sample_female + sample_male)
13
- demo.launch()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  import gradio as gr
2
  from model import CnnVoiceClassifier
3
+ from glob import glob
4
 
5
  model = CnnVoiceClassifier()
6
 
7
  audio_component = gr.Audio(type='filepath', label='Upload your audio file here')
8
  label_component = gr.Label(label='Gender classification result')
9
 
10
+ sample_female = glob('female_*.wav')
11
+ sample_male = glob('male_*.wav')
12
 
13
+ title = 'CNN Voice Classifier πŸ‘¨πŸ‘©'
14
+ description = '<h2>This application is a Convolutional Neural Network (CNN) voice gender classifier. You can upload an audio file or record it from your microphone and the application will predict the gender of the speaker.</h2>'
15
+
16
+ css = '''
17
+ .article-text {
18
+ font-size: 1.6em;
19
+ text-align: justify;
20
+ }
21
+ '''
22
+
23
+ article = '''
24
+ <div style="text-align: center; padding: 20px;">
25
+ <img src="https://media.licdn.com/dms/image/v2/C4E03AQHo_SiOFDMLtA/profile-displayphoto-shrink_200_200/profile-displayphoto-shrink_200_200/0/1563539658469?e=1756339200&v=beta&t=--Zge8UfsgSSSE1SyNrOQYxDmgq_Q87szn5pMNE4xpE" alt="Profile Image" style="border-radius: 50%; margin: 0 auto;">
26
+ <h1>Cristian Tatu</h1>
27
+ <p class="article-text">
28
+ I created this AI model as a side project stemming from a larger one. While exploring the Mozilla Common Voice dataset for training a Text-to-Speech (TTS) model, I noticed a significant number of audio samples had incorrect gender labels. This would severely degrade the quality and training of any TTS model.
29
+ My goal was to find a quick and easy way to determine a person's gender from their voice. To my surprise, "easy" solutions like the one found <a href="https://github.com/primaryobjects/voice-gender">here</a> weren't robust enough to handle background noise or poor microphone quality. Conversely, robust solutions, such as complex Transformer models like the one <a href="https://huggingface.co/JaesungHuh/voice-gender-classifier">here</a>, were too resource-intensive.
30
+ I believed there had to be a relatively simple model, like a Convolutional Neural Network (CNN), that could reliably distinguish a person's voice despite environmental noise and recording artifacts.
31
+ </p>
32
+ </div>
33
+
34
+ <div style="display: grid; grid-template-columns: 70% 2fr; gap: 5%; margin-top: 30px; align-items: start;">
35
+ <div>
36
+ <p class="article-text">
37
+ To develop this Convolutional Neural Network (CNN), the quality of the dataset was a top priority. I aimed to use datasets with a wide variety of sound conditions and languages to prevent the model from being biased towards any particular pattern other than a person's gender.
38
+ <br>
39
+ To achieve this, I combined three distinct datasets:
40
+ </p>
41
+ <ul class="article-text" style="padding-left: 5%;">
42
+ <li><a href="https://www.islrn.org/resources/038-476-412-610-4/">aGender</a> - This dataset includes german speech samples recorded over public telephone lines, featuring both read and semi-spontaneous speech..</li>
43
+ <li><a href="https://huggingface.co/datasets/litagin/moe-speech">litagin/moe-speech</a> - A high-quality collection of character acting speech audio performed by Japanese professional voice actors.</li>
44
+ <li><a href="https://research.google.com/audioset/dataset/index.html">AudioSet</a> - A large-scale compilation of human-labeled 10-second sound clips extracted from YouTube videos.</li>
45
+ </ul>
46
+ <p class="article-text">
47
+ By combining these diverse datasets with data augmentation techniques, such as adding background noise to the audio clips at training time, I was able to create a very robust model.
48
+ </p>
49
+ <h1>Model Architecture:</h1>
50
+ <p class="article-text">
51
+ The architecture for this model was inspired by MobileNetV3, which is known for being lightweight and efficient, though originally designed for images. Fortunately, it's easily adaptable for 1D signals like audio.
52
+ The diagram below illustrates the core building block of the architecture, called the <i>Universal Inverted Bottleneck</i>. This block features two convolutional layers surrounding a <i>DepthWise</i> layer and a simple attention mechanism known as <i>Squeeze and Excitation</i>. A skip connection can be included if the configuration allows for it. The complete model architecture, shown on the right, is composed of numerous such blocks with various configurations of skip connections, filters, and attention mechanisms.
53
+ </p>
54
+ <img src="gradio_api/file=inv_block.png" style="max-width: 80%; height: auto; object-fit: contain; margin: 0 auto;">
55
+ <p class="article-text">
56
+ This model is incredibly compact, taking up only 4MB. This makes it perfect for running on mobile devices or even directly within web browsers.
57
+ I trained the model on Google Colab for approximately 114 epochs, utilizing ReduceLROnPlateau and EarlyStopping for optimization. To ensure broad compatibility, I converted the model to the universal ONNX format, making it easy to deploy across various platforms including Linux, Windows, macOS, Android, iOS, and WebGPU.
58
+ Below, you can see the graphs illustrating the evolution of the model's Accuracy, Loss, and Learning Rate during training.
59
+ </p>
60
+ </div>
61
+ <img src="gradio_api/file=cnn_arch.png" style="max-width: 80%; height: auto; object-fit: contain;">
62
+ </div>
63
+
64
+ <div style="display: flex; justify-content: center; gap: 10px; margin-top: 30px;">
65
+ <img src="gradio_api/file=acc.png" style="max-width: 33%; height: auto; object-fit: contain;">
66
+ <img src="gradio_api/file=loss.png" style="max-width: 33%; height: auto; object-fit: contain;">
67
+ <img src="gradio_api/file=lr.png" style="max-width: 33%; height: auto; object-fit: contain;">
68
+ </div>
69
+ '''
70
+
71
+ demo = gr.Interface(fn=model.inference, inputs=audio_component, outputs=label_component, flagging_mode="never", css=css,
72
+ examples=sample_female + sample_male, title=title, description=description, article=article)
73
+ demo.launch(allowed_paths=glob('*.png'))
male_3.wav β†’ male_foreign_lang.wav RENAMED
File without changes
male_2.wav β†’ male_foreign_lang2.wav RENAMED
File without changes
male_4.wav β†’ male_foreign_lang3.wav RENAMED
File without changes
male_1.wav β†’ male_radio.wav RENAMED
File without changes