CNN Musical Note Classifier

A deep learning model for classifying musical notes based on their pitch and length. This model achieves remarkable performance on the dataset, with a test accuracy of 99.66%.

Model Overview

Architecture

This model is built using a Convolutional Neural Network (CNN) architecture with the following features:

Input size: (64, 64, 1)
Number of parameters: 696,255
Layers include:
- Multiple Conv2D and BatchNormalization layers for feature extraction
- GlobalAveragePooling2D and Dense layers for classification
- Regularization via Dropout layers
Output: 85 classes, representing combinations of pitch and note length.

Dataset

Original dataset size: 1,785 samples
Augmented dataset size: 71,400 samples
Total size: 73,185 samples
Labels include various combinations of pitch (A3, B4, C6, etc.) and note lengths (16th, quarter, whole, etc.).

Training Details

Optimizer: Adam
Loss function: Categorical Crossentropy
Epochs: 400+
Batch size: Optimized for balanced training speed and accuracy
Final evaluation results:
- Test Loss: 0.1286
- Test Accuracy: 99.66%

Examples of Feature Detection

Filter Visualization

Filters from the first convolutional layers demonstrate the features captured by the network.

Training and Validation Loss

The following graph shows the training and validation loss during model training:

Labels

The model supports 85 classes, which include:

Pitches: A3, B4, C6, etc.
Note Lengths: 16th, quarter, whole, etc.

Full label list: python ['A316th', 'A3eighth', 'A3half', 'A3quarter', 'A3whole', 'A416th', 'A4eighth', 'A4half', 'A4quarter', 'A4whole', 'A516th', 'A5eighth', 'A5half', 'A5quarter', 'A5whole', 'B316th', 'B3eighth', 'B3half', 'B3quarter', 'B3whole', 'B416th', 'B4eighth', 'B4half', 'B4quarter', 'B4whole', 'B516th', 'B5eighth', 'B5half', 'B5quarter', 'B5whole', 'C416th', 'C4eighth', 'C4half', 'C4quarter', 'C4whole', 'C516th', 'C5eighth', 'C5half', 'C5quarter', 'C5whole', 'C616th', 'C6eighth', 'C6half', 'C6quarter', 'C6whole', 'D416th', 'D4eighth', 'D4half', 'D4quarter', 'D4whole', 'D516th', 'D5eighth', 'D5half', 'D5quarter', 'D5whole', 'E416th', 'E4eighth', 'E4half', 'E4quarter', 'E4whole', 'E516th', 'E5eighth', 'E5half', 'E5quarter', 'E5whole', 'F416th', 'F4eighth', 'F4half', 'F4quarter', 'F4whole', 'F516th', 'F5eighth', 'F5half', 'F5quarter', 'F5whole', 'G416th', 'G4eighth', 'G4half', 'G4quarter', 'G4whole', 'G516th', 'G5eighth', 'G5half', 'G5quarter', 'G5whole']

Datasets

To ensure the model generalizes well, we augmented the data using the following techniques:

Cropping the notes in different ways to simulate varied positioning on the musical staff.
Using random transformations like rotation, zooming, and shifting with TensorFlow's ImageDataGenerator.
Augmentation expanded the dataset from 1,785 images to a total of 55,335 images. https://huggingface.co/datasets/dongim04/augmented_musical_note_dataset