AST Fine-Tuned Model for Emotion Classification

AST Fine-Tuned Model for Emotion Classification

This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.


Model Details

  • Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
  • Fine-Tuned Dataset: CREMA-D
  • Architecture: Audio Spectrogram Transformer (AST)
  • Model Type: Single-label classification
  • Input Features: Log-Mel Spectrograms (128 mel bins)
  • Output Classes:
    • ANG: Anger
    • DIS: Disgust
    • FEA: Fear
    • HAP: Happiness
    • NEU: Neutral
    • SAD: Sadness

Model Configuration

  • Hidden Size: 768
  • Number of Attention Heads: 12
  • Number of Hidden Layers: 12
  • Patch Size: 16
  • Maximum Length: 1024
  • Dropout Probability: 0.0
  • Activation Function: GELU (Gaussian Error Linear Unit)
  • Optimizer: Adam
  • Learning Rate: 1e-4

Training Details

  • Dataset: CREMA-D (Emotion-Labeled Speech Data)
  • Data Augmentation:
    • Noise injection
    • Time shifting
    • Speed perturbation
  • Fine-Tuning Epochs: 5
  • Batch Size: 16
  • Learning Rate Scheduler: Linear decay
  • Best Validation Accuracy: 60.71%
  • Best Checkpoint: ./results/checkpoint-1119

How to Use

Load the Model

from transformers import AutoModelForAudioClassification, AutoProcessor

# Load the model and processor
model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

# Prepare input audio (e.g., waveform) as log-mel spectrogram
inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

# Make predictions
outputs = model(**inputs)
predicted_class = outputs.logits.argmax(-1).item()

print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")

Metrics

Validation Results

  • Best Validation Accuracy: 60.71%
  • Validation Loss: 1.1126

Evaluation Details

  • Eval Dataset: CREMA-D test split
  • Batch Size: 16
  • Number of Steps: 94

Limitations

  • The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
  • Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

Acknowledgments

This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.


License

The model is shared under the MIT License. Refer to the licensing details in the repository.


Citation

If you use this model in your work, please cite:

@misc{ast-finetuned-model,
  author = {forwarder1121},
  title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
  year = {2024},
  url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
}

Contact

For questions, reach out to [email protected].

Downloads last month
161
Safetensors
Model size
86.2M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for forwarder1121/ast-finetuned-model

Finetuned
(105)
this model