Update README.md

bc06669 verified 4 months ago

3.75 kB

	---
	language:
	- en
	datasets:
	- CREMA-D
	library_name: transformers
	tags:
	- emotion-classification
	- audio-classification
	- audio-spectrogram
	- transformer
	- fine-tuned
	license: apache-2.0
	pipeline_tag: audio-classification
	base_model: "MIT/ast-finetuned-audioset-10-10-0.4593"
	metrics:
	- accuracy
	- f1
	task_categories:
	- audio-classification
	---


	# AST Fine-Tuned Model for Emotion Classification


	# AST Fine-Tuned Model for Emotion Classification

	This is a fine-tuned Audio Spectrogram Transformer (AST) model, specifically designed for classifying emotions in speech audio. The model was fine-tuned on the CREMA-D dataset, focusing on six emotional categories. The base model was sourced from MIT's pre-trained AST model.

	---

	## Model Details
	- Base Model: `MIT/ast-finetuned-audioset-10-10-0.4593`
	- Fine-Tuned Dataset: CREMA-D
	- Architecture: Audio Spectrogram Transformer (AST)
	- Model Type: Single-label classification
	- Input Features: Log-Mel Spectrograms (128 mel bins)
	- Output Classes:
	- ANG: Anger
	- DIS: Disgust
	- FEA: Fear
	- HAP: Happiness
	- NEU: Neutral
	- SAD: Sadness

	---

	## Model Configuration
	- Hidden Size: 768
	- Number of Attention Heads: 12
	- Number of Hidden Layers: 12
	- Patch Size: 16
	- Maximum Length: 1024
	- Dropout Probability: 0.0
	- Activation Function: GELU (Gaussian Error Linear Unit)
	- Optimizer: Adam
	- Learning Rate: 1e-4

	---

	## Training Details
	- Dataset: CREMA-D (Emotion-Labeled Speech Data)
	- Data Augmentation:
	- Noise injection
	- Time shifting
	- Speed perturbation
	- Fine-Tuning Epochs: 5
	- Batch Size: 16
	- Learning Rate Scheduler: Linear decay
	- Best Validation Accuracy: 60.71%
	- Best Checkpoint: `./results/checkpoint-1119`

	---

	## How to Use

	### Load the Model
	```python
	from transformers import AutoModelForAudioClassification, AutoProcessor

	# Load the model and processor
	model = AutoModelForAudioClassification.from_pretrained("forwarder1121/ast-finetuned-model")
	processor = AutoProcessor.from_pretrained("forwarder1121/ast-finetuned-model")

	# Prepare input audio (e.g., waveform) as log-mel spectrogram
	inputs = processor("path_to_audio.wav", sampling_rate=16000, return_tensors="pt")

	# Make predictions
	outputs = model(**inputs)
	predicted_class = outputs.logits.argmax(-1).item()

	print(f"Predicted emotion: {model.config.id2label[str(predicted_class)]}")
	```

	---

	## Metrics

	### Validation Results
	- Best Validation Accuracy: 60.71%
	- Validation Loss: 1.1126

	### Evaluation Details
	- Eval Dataset: CREMA-D test split
	- Batch Size: 16
	- Number of Steps: 94

	---

	## Limitations
	- The model was trained on CREMA-D, which has a specific set of speech data. It may not generalize well to datasets with different accents, speech styles, or languages.
	- Validation accuracy is 60.71%, indicating room for improvement for real-world deployment.

	---

	## Acknowledgments
	This work is based on the Audio Spectrogram Transformer (AST) model by MIT, fine-tuned for emotion classification. Special thanks to the developers of Hugging Face and the CREMA-D dataset contributors.

	---

	## License
	The model is shared under the MIT License. Refer to the licensing details in the repository.

	---

	## Citation
	If you use this model in your work, please cite:
	```
	@misc{ast-finetuned-model,
	author = {forwarder1121},
	title = {Fine-Tuned Audio Spectrogram Transformer for Emotion Classification},
	year = {2024},
	url = {https://huggingface.co/forwarder1121/ast-finetuned-model},
	}
	```

	---

	## Contact
	For questions, reach out to `[email protected]`.