This model is fine-tuned especially for Song Transcription. Later on it was implemented for Lyrical Video Generation task.