Audio-Text-to-Text
glap_model