|
--- |
|
metrics: |
|
- wer |
|
- cer |
|
library_name: transformers |
|
pipeline_tag: automatic-speech-recognition |
|
tags: |
|
- Cretan |
|
- Greek dialect |
|
--- |
|
|
|
# Cretan XLS-R model |
|
|
|
Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or |
|
belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the |
|
village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between |
|
Greece and Turkey in 1923. The historical and geographical factors that have shaped the development |
|
and preservation of the dialect include the long-term isolation of Crete from the mainland, and the |
|
successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks, |
|
over a period of seven centuries. Cretan has been divided based on its phonological, phonetic, |
|
morphological, and lexical characteristics into two major dialect groups: the western and the eastern. |
|
The boundary between these groups coincides with the administrative division of the island into the |
|
prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more |
|
homogeneous than the western one, which shows more variation across all levels of linguistic analysis. |
|
Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains |
|
the sole means of communication for a large number of speakers in various parts of the island. |
|
|
|
This is the first automatic speech recognition (ASR) model for Cretan. |
|
To train the model, we fine-tuned a Greek XLS-R model ([jonatasgrosman/wav2vec2-large-xlsr-53-greek](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek)) on the Cretan resources (see below). |
|
|
|
## Resources |
|
|
|
For the compilation of the Cretan corpus, we gathered 32 tapes containing material from |
|
radio broadcasts in digital format, with permission from the Audiovisual Department of the |
|
Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and |
|
aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001, |
|
totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives |
|
by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms |
|
of textual genre, the linguistic content of the broadcasts consists of folklore |
|
narratives expressed in the local linguistic variety. Out of the total volume of material |
|
collected, we utilized nine tapes. Criteria for material selection included, on the one hand, |
|
maximizing digital clarity of speech and, on the other hand, ensuring representative sampling |
|
across the entire three-year period of radio recordings. To obtain an initial transcription, |
|
we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently, |
|
the transcripts were manually corrected in collaboration with the local community. |
|
The transcription system that was used was based on the Greek alphabet and orthography |
|
and it was annotated in Praat. |
|
|
|
To prepare the dataset, the texts were normalized (see [greek_dialects_asr/](https://gitlab.com/ilsp-spmd-all/speech/greek_dialects_asr/) for scripts), |
|
and all audio files were converted into a 16 kHz mono format. |
|
|
|
We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s. |
|
Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration |
|
(compared to the initial 2h recordings of the 9 tapes). |
|
|
|
## Metrics |
|
|
|
We evaluated the model on the test set split, which consists of 10% of the dataset recordings. |
|
|
|
|Model|WER|CER| |
|
|---|---|---| |
|
|pre-trained|104.83%|91.73%| |
|
|fine-tuned|28.27%|7.88%| |
|
|
|
## Training hyperparameters |
|
|
|
We fine-tuned the baseline model (`wav2vec2-large-xlsr-53-greek`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters: |
|
|
|
| arg | value | |
|
|-------------------------------|-------| |
|
| `per_device_train_batch_size` | 8 | |
|
| `gradient_accumulation_steps` | 2 | |
|
| `num_train_epochs` | 35 | |
|
| `learning_rate` | 3e-4 | |
|
| `warmup_steps` | 500 | |
|
|
|
## Citation |
|
|
|
To cite this work or read more about the training pipeline, see: |
|
|
|
S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024. |