ilsp
/

xls-r-greek-cretan

+---
+metrics:
+- wer
+- cer
+library_name: transformers
+pipeline_tag: automatic-speech-recognition
+tags:
+- Cretan
+- Greek dialect
+---
+# Cretan XLS-R model
+Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or
+belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the
+village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between
+Greece and Turkey in 1923. The historical and geographical factors that have shaped the development
+and preservation of the dialect include the long-term isolation of Crete from the mainland, and the
+successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks,
+over a period of seven centuries. Cretan has been divided based on its phonological, phonetic,
+morphological, and lexical characteristics into two major dialect groups: the western and the eastern.
+The boundary between these groups coincides with the administrative division of the island into the
+prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more
+homogeneous than the western one, which shows more variation across all levels of linguistic analysis.
+Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains
+the sole means of communication for a large number of speakers in various parts of the island.
+This is the first automatic speech recognition (ASR) model for Cretan.
+To train the model, we fine-tuned a Greek XLS-R model ([jonatasgrosman/wav2vec2-large-xlsr-53-greek](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek)) on 11h of recorded Pomak speech.
+## Resources
+For the compilation of the Cretan corpus, we gathered 32 tapes containing material from
+radio broadcasts in digital format, with permission from the Audiovisual Department of the
+Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and
+aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001,
+totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives
+by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms
+of textual genre, the linguistic content of the broadcasts consists of folklore
+narratives expressed in the local linguistic variety. Out of the total volume of material
+collected, we utilized nine tapes. Criteria for material selection included, on the one hand,
+maximizing digital clarity of speech and, on the other hand, ensuring representative sampling
+across the entire three-year period of radio recordings. To obtain an initial transcription,
+we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently,
+the transcripts were manually corrected in collaboration with the local community.
+The transcription system that was used was based on the Greek alphabet and orthography
+and it was annotated in Praat.
+To prepare the dataset, the texts were normalized (see [greek_dialects_asr/](https://gitlab.com/ilsp-spmd-all/speech/greek_dialects_asr/) for scripts),
+and all audio files were converted into a 16 kHz mono format.
+We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s.
+Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration
+(compared to the initial 2h recordings of the 9 tapes).
+## Metrics
+We evaluated the model on the test set split, which consists of 10% of the dataset recordings.
+|Model|WER|CER|
+|---|---|---|
+|pre-trained|104.83%|91.73%|
+|fine-tuned|28.27%|7.88%|
+## Training hyperparameters
+We fine-tuned the baseline model (`wav2vec2-large-xlsr-53-greek`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:
+| arg                           | value |
+|-------------------------------|-------|
+| `per_device_train_batch_size` | 8     |
+| `gradient_accumulation_steps` | 2     |
+| `num_train_epochs`            | 35    |
+| `learning_rate`               | 3e-4  |
+| `warmup_steps`                | 500   |
+## Citation
+To cite this work or read more about the training pipeline, see:
+S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.