File size: 4,632 Bytes
d685178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39c8cdc
d685178
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
metrics:
- wer
- cer
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
- Cretan
- Greek dialect
---

# Cretan XLS-R model

Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or 
belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the 
village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between 
Greece and Turkey in 1923. The historical and geographical factors that have shaped the development 
and preservation of the dialect include the long-term isolation of Crete from the mainland, and the 
successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks, 
over a period of seven centuries. Cretan has been divided based on its phonological, phonetic, 
morphological, and lexical characteristics into two major dialect groups: the western and the eastern. 
The boundary between these groups coincides with the administrative division of the island into the 
prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more 
homogeneous than the western one, which shows more variation across all levels of linguistic analysis. 
Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains 
the sole means of communication for a large number of speakers in various parts of the island.

This is the first automatic speech recognition (ASR) model for Cretan. 
To train the model, we fine-tuned a Greek XLS-R model ([jonatasgrosman/wav2vec2-large-xlsr-53-greek](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek)) on the Cretan resources (see below).

## Resources

For the compilation of the Cretan corpus, we gathered 32 tapes containing material from  
radio broadcasts in digital format, with permission from the Audiovisual Department of the 
Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and 
aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001, 
totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives 
by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms 
of textual genre, the linguistic content of the broadcasts consists of folklore 
narratives expressed in the local linguistic variety. Out of the total volume of material 
collected, we utilized nine tapes. Criteria for material selection included, on the one hand, 
maximizing digital clarity of speech and, on the other hand, ensuring representative sampling 
across the entire three-year period of radio recordings. To obtain an initial transcription, 
we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently, 
the transcripts were manually corrected in collaboration with the local community. 
The transcription system that was used was based on the Greek alphabet and orthography
and it was annotated in Praat.

To prepare the dataset, the texts were normalized (see [greek_dialects_asr/](https://gitlab.com/ilsp-spmd-all/speech/greek_dialects_asr/) for scripts), 
and all audio files were converted into a 16 kHz mono format.

We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s.
Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration 
(compared to the initial 2h recordings of the 9 tapes).

## Metrics

We evaluated the model on the test set split, which consists of 10% of the dataset recordings.

|Model|WER|CER|
|---|---|---|
|pre-trained|104.83%|91.73%|
|fine-tuned|28.27%|7.88%|

## Training hyperparameters

We fine-tuned the baseline model (`wav2vec2-large-xlsr-53-greek`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:

| arg                           | value |
|-------------------------------|-------|
| `per_device_train_batch_size` | 8     |
| `gradient_accumulation_steps` | 2     |
| `num_train_epochs`            | 35    |
| `learning_rate`               | 3e-4  |
| `warmup_steps`                | 500   |

## Citation

To cite this work or read more about the training pipeline, see:

S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.