ctsoukala commited on
Commit
d685178
1 Parent(s): bf041c5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - wer
4
+ - cer
5
+ library_name: transformers
6
+ pipeline_tag: automatic-speech-recognition
7
+ tags:
8
+ - Cretan
9
+ - Greek dialect
10
+ ---
11
+
12
+ # Cretan XLS-R model
13
+
14
+ Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or
15
+ belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the
16
+ village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between
17
+ Greece and Turkey in 1923. The historical and geographical factors that have shaped the development
18
+ and preservation of the dialect include the long-term isolation of Crete from the mainland, and the
19
+ successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks,
20
+ over a period of seven centuries. Cretan has been divided based on its phonological, phonetic,
21
+ morphological, and lexical characteristics into two major dialect groups: the western and the eastern.
22
+ The boundary between these groups coincides with the administrative division of the island into the
23
+ prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more
24
+ homogeneous than the western one, which shows more variation across all levels of linguistic analysis.
25
+ Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains
26
+ the sole means of communication for a large number of speakers in various parts of the island.
27
+
28
+ This is the first automatic speech recognition (ASR) model for Cretan.
29
+ To train the model, we fine-tuned a Greek XLS-R model ([jonatasgrosman/wav2vec2-large-xlsr-53-greek](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek)) on 11h of recorded Pomak speech.
30
+
31
+ ## Resources
32
+
33
+ For the compilation of the Cretan corpus, we gathered 32 tapes containing material from
34
+ radio broadcasts in digital format, with permission from the Audiovisual Department of the
35
+ Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and
36
+ aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001,
37
+ totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives
38
+ by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms
39
+ of textual genre, the linguistic content of the broadcasts consists of folklore
40
+ narratives expressed in the local linguistic variety. Out of the total volume of material
41
+ collected, we utilized nine tapes. Criteria for material selection included, on the one hand,
42
+ maximizing digital clarity of speech and, on the other hand, ensuring representative sampling
43
+ across the entire three-year period of radio recordings. To obtain an initial transcription,
44
+ we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently,
45
+ the transcripts were manually corrected in collaboration with the local community.
46
+ The transcription system that was used was based on the Greek alphabet and orthography
47
+ and it was annotated in Praat.
48
+
49
+ To prepare the dataset, the texts were normalized (see [greek_dialects_asr/](https://gitlab.com/ilsp-spmd-all/speech/greek_dialects_asr/) for scripts),
50
+ and all audio files were converted into a 16 kHz mono format.
51
+
52
+ We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s.
53
+ Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration
54
+ (compared to the initial 2h recordings of the 9 tapes).
55
+
56
+ ## Metrics
57
+
58
+ We evaluated the model on the test set split, which consists of 10% of the dataset recordings.
59
+
60
+ |Model|WER|CER|
61
+ |---|---|---|
62
+ |pre-trained|104.83%|91.73%|
63
+ |fine-tuned|28.27%|7.88%|
64
+
65
+ ## Training hyperparameters
66
+
67
+ We fine-tuned the baseline model (`wav2vec2-large-xlsr-53-greek`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:
68
+
69
+ | arg | value |
70
+ |-------------------------------|-------|
71
+ | `per_device_train_batch_size` | 8 |
72
+ | `gradient_accumulation_steps` | 2 |
73
+ | `num_train_epochs` | 35 |
74
+ | `learning_rate` | 3e-4 |
75
+ | `warmup_steps` | 500 |
76
+
77
+ ## Citation
78
+
79
+ To cite this work or read more about the training pipeline, see:
80
+
81
+ S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.