Update README.md

39c8cdc verified 12 days ago

4.63 kB

	---
	metrics:
	- wer
	- cer
	library_name: transformers
	pipeline_tag: automatic-speech-recognition
	tags:
	- Cretan
	- Greek dialect
	---

	# Cretan XLS-R model

	Cretan is a variety of Modern Greek predominantly used by speakers who reside on the island of Crete or
	belong to the Cretan diaspora. This includes communities of Cretan origin that were relocated to the
	village of Hamidieh in Syria and to Western Asia Minor, following the population exchange between
	Greece and Turkey in 1923. The historical and geographical factors that have shaped the development
	and preservation of the dialect include the long-term isolation of Crete from the mainland, and the
	successive domination of the island by foreign powers, such as the Arabs, the Venetians, and the Turks,
	over a period of seven centuries. Cretan has been divided based on its phonological, phonetic,
	morphological, and lexical characteristics into two major dialect groups: the western and the eastern.
	The boundary between these groups coincides with the administrative division of the island into the
	prefectures of Rethymno and Heraklion. Kontosopoulos (2008) argues that the eastern dialect group is more
	homogeneous than the western one, which shows more variation across all levels of linguistic analysis.
	Contrary to other Modern Greek Dialects, Cretan does not face the threat of extinction, as it remains
	the sole means of communication for a large number of speakers in various parts of the island.

	This is the first automatic speech recognition (ASR) model for Cretan.
	To train the model, we fine-tuned a Greek XLS-R model ([jonatasgrosman/wav2vec2-large-xlsr-53-greek](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-greek)) on the Cretan resources (see below).

	## Resources

	For the compilation of the Cretan corpus, we gathered 32 tapes containing material from
	radio broadcasts in digital format, with permission from the Audiovisual Department of the
	Vikelaia Municipal Library of Heraklion, Crete. These broadcasts were recorded and
	aired by Radio Mires, in the Messara region of Heraklion, during the period 1998-2001,
	totaling 958 minutes and 47 seconds. These recordings primarily consist of narratives
	by one speaker, Ioannis Anagnostakis, who is responsible for their composition. In terms
	of textual genre, the linguistic content of the broadcasts consists of folklore
	narratives expressed in the local linguistic variety. Out of the total volume of material
	collected, we utilized nine tapes. Criteria for material selection included, on the one hand,
	maximizing digital clarity of speech and, on the other hand, ensuring representative sampling
	across the entire three-year period of radio recordings. To obtain an initial transcription,
	we employed the Large-v2 model, which was the largest Whisper model at the time. Subsequently,
	the transcripts were manually corrected in collaboration with the local community.
	The transcription system that was used was based on the Greek alphabet and orthography
	and it was annotated in Praat.

	To prepare the dataset, the texts were normalized (see [greek_dialects_asr/](https://gitlab.com/ilsp-spmd-all/speech/greek_dialects_asr/) for scripts),
	and all audio files were converted into a 16 kHz mono format.

	We split the Praat annotations into audio-transcription segments, which resulted in a dataset of a total duration of 1h 21m 12s.
	Note that the removal of music, long pauses, and non-transcribed segments leads to a reduction of the total audio duration
	(compared to the initial 2h recordings of the 9 tapes).

	## Metrics

	We evaluated the model on the test set split, which consists of 10% of the dataset recordings.

	\|Model\|WER\|CER\|
	\|---\|---\|---\|
	\|pre-trained\|104.83%\|91.73%\|
	\|fine-tuned\|28.27%\|7.88%\|

	## Training hyperparameters

	We fine-tuned the baseline model (`wav2vec2-large-xlsr-53-greek`) on an NVIDIA GeForce RTX 3090, using the following hyperparameters:

	\| arg \| value \|
	\|-------------------------------\|-------\|
	\| `per_device_train_batch_size` \| 8 \|
	\| `gradient_accumulation_steps` \| 2 \|
	\| `num_train_epochs` \| 35 \|
	\| `learning_rate` \| 3e-4 \|
	\| `warmup_steps` \| 500 \|

	## Citation

	To cite this work or read more about the training pipeline, see:

	S. Vakirtzian, C. Tsoukala, S. Bompolas, K. Mouzou, V. Stamou, G. Paraskevopoulos, A. Dimakis, S. Markantonatou, A. Ralli, A. Anastasopoulos, Speech Recognition for Greek Dialects: A Challenging Benchmark, Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH), 2024.