parapar commited on
Commit
1da9d7f
1 Parent(s): 960cd74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -2
README.md CHANGED
@@ -29,8 +29,14 @@ widget:
29
  output:
30
  text: Frades é un concello da provincia da Coruña, pertencente á comarca de Ordes. Está situado a 15 quilómetros de Santiago de Compostela.
31
  ---
 
 
 
32
 
33
- # Llama-3.1-8B-Instruct-Galician
 
 
 
34
 
35
  This model is a continued pretraining version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the [CorpusNós](https://zenodo.org/records/11655219) dataset.
36
 
@@ -108,4 +114,13 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
108
 
109
  ## Citation
110
 
111
- _Coming soon_
 
 
 
 
 
 
 
 
 
 
29
  output:
30
  text: Frades é un concello da provincia da Coruña, pertencente á comarca de Ordes. Está situado a 15 quilómetros de Santiago de Compostela.
31
  ---
32
+ <div align="center">
33
+ <p align="center"><img width=20% src="https://gitlab.irlab.org/eliseo.bao/xovetic-llms-underrepresented-languages/-/raw/main/img/logo.png" /></p>
34
+ </div>
35
 
36
+
37
+
38
+
39
+ # Llama-3.1-8B-Instruct-Galician a.k.a. Cabuxa 2.0
40
 
41
  This model is a continued pretraining version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) on the [CorpusNós](https://zenodo.org/records/11655219) dataset.
42
 
 
114
 
115
  ## Citation
116
 
117
+ ```
118
+ @inproceedings{bao-perez-parapar-xovetic-2024,
119
+ title={Adapting Large Language Models for Underrepresented Languages},
120
+ author={Eliseo Bao and Anxo Pérez and Javier Parapar },
121
+ booktitle={VII Congreso XoveTIC: impulsando el talento cient{\'\i}fico},
122
+ year={2024},
123
+ organization={Universidade da Coru{\~n}a, Servizo de Publicaci{\'o}ns}
124
+ abstact = {The popularization of Large Language Models (LLMs), especially with the development of conversational systems, makes mandatory to think about facilitating the use of artificial intelligence (AI) to everyone. Most models neglect minority languages, prioritizing widely spoken ones. This exacerbates their underrepresentation in the digital world and negatively affects their speakers. We present two resources aimed at improving natural language processing (NLP) for Galician: (i) a Llama 3.1 instruct model adapted through continuous pre-training on the CorpusNos dataset; and (ii) a Galician version of the Alpaca dataset, used to assess the improvement over the base model. In this evaluation, our model outperformed both the base model and another Galician model in quantitative and qualitative terms}
125
+ }
126
+ ```