vesteinn
/

IceBERT

Model card Files Files and versions Community

IceBERT / README.md

vesteinn's picture

Update README.md

ff95197 verified 5 months ago

|

history blame contribute delete

1.94 kB

	---
	language: is
	widget:
	- text: Má bjóða þér <mask> í kvöld?
	- text: Forseti <mask> er ágæt.
	- text: Súpan var <mask> á bragðið.
	tags:
	- roberta
	- icelandic
	- masked-lm
	- pytorch
	license: cc-by-4.0
	datasets:
	- mideind/icelandic-common-crawl-corpus-IC3
	---

	# IceBERT

	IceBERT was trained with fairseq using the RoBERTa-base architecture. The training data used is shown in the table below.

	\| Dataset \| Size \| Tokens \|
	\|------------------------------------------------------\|---------\|--------\|
	\| Icelandic Gigaword Corpus v20.05 (IGC) \| 8.2 GB \| 1,388M \|
	\| Icelandic Common Crawl Corpus (IC3) \| 4.9 GB \| 824M \|
	\| Greynir News articles \| 456 MB \| 76M \|
	\| Icelandic Sagas \| 9 MB \| 1.7M \|
	\| Open Icelandic e-books (Rafbókavefurinn) \| 14 MB \| 2.6M \|
	\| Data from the medical library of Landspitali \| 33 MB \| 5.2M \|
	\| Student theses from Icelandic universities (Skemman) \| 2.2 GB \| 367M \|
	\| Total \| 15.8 GB \| 2,664M \|


	If you find this model useful, please cite

	```
	@inproceedings{snaebjarnarson-etal-2022-warm,
	title = "A Warm Start and a Clean Crawled Corpus - A Recipe for Good Language Models",
	author = "Sn{\ae}bjarnarson, V{\'e}steinn and
	S{\'\i}monarson, Haukur Barri and
	Ragnarsson, P{\'e}tur Orri and
	Ing{\'o}lfsd{\'o}ttir, Svanhv{\'\i}t Lilja and
	J{\'o}nsson, Haukur and
	Thorsteinsson, Vilhjalmur and
	Einarsson, Hafsteinn",
	booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
	month = jun,
	year = "2022",
	address = "Marseille, France",
	publisher = "European Language Resources Association",
	url = "https://aclanthology.org/2022.lrec-1.464",
	pages = "4356--4366",
	}
	```