---
pretty_name: mC4
annotations_creators:
- no-annotation
language_creators:
- found
languages:
- af
- am
- ar
- az
- be
- bg
- bg-Latn
- bn
- ca
- ceb
- co
- cs
- cy
- da
- de
- el
- el-Latn
- en
- eo
- es
- et
- eu
- fa
- fi
- fil
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- haw
- hi
- hi-Latn
- hmn
- ht
- hu
- hy
- id
- ig
- is
- it
- iw
- ja
- ja-Latn
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lb
- lo
- lt
- lv
- mg
- mi
- mk
- ml
- mn
- mr
- ms
- mt
- my
- ne
- nl
- "no"
- ny
- pa
- pl
- ps
- pt
- ro
- ru
- ru-Latn
- sd
- si
- sk
- sl
- sm
- sn
- so
- sq
- sr
- st
- su
- sv
- sw
- ta
- te
- tg
- th
- tr
- uk
- und
- ur
- uz
- vi
- xh
- yi
- yo
- zh
- zh-Latn
- zu
licenses:
- odc-by-1.0
multilinguality:
- multilingual
size_categories:
- n<1K
- 1K<n<10K
- 10K<n<100K
- 100K<n<1M
- 1M<n<10M
- 10M<n<100M
- 100M<n<1B
- 1B<n<10B
source_datasets:
- original
task_categories:
- sequence-modeling
task_ids:
- language-modeling
paperswithcode_id: mc4
---

# Dataset Card for mC4

## Table of Contents

- [Dataset Card for mC4](#dataset-card-for-mc4)
  - [Table of Contents](#table-of-contents)
  - [Dataset Description](#dataset-description)
    - [Dataset Summary](#dataset-summary)
    - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
    - [Languages](#languages)
  - [Dataset Structure](#dataset-structure)
    - [Data Instances](#data-instances)
    - [Data Fields](#data-fields)
    - [Data Splits](#data-splits)
  - [Dataset Creation](#dataset-creation)
    - [Curation Rationale](#curation-rationale)
    - [Source Data](#source-data)
      - [Initial Data Collection and Normalization](#initial-data-collection-and-normalization)
      - [Who are the source language producers?](#who-are-the-source-language-producers)
    - [Annotations](#annotations)
      - [Annotation process](#annotation-process)
      - [Who are the annotators?](#who-are-the-annotators)
    - [Personal and Sensitive Information](#personal-and-sensitive-information)
  - [Considerations for Using the Data](#considerations-for-using-the-data)
    - [Social Impact of Dataset](#social-impact-of-dataset)
    - [Discussion of Biases](#discussion-of-biases)
    - [Other Known Limitations](#other-known-limitations)
  - [Additional Information](#additional-information)
    - [Dataset Curators](#dataset-curators)
    - [Licensing Information](#licensing-information)
    - [Citation Information](#citation-information)
    - [Contributions](#contributions)

## Dataset Description

- **Homepage:** https://huggingface.co/datasets/allenai/c4
- **Paper:** https://arxiv.org/abs/1910.10683

### Dataset Summary

A multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: "https://commoncrawl.org".

This is the version prepared by AllenAI, hosted at this address: https://huggingface.co/datasets/allenai/c4

108 languages are available and are reported in the table below.

Note that the languages that end with "-Latn" are simply romanized variants, i.e. written using the Latin script.

| language code   | language name        |
|:----------------|:---------------------|
| af              | Afrikaans            |
| am              | Amharic              |
| ar              | Arabic               |
| az              | Azerbaijani          |
| be              | Belarusian           |
| bg              | Bulgarian            |
| bg-Latn         | Bulgarian (Latin)    |
| bn              | Bangla               |
| ca              | Catalan              |
| ceb             | Cebuano              |
| co              | Corsican             |
| cs              | Czech                |
| cy              | Welsh                |
| da              | Danish               |
| de              | German               |
| el              | Greek                |
| el-Latn         | Greek (Latin)        |
| en              | English              |
| eo              | Esperanto            |
| es              | Spanish              |
| et              | Estonian             |
| eu              | Basque               |
| fa              | Persian              |
| fi              | Finnish              |
| fil             | Filipino             |
| fr              | French               |
| fy              | Western Frisian      |
| ga              | Irish                |
| gd              | Scottish Gaelic      |
| gl              | Galician             |
| gu              | Gujarati             |
| ha              | Hausa                |
| haw             | Hawaiian             |
| hi              | Hindi                |
| hi-Latn         | Hindi (Latin script) |
| hmn             | Hmong, Mong          |
| ht              | Haitian              |
| hu              | Hungarian            |
| hy              | Armenian             |
| id              | Indonesian           |
| ig              | Igbo                 |
| is              | Icelandic            |
| it              | Italian              |
| iw              | former Hebrew        |
| ja              | Japanese             |
| ja-Latn         | Japanese (Latin)     |
| jv              | Javanese             |
| ka              | Georgian             |
| kk              | Kazakh               |
| km              | Khmer                |
| kn              | Kannada              |
| ko              | Korean               |
| ku              | Kurdish              |
| ky              | Kyrgyz               |
| la              | Latin                |
| lb              | Luxembourgish        |
| lo              | Lao                  |
| lt              | Lithuanian           |
| lv              | Latvian              |
| mg              | Malagasy             |
| mi              | Maori                |
| mk              | Macedonian           |
| ml              | Malayalam            |
| mn              | Mongolian            |
| mr              | Marathi              |
| ms              | Malay                |
| mt              | Maltese              |
| my              | Burmese              |
| ne              | Nepali               |
| nl              | Dutch                |
| no              | Norwegian            |
| ny              | Nyanja               |
| pa              | Punjabi              |
| pl              | Polish               |
| ps              | Pashto               |
| pt              | Portuguese           |
| ro              | Romanian             |
| ru              | Russian              |
| ru-Latn         | Russian (Latin)      |
| sd              | Sindhi               |
| si              | Sinhala              |
| sk              | Slovak               |
| sl              | Slovenian            |
| sm              | San Marino           |
| sn              | Shona                |
| so              | Somali               |
| sq              | Albanian             |
| sr              | Serbian              |
| st              | Southern Sotho       |
| su              | Sundanese            |
| sv              | Swedish              |
| sw              | Swahili              |
| ta              | Tamil                |
| te              | Telugu               |
| tg              | Tajik                |
| th              | Thai                 |
| tr              | Turkish              |
| uk              | Ukrainian            |
| und             | Unknown language     |
| ur              | Urdu                 |
| uz              | Uzbek                |
| vi              | Vietnamese           |
| xh              | Xhosa                |
| yi              | Yiddish              |
| yo              | Yoruba               |
| zh              | Chinese              |
| zh-Latn         | Chinese (Latin)      |
| zu              | Zulu                 |

You can load the mC4 subset of any language like this:

```python
from datasets import load_dataset

en_mc4 = load_dataset("mc4", "en")
```

And if you can even specify a list of languages:

```python
from datasets import load_dataset

mc4_subset_with_five_languages = load_dataset("mc4", languages=["en", "fr", "es", "de", "zh"])
```

### Supported Tasks and Leaderboards

mC4 is mainly intended to pretrain language models and word representations.

### Languages

The dataset supports 108 languages.

## Dataset Structure

### Data Instances

An example form the `en` config is:

```
{'timestamp': '2018-06-24T01:32:39Z',
 'text': 'Farm Resources in Plumas County\nShow Beginning Farmer Organizations & Professionals (304)\nThere are 304 resources serving Plumas County in the following categories:\nMap of Beginning Farmer Organizations & Professionals serving Plumas County\nVictoria Fisher - Office Manager - Loyalton, CA\nAmy Lynn Rasband - UCCE Plumas-Sierra Administrative Assistant II - Quincy , CA\nShow Farm Income Opportunities Organizations & Professionals (353)\nThere are 353 resources serving Plumas County in the following categories:\nFarm Ranch And Forest Retailers (18)\nMap of Farm Income Opportunities Organizations & Professionals serving Plumas County\nWarner Valley Wildlife Area - Plumas County\nShow Farm Resources Organizations & Professionals (297)\nThere are 297 resources serving Plumas County in the following categories:\nMap of Farm Resources Organizations & Professionals serving Plumas County\nThere are 57 resources serving Plumas County in the following categories:\nMap of Organic Certification Organizations & Professionals serving Plumas County',
 'url': 'http://www.californialandcan.org/Plumas/Farm-Resources/'}
```

### Data Fields

The data have several fields:

- `url`: url of the source as a string
- `text`: text content as a string
- `timestamp`: timestamp as a string

### Data Splits

To build mC4, the authors used [CLD3](https://github.com/google/cld3) to identify over 100 languages. The resulting mC4 subsets for each language are reported in this table:

| config   | train   | validation   |
|:---------|:--------|:-------------|
| af       | ?       | ?            |
| am       | ?       | ?            |
| ar       | ?       | ?            |
| az       | ?       | ?            |
| be       | ?       | ?            |
| bg       | ?       | ?            |
| bg-Latn  | ?       | ?            |
| bn       | ?       | ?            |
| ca       | ?       | ?            |
| ceb      | ?       | ?            |
| co       | ?       | ?            |
| cs       | ?       | ?            |
| cy       | ?       | ?            |
| da       | ?       | ?            |
| de       | ?       | ?            |
| el       | ?       | ?            |
| el-Latn  | ?       | ?            |
| en       | ?       | ?            |
| eo       | ?       | ?            |
| es       | ?       | ?            |
| et       | ?       | ?            |
| eu       | ?       | ?            |
| fa       | ?       | ?            |
| fi       | ?       | ?            |
| fil      | ?       | ?            |
| fr       | ?       | ?            |
| fy       | ?       | ?            |
| ga       | ?       | ?            |
| gd       | ?       | ?            |
| gl       | ?       | ?            |
| gu       | ?       | ?            |
| ha       | ?       | ?            |
| haw      | ?       | ?            |
| hi       | ?       | ?            |
| hi-Latn  | ?       | ?            |
| hmn      | ?       | ?            |
| ht       | ?       | ?            |
| hu       | ?       | ?            |
| hy       | ?       | ?            |
| id       | ?       | ?            |
| ig       | ?       | ?            |
| is       | ?       | ?            |
| it       | ?       | ?            |
| iw       | ?       | ?            |
| ja       | ?       | ?            |
| ja-Latn  | ?       | ?            |
| jv       | ?       | ?            |
| ka       | ?       | ?            |
| kk       | ?       | ?            |
| km       | ?       | ?            |
| kn       | ?       | ?            |
| ko       | ?       | ?            |
| ku       | ?       | ?            |
| ky       | ?       | ?            |
| la       | ?       | ?            |
| lb       | ?       | ?            |
| lo       | ?       | ?            |
| lt       | ?       | ?            |
| lv       | ?       | ?            |
| mg       | ?       | ?            |
| mi       | ?       | ?            |
| mk       | ?       | ?            |
| ml       | ?       | ?            |
| mn       | ?       | ?            |
| mr       | ?       | ?            |
| ms       | ?       | ?            |
| mt       | ?       | ?            |
| my       | ?       | ?            |
| ne       | ?       | ?            |
| nl       | ?       | ?            |
| no       | ?       | ?            |
| ny       | ?       | ?            |
| pa       | ?       | ?            |
| pl       | ?       | ?            |
| ps       | ?       | ?            |
| pt       | ?       | ?            |
| ro       | ?       | ?            |
| ru       | ?       | ?            |
| ru-Latn  | ?       | ?            |
| sd       | ?       | ?            |
| si       | ?       | ?            |
| sk       | ?       | ?            |
| sl       | ?       | ?            |
| sm       | ?       | ?            |
| sn       | ?       | ?            |
| so       | ?       | ?            |
| sq       | ?       | ?            |
| sr       | ?       | ?            |
| st       | ?       | ?            |
| su       | ?       | ?            |
| sv       | ?       | ?            |
| sw       | ?       | ?            |
| ta       | ?       | ?            |
| te       | ?       | ?            |
| tg       | ?       | ?            |
| th       | ?       | ?            |
| tr       | ?       | ?            |
| uk       | ?       | ?            |
| und      | ?       | ?            |
| ur       | ?       | ?            |
| uz       | ?       | ?            |
| vi       | ?       | ?            |
| xh       | ?       | ?            |
| yi       | ?       | ?            |
| yo       | ?       | ?            |
| zh       | ?       | ?            |
| zh-Latn  | ?       | ?            |
| zu       | ?       | ?            |

## Dataset Creation

### Curation Rationale

[More Information Needed]

### Source Data

#### Initial Data Collection and Normalization

[More Information Needed]

#### Who are the source language producers?

[More Information Needed]

### Annotations

#### Annotation process

[More Information Needed]

#### Who are the annotators?

[More Information Needed]

### Personal and Sensitive Information

[More Information Needed]

## Considerations for Using the Data

### Social Impact of Dataset

[More Information Needed]

### Discussion of Biases

[More Information Needed]

### Other Known Limitations

[More Information Needed]

## Additional Information

### Dataset Curators

[More Information Needed]

### Licensing Information

AllenAI are releasing this dataset under the terms of ODC-BY. By using this, you are also bound by the Common Crawl terms of use in respect of the content contained in the dataset.

### Citation Information

```
@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}
```

### Contributions

Thanks to [@dirkgr](https://github.com/dirkgr) and [@lhoestq](https://github.com/lhoestq) for adding this dataset.