Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: id
|
3 |
+
tags:
|
4 |
+
- indobert
|
5 |
+
- indolem
|
6 |
+
license: mit
|
7 |
+
inference: false
|
8 |
+
datasets:
|
9 |
+
- 220M words (IndoWiki, IndoWC, News)
|
10 |
+
- Squad 2.0 (Indonesian translated)
|
11 |
+
---
|
12 |
+
## indoBERT Base-Uncased fine-tuned on Translated Squad v2.0
|
13 |
+
[IndoBERT](https://huggingface.co/indolem/indobert-base-uncased) trained by [IndoLEM](https://indolem.github.io/) and fine-tuned on [Translated SQuAD 2.0](https://github.com/Wikidepia/indonesian_datasets/tree/master/question-answering/squad) for **Q&A** downstream task.
|
14 |
+
|
15 |
+
**Model Size** (after training): 420mb
|
16 |
+
|
17 |
+
## Details of indoBERT (from their documentation)
|
18 |
+
[IndoBERT](https://huggingface.co/indolem/indobert-base-uncased) is the Indonesian version of BERT model. We train the model using over 220M words, aggregated from three main sources:
|
19 |
+
|
20 |
+
- Indonesian Wikipedia (74M words)
|
21 |
+
- news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
|
22 |
+
- an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).
|
23 |
+
|
24 |
+
We trained the model for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT-base).
|
25 |
+
|
26 |
+
This IndoBERT was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.
|
27 |
+
|
28 |
+
## Details of the downstream task (Q&A) - Dataset
|
29 |
+
SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.
|
30 |
+
|
31 |
+
| Dataset | Split | # samples |
|
32 |
+
| -------- | ----- | --------- |
|
33 |
+
| SQuAD2.0 | train | 130k |
|
34 |
+
| SQuAD2.0 | eval | 12.3k |
|
35 |
+
|
36 |
+
## Model Training
|
37 |
+
The model was trained on a Tesla T4 GPU and 12GB of RAM.
|
38 |
+
|
39 |
+
## Results:
|
40 |
+
| Metric | # Value |
|
41 |
+
| ------ | --------- |
|
42 |
+
| **EM** | **51.61** |
|
43 |
+
| **F1** | **69.09** |
|