Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -17,20 +17,26 @@ pinned: false
|
|
| 17 |
|
| 18 |
<table>
|
| 19 |
<tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
<td valign="top">
|
| 21 |
<h3>IndoT5: T5 Language Models for the Indonesian Language</h3>
|
| 22 |
<p>IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the <a href="https://github.com/IndoNLP/indonlg">IndoNLG</a> (text generation) benchmark.</p>
|
| 23 |
</td>
|
|
|
|
|
|
|
| 24 |
<td valign="top">
|
| 25 |
<h3>Indonesian Sentence Embedding Models</h3>
|
| 26 |
<p>We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a> and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.</p>
|
| 27 |
</td>
|
| 28 |
-
</tr>
|
| 29 |
-
<tr>
|
| 30 |
<td valign="top">
|
| 31 |
<h3>Indonesian Natural Language Inference Models</h3>
|
| 32 |
<p>Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.</p>
|
| 33 |
</td>
|
|
|
|
|
|
|
| 34 |
<td valign="top">
|
| 35 |
<h3>Many-to-Many Multilingual Translation Models</h3>
|
| 36 |
<p>Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the <a href="https://github.com/IndoNLP/nusax">NusaX</a> benchmark.</p>
|
|
|
|
| 17 |
|
| 18 |
<table>
|
| 19 |
<tr>
|
| 20 |
+
<td valign="top">
|
| 21 |
+
<h3>NusaBERT: Teaching IndoBERT to be multilingual and multicultural!</h3>
|
| 22 |
+
<p>This project aims to extend the multilingual and multicultural capability of <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a>. We expanded the IndoBERT tokenizer on 12 new regional languages of Indonesia, and continued pre-training on a large-scale corpus consisting of the Indonesian language and 12 regional languages of Indonesia. Our models are highly competitive and robust on multilingual and multicultural benchmarks, such as <a href="https://github.com/IndoNLP/indonlu">IndoNLU</a>, <a href="https://github.com/IndoNLP/nusax">NusaX</a>, and <a href="https://github.com/IndoNLP/nusa-writes">NusaWrites</a>.</p>
|
| 23 |
+
</td>
|
| 24 |
<td valign="top">
|
| 25 |
<h3>IndoT5: T5 Language Models for the Indonesian Language</h3>
|
| 26 |
<p>IndoT5 is a T5-based language model trained specifically for the Indonesian language. With just 8 hours of training on a limited budget, we developed a competitive sequence-to-sequence, encoder-decode model capable of fine-tuning tasks such as summarization, chit-chat, and question-answering. Despite the limited training constraints, our model is competitive when evaluated on the <a href="https://github.com/IndoNLP/indonlg">IndoNLG</a> (text generation) benchmark.</p>
|
| 27 |
</td>
|
| 28 |
+
</tr>
|
| 29 |
+
<tr>
|
| 30 |
<td valign="top">
|
| 31 |
<h3>Indonesian Sentence Embedding Models</h3>
|
| 32 |
<p>We trained open-source sentence embedding models for Indonesian, enabling applications such as information retrieval (useful for retrieval-augmented generation!) semantic text similarity, and zero-shot text classification. We leverage existing pre-trained Indonesian language models like <a href="https://github.com/IndoNLP/indonlu">IndoBERT</a> and state-of-the-art unsupervised techniques and established sentence embedding benchmarks.</p>
|
| 33 |
</td>
|
|
|
|
|
|
|
| 34 |
<td valign="top">
|
| 35 |
<h3>Indonesian Natural Language Inference Models</h3>
|
| 36 |
<p>Open-source lightweight NLI models that are competitive with larger models on IndoNLI benchmark, with significantly less parameters. We applied knowledge distillation methods to small existing pre-trained language models like IndoBERT Lite. These models offer efficient solutions for tasks requiring natural language inference capabilities while minimizing computational resources such as cross-encoder-based semantic search.</p>
|
| 37 |
</td>
|
| 38 |
+
</tr>
|
| 39 |
+
<tr>
|
| 40 |
<td valign="top">
|
| 41 |
<h3>Many-to-Many Multilingual Translation Models</h3>
|
| 42 |
<p>Adapting mT5 to 45 languages of Indonesia, we developed a robust baseline model for multilingual translation for languages of Indonesia. This facilitates further fine-tuning for niche domains and low-resource languages, contributing to greater linguistic inclusivity. Our models are competitive with existing multilingual translation models on the <a href="https://github.com/IndoNLP/nusax">NusaX</a> benchmark.</p>
|