fathan
/

indojave-codemixed-indobert-base

@@ -1,37 +1,90 @@
 ---
-license: mit
 tags:
 - generated_from_trainer
-metrics:
-- accuracy
 model-index:
-- name: code_mixed_ije_indobert
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# code_mixed_ije_indobert
-This model is a fine-tuned version of [indolem/indobert-base-uncased](https://huggingface.co/indolem/indobert-base-uncased) on an unknown dataset.
-It achieves the following results on the evaluation set:
-- Loss: 1.9968
-- Accuracy: 0.6241
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters
@@ -44,13 +97,9 @@ The following hyperparameters were used during training:
 - lr_scheduler_type: linear
 - num_epochs: 3.0
-### Training results
 ### Framework versions
 - Transformers 4.26.0
 - Pytorch 1.12.0+cu102
 - Datasets 2.9.0
-- Tokenizers 0.12.1

 ---
 tags:
 - generated_from_trainer
 model-index:
+- name: code-mixed-ijebertweet
   results: []
+language:
+- id
+- jv
+- en
+pipeline_tag: fill-mask
+widget:
+- text: biasane nek arep [MASK] file bs pake software ini
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# IndoJavE-BERT
+## About
+IndoJavE-BERT is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data.
+This model is trained based on [IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) model utilizing
+Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.
+## Pre-training Data
+The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
+To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
+The following are few examples of the keyword phrases:
+- travelling terus
+- proud koncoku
+- great kalian semua
+- chattingane ilang
+- baru aja launching
+We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
+- remove duplicate tweets,
+- remove tweets with token length less than 5,
+- remove multiple space,
+- convert emoticon,
+- convert all tweets to lower case.
+After the first stage pre-processing, we obtain 17,385,773 tweets.
+In the second stage pre-processing, we do the following pre-processing tasks:
+- split the tweets into sentences,
+- remove sentences with token length less than 4,
+- convert ‘@username’ to ‘@USER’,
+- convert URL to HTTPURL.
+Finally, we have 28,121,693 sentences for the training process.
+This pretraining data will not be opened to public due to Twitter policy.
+## Model
+| Model name               | Base model      | Size of training data      | Size of validation data |
+|--------------------------|-----------------|----------------------------|-------------------------|
+| `IndoJavE-BERT`          | IndoBERT        | 2.24 GB of text            | 249 MB of text          |
+## Evaluation Results
+We train the data with 3 epochs and total steps of 296K for 4 days.
+The following are the results obtained from the training:
+| train loss | eval loss  | eval perplexity |
+|------------|------------|-----------------|
+|   2.2431   |   1.9968   |     7.3657      |
+## How to use
+### Load model and tokenizer
+```python
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-bert")
+model = AutoModel.from_pretrained("fathan/indojave-codemixed-bert")
+```
+### Masked language model
+```python
+from transformers import pipeline
+pretrained_model = "fathan/indojave-codemixed-bert"
+fill_mask = pipeline(
+    "fill-mask",
+    model=pretrained_model,
+    tokenizer=pretrained_model
+)
+```
 ### Training hyperparameters
 - lr_scheduler_type: linear
 - num_epochs: 3.0
 ### Framework versions
 - Transformers 4.26.0
 - Pytorch 1.12.0+cu102
 - Datasets 2.9.0
+- Tokenizers 0.12.1