File size: 3,328 Bytes
1ac4a63 7d983c7 9bb6eed e164338 1ac4a63 eb2d830 7d983c7 9650117 678080b 7d983c7 396e92c 7d983c7 396e92c 7d983c7 2fd04d1 c2fe415 7d983c7 9650117 3b66b54 678080b f5ee598 678080b a0768ba 1ac4a63 c2fe415 3b66b54 c2fe415 3b66b54 c2fe415 1ac4a63 7d983c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
tags:
- generated_from_trainer
model-index:
- name: code_mixed_ijebert
results: []
language:
- id
- jv
- en
pipeline_tag: fill-mask
widget:
- text: biasane nek arep [MASK] file bs pake software ini
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# IndoJavE: BERT-base
## About
This is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data.
This model is trained based on [BERT](https://arxiv.org/abs/1810.04805) model utilizing
Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.
## Pre-training Data
The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
The following are few examples of the keyword phrases:
- travelling terus
- proud koncoku
- great kalian semua
- chattingane ilang
- baru aja launching
We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
- remove duplicate tweets,
- remove tweets with token length less than 5,
- remove multiple space,
- convert emoticon,
- convert all tweets to lower case.
After the first stage pre-processing, we obtain 17,385,773 tweets.
In the second stage pre-processing, we do the following pre-processing tasks:
- split the tweets into sentences,
- remove sentences with token length less than 4,
- convert ‘@username’ to ‘@USER’,
- convert URL to HTTPURL.
Finally, we have 28,121,693 sentences for the training process.
This pretraining data will not be opened to public due to Twitter policy.
## Model
| Model name | Architecture | Size of training data | Size of validation data |
|-----------------------------------|-----------------|----------------------------|-------------------------|
| `indojave-codemixed-bert-base` | BERT | 2.24 GB of text | 249 MB of text |
## Evaluation Results
We train the data with 3 epochs and total steps of 296K for 12 days.
The following are the results obtained from the training:
| train loss | eval loss | eval perplexity |
|------------|------------|-----------------|
| 3.5057 | 3.0559 | 21.2398 |
## How to use
### Load model and tokenizer
```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-bert-base")
model = AutoModel.from_pretrained("fathan/indojave-codemixed-bert-base")
```
### Masked language model
```python
from transformers import pipeline
pretrained_model = "fathan/indojave-codemixed-bert-base"
fill_mask = pipeline(
"fill-mask",
model=pretrained_model,
tokenizer=pretrained_model
)
```
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 256
- eval_batch_size: 256
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0
### Framework versions
- Transformers 4.26.0
- Pytorch 1.12.0+cu102
- Datasets 2.9.0
- Tokenizers 0.12.1 |