fathan commited on
Commit
666b1c9
1 Parent(s): a8dbcb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +76 -27
README.md CHANGED
@@ -1,37 +1,90 @@
1
  ---
2
- license: mit
3
  tags:
4
  - generated_from_trainer
5
- metrics:
6
- - accuracy
7
  model-index:
8
- - name: code_mixed_ije_indobert
9
  results: []
 
 
 
 
 
 
 
10
  ---
11
 
12
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
  should probably proofread and complete it, then remove this comment. -->
14
 
15
- # code_mixed_ije_indobert
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
- This model is a fine-tuned version of [indolem/indobert-base-uncased](https://huggingface.co/indolem/indobert-base-uncased) on an unknown dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 1.9968
20
- - Accuracy: 0.6241
21
 
22
- ## Model description
23
-
24
- More information needed
25
-
26
- ## Intended uses & limitations
27
-
28
- More information needed
29
-
30
- ## Training and evaluation data
31
-
32
- More information needed
33
-
34
- ## Training procedure
35
 
36
  ### Training hyperparameters
37
 
@@ -44,13 +97,9 @@ The following hyperparameters were used during training:
44
  - lr_scheduler_type: linear
45
  - num_epochs: 3.0
46
 
47
- ### Training results
48
-
49
-
50
-
51
  ### Framework versions
52
 
53
  - Transformers 4.26.0
54
  - Pytorch 1.12.0+cu102
55
  - Datasets 2.9.0
56
- - Tokenizers 0.12.1
 
1
  ---
 
2
  tags:
3
  - generated_from_trainer
 
 
4
  model-index:
5
+ - name: code-mixed-ijebertweet
6
  results: []
7
+ language:
8
+ - id
9
+ - jv
10
+ - en
11
+ pipeline_tag: fill-mask
12
+ widget:
13
+ - text: biasane nek arep [MASK] file bs pake software ini
14
  ---
15
 
16
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
17
  should probably proofread and complete it, then remove this comment. -->
18
 
19
+ # IndoJavE-BERT
20
+
21
+ ## About
22
+ IndoJavE-BERT is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data.
23
+ This model is trained based on [IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) model utilizing
24
+ Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.
25
+
26
+ ## Pre-training Data
27
+ The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
28
+ To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
29
+ The following are few examples of the keyword phrases:
30
+ - travelling terus
31
+ - proud koncoku
32
+ - great kalian semua
33
+ - chattingane ilang
34
+ - baru aja launching
35
+
36
+ We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
37
+ - remove duplicate tweets,
38
+ - remove tweets with token length less than 5,
39
+ - remove multiple space,
40
+ - convert emoticon,
41
+ - convert all tweets to lower case.
42
+
43
+ After the first stage pre-processing, we obtain 17,385,773 tweets.
44
+ In the second stage pre-processing, we do the following pre-processing tasks:
45
+ - split the tweets into sentences,
46
+ - remove sentences with token length less than 4,
47
+ - convert ‘@username’ to ‘@USER’,
48
+ - convert URL to HTTPURL.
49
+
50
+ Finally, we have 28,121,693 sentences for the training process.
51
+ This pretraining data will not be opened to public due to Twitter policy.
52
+
53
+ ## Model
54
+ | Model name | Base model | Size of training data | Size of validation data |
55
+ |--------------------------|-----------------|----------------------------|-------------------------|
56
+ | `IndoJavE-BERT` | IndoBERT | 2.24 GB of text | 249 MB of text |
57
+
58
+ ## Evaluation Results
59
+ We train the data with 3 epochs and total steps of 296K for 4 days.
60
+ The following are the results obtained from the training:
61
+
62
+ | train loss | eval loss | eval perplexity |
63
+ |------------|------------|-----------------|
64
+ | 2.2431 | 1.9968 | 7.3657 |
65
+
66
+ ## How to use
67
+ ### Load model and tokenizer
68
+ ```python
69
+ from transformers import AutoTokenizer, AutoModel
70
+ tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-bert")
71
+ model = AutoModel.from_pretrained("fathan/indojave-codemixed-bert")
72
+
73
+ ```
74
+ ### Masked language model
75
+ ```python
76
+ from transformers import pipeline
77
+
78
+ pretrained_model = "fathan/indojave-codemixed-bert"
79
+
80
+ fill_mask = pipeline(
81
+ "fill-mask",
82
+ model=pretrained_model,
83
+ tokenizer=pretrained_model
84
+ )
85
+ ```
86
 
 
 
 
 
87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
 
89
  ### Training hyperparameters
90
 
 
97
  - lr_scheduler_type: linear
98
  - num_epochs: 3.0
99
 
 
 
 
 
100
  ### Framework versions
101
 
102
  - Transformers 4.26.0
103
  - Pytorch 1.12.0+cu102
104
  - Datasets 2.9.0
105
+ - Tokenizers 0.12.1