fathan
/

indojave-codemixed-indobert-base

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

indojave-codemixed-indobert-base / README.md

fathan's picture

Update README.md

4a4c322 over 1 year ago

|

history blame contribute delete

3.37 kB

	---
	tags:
	- generated_from_trainer
	model-index:
	- name: code-mixed-ijebertweet
	results: []
	language:
	- id
	- jv
	- en
	pipeline_tag: fill-mask
	widget:
	- text: biasane nek arep [MASK] file bs pake software ini
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Indojave: IndoBERT-base

	## About
	This is a pre-trained masked language model for code-mixed Indonesian-Javanese-English tweets data.
	This model is trained based on [IndoBERT](https://arxiv.org/pdf/2011.00677.pdf) model utilizing
	Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.

	## Pre-training Data
	The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
	To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
	The following are few examples of the keyword phrases:
	- travelling terus
	- proud koncoku
	- great kalian semua
	- chattingane ilang
	- baru aja launching

	We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
	- remove duplicate tweets,
	- remove tweets with token length less than 5,
	- remove multiple space,
	- convert emoticon,
	- convert all tweets to lower case.

	After the first stage pre-processing, we obtain 17,385,773 tweets.
	In the second stage pre-processing, we do the following pre-processing tasks:
	- split the tweets into sentences,
	- remove sentences with token length less than 4,
	- convert ‘@username’ to ‘@USER’,
	- convert URL to HTTPURL.

	Finally, we have 28,121,693 sentences for the training process.
	This pretraining data will not be opened to public due to Twitter policy.

	## Model
	\| Model name \| Base model \| Size of training data \| Size of validation data \|
	\|----------------------------------------\|-----------------\|----------------------------\|-------------------------\|
	\| `indojave-codemixed-indobert-base` \| IndoBERT \| 2.24 GB of text \| 249 MB of text \|

	## Evaluation Results
	We train the data with 3 epochs and total steps of 296K for 4 days.
	The following are the results obtained from the training:

	\| train loss \| eval loss \| eval perplexity \|
	\|------------\|------------\|-----------------\|
	\| 2.2431 \| 1.9968 \| 7.3657 \|

	## How to use
	### Load model and tokenizer
	```python
	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained("fathan/indojave-codemixed-indobert-base")
	model = AutoModel.from_pretrained("fathan/indojave-codemixed-indobert-base")

	```
	### Masked language model
	```python
	from transformers import pipeline

	pretrained_model = "fathan/indojave-codemixed-indobert-base"

	fill_mask = pipeline(
	"fill-mask",
	model=pretrained_model,
	tokenizer=pretrained_model
	)
	```



	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 256
	- eval_batch_size: 256
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3.0

	### Framework versions

	- Transformers 4.26.0
	- Pytorch 1.12.0+cu102
	- Datasets 2.9.0
	- Tokenizers 0.12.1