fathan
/

indojave-codemixed-bert-base

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

indojave-codemixed-bert-base / README.md

fathan's picture

Update README.md

2fd04d1 almost 2 years ago

|

2.43 kB

	---
	tags:
	- generated_from_trainer
	model-index:
	- name: code_mixed_ijebert
	results: []
	language:
	- id
	- jv
	- en
	pipeline_tag: fill-mask
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Code-mixed IJEBERT

	## About
	Code-mixed IJEBERT is a pre-trained maksed language model for code-mixed Indonesian-Javanese-English tweets data.
	This model is trained based on [BERT](https://huggingface.co/bert-base-multilingual-cased) model utilizing
	Hugging Face's [Transformers]((https://huggingface.co/transformers)) library.

	## Pre-training Data
	The Twitter data is collected from January 2022 until January 2023. The tweets are collected using 8698 random keyword phrases.
	To make sure the retrieved data are code-mixed, we use keyword phrases that contain code-mixed Indonesian, Javanese, or English words.
	The following are few examples of the keyword phrases:
	- travelling terus
	- proud koncoku
	- great kalian semua
	- chattingane ilang
	- baru aja launching

	We acquire 40,788,384 raw tweets. We apply first stage pre-processing tasks such as:
	- remove duplicate tweets,
	- remove tweets with token length less than 5,
	- remove multiple space,
	- convert emoticon,
	- convert all tweets to lower case.

	After the first stage pre-processing, we obtain 17,385,773 tweets.
	In the second stage pre-processing, we do the following pre-processing tasks:
	- split the tweets into sentences,
	- remove sentences with token length less than 4,
	- convert ‘@username’ to ‘@USER’,
	- convert URL to HTTPURL.

	Finally, we have 28,121,693 sentences for our pre-training task.

	## Model
	\| Model name \| #params \| Arch. \| Size of training data \| Size of validation data \|
	\|----------------------\|---------\|----------\|----------------------------\|-------------------------\|
	\| `code-mixed-ijebert` \| \| BERT \| 2.24 GB of text \| 249 MB of text \|

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 256
	- eval_batch_size: 256
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 3.0

	### Training results



	### Framework versions

	- Transformers 4.26.0
	- Pytorch 1.12.0+cu102
	- Datasets 2.9.0
	- Tokenizers 0.12.1