dv-labse

This is an experiment in cross-lingual transfer learning, to insert Dhivehi word and word-piece tokens into Google's LaBSE model.

This currently outperforms dv-wave and dv-MuRIL (a similar transfer learning model) on the Maldivian News Classification task https://github.com/Sofwath/DhivehiDatasets

  • mBERT: 52%
  • dv-wave (ELECTRA): 89%
  • dv-muril: 90.7%
  • dv-labse: 91.3-91.5% (may continue training)

Training

  • Start with LaBSE (similar to mBERT) with no Thaana vocabulary
  • Based on PanLex dictionaries, attach 1,100 Dhivehi words to Sinhalese or English embeddings
  • Add remaining words and word-pieces from dv-wave's vocabulary to vocab.txt
  • Continue BERT pretraining on Dhivehi text

CoLab notebook: https://colab.research.google.com/drive/1CUn44M2fb4Qbat2pAvjYqsPvWLt1Novi

Downloads last month
9
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.