CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings)

The pre-trained CafeBERT model is the state-of-the-art language model for Vietnamese (Cafe or coffee is a popular drink every morning in Vietnam):

CafeBERT is a large-scale multilingual language model with strong support for Vietnamese. The model is based on XLM-Roberta (the state-of-the-art multilingual language model) and is enhanced with a large Vietnamese corpus with many domains: Wikipedia, newspapers... CafeBERT has outstanding performance on the VLUE benchmark and other tasks, such as machine reading comprehension, text classification, natural language inference, part-of-speech tagging...

The general architecture and experimental results of PhoBERT can be found in our paper:

@inproceedings{do-etal-2024-vlue,
    title = "{VLUE}: A New Benchmark and Multi-task Knowledge Transfer Learning for {V}ietnamese Natural Language Understanding",
    author = "Do, Phong  and
      Tran, Son  and
      Hoang, Phu  and
      Nguyen, Kiet  and
      Nguyen, Ngan",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2024",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-naacl.15",
    pages = "211--222",
    abstract = "The success of Natural Language Understanding (NLU) benchmarks in various languages, such as GLUE for English, CLUE for Chinese, KLUE for Korean, and IndoNLU for Indonesian, has facilitated the evaluation of new NLU models across a wide range of tasks. To establish a standardized set of benchmarks for Vietnamese NLU, we introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark. The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding. To provide an insightful overview of the current state of Vietnamese NLU, we then evaluate seven state-of-the-art pre-trained models, including both multilingual and Vietnamese monolingual models, on our proposed VLUE benchmark. Furthermore, we present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark. Our model combines the proficiency of a multilingual pre-trained model with Vietnamese linguistic knowledge. CafeBERT is developed based on the XLM-RoBERTa model, with an additional pretraining step utilizing a significant amount of Vietnamese textual data to enhance its adaptation to the Vietnamese language. For the purpose of future research, CafeBERT is made publicly available for research purposes.",
}

Please CITE our paper when CafeBERT is used to help produce published results or is incorporated into other software.

Installation

Install transformers and SentencePiece packages:

pip install transformers
pip install SentencePiece

Example usage

from transformers import AutoModel, AutoTokenizer
import torch

model= AutoModel.from_pretrained('uitnlp/CafeBERT')
tokenizer = AutoTokenizer.from_pretrained('uitnlp/CafeBERT')

encoding = tokenizer('Cà phê được trồng nhiều ở khu vực Tây Nguyên của Việt Nam.', return_tensors='pt')

with torch.no_grad():
  output = model(**encoding)

uitnlp
/

CafeBERT

CafeBERT: A Pre-Trained Language Model for Vietnamese (NAACL-2024 Findings)

Model tree for uitnlp/CafeBERT