metadata

datasets:
  - msislam/marc-code-mixed-small
language:
  - de
  - en
  - es
  - fr
metrics:
  - seqeval
widget:
  - text: Hala Madrid y nada más. It means Go Madrid and nothing more.
  - text: Hallo, Guten Tag! how are you?
  - text: >-
      Sie sind gut. How about you? Comment va ta mère? And what about your
      school? Estoy aprendiendo español. Thanks.

Code-Mixed Language Detection using XLM-RoBERTa

Description

This model detects languages in a text (Code-Mixed text) with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on xlm-roberta-base.

Training Dataset

The training dataset is based on The Multilingual Amazon Reviews Corpus. The preprocessed dataset that has been used to train, validate, and test this model can be found here.

Results

'DE': {'precision': 0.9870741390453328,
       'recall': 0.9883516686696866,
       'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
       'recall': 0.9914748508098892,
       'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
       'recall': 0.9912407007439404,
       'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
       'recall': 0.9871314927468414,
       'f1': 0.9871892366188945}

'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168

Usage

The model can be used as follows:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'

tokens = tokenizer(text, add_special_tokens= False, return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

labels_predicted = logits.argmax(-1)

lang_tag_predicted = [model_best.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted