|
--- |
|
datasets: |
|
- msislam/marc-code-mixed-small |
|
language: |
|
- de |
|
- en |
|
- es |
|
- fr |
|
metrics: |
|
- seqeval |
|
widget: |
|
- text: >- |
|
Hala Madrid y nada más. It means Go Madrid and nothing more. |
|
- text: >- |
|
Hallo, Guten Tag! how are you? |
|
- text: >- |
|
Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks. |
|
--- |
|
|
|
# Code-Mixed Language Detection using XLM-RoBERTa |
|
|
|
## Description |
|
This model detects languages in a text (Code-Mixed text) with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base). |
|
|
|
## Training Dataset |
|
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small). |
|
|
|
## Results |
|
|
|
```python |
|
'DE': {'precision': 0.9870741390453328, |
|
'recall': 0.9883516686696866, |
|
'f1': 0.9877124907612713} |
|
'EN': {'precision': 0.9901617633147289, |
|
'recall': 0.9914748508098892, |
|
'f1': 0.9908178720181748} |
|
'ES': {'precision': 0.9912407007439404, |
|
'recall': 0.9912407007439404, |
|
'f1': 0.9912407007439406} |
|
'FR': {'precision': 0.9872469872469872, |
|
'recall': 0.9871314927468414, |
|
'f1': 0.9871892366188945} |
|
|
|
'overall_precision': 0.9888723454274744 |
|
'overall_recall': 0.9895702634880803 |
|
'overall_f1': 0.9892211813585232 |
|
'overall_accuracy': 0.9993651810717168 |
|
``` |
|
|
|
## Codes |
|
|
|
The codes associated with the model can be found in this [GitHUb Repo.](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model.git) |
|
|
|
## Usage |
|
|
|
The model can be used as follows: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") |
|
|
|
model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta") |
|
|
|
text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.' |
|
|
|
inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
labels_predicted = logits.argmax(-1) |
|
|
|
lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]] |
|
lang_tag_predicted |
|
``` |
|
|
|
## Limitations |
|
The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are: |
|
* The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language. |
|
* Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted. |
|
* The prediction also depends on punctuation. |