File size: 2,980 Bytes
79ab19c fd49efe 79ab19c 018e931 10dab53 018e931 50f5394 10dab53 251d5fe 10dab53 bd200de 10dab53 1290240 3efb6ba 1290240 10dab53 3efb6ba 10dab53 4babf95 10dab53 3efb6ba 10dab53 da4d090 10dab53 da4d090 10dab53 da4d090 10dab53 3633d78 5c37b23 3633d78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 |
---
datasets:
- msislam/marc-code-mixed-small
language:
- de
- en
- es
- fr
metrics:
- seqeval
widget:
- text: >-
Hala Madrid y nada más. It means Go Madrid and nothing more.
- text: >-
Hallo, Guten Tag! how are you?
- text: >-
Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
---
# Code-Mixed Language Detection using XLM-RoBERTa
## Description
This model detects languages in a text (Code-Mixed text) with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).
## Training Dataset
The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).
## Results
```python
'DE': {'precision': 0.9870741390453328,
'recall': 0.9883516686696866,
'f1': 0.9877124907612713}
'EN': {'precision': 0.9901617633147289,
'recall': 0.9914748508098892,
'f1': 0.9908178720181748}
'ES': {'precision': 0.9912407007439404,
'recall': 0.9912407007439404,
'f1': 0.9912407007439406}
'FR': {'precision': 0.9872469872469872,
'recall': 0.9871314927468414,
'f1': 0.9871892366188945}
'overall_precision': 0.9888723454274744
'overall_recall': 0.9895702634880803
'overall_f1': 0.9892211813585232
'overall_accuracy': 0.9993651810717168
```
## Codes
The codes associated with the model can be found in this [GitHUb Repo.](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model.git)
## Usage
The model can be used as follows:
```python
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")
text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'
inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
labels_predicted = logits.argmax(-1)
lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]]
lang_tag_predicted
```
## Limitations
The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are:
* The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language.
* Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted.
* The prediction also depends on punctuation. |