msislam
/

code-mixed-language-detection-XLMRoberta

Token Classification

Inference Endpoints

Model card Files Files and versions Community

code-mixed-language-detection-XLMRoberta / README.md

msislam's picture

Update README.md

4babf95 about 1 year ago

|

2.98 kB

	---
	datasets:
	- msislam/marc-code-mixed-small
	language:
	- de
	- en
	- es
	- fr
	metrics:
	- seqeval
	widget:
	- text: >-
	Hala Madrid y nada más. It means Go Madrid and nothing more.
	- text: >-
	Hallo, Guten Tag! how are you?
	- text: >-
	Sie sind gut. How about you? Comment va ta mère? And what about your school? Estoy aprendiendo español. Thanks.
	---

	# Code-Mixed Language Detection using XLM-RoBERTa

	## Description
	This model detects languages in a text (Code-Mixed text) with their boundaries by classifying each token. Currently, it supports German (DE), English (EN), Spanish (ES), and French (FR) languages. The model is fine-tuned on [xlm-roberta-base](https://huggingface.co/xlm-roberta-base).

	## Training Dataset
	The training dataset is based on [The Multilingual Amazon Reviews Corpus](https://huggingface.co/datasets/amazon_reviews_multi). The preprocessed dataset that has been used to train, validate, and test this model can be found [here](https://huggingface.co/datasets/msislam/marc-code-mixed-small).

	## Results

	```python
	'DE': {'precision': 0.9870741390453328,
	'recall': 0.9883516686696866,
	'f1': 0.9877124907612713}
	'EN': {'precision': 0.9901617633147289,
	'recall': 0.9914748508098892,
	'f1': 0.9908178720181748}
	'ES': {'precision': 0.9912407007439404,
	'recall': 0.9912407007439404,
	'f1': 0.9912407007439406}
	'FR': {'precision': 0.9872469872469872,
	'recall': 0.9871314927468414,
	'f1': 0.9871892366188945}

	'overall_precision': 0.9888723454274744
	'overall_recall': 0.9895702634880803
	'overall_f1': 0.9892211813585232
	'overall_accuracy': 0.9993651810717168
	```

	## Codes

	The codes associated with the model can be found in this [GitHUb Repo.](https://github.com/msishuvo/Language-Identification-in-Code-Mixed-Text-using-Large-Language-Model.git)

	## Usage

	The model can be used as follows:

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

	model = AutoModelForTokenClassification.from_pretrained("msislam/code-mixed-language-detection-XLMRoberta")

	text = 'Hala Madrid y nada más. It means Go Madrid and nothing more.'

	inputs = tokenizer(text, add_special_tokens= False, return_tensors="pt")

	with torch.no_grad():
	logits = model(**inputs).logits

	labels_predicted = logits.argmax(-1)

	lang_tag_predicted = [model.config.id2label[t.item()] for t in labels_predicted[0]]
	lang_tag_predicted
	```

	## Limitations
	The model might show some contradictory or conflicting behavior sometimes. Some of the known (till now) issues are:
	* The model might not be able to predict a small number (typically 1 or 2) of tokens or tokens in a noun phrase from another language if they are found in the sequence of one language.
	* Proper nouns, and some cross-lingual tokens (in, me, etc.) might be wrongly predicted.
	* The prediction also depends on punctuation.