File size: 4,808 Bytes
a53ef18
 
 
 
 
 
 
 
 
 
 
 
58ed1f8
 
a53ef18
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
library_name: transformers
tags:
- language
- detection
- classification
license: mit
datasets:
- hac541309/open-lid-dataset
pipeline_tag: text-classification
---

This is a clone of https://huggingface.co/alexneakameni/language_detection with onnx format

# Language Detection Model

A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.

## Model Details

- **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html)
- **Hidden Size**: 384  
- **Number of Layers**: 4  
- **Attention Heads**: 6  
- **Max Sequence Length**: 512  
- **Dropout**: 0.1  
- **Vocabulary Size**: 50,257  

## Training Process

- **Dataset**: 
  - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)  
  - Split into train (90%) and test (10%)
- **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
- **Hyperparameters**:  
  - Learning Rate: 2e-5  
  - Batch Size: 256 (training) / 512 (testing)  
  - Epochs: 1  
  - Scheduler: Cosine  
- **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging

## Evaluation

The model was evaluated on the test split. Below are the overall metrics:

- **Accuracy**: 0.969466  
- **Precision**: 0.969586  
- **Recall**: 0.969466  
- **F1 Score**: 0.969417

Detailled evaluation (Size is the number of languages supported)

| Script | Support | Precision | Recall | F1 Score | Size |
|--------|---------|-----------|--------|----------|------|
| Arab   | 819219  | 0.9038    | 0.9014 | 0.9023   | 21   |
| Latn   | 7924704 | 0.9678    | 0.9663 | 0.9670   | 125  |
| Ethi   | 144403  | 0.9967    | 0.9964 | 0.9966   | 2    |
| Beng   | 163983  | 0.9949    | 0.9935 | 0.9942   | 3    |
| Deva   | 423895  | 0.9495    | 0.9326 | 0.9405   | 10   |
| Cyrl   | 831949  | 0.9899    | 0.9883 | 0.9891   | 12   |
| Tibt   | 35683   | 0.9925    | 0.9930 | 0.9927   | 2    |
| Grek   | 131155  | 0.9984    | 0.9990 | 0.9987   | 1    |
| Gujr   | 86912   | 0.99999   | 0.9999 | 0.99995  | 1    |
| Hebr   | 100530  | 0.9966    | 0.9995 | 0.9981   | 2    |
| Armn   | 67203   | 0.9999    | 0.9998 | 0.9998   | 1    |
| Jpan   | 88004   | 0.9983    | 0.9987 | 0.9985   | 1    |
| Knda   | 67170   | 0.9999    | 0.9998 | 0.9999   | 1    |
| Geor   | 70769   | 0.99997   | 0.9998 | 0.9999   | 1    |
| Khmr   | 39708   | 1.0000    | 0.9997 | 0.9999   | 1    |
| Hang   | 108509  | 0.9997    | 0.9999 | 0.9998   | 1    |
| Laoo   | 29389   | 0.9999    | 0.9999 | 0.9999   | 1    |
| Mlym   | 68418   | 0.99996   | 0.9999 | 0.9999   | 1    |
| Mymr   | 100857  | 0.9999    | 0.9992 | 0.9995   | 2    |
| Orya   | 44976   | 0.9995    | 0.9998 | 0.9996   | 1    |
| Guru   | 67106   | 0.99999   | 0.9999 | 0.9999   | 1    |
| Olck   | 22279   | 1.0000    | 0.9991 | 0.9995   | 1    |
| Sinh   | 67492   | 1.0000    | 0.9998 | 0.9999   | 1    |
| Taml   | 76373   | 0.99997   | 0.9999 | 0.9999   | 1    |
| Tfng   | 41325   | 0.8512    | 0.8246 | 0.8247   | 2    |
| Telu   | 62387   | 0.99997   | 0.9999 | 0.9999   | 1    |
| Thai   | 83820   | 0.99995   | 0.9998 | 0.9999   | 1    |
| Hant   | 152723  | 0.9945    | 0.9954 | 0.9949   | 2    |
| Hans   | 92689   | 0.9893    | 0.9870 | 0.9882   | 1    |


A detailed per-script classification report is also provided in the repository for further analysis.

---

### How to Use

You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines):

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")

language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)

text = "Hello world!"
predictions = language_detection(text)
print(predictions)
```

This will output the predicted language code or label with the corresponding confidence score.

---

**Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications. 

For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language). 

Thank you for using this model—feedback and contributions are welcome!