File size: 10,336 Bytes
7a06de8
 
 
 
 
 
 
 
 
 
 
449228a
 
 
 
 
58bdc7c
 
 
 
 
 
 
 
 
 
 
 
9c61cef
58bdc7c
 
 
 
 
 
 
 
 
 
 
 
 
449228a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
model_name: xlm-roberta-large Azerbaijani NER
tags:
  - NER
  - xlm-roberta
  - Azerbaijani
  - HuggingFace
license: "apache-2.0"  # Or replace with your model's license type
library_name: transformers
---

# Azerbaijani Named Entity Recognition (NER) with XLM-RoBERTa Large

**Repository on Hugging Face**: [IsmatS/xlm_roberta_large_az_ner](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner)  
**Repository on GitHub**: [Named Entity Recognition](https://github.com/Ismat-Samadov/Named_Entity_Recognition)

## File Structure

```plaintext
.
├── README.md                   # Documentation for the project
├── config.json                 # Configuration file for model deployment
├── model-001.safetensors       # Model weights in Safetensors format for safe deployment
├── sentencepiece.bpe.model     # SentencePiece model for tokenization
├── special_tokens_map.json     # Map for special tokens (e.g., <PAD>, <CLS>)
├── tokenizer.json              # JSON configuration for tokenizer
├── tokenizer_config.json       # Additional tokenizer configurations
├── xlm_roberta_large.ipynb     # Jupyter Notebook for training and experimentation
└── xlm_roberta_large.py        # Python script for training and experimentation
```

**Explanation**:
- **README.md**: Provides detailed information on the project, including setup, usage, and evaluation.
- **config.json**: Stores configuration details for model deployment, such as model parameters.
- **model-001.safetensors**: Contains model weights in a secure, efficient format.
- **sentencepiece.bpe.model**: Tokenization model used to segment sentences into subwords for `xlm-roberta-large`.
- **special_tokens_map.json**: Maps special tokens required by the tokenizer (e.g., `<PAD>` for padding).
- **tokenizer.json**: Contains the main tokenizer configuration.
- **tokenizer_config.json**: Additional configuration settings for the tokenizer.
- **xlm_roberta_large.ipynb**: A Jupyter notebook for experimenting with and training the model.
- **xlm_roberta_large.py**: Python script for training and running evaluations outside of Jupyter.

## Project Overview

This project leverages `xlm-roberta-large`, a multilingual transformer model, fine-tuned for Azerbaijani Named Entity Recognition (NER). The model identifies various named entities, including persons, organizations, dates, etc., using a dataset specially designed for the Azerbaijani language.

## Table of Contents

1. [Setup and Dependencies](#setup-and-dependencies)
2. [Dataset](#dataset)
3. [Model Architecture](#model-architecture)
4. [Training Process](#training-process)
5. [Training Metrics and Results](#training-metrics-and-results)
6. [Evaluation and Detailed Metrics Explanation](#evaluation-and-detailed-metrics-explanation)
7. [Saving and Loading the Model](#saving-and-loading-the-model)
8. [Example Inference](#example-inference)
9. [Troubleshooting and Notes](#troubleshooting-and-notes)

---

## Setup and Dependencies

Install the required libraries:

```python
!pip install transformers datasets seqeval huggingface_hub
```

### Imports

The project requires `transformers`, `datasets`, `torch`, and `seqeval` libraries.

## Dataset

The Azerbaijani NER dataset includes entities such as **PERSON**, **LOCATION**, **ORGANISATION**, etc., and is hosted on Hugging Face.

```python
from datasets import load_dataset
dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
```

A preprocessing function ensures token and NER tag formatting:

```python
def preprocess_example(example):
    example["tokens"] = ast.literal_eval(example["tokens"])
    example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"])))
    return example

dataset = dataset.map(preprocess_example)
```

## Model Architecture

The model is based on `xlm-roberta-large`, designed to handle multilingual text processing.

### Tokenization and Label Alignment

A custom function `tokenize_and_align_labels` tokenizes input while aligning entity labels with tokens.

```python
tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")

def tokenize_and_align_labels(example):
    tokenized_inputs = tokenizer(
        example["tokens"], truncation=True, is_split_into_words=True,
        padding="max_length", max_length=128
    )
    # Alignment code follows here...
    return tokenized_inputs
```

---

## Training Process

Training uses the Hugging Face `Trainer`, which handles the training loop, metrics computation, and model checkpointing.

### Model Initialization

```python
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
    "xlm-roberta-large", num_labels=len(label_list)
)
```

### Define Evaluation Metrics

The following metrics help evaluate the model’s accuracy and performance in recognizing and classifying entities.

```python
from seqeval.metrics import precision_score, recall_score, f1_score, classification_report

def compute_metrics(p):
    # Metric computation code
    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
    }
```

### Training Configuration

```python
training_args = TrainingArguments(
    output_dir="./results", evaluation_strategy="epoch", save_strategy="epoch",
    learning_rate=2e-5, per_device_train_batch_size=128, per_device_eval_batch_size=128,
    num_train_epochs=12, weight_decay=0.005, fp16=True, logging_dir='./logs',
    save_total_limit=2, load_best_model_at_end=True, metric_for_best_model="f1", report_to="none"
)
```

---

## Training Metrics and Results

During training, metrics for each epoch were recorded. These metrics include `Training Loss`, `Validation Loss`, `Precision`, `Recall`, and `F1-Score` for both training and validation sets.

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1-Score |
|-------|---------------|-----------------|-----------|--------|----------|
| 1     | 0.4075        | 0.2538          | 0.7689    | 0.7214 | 0.7444   |
| 2     | 0.2556        | 0.2497          | 0.7835    | 0.7245 | 0.7528   |
| 3     | 0.2144        | 0.2488          | 0.7509    | 0.7489 | 0.7499   |
| 4     | 0.1934        | 0.2571          | 0.7686    | 0.7404 | 0.7542   |
| 5     | 0.1698        | 0.2757          | 0.7458    | 0.7537 | 0.7497   |
| 6     | 0.1526        | 0.2881          | 0.7831    | 0.7284 | 0.7548   |
| 7     | 0.1443        | 0.3034          | 0.7585    | 0.7381 | 0.7481   |
| 8     | 0.1268        | 0.3113          | 0.7456    | 0.7509 | 0.7482   |
| 9     | 0.1194        | 0.3316          | 0.7393    | 0.7495 | 0.7444   |
| 10    | 0.1094        | 0.3448          | 0.7543    | 0.7372 | 0.7456   |
| 11    | 0.1029        | 0.3549          | 0.7519    | 0.7413 | 0.7466   |

These metrics demonstrate the model’s performance progression through each epoch, highlighting how it optimizes towards lower losses and higher precision, recall, and F1-scores.

---

## Evaluation and Detailed Metrics Explanation

After training, the model was evaluated across various entity types with the following results:

| Entity       | Precision | Recall | F1-score | Support |
|--------------|-----------|--------|----------|---------|
| ART          | 0.41      | 0.19   | 0.26     | 1828    |
| DATE         | 0.53      | 0.49   | 0.51     | 834     |
| EVENT        | 0.67      | 0.51   | 0.58     | 63      |
| FACILITY     | 0.74      | 0.68   | 0.71     | 1134    |
| LAW          | 0.62      | 0.58   | 0.60     | 1066    |
| LOCATION     | 0.81      | 0.79   | 0.80     | 8795    |
| MONEY        | 0.59      | 0.56   | 0.58     | 555     |
| ORGANISATION | 0.70      | 0.69   | 0.70     | 554     |
| PERCENTAGE   | 0.80      | 0.82   | 0.81     | 3502    |
| PERSON       | 0.90      | 0.82   | 0.86     | 7007    |
| PRODUCT      | 0.83      | 0.84   | 0.84     | 2624    |
| TIME         | 0.60      | 0.53   | 0.57     | 1584    |

### Explanation of Metrics

- **Precision**: Represents the proportion of correctly identified entities out of all entities predicted by the model. High precision is vital in NER tasks to reduce false positives, ensuring only actual entities are classified.
- **Recall**: Indicates the proportion of correctly identified entities out of all actual entities present in the dataset. Higher recall captures all relevant entities, which is essential to avoid missing important information.
- **F1-Score**: The harmonic mean of precision and recall, balancing both metrics. A high F1-score suggests an effective trade-off between precision and recall, ideal for NER where both metrics are crucial for accurate entity recognition.

## Saving and Loading the Model

Save the model and tokenizer for future use or further fine-tuning:

```python
save_directory = "./xlm-roberta-large"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)
```

---

## Example Inference

Load the saved model for inference, utilizing the Hugging Face pipeline for NER.

```python
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForTokenClassification.from_pretrained(save_directory)
device = 0 if torch.cuda.is_available() else -1
nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer

, aggregation_strategy="simple", device=device)

# Example sentence
test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."]
evaluate_model(test_texts, [["B-PERSON", "B-ORGANISATION"]])
```

---

## Troubleshooting and Notes

- **Token Alignment**: Tokenization and label alignment must be carefully handled, especially when dealing with subwords.
- **Memory Management**: Adjust batch size if encountering GPU memory limitations.
- **Early Stopping**: Configured with a patience of 5 epochs to avoid overfitting, automatically halting training if validation performance plateaus.

This README provides detailed information on the project setup, training, evaluation, and inference for the `xlm-roberta-large` NER model fine-tuned for Azerbaijani text. The model's performance on various entity types and the clear explanation of metrics make it a comprehensive resource for Azerbaijani NER tasks.