IsmatS commited on
Commit
449228a
·
verified ·
1 Parent(s): 6ca12ef

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +206 -3
README.md CHANGED
@@ -1,3 +1,206 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Azerbaijani Named Entity Recognition (NER) with XLM-RoBERTa Large
2
+
3
+ **Repository on Hugging Face**: [IsmatS/xlm_roberta_large_az_ner](https://huggingface.co/IsmatS/xlm_roberta_large_az_ner)
4
+ **Repository on GitHub**: [Named Entity Recognition](https://github.com/Ismat-Samadov/Named_Entity_Recognition)
5
+
6
+ ## Project Overview
7
+
8
+ This project leverages `xlm-roberta-large`, a multilingual transformer model, fine-tuned for Azerbaijani Named Entity Recognition (NER). The model identifies various named entities, including persons, organizations, dates, etc., using a dataset specially designed for the Azerbaijani language.
9
+
10
+ ## Table of Contents
11
+
12
+ 1. [Setup and Dependencies](#setup-and-dependencies)
13
+ 2. [Dataset](#dataset)
14
+ 3. [Model Architecture](#model-architecture)
15
+ 4. [Training Process](#training-process)
16
+ 5. [Training Metrics and Results](#training-metrics-and-results)
17
+ 6. [Evaluation and Detailed Metrics Explanation](#evaluation-and-detailed-metrics-explanation)
18
+ 7. [Saving and Loading the Model](#saving-and-loading-the-model)
19
+ 8. [Example Inference](#example-inference)
20
+ 9. [Troubleshooting and Notes](#troubleshooting-and-notes)
21
+
22
+ ---
23
+
24
+ ## Setup and Dependencies
25
+
26
+ Install the required libraries:
27
+
28
+ ```python
29
+ !pip install transformers datasets seqeval huggingface_hub
30
+ ```
31
+
32
+ ### Imports
33
+
34
+ The project requires `transformers`, `datasets`, `torch`, and `seqeval` libraries.
35
+
36
+ ## Dataset
37
+
38
+ The Azerbaijani NER dataset includes entities such as **PERSON**, **LOCATION**, **ORGANISATION**, etc., and is hosted on Hugging Face.
39
+
40
+ ```python
41
+ from datasets import load_dataset
42
+ dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
43
+ ```
44
+
45
+ A preprocessing function ensures token and NER tag formatting:
46
+
47
+ ```python
48
+ def preprocess_example(example):
49
+ example["tokens"] = ast.literal_eval(example["tokens"])
50
+ example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"])))
51
+ return example
52
+
53
+ dataset = dataset.map(preprocess_example)
54
+ ```
55
+
56
+ ## Model Architecture
57
+
58
+ The model is based on `xlm-roberta-large`, designed to handle multilingual text processing.
59
+
60
+ ### Tokenization and Label Alignment
61
+
62
+ A custom function `tokenize_and_align_labels` tokenizes input while aligning entity labels with tokens.
63
+
64
+ ```python
65
+ tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
66
+
67
+ def tokenize_and_align_labels(example):
68
+ tokenized_inputs = tokenizer(
69
+ example["tokens"], truncation=True, is_split_into_words=True,
70
+ padding="max_length", max_length=128
71
+ )
72
+ # Alignment code follows here...
73
+ return tokenized_inputs
74
+ ```
75
+
76
+ ---
77
+
78
+ ## Training Process
79
+
80
+ Training uses the Hugging Face `Trainer`, which handles the training loop, metrics computation, and model checkpointing.
81
+
82
+ ### Model Initialization
83
+
84
+ ```python
85
+ from transformers import AutoModelForTokenClassification
86
+ model = AutoModelForTokenClassification.from_pretrained(
87
+ "xlm-roberta-large", num_labels=len(label_list)
88
+ )
89
+ ```
90
+
91
+ ### Define Evaluation Metrics
92
+
93
+ The following metrics help evaluate the model’s accuracy and performance in recognizing and classifying entities.
94
+
95
+ ```python
96
+ from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
97
+
98
+ def compute_metrics(p):
99
+ # Metric computation code
100
+ return {
101
+ "precision": precision_score(true_labels, true_predictions),
102
+ "recall": recall_score(true_labels, true_predictions),
103
+ "f1": f1_score(true_labels, true_predictions),
104
+ }
105
+ ```
106
+
107
+ ### Training Configuration
108
+
109
+ ```python
110
+ training_args = TrainingArguments(
111
+ output_dir="./results", evaluation_strategy="epoch", save_strategy="epoch",
112
+ learning_rate=2e-5, per_device_train_batch_size=128, per_device_eval_batch_size=128,
113
+ num_train_epochs=12, weight_decay=0.005, fp16=True, logging_dir='./logs',
114
+ save_total_limit=2, load_best_model_at_end=True, metric_for_best_model="f1", report_to="none"
115
+ )
116
+ ```
117
+
118
+ ---
119
+
120
+ ## Training Metrics and Results
121
+
122
+ During training, metrics for each epoch were recorded. These metrics include `Training Loss`, `Validation Loss`, `Precision`, `Recall`, and `F1-Score` for both training and validation sets.
123
+
124
+ | Epoch | Training Loss | Validation Loss | Precision | Recall | F1-Score |
125
+ |-------|---------------|-----------------|-----------|--------|----------|
126
+ | 1 | 0.4075 | 0.2538 | 0.7689 | 0.7214 | 0.7444 |
127
+ | 2 | 0.2556 | 0.2497 | 0.7835 | 0.7245 | 0.7528 |
128
+ | 3 | 0.2144 | 0.2488 | 0.7509 | 0.7489 | 0.7499 |
129
+ | 4 | 0.1934 | 0.2571 | 0.7686 | 0.7404 | 0.7542 |
130
+ | 5 | 0.1698 | 0.2757 | 0.7458 | 0.7537 | 0.7497 |
131
+ | 6 | 0.1526 | 0.2881 | 0.7831 | 0.7284 | 0.7548 |
132
+ | 7 | 0.1443 | 0.3034 | 0.7585 | 0.7381 | 0.7481 |
133
+ | 8 | 0.1268 | 0.3113 | 0.7456 | 0.7509 | 0.7482 |
134
+ | 9 | 0.1194 | 0.3316 | 0.7393 | 0.7495 | 0.7444 |
135
+ | 10 | 0.1094 | 0.3448 | 0.7543 | 0.7372 | 0.7456 |
136
+ | 11 | 0.1029 | 0.3549 | 0.7519 | 0.7413 | 0.7466 |
137
+
138
+ These metrics demonstrate the model’s performance progression through each epoch, highlighting how it optimizes towards lower losses and higher precision, recall, and F1-scores.
139
+
140
+ ---
141
+
142
+ ## Evaluation and Detailed Metrics Explanation
143
+
144
+ After training, the model was evaluated across various entity types with the following results:
145
+
146
+ | Entity | Precision | Recall | F1-score | Support |
147
+ |--------------|-----------|--------|----------|---------|
148
+ | ART | 0.41 | 0.19 | 0.26 | 1828 |
149
+ | DATE | 0.53 | 0.49 | 0.51 | 834 |
150
+ | EVENT | 0.67 | 0.51 | 0.58 | 63 |
151
+ | FACILITY | 0.74 | 0.68 | 0.71 | 1134 |
152
+ | LAW | 0.62 | 0.58 | 0.60 | 1066 |
153
+ | LOCATION | 0.81 | 0.79 | 0.80 | 8795 |
154
+ | MONEY | 0.59 | 0.56 | 0.58 | 555 |
155
+ | ORGANISATION | 0.70 | 0.69 | 0.70 | 554 |
156
+ | PERCENTAGE | 0.80 | 0.82 | 0.81 | 3502 |
157
+ | PERSON | 0.90 | 0.82 | 0.86 | 7007 |
158
+ | PRODUCT | 0.83 | 0.84 | 0.84 | 2624 |
159
+ | TIME | 0.60 | 0.53 | 0.57 | 1584 |
160
+
161
+ ### Explanation of Metrics
162
+
163
+ - **Precision**: Represents the proportion of correctly identified entities out of all entities predicted by the model. High precision is vital in NER tasks to reduce false positives, ensuring only actual entities are classified.
164
+ - **Recall**: Indicates the proportion of correctly identified entities out of all actual entities present in the dataset. Higher recall captures all relevant entities, which is essential to avoid missing important information.
165
+ - **F1-Score**: The harmonic mean of precision and recall, balancing both metrics. A high F1-score suggests an effective trade-off between precision and recall, ideal for NER where both metrics are crucial for accurate entity recognition.
166
+
167
+ ## Saving and Loading the Model
168
+
169
+ Save the model and tokenizer for future use or further fine-tuning:
170
+
171
+ ```python
172
+ save_directory = "./xlm-roberta-large"
173
+ model.save_pretrained(save_directory)
174
+ tokenizer.save_pretrained(save_directory)
175
+ ```
176
+
177
+ ---
178
+
179
+ ## Example Inference
180
+
181
+ Load the saved model for inference, utilizing the Hugging Face pipeline for NER.
182
+
183
+ ```python
184
+ from transformers import pipeline
185
+
186
+ tokenizer = AutoTokenizer.from_pretrained(save_directory)
187
+ model = AutoModelForTokenClassification.from_pretrained(save_directory)
188
+ device = 0 if torch.cuda.is_available() else -1
189
+ nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer
190
+
191
+ , aggregation_strategy="simple", device=device)
192
+
193
+ # Example sentence
194
+ test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."]
195
+ evaluate_model(test_texts, [["B-PERSON", "B-ORGANISATION"]])
196
+ ```
197
+
198
+ ---
199
+
200
+ ## Troubleshooting and Notes
201
+
202
+ - **Token Alignment**: Tokenization and label alignment must be carefully handled, especially when dealing with subwords.
203
+ - **Memory Management**: Adjust batch size if encountering GPU memory limitations.
204
+ - **Early Stopping**: Configured with a patience of 5 epochs to avoid overfitting, automatically halting training if validation performance plateaus.
205
+
206
+ This README provides detailed information on the project setup, training, evaluation, and inference for the `xlm-roberta-large` NER model fine-tuned for Azerbaijani text. The model's performance on various entity types and the clear explanation of metrics make it a comprehensive resource for Azerbaijani NER tasks.