File size: 3,023 Bytes
49b452e
 
 
 
 
 
 
8ab91a7
bf38a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7501d62
bf38a8a
 
 
 
 
78144cc
bf38a8a
 
 
 
 
 
 
 
 
 
3b54312
bf38a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce71ff7
bf38a8a
 
 
 
 
 
 
 
 
 
 
 
ce71ff7
 
 
bf38a8a
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: mit
language:
- ru
base_model:
- google-bert/bert-base-multilingual-cased
pipeline_tag: token-classification
library_name: transformers
---

# Russian Named Entity Recognition Model

## Model description

This model is a fine-tuned version of `bert-base-multilingual-cased` for Named Entity Recognition (NER) in Russian text. It can identify various entity types such as person names, locations, and organizations using the BIOLU tagging format.

## Intended uses & limitations

The model is designed to identify named entities in Russian text. It can be used for tasks such as information extraction, content analysis, and text preprocessing for downstream NLP tasks.

### How to use

Here's a simple example of how to use the model:

```python
from transformers import pipeline

ner_pipe = pipeline("ner", model="Gherman/bert-base-NER-Russian")

text = "Меня зовут Сергей Иванович из Москвы."
results = ner_pipe(text)

for result in results:
    print(f"Word: {result['word']}, Entity: {result['entity']}, Score: {result['score']:.4f}")
```

### Limitations and bias

- The model's performance may vary depending on the domain and style of the input text.
- It may struggle with rare or complex entity names not seen during training.
- The model might exhibit biases present in the training data.

## Training data

The model was trained on [Detailed-NER-Dataset-RU by AlexKly](https://github.com/AlexKly/Detailed-NER-Dataset-RU). Check it out, the dataset is pretty good!

### Label Information

The dataset is labeled using the BIOLU format, where:

- B: Beginning token of an entity
- I: Inner token of an entity
- O: Other (non-entity) token
- L: Last token of an entity
- U: Unit token (single-token entity)

The following entity types are included in the dataset:

Location (LOC) tags:
- COUNTRY
- REGION
- CITY
- DISTRICT
- STREET
- HOUSE

Person (PER) tags:
- LAST_NAME
- FIRST_NAME
- MIDDLE_NAME

For example, a full tag might look like "B-CITY" for the beginning token of a city name, or "U-COUNTRY" for a single-token country name.

## Training procedure

The model was fine-tuned from the `bert-base-multilingual-cased` checkpoint using the Hugging Face Transformers library.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-5
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with weight decay fix
- lr_scheduler_type: linear
- num_epochs: 10

### Framework versions

- Transformers 4.28.1
- Pytorch 1.13.0
- Datasets 2.12.0
- Tokenizers 0.13.3

## Evaluation results

The model achieves the following results on the evaluation set:

- Precision: 0.987843
- Recall: 0.988498
- F1 Score: 0.988170

## Ethical considerations

This model is intended for use in analyzing Russian text and should be used responsibly. Users should be aware of potential biases in the model's predictions and use the results judiciously, especially in applications that may impact individuals or groups.