Commit
·
6687578
1
Parent(s):
05c21b6
Update README.md
Browse files
README.md
CHANGED
@@ -21,15 +21,12 @@ widget:
|
|
21 |
# A Named Entity Recognition Model for Kazakh
|
22 |
- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
|
23 |
- The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
|
24 |
-
##
|
25 |
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
|
26 |
-
As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
|
27 |
|
28 |
-
|
29 |
-
| :---: | :---: | :---: | :---: | :---: |
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
KazNERD (Cleaned) | Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
|
34 |
-
KazNERD (Original)| NE | 109,342 (80.20%) | 13,483 (9.89%)| 13,508 (9.91%) | 136,333 (100%) |
|
35 |
-
KazNERD (Cleaned) | NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
|
|
|
21 |
# A Named Entity Recognition Model for Kazakh
|
22 |
- The model was inspired by the [LREC 2022](https://lrec2022.lrec-conf.org/en/) paper [*KazNERD: Kazakh Named Entity Recognition Dataset*](https://aclanthology.org/2022.lrec-1.44).
|
23 |
- The original repository for the paper can be found at *https://github.com/IS2AI/KazNERD*.
|
24 |
+
## KazNERD (cleaned)
|
25 |
While the original dataset contained tokens denoting speech disfluencies and hesitations (parenthesised) and background noise [bracketed], this model was trained on a version of the dataset where such tokens and duplicates were removed.
|
26 |
+
As a result, the number of sentences, tokens, and named entities (NEs) in the cleaned dataset changed.
|
27 |
|
28 |
+
| Unit | Train | Valid | Test | Total |
|
29 |
+
| :---: | :---: | :---: | :---: | :---: |
|
30 |
+
| Sentence | 88,540 (80.00%) | 11,067 (10.00%) | 11,068 (10.00%) | 110,675 (100%) |
|
31 |
+
| Token | 1,088,461 (80.04%) | 136,021 (10.00%) | 135,426 (9.96%) | 1,359,908 (100%) |
|
32 |
+
| NE | 106,148 (80.17%) | 13,189 (9.96%) | 13,072 (9.87%) | 132,409 (100%) |
|
|
|
|
|
|