Initial model

Browse files

Files changed (9) hide show

README.md +87 -0
config.json +65 -0
merges.txt +0 -0
pytorch_model.bin +3 -0
special_tokens_map.json +1 -0
tf_model.h5 +3 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,87 @@

+---
+language: is
+license: apache-2.0
+widget:
+ - text: "Kristin manneskja getur ekki lagt frásagnir af Jesú Kristi á hilluna vegna þess að hún sé búin að lesa þær ."
+ - text: "Til hvers að kjósa flokk , sem þykist vera Jafnaðarmannaflokkur rétt fyrir kosningar , þegar að það er hægt að kjósa sannnan jafnaðarmannaflokk , sjálfan Jafnaðarmannaflokk Íslands - Samfylkinguna ."
+ - text: "Það sannaðist svo eftirminnilega á plötunni Það þarf fólk eins og þig sem kom út fyrir þremur árum , en á henni hann Fálka úr Keflavík og Gáluna , son sinn , til að útsetja lög hans og spila inn ."
+ - text: "Lögin hafa áður komið út sem aukalög á smáskífum af Hail to the Thief , en á disknum er líka myndband og fleira efni fyrir tölvur ."
+ - text: "Britney gerði honum viðvart og hann ók henni á UCLA-sjúkrahúsið í Santa Monica en það er í nágrenni hljóðversins ."
+---
+# IcelandicNER RoBERTa
+This model was fine-tuned on the MIM-GOLD-NER dataset for the Icelandic language.
+The [MIM-GOLD-NER](http://hdl.handle.net/20.500.12537/42) corpus was developed at [Reykjavik University](https://en.ru.is/) in 2018–2020 that covered eight types of entities:
+- Date
+- Location
+- Miscellaneous
+- Money
+- Organization
+- Percent
+- Person
+- Time
+## Dataset Information
+|       |   Records |   B-Date |   B-Location |   B-Miscellaneous |   B-Money |   B-Organization |   B-Percent |   B-Person |   B-Time |   I-Date |   I-Location |   I-Miscellaneous |   I-Money |   I-Organization |   I-Percent |   I-Person |   I-Time |
+|:------|----------:|---------:|-------------:|------------------:|----------:|-----------------:|------------:|-----------:|---------:|---------:|-------------:|------------------:|----------:|-----------------:|------------:|-----------:|---------:|
+| Train |     39988 |     3409 |         5980 |              4351 |       729 |             5754 |         502 |      11719 |      868 |     2112 |          516 |              3036 |       770 |             2382 |          50 |       5478 |      790 |
+| Valid |      7063 |      570 |         1034 |               787 |       100 |             1078 |         103 |       2106 |      147 |      409 |           76 |               560 |       104 |              458 |           7 |        998 |      136 |
+| Test  |      8299 |      779 |         1319 |               935 |       153 |             1315 |         108 |       2247 |      172 |      483 |          104 |               660 |       167 |              617 |          10 |       1089 |      158 |
+## Evaluation
+The following tables summarize the scores obtained by model overall and per each class.
+|     entity    | precision |  recall  | f1-score | support |
+|:-------------:|:---------:|:--------:|:--------:|:-------:|
+|      Date     |  0.961881 | 0.971759 | 0.966794 |  779.0  |
+|    Location   |  0.963047 | 0.968158 | 0.965595 |  1319.0 |
+| Miscellaneous |  0.884946 | 0.880214 | 0.882574 |  935.0  |
+|     Money     |  0.980132 | 0.967320 | 0.973684 |  153.0  |
+|  Organization |  0.924300 | 0.928517 | 0.926404 |  1315.0 |
+|    Percent    |  1.000000 | 1.000000 | 1.000000 |  108.0  |
+|     Person    |  0.978591 | 0.976413 | 0.977501 |  2247.0 |
+|      Time     |  0.965116 | 0.965116 | 0.965116 |  172.0  |
+|   micro avg   |  0.951258 | 0.952476 | 0.951866 |  7028.0 |
+|   macro avg   |  0.957252 | 0.957187 | 0.957209 |  7028.0 |
+|  weighted avg |  0.951237 | 0.952476 | 0.951849 |  7028.0 |
+## How To Use
+You use this model with Transformers pipeline for NER.
+### Installing requirements
+```bash
+pip install transformers
+```
+### How to predict using pipeline
+```python
+from transformers import AutoTokenizer
+from transformers import AutoModelForTokenClassification  # for pytorch
+from transformers import TFAutoModelForTokenClassification  # for tensorflow
+from transformers import pipeline
+model_name_or_path = "m3hrdadfi/icelandic-ner-roberta"
+tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+model = AutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Pytorch
+# model = TFAutoModelForTokenClassification.from_pretrained(model_name_or_path)  # Tensorflow
+nlp = pipeline("ner", model=model, tokenizer=tokenizer)
+example = "Kristin manneskja getur ekki lagt frásagnir af Jesú Kristi á hilluna vegna þess að hún sé búin að lesa þær ."
+ner_results = nlp(example)
+print(ner_results)
+```
+## Questions?
+Post a Github issue on the [IcelandicNER Issues](https://github.com/m3hrdadfi/icelandic-ner/issues) repo.

config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "_name_or_path": "mideind/IceBERT",
+  "architectures": [
+    "RobertaForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "finetuning_task": "ner",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "O",
+    "1": "B-Date",
+    "2": "B-Location",
+    "3": "B-Miscellaneous",
+    "4": "B-Money",
+    "5": "B-Organization",
+    "6": "B-Percent",
+    "7": "B-Person",
+    "8": "B-Time",
+    "9": "I-Date",
+    "10": "I-Location",
+    "11": "I-Miscellaneous",
+    "12": "I-Money",
+    "13": "I-Organization",
+    "14": "I-Percent",
+    "15": "I-Person",
+    "16": "I-Time"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "B-Date": 1,
+    "B-Location": 2,
+    "B-Miscellaneous": 3,
+    "B-Money": 4,
+    "B-Organization": 5,
+    "B-Percent": 6,
+    "B-Person": 7,
+    "B-Time": 8,
+    "I-Date": 9,
+    "I-Location": 10,
+    "I-Miscellaneous": 11,
+    "I-Money": 12,
+    "I-Organization": 13,
+    "I-Percent": 14,
+    "I-Person": 15,
+    "I-Time": 16,
+    "O": 0
+  },
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.7.0.dev0",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50000
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b9eb58d426b82ac19564c792f4cc1832e90354935a7976841a1a68c6b7983770
+size 495548151

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true}}

tf_model.h5 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:453d2d3a42efb71ed0cb0b0fd497994e3bf9df1988ca6b69c0a96b835e566b63
+size 495746192

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@

+ {"unk_token": {"content": "<unk>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "add_prefix_space": true, "errors": "replace", "sep_token": {"content": "</s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "cls_token": {"content": "<s>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": {"content": "<pad>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "mask_token": {"content": "<mask>", "single_word": false, "lstrip": true, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "special_tokens_map_file": "/content/cache/b21a20c1d1a8c4ce0f3f9b2a311ea6fa001eaaaee064c36040b1c5885cdc73f0.cb2244924ab24d706b02fd7fcedaea4531566537687a539ebb94db511fd122a0", "name_or_path": "mideind/IceBERT"}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff