File size: 2,896 Bytes
4f09c24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
## A Universal Dependency parser built on top of a Transformer language model

Score on pre-tokenized test data:

```
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.70 |     99.77 |     99.73 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.62 |     99.61 |     99.61 |
UPOS       |     96.99 |     96.97 |     96.98 |     97.36
XPOS       |     93.65 |     93.64 |     93.65 |     94.01
UFeats     |     91.31 |     91.29 |     91.30 |     91.65
AllTags    |     86.86 |     86.85 |     86.86 |     87.19
Lemmas     |     95.83 |     95.81 |     95.82 |     96.19
UAS        |     89.01 |     89.00 |     89.00 |     89.35
LAS        |     85.72 |     85.70 |     85.71 |     86.04
CLAS       |     81.39 |     80.91 |     81.15 |     81.34
MLAS       |     69.21 |     68.81 |     69.01 |     69.17
BLEX       |     77.44 |     76.99 |     77.22 |     77.40
```


Score on untokenized test data:

```
Metric     | Precision |    Recall |  F1 Score | AligndAcc
-----------+-----------+-----------+-----------+-----------
Tokens     |     99.50 |     99.66 |     99.58 |
Sentences  |    100.00 |    100.00 |    100.00 |
Words      |     99.42 |     99.50 |     99.46 |
UPOS       |     96.80 |     96.88 |     96.84 |     97.37
XPOS       |     93.48 |     93.56 |     93.52 |     94.03
UFeats     |     91.13 |     91.20 |     91.16 |     91.66
AllTags    |     86.71 |     86.78 |     86.75 |     87.22
Lemmas     |     95.66 |     95.74 |     95.70 |     96.22
UAS        |     88.76 |     88.83 |     88.80 |     89.28
LAS        |     85.49 |     85.55 |     85.52 |     85.99
CLAS       |     81.19 |     80.73 |     80.96 |     81.31
MLAS       |     69.06 |     68.67 |     68.87 |     69.16
BLEX       |     77.28 |     76.84 |     77.06 |     77.39
````

To use the model, you need to setup COMBO, which makes it possible to use word embeddings from a pre-trained transformer model ([electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is)).

```bash
git submodule update --init --recursive
pip install -U pip setuptools wheel
pip install --index-url https://pypi.clarin-pl.eu/simple combo==1.0.5
```

* For Python 3.9, you might need to install cython:

```bash
pip install -U pip cython
```

* Then you can run the model as it is done in parse_file.py

For more instructions, see here: https://gitlab.clarin-pl.eu/syntactic-tools/combo

The Tokenizer directory is a clone of [Miðeind's tokenizer](https://github.com/icelandic-lt/Tokenizer).

The directory `transformer_models/` contains the pretrained model [electra-base-igc-is](https://huggingface.co/Icelandic-lt/electra-base-igc-is),
which supplies the parser with contextual embeddings and attention, trained by Jón Friðrik Daðason.


## License
https://opensource.org/licenses/Apache-2.0