Spaces:
Runtime error
Runtime error
Update Space (evaluate main: 828c6327)
Browse files
README.md
CHANGED
@@ -1,12 +1,182 @@
|
|
1 |
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
|
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title:
|
3 |
+
emoji: 🤗
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: red
|
6 |
sdk: gradio
|
7 |
sdk_version: 3.0.2
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
tags:
|
11 |
+
- evaluate
|
12 |
+
- metric
|
13 |
---
|
14 |
|
15 |
+
## Metric description
|
16 |
+
|
17 |
+
CoVal is a coreference evaluation tool for the [CoNLL](https://huggingface.co/datasets/conll2003) and [ARRAU](https://catalog.ldc.upenn.edu/LDC2013T22) datasets which implements of the common evaluation metrics including MUC [Vilain et al, 1995](https://aclanthology.org/M95-1005.pdf), B-cubed [Bagga and Baldwin, 1998](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.34.2578&rep=rep1&type=pdf), CEAFe [Luo et al., 2005](https://aclanthology.org/H05-1004.pdf), LEA [Moosavi and Strube, 2016](https://aclanthology.org/P16-1060.pdf) and the averaged CoNLL score (the average of the F1 values of MUC, B-cubed and CEAFe).
|
18 |
+
|
19 |
+
CoVal code was written by [`@ns-moosavi`](https://github.com/ns-moosavi), with some parts borrowed from [Deep Coref](https://github.com/clarkkev/deep-coref/blob/master/evaluation.py). The test suite is taken from the [official CoNLL code](https://github.com/conll/reference-coreference-scorers/), with additions by [`@andreasvc`](https://github.com/andreasvc) and file parsing developed by Leo Born.
|
20 |
+
|
21 |
+
## How to use
|
22 |
+
|
23 |
+
The metric takes two lists of sentences as input: one representing `predictions` and `references`, with the sentences consisting of words in the CoNLL format (see the [Limitations and bias](#Limitations-and-bias) section below for more details on the CoNLL format).
|
24 |
+
|
25 |
+
```python
|
26 |
+
from evaluate import load
|
27 |
+
coval = load('coval')
|
28 |
+
words = ['bc/cctv/00/cctv_0005 0 0 Thank VBP (TOP(S(VP* thank 01 1 Xu_li * (V*) * -',
|
29 |
+
... 'bc/cctv/00/cctv_0005 0 1 you PRP (NP*) - - - Xu_li * (ARG1*) (ARG0*) (116)',
|
30 |
+
... 'bc/cctv/00/cctv_0005 0 2 everyone NN (NP*) - - - Xu_li * (ARGM-DIS*) * (116)',
|
31 |
+
... 'bc/cctv/00/cctv_0005 0 3 for IN (PP* - - - Xu_li * (ARG2* * -',
|
32 |
+
... 'bc/cctv/00/cctv_0005 0 4 watching VBG (S(VP*)))) watch 01 1 Xu_li * *) (V*) -',
|
33 |
+
... 'bc/cctv/00/cctv_0005 0 5 . . *)) - - - Xu_li * * * -']
|
34 |
+
references = [words]
|
35 |
+
predictions = [words]
|
36 |
+
results = coval.compute(predictions=predictions, references=references)
|
37 |
+
```
|
38 |
+
It also has several optional arguments:
|
39 |
+
|
40 |
+
`keep_singletons`: After extracting all mentions of key or system file mentions whose corresponding coreference chain is of size one are considered as singletons. The default evaluation mode will include singletons in evaluations if they are included in the key or the system files. By setting `keep_singletons=False`, all singletons in the key and system files will be excluded from the evaluation.
|
41 |
+
|
42 |
+
`NP_only`: Most of the recent coreference resolvers only resolve NP mentions and leave out the resolution of VPs. By setting the `NP_only` option, the scorer will only evaluate the resolution of NPs.
|
43 |
+
|
44 |
+
`min_span`: By setting `min_span`, the scorer reports the results based on automatically detected minimum spans. Minimum spans are determined using the [MINA algorithm](https://arxiv.org/pdf/1906.06703.pdf).
|
45 |
+
|
46 |
+
|
47 |
+
## Output values
|
48 |
+
|
49 |
+
The metric outputs a dictionary with the following key-value pairs:
|
50 |
+
|
51 |
+
`mentions`: number of mentions, ranges from 0-1
|
52 |
+
|
53 |
+
`muc`: MUC metric, which expresses performance in terms of recall and precision, ranging from 0-1.
|
54 |
+
|
55 |
+
`bcub`: B-cubed metric, which is the averaged precision of all items in the distribution, ranging from 0-1.
|
56 |
+
|
57 |
+
`ceafe`: CEAFe (Constrained Entity Alignment F-Measure) is computed by aligning reference and system entities with the constraint that a reference entity is aligned with at most one reference entity. It ranges from 0-1
|
58 |
+
|
59 |
+
`lea`: LEA is a Link-Based Entity-Aware metric which, for each entity, considers how important the entity is and how well it is resolved. It ranges from 0-1.
|
60 |
+
|
61 |
+
`conll_score`: averaged CoNLL score (the average of the F1 values of `muc`, `bcub` and `ceafe`), ranging from 0 to 100.
|
62 |
+
|
63 |
+
|
64 |
+
### Values from popular papers
|
65 |
+
|
66 |
+
Given that many of the metrics returned by COVAL come from different sources, is it hard to cite reference values for all of them.
|
67 |
+
|
68 |
+
The CoNLL score is used to track progress on different datasets such as the [ARRAU corpus](https://paperswithcode.com/sota/coreference-resolution-on-the-arrau-corpus) and [CoNLL 2012](https://paperswithcode.com/sota/coreference-resolution-on-conll-2012).
|
69 |
+
|
70 |
+
## Examples
|
71 |
+
|
72 |
+
Maximal values
|
73 |
+
|
74 |
+
```python
|
75 |
+
from evaluate import load
|
76 |
+
coval = load('coval')
|
77 |
+
words = ['bc/cctv/00/cctv_0005 0 0 Thank VBP (TOP(S(VP* thank 01 1 Xu_li * (V*) * -',
|
78 |
+
... 'bc/cctv/00/cctv_0005 0 1 you PRP (NP*) - - - Xu_li * (ARG1*) (ARG0*) (116)',
|
79 |
+
... 'bc/cctv/00/cctv_0005 0 2 everyone NN (NP*) - - - Xu_li * (ARGM-DIS*) * (116)',
|
80 |
+
... 'bc/cctv/00/cctv_0005 0 3 for IN (PP* - - - Xu_li * (ARG2* * -',
|
81 |
+
... 'bc/cctv/00/cctv_0005 0 4 watching VBG (S(VP*)))) watch 01 1 Xu_li * *) (V*) -',
|
82 |
+
... 'bc/cctv/00/cctv_0005 0 5 . . *)) - - - Xu_li * * * -']
|
83 |
+
references = [words]
|
84 |
+
predictions = [words]
|
85 |
+
results = coval.compute(predictions=predictions, references=references)
|
86 |
+
print(results)
|
87 |
+
{'mentions/recall': 1.0, 'mentions/precision': 1.0, 'mentions/f1': 1.0, 'muc/recall': 1.0, 'muc/precision': 1.0, 'muc/f1': 1.0, 'bcub/recall': 1.0, 'bcub/precision': 1.0, 'bcub/f1': 1.0, 'ceafe/recall': 1.0, 'ceafe/precision': 1.0, 'ceafe/f1': 1.0, 'lea/recall': 1.0, 'lea/precision': 1.0, 'lea/f1': 1.0, 'conll_score': 100.0}
|
88 |
+
```
|
89 |
+
|
90 |
+
## Limitations and bias
|
91 |
+
|
92 |
+
This wrapper of CoVal currently only works with [CoNLL line format](https://huggingface.co/datasets/conll2003), which has one word per line with all the annotation for this word in column separated by spaces:
|
93 |
+
|
94 |
+
| Column | Type | Description |
|
95 |
+
|:-------|:----------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
96 |
+
| 1 | Document ID | This is a variation on the document filename |
|
97 |
+
| 2 | Part number | Some files are divided into multiple parts numbered as 000, 001, 002, ... etc. |
|
98 |
+
| 3 | Word number | |
|
99 |
+
| 4 | Word | This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release. |
|
100 |
+
| 5 | Part-of-Speech | |
|
101 |
+
| 6 | Parse bit | This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column. |
|
102 |
+
| 7 | Predicate lemma | The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-". |
|
103 |
+
| 8 | Predicate Frameset ID | This is the PropBank frameset ID of the predicate in Column 7. |
|
104 |
+
| 9 | Word sense | This is the word sense of the word in Column 3. |
|
105 |
+
| 10 | Speaker/Author | This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data. |
|
106 |
+
| 11 | Named Entities | These columns identifies the spans representing various named entities. |
|
107 |
+
| 12:N | Predicate Arguments | There is one column each of predicate argument structure information for the predicate mentioned in Column 7. |
|
108 |
+
| N | Coreference | Coreference chain information encoded in a parenthesis structure. |
|
109 |
+
|
110 |
+
## Citations
|
111 |
+
|
112 |
+
```bibtex
|
113 |
+
@InProceedings{moosavi2019minimum,
|
114 |
+
author = { Nafise Sadat Moosavi, Leo Born, Massimo Poesio and Michael Strube},
|
115 |
+
title = {Using Automatically Extracted Minimum Spans to Disentangle Coreference Evaluation from Boundary Detection},
|
116 |
+
year = {2019},
|
117 |
+
booktitle = {Proceedings of the 57th Annual Meeting of
|
118 |
+
the Association for Computational Linguistics (Volume 1: Long Papers)},
|
119 |
+
publisher = {Association for Computational Linguistics},
|
120 |
+
address = {Florence, Italy},
|
121 |
+
}
|
122 |
+
```
|
123 |
+
```bibtex
|
124 |
+
@inproceedings{10.3115/1072399.1072405,
|
125 |
+
author = {Vilain, Marc and Burger, John and Aberdeen, John and Connolly, Dennis and Hirschman, Lynette},
|
126 |
+
title = {A Model-Theoretic Coreference Scoring Scheme},
|
127 |
+
year = {1995},
|
128 |
+
isbn = {1558604022},
|
129 |
+
publisher = {Association for Computational Linguistics},
|
130 |
+
address = {USA},
|
131 |
+
url = {https://doi.org/10.3115/1072399.1072405},
|
132 |
+
doi = {10.3115/1072399.1072405},
|
133 |
+
booktitle = {Proceedings of the 6th Conference on Message Understanding},
|
134 |
+
pages = {45–52},
|
135 |
+
numpages = {8},
|
136 |
+
location = {Columbia, Maryland},
|
137 |
+
series = {MUC6 ’95}
|
138 |
+
}
|
139 |
+
```
|
140 |
+
|
141 |
+
```bibtex
|
142 |
+
@INPROCEEDINGS{Bagga98algorithmsfor,
|
143 |
+
author = {Amit Bagga and Breck Baldwin},
|
144 |
+
title = {Algorithms for Scoring Coreference Chains},
|
145 |
+
booktitle = {In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference},
|
146 |
+
year = {1998},
|
147 |
+
pages = {563--566}
|
148 |
+
}
|
149 |
+
```
|
150 |
+
```bibtex
|
151 |
+
@INPROCEEDINGS{Luo05oncoreference,
|
152 |
+
author = {Xiaoqiang Luo},
|
153 |
+
title = {On coreference resolution performance metrics},
|
154 |
+
booktitle = {In Proc. of HLT/EMNLP},
|
155 |
+
year = {2005},
|
156 |
+
pages = {25--32},
|
157 |
+
publisher = {URL}
|
158 |
+
}
|
159 |
+
```
|
160 |
+
|
161 |
+
```bibtex
|
162 |
+
@inproceedings{moosavi-strube-2016-coreference,
|
163 |
+
title = "Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric",
|
164 |
+
author = "Moosavi, Nafise Sadat and
|
165 |
+
Strube, Michael",
|
166 |
+
booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
|
167 |
+
month = aug,
|
168 |
+
year = "2016",
|
169 |
+
address = "Berlin, Germany",
|
170 |
+
publisher = "Association for Computational Linguistics",
|
171 |
+
url = "https://www.aclweb.org/anthology/P16-1060",
|
172 |
+
doi = "10.18653/v1/P16-1060",
|
173 |
+
pages = "632--642",
|
174 |
+
}
|
175 |
+
```
|
176 |
+
|
177 |
+
## Further References
|
178 |
+
|
179 |
+
- [CoNLL 2012 Task Description](http://www.conll.cemantix.org/2012/data.html): for information on the format (section "*_conll File Format")
|
180 |
+
- [CoNLL Evaluation details](https://github.com/ns-moosavi/coval/blob/master/conll/README.md)
|
181 |
+
- [Hugging Face - Neural Coreference Resolution (Neuralcoref)](https://huggingface.co/coref/)
|
182 |
+
|
app.py
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import evaluate
|
2 |
+
from evaluate.utils import launch_gradio_widget
|
3 |
+
|
4 |
+
|
5 |
+
module = evaluate.load("coval")
|
6 |
+
launch_gradio_widget(module)
|
coval.py
ADDED
@@ -0,0 +1,321 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Copyright 2020 The HuggingFace Evaluate Authors.
|
2 |
+
#
|
3 |
+
# Licensed under the Apache License, Version 2.0 (the "License");
|
4 |
+
# you may not use this file except in compliance with the License.
|
5 |
+
# You may obtain a copy of the License at
|
6 |
+
#
|
7 |
+
# http://www.apache.org/licenses/LICENSE-2.0
|
8 |
+
#
|
9 |
+
# Unless required by applicable law or agreed to in writing, software
|
10 |
+
# distributed under the License is distributed on an "AS IS" BASIS,
|
11 |
+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
12 |
+
# See the License for the specific language governing permissions and
|
13 |
+
# limitations under the License.
|
14 |
+
""" CoVal metric. """
|
15 |
+
import coval # From: git+https://github.com/ns-moosavi/coval.git noqa: F401
|
16 |
+
import datasets
|
17 |
+
from coval.conll import reader, util
|
18 |
+
from coval.eval import evaluator
|
19 |
+
|
20 |
+
import evaluate
|
21 |
+
|
22 |
+
|
23 |
+
logger = evaluate.logging.get_logger(__name__)
|
24 |
+
|
25 |
+
|
26 |
+
_CITATION = """\
|
27 |
+
@InProceedings{moosavi2019minimum,
|
28 |
+
author = { Nafise Sadat Moosavi, Leo Born, Massimo Poesio and Michael Strube},
|
29 |
+
title = {Using Automatically Extracted Minimum Spans to Disentangle Coreference Evaluation from Boundary Detection},
|
30 |
+
year = {2019},
|
31 |
+
booktitle = {Proceedings of the 57th Annual Meeting of
|
32 |
+
the Association for Computational Linguistics (Volume 1: Long Papers)},
|
33 |
+
publisher = {Association for Computational Linguistics},
|
34 |
+
address = {Florence, Italy},
|
35 |
+
}
|
36 |
+
|
37 |
+
@inproceedings{10.3115/1072399.1072405,
|
38 |
+
author = {Vilain, Marc and Burger, John and Aberdeen, John and Connolly, Dennis and Hirschman, Lynette},
|
39 |
+
title = {A Model-Theoretic Coreference Scoring Scheme},
|
40 |
+
year = {1995},
|
41 |
+
isbn = {1558604022},
|
42 |
+
publisher = {Association for Computational Linguistics},
|
43 |
+
address = {USA},
|
44 |
+
url = {https://doi.org/10.3115/1072399.1072405},
|
45 |
+
doi = {10.3115/1072399.1072405},
|
46 |
+
booktitle = {Proceedings of the 6th Conference on Message Understanding},
|
47 |
+
pages = {45–52},
|
48 |
+
numpages = {8},
|
49 |
+
location = {Columbia, Maryland},
|
50 |
+
series = {MUC6 ’95}
|
51 |
+
}
|
52 |
+
|
53 |
+
@INPROCEEDINGS{Bagga98algorithmsfor,
|
54 |
+
author = {Amit Bagga and Breck Baldwin},
|
55 |
+
title = {Algorithms for Scoring Coreference Chains},
|
56 |
+
booktitle = {In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference},
|
57 |
+
year = {1998},
|
58 |
+
pages = {563--566}
|
59 |
+
}
|
60 |
+
|
61 |
+
@INPROCEEDINGS{Luo05oncoreference,
|
62 |
+
author = {Xiaoqiang Luo},
|
63 |
+
title = {On coreference resolution performance metrics},
|
64 |
+
booktitle = {In Proc. of HLT/EMNLP},
|
65 |
+
year = {2005},
|
66 |
+
pages = {25--32},
|
67 |
+
publisher = {URL}
|
68 |
+
}
|
69 |
+
|
70 |
+
@inproceedings{moosavi-strube-2016-coreference,
|
71 |
+
title = "Which Coreference Evaluation Metric Do You Trust? A Proposal for a Link-based Entity Aware Metric",
|
72 |
+
author = "Moosavi, Nafise Sadat and
|
73 |
+
Strube, Michael",
|
74 |
+
booktitle = "Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
|
75 |
+
month = aug,
|
76 |
+
year = "2016",
|
77 |
+
address = "Berlin, Germany",
|
78 |
+
publisher = "Association for Computational Linguistics",
|
79 |
+
url = "https://www.aclweb.org/anthology/P16-1060",
|
80 |
+
doi = "10.18653/v1/P16-1060",
|
81 |
+
pages = "632--642",
|
82 |
+
}
|
83 |
+
|
84 |
+
"""
|
85 |
+
|
86 |
+
_DESCRIPTION = """\
|
87 |
+
CoVal is a coreference evaluation tool for the CoNLL and ARRAU datasets which
|
88 |
+
implements of the common evaluation metrics including MUC [Vilain et al, 1995],
|
89 |
+
B-cubed [Bagga and Baldwin, 1998], CEAFe [Luo et al., 2005],
|
90 |
+
LEA [Moosavi and Strube, 2016] and the averaged CoNLL score
|
91 |
+
(the average of the F1 values of MUC, B-cubed and CEAFe)
|
92 |
+
[Denis and Baldridge, 2009a; Pradhan et al., 2011].
|
93 |
+
|
94 |
+
This wrapper of CoVal currently only work with CoNLL line format:
|
95 |
+
The CoNLL format has one word per line with all the annotation for this word in column separated by spaces:
|
96 |
+
Column Type Description
|
97 |
+
1 Document ID This is a variation on the document filename
|
98 |
+
2 Part number Some files are divided into multiple parts numbered as 000, 001, 002, ... etc.
|
99 |
+
3 Word number
|
100 |
+
4 Word itself This is the token as segmented/tokenized in the Treebank. Initially the *_skel file contain the placeholder [WORD] which gets replaced by the actual token from the Treebank which is part of the OntoNotes release.
|
101 |
+
5 Part-of-Speech
|
102 |
+
6 Parse bit This is the bracketed structure broken before the first open parenthesis in the parse, and the word/part-of-speech leaf replaced with a *. The full parse can be created by substituting the asterix with the "([pos] [word])" string (or leaf) and concatenating the items in the rows of that column.
|
103 |
+
7 Predicate lemma The predicate lemma is mentioned for the rows for which we have semantic role information. All other rows are marked with a "-"
|
104 |
+
8 Predicate Frameset ID This is the PropBank frameset ID of the predicate in Column 7.
|
105 |
+
9 Word sense This is the word sense of the word in Column 3.
|
106 |
+
10 Speaker/Author This is the speaker or author name where available. Mostly in Broadcast Conversation and Web Log data.
|
107 |
+
11 Named Entities These columns identifies the spans representing various named entities.
|
108 |
+
12:N Predicate Arguments There is one column each of predicate argument structure information for the predicate mentioned in Column 7.
|
109 |
+
N Coreference Coreference chain information encoded in a parenthesis structure.
|
110 |
+
More informations on the format can be found here (section "*_conll File Format"): http://www.conll.cemantix.org/2012/data.html
|
111 |
+
|
112 |
+
Details on the evaluation on CoNLL can be found here: https://github.com/ns-moosavi/coval/blob/master/conll/README.md
|
113 |
+
|
114 |
+
CoVal code was written by @ns-moosavi.
|
115 |
+
Some parts are borrowed from https://github.com/clarkkev/deep-coref/blob/master/evaluation.py
|
116 |
+
The test suite is taken from https://github.com/conll/reference-coreference-scorers/
|
117 |
+
Mention evaluation and the test suite are added by @andreasvc.
|
118 |
+
Parsing CoNLL files is developed by Leo Born.
|
119 |
+
"""
|
120 |
+
|
121 |
+
_KWARGS_DESCRIPTION = """
|
122 |
+
Calculates coreference evaluation metrics.
|
123 |
+
Args:
|
124 |
+
predictions: list of sentences. Each sentence is a list of word predictions to score in the CoNLL format.
|
125 |
+
Each prediction is a word with its annotations as a string made of columns joined with spaces.
|
126 |
+
Only columns 4, 5, 6 and the last column are used (word, POS, Pars and coreference annotation)
|
127 |
+
See the details on the format in the description of the metric.
|
128 |
+
references: list of sentences. Each sentence is a list of word reference to score in the CoNLL format.
|
129 |
+
Each reference is a word with its annotations as a string made of columns joined with spaces.
|
130 |
+
Only columns 4, 5, 6 and the last column are used (word, POS, Pars and coreference annotation)
|
131 |
+
See the details on the format in the description of the metric.
|
132 |
+
keep_singletons: After extracting all mentions of key or system files,
|
133 |
+
mentions whose corresponding coreference chain is of size one,
|
134 |
+
are considered as singletons. The default evaluation mode will include
|
135 |
+
singletons in evaluations if they are included in the key or the system files.
|
136 |
+
By setting 'keep_singletons=False', all singletons in the key and system files
|
137 |
+
will be excluded from the evaluation.
|
138 |
+
NP_only: Most of the recent coreference resolvers only resolve NP mentions and
|
139 |
+
leave out the resolution of VPs. By setting the 'NP_only' option, the scorer will only evaluate the resolution of NPs.
|
140 |
+
min_span: By setting 'min_span', the scorer reports the results based on automatically detected minimum spans.
|
141 |
+
Minimum spans are determined using the MINA algorithm.
|
142 |
+
|
143 |
+
Returns:
|
144 |
+
'mentions': mentions
|
145 |
+
'muc': MUC metric [Vilain et al, 1995]
|
146 |
+
'bcub': B-cubed [Bagga and Baldwin, 1998]
|
147 |
+
'ceafe': CEAFe [Luo et al., 2005]
|
148 |
+
'lea': LEA [Moosavi and Strube, 2016]
|
149 |
+
'conll_score': averaged CoNLL score (the average of the F1 values of MUC, B-cubed and CEAFe)
|
150 |
+
|
151 |
+
Examples:
|
152 |
+
|
153 |
+
>>> coval = evaluate.load('coval')
|
154 |
+
>>> words = ['bc/cctv/00/cctv_0005 0 0 Thank VBP (TOP(S(VP* thank 01 1 Xu_li * (V*) * -',
|
155 |
+
... 'bc/cctv/00/cctv_0005 0 1 you PRP (NP*) - - - Xu_li * (ARG1*) (ARG0*) (116)',
|
156 |
+
... 'bc/cctv/00/cctv_0005 0 2 everyone NN (NP*) - - - Xu_li * (ARGM-DIS*) * (116)',
|
157 |
+
... 'bc/cctv/00/cctv_0005 0 3 for IN (PP* - - - Xu_li * (ARG2* * -',
|
158 |
+
... 'bc/cctv/00/cctv_0005 0 4 watching VBG (S(VP*)))) watch 01 1 Xu_li * *) (V*) -',
|
159 |
+
... 'bc/cctv/00/cctv_0005 0 5 . . *)) - - - Xu_li * * * -']
|
160 |
+
>>> references = [words]
|
161 |
+
>>> predictions = [words]
|
162 |
+
>>> results = coval.compute(predictions=predictions, references=references)
|
163 |
+
>>> print(results) # doctest:+ELLIPSIS
|
164 |
+
{'mentions/recall': 1.0,[...] 'conll_score': 100.0}
|
165 |
+
"""
|
166 |
+
|
167 |
+
|
168 |
+
def get_coref_infos(
|
169 |
+
key_lines, sys_lines, NP_only=False, remove_nested=False, keep_singletons=True, min_span=False, doc="dummy_doc"
|
170 |
+
):
|
171 |
+
|
172 |
+
key_doc_lines = {doc: key_lines}
|
173 |
+
sys_doc_lines = {doc: sys_lines}
|
174 |
+
|
175 |
+
doc_coref_infos = {}
|
176 |
+
|
177 |
+
key_nested_coref_num = 0
|
178 |
+
sys_nested_coref_num = 0
|
179 |
+
key_removed_nested_clusters = 0
|
180 |
+
sys_removed_nested_clusters = 0
|
181 |
+
key_singletons_num = 0
|
182 |
+
sys_singletons_num = 0
|
183 |
+
|
184 |
+
key_clusters, singletons_num = reader.get_doc_mentions(doc, key_doc_lines[doc], keep_singletons)
|
185 |
+
key_singletons_num += singletons_num
|
186 |
+
|
187 |
+
if NP_only or min_span:
|
188 |
+
key_clusters = reader.set_annotated_parse_trees(key_clusters, key_doc_lines[doc], NP_only, min_span)
|
189 |
+
|
190 |
+
sys_clusters, singletons_num = reader.get_doc_mentions(doc, sys_doc_lines[doc], keep_singletons)
|
191 |
+
sys_singletons_num += singletons_num
|
192 |
+
|
193 |
+
if NP_only or min_span:
|
194 |
+
sys_clusters = reader.set_annotated_parse_trees(sys_clusters, key_doc_lines[doc], NP_only, min_span)
|
195 |
+
|
196 |
+
if remove_nested:
|
197 |
+
nested_mentions, removed_clusters = reader.remove_nested_coref_mentions(key_clusters, keep_singletons)
|
198 |
+
key_nested_coref_num += nested_mentions
|
199 |
+
key_removed_nested_clusters += removed_clusters
|
200 |
+
|
201 |
+
nested_mentions, removed_clusters = reader.remove_nested_coref_mentions(sys_clusters, keep_singletons)
|
202 |
+
sys_nested_coref_num += nested_mentions
|
203 |
+
sys_removed_nested_clusters += removed_clusters
|
204 |
+
|
205 |
+
sys_mention_key_cluster = reader.get_mention_assignments(sys_clusters, key_clusters)
|
206 |
+
key_mention_sys_cluster = reader.get_mention_assignments(key_clusters, sys_clusters)
|
207 |
+
|
208 |
+
doc_coref_infos[doc] = (key_clusters, sys_clusters, key_mention_sys_cluster, sys_mention_key_cluster)
|
209 |
+
|
210 |
+
if remove_nested:
|
211 |
+
logger.info(
|
212 |
+
"Number of removed nested coreferring mentions in the key "
|
213 |
+
f"annotation: {key_nested_coref_num}; and system annotation: {sys_nested_coref_num}"
|
214 |
+
)
|
215 |
+
logger.info(
|
216 |
+
"Number of resulting singleton clusters in the key "
|
217 |
+
f"annotation: {key_removed_nested_clusters}; and system annotation: {sys_removed_nested_clusters}"
|
218 |
+
)
|
219 |
+
|
220 |
+
if not keep_singletons:
|
221 |
+
logger.info(
|
222 |
+
f"{key_singletons_num:d} and {sys_singletons_num:d} singletons are removed from the key and system "
|
223 |
+
"files, respectively"
|
224 |
+
)
|
225 |
+
|
226 |
+
return doc_coref_infos
|
227 |
+
|
228 |
+
|
229 |
+
def compute_score(key_lines, sys_lines, metrics, NP_only, remove_nested, keep_singletons, min_span):
|
230 |
+
doc_coref_infos = get_coref_infos(key_lines, sys_lines, NP_only, remove_nested, keep_singletons, min_span)
|
231 |
+
|
232 |
+
output_scores = {}
|
233 |
+
conll = 0
|
234 |
+
conll_subparts_num = 0
|
235 |
+
|
236 |
+
for name, metric in metrics:
|
237 |
+
recall, precision, f1 = evaluator.evaluate_documents(doc_coref_infos, metric, beta=1)
|
238 |
+
if name in ["muc", "bcub", "ceafe"]:
|
239 |
+
conll += f1
|
240 |
+
conll_subparts_num += 1
|
241 |
+
output_scores.update({f"{name}/recall": recall, f"{name}/precision": precision, f"{name}/f1": f1})
|
242 |
+
|
243 |
+
logger.info(
|
244 |
+
name.ljust(10),
|
245 |
+
f"Recall: {recall * 100:.2f}",
|
246 |
+
f" Precision: {precision * 100:.2f}",
|
247 |
+
f" F1: {f1 * 100:.2f}",
|
248 |
+
)
|
249 |
+
|
250 |
+
if conll_subparts_num == 3:
|
251 |
+
conll = (conll / 3) * 100
|
252 |
+
logger.info(f"CoNLL score: {conll:.2f}")
|
253 |
+
output_scores.update({"conll_score": conll})
|
254 |
+
|
255 |
+
return output_scores
|
256 |
+
|
257 |
+
|
258 |
+
def check_gold_parse_annotation(key_lines):
|
259 |
+
has_gold_parse = False
|
260 |
+
for line in key_lines:
|
261 |
+
if not line.startswith("#"):
|
262 |
+
if len(line.split()) > 6:
|
263 |
+
parse_col = line.split()[5]
|
264 |
+
if not parse_col == "-":
|
265 |
+
has_gold_parse = True
|
266 |
+
break
|
267 |
+
else:
|
268 |
+
break
|
269 |
+
return has_gold_parse
|
270 |
+
|
271 |
+
|
272 |
+
@evaluate.utils.file_utils.add_start_docstrings(_DESCRIPTION, _KWARGS_DESCRIPTION)
|
273 |
+
class Coval(evaluate.EvaluationModule):
|
274 |
+
def _info(self):
|
275 |
+
return evaluate.EvaluationModuleInfo(
|
276 |
+
description=_DESCRIPTION,
|
277 |
+
citation=_CITATION,
|
278 |
+
inputs_description=_KWARGS_DESCRIPTION,
|
279 |
+
features=datasets.Features(
|
280 |
+
{
|
281 |
+
"predictions": datasets.Sequence(datasets.Value("string")),
|
282 |
+
"references": datasets.Sequence(datasets.Value("string")),
|
283 |
+
}
|
284 |
+
),
|
285 |
+
codebase_urls=["https://github.com/ns-moosavi/coval"],
|
286 |
+
reference_urls=[
|
287 |
+
"https://github.com/ns-moosavi/coval",
|
288 |
+
"https://www.aclweb.org/anthology/P16-1060",
|
289 |
+
"http://www.conll.cemantix.org/2012/data.html",
|
290 |
+
],
|
291 |
+
)
|
292 |
+
|
293 |
+
def _compute(
|
294 |
+
self, predictions, references, keep_singletons=True, NP_only=False, min_span=False, remove_nested=False
|
295 |
+
):
|
296 |
+
allmetrics = [
|
297 |
+
("mentions", evaluator.mentions),
|
298 |
+
("muc", evaluator.muc),
|
299 |
+
("bcub", evaluator.b_cubed),
|
300 |
+
("ceafe", evaluator.ceafe),
|
301 |
+
("lea", evaluator.lea),
|
302 |
+
]
|
303 |
+
|
304 |
+
if min_span:
|
305 |
+
has_gold_parse = util.check_gold_parse_annotation(references)
|
306 |
+
if not has_gold_parse:
|
307 |
+
raise NotImplementedError("References should have gold parse annotation to use 'min_span'.")
|
308 |
+
# util.parse_key_file(key_file)
|
309 |
+
# key_file = key_file + ".parsed"
|
310 |
+
|
311 |
+
score = compute_score(
|
312 |
+
key_lines=references,
|
313 |
+
sys_lines=predictions,
|
314 |
+
metrics=allmetrics,
|
315 |
+
NP_only=NP_only,
|
316 |
+
remove_nested=remove_nested,
|
317 |
+
keep_singletons=keep_singletons,
|
318 |
+
min_span=min_span,
|
319 |
+
)
|
320 |
+
|
321 |
+
return score
|
requirements.txt
ADDED
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# TODO: fix github to release
|
2 |
+
git+https://github.com/huggingface/evaluate.git@b6e6ed7f3e6844b297bff1b43a1b4be0709b9671
|
3 |
+
datasets~=2.0
|
4 |
+
git+https://github.com/ns-moosavi/coval.git
|