Spaces:
Runtime error
Runtime error
File size: 8,922 Bytes
b519cf9 88b0888 ee82c0f b519cf9 88b0888 b519cf9 ee82c0f b519cf9 5306dec f6db68b 88b0888 ff3eea6 88b0888 0d196a8 54553f6 5306dec 54553f6 5306dec d772cf1 88b0888 0aa8c44 12cbf3e 1a7d487 88b0888 5306dec db06d31 f6db68b 01078a4 f6db68b 1a7d487 f6db68b 54553f6 f6db68b 0d196a8 1e03f3b 88b0888 5306dec 46c9e1a 54553f6 46c9e1a 54553f6 5306dec 65757ca f6db68b 2438bff 5199800 5306dec 5199800 5306dec 5199800 2438bff f6db68b 5306dec 619e946 5306dec 619e946 f6db68b 88b0888 5306dec 84cfdc8 88b0888 ee82c0f 35a28f9 ee82c0f 88b0888 5306dec ee82c0f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 |
---
title: relation_extraction
datasets:
- none
tags:
- evaluate
- metric
description: >-
This metric is used for evaluating the F1 accuracy of input references and
predictions.
sdk: gradio
sdk_version: 3.19.1
app_file: app.py
pinned: false
license: apache-2.0
---
# Metric Card for relation_extraction evaluation
This metric is used for evaluating the quality of relation extraction output. By calculating the Micro and Macro F1 score of every relation extraction outputs to ensure the quality.
## Metric Description
This metric computes and returns various scoring metrics for the prediction model based on the mode specified, including Precision, Recall, F1-Score and others. It evaluates the model's predictions against the provided reference data.
## How to Use
```python
import evaluate
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
[
{"head": "phip igments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
]
]
predictions = [
[
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])
```
### Inputs
- **predictions** (`list` of `list` of `dictionary`): A list of list of dictionary predicted relations from the model.
- **references** (`list` of `list` of `dictionary`): A list of list of dictionary ground-truth or reference relations to compare the predictions against.
- **mode** (`str`, Optional): Evaluation mode - `strict` or `boundaries`. Default `strict`. `strict` mode takes into account both entities type and their relationships, while `boundaries` mode only considers the entity spans of the relationships.
- **detailed_scores** (`bool`, Optional): Default `False`. If `True` it returns scores for each relation type specifically, if `False` it returns the overall scores.
- **relation_types** (`list`, Optional): Default `[]`. A list of relation types to consider while evaluating. If not provided, relation types will be constructed from the ground truth or reference data.
### Output Values
**output** (`dictionary` of `dictionaries`) A dictionary mapping each entity type to its respective scoring metrics such as Precision, Recall, F1 score.
- **ALL** (`dictionary`): score of total relation type
- **tp** : true positive count
- **fp** : false positive count
- **fn** : false negative count
- **p** : precision
- **r** : recall
- **f1** : micro f1 score
- **Macro_f1** : macro f1 score
- **Macro_p** : macro precision
- **Macro_r** : macro recall
- **{selected relation type}** (`dictionary`): score of selected relation type
- **tp** : true positive count
- **fp** : false positive count
- **fn** : false negative count
- **p** : precision
- **r** : recall
- **f1** : micro f1 score
Output Example:
```python
{'tp': 1, 'fp': 1, 'fn': 1, 'p': 50.0, 'r': 50.0, 'f1': 50.0, 'Macro_f1': 50.0, 'Macro_p': 50.0, 'Macro_r': 50.0}
```
Note : `Macro_f1`, `Macro_p`, `Macro_r`, `p`, `r`, `f1` are always numbers between 0 and 1. The values of `tp`, `fp`, `fn` depend on the number of data inputs.
### Examples
Example1 : Only one prediction and reference.
```python
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
[
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
]
]
predictions = [
[
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
]
]
scores = metric.compute(predictions=predictions, references=references, mode="strict", detailed_scores=False, relation_types=[])
print(scores)
>>> {'tp': 1, 'fp': 1, 'fn': 2, 'p': 50.0, 'r': 33.333333333333336, 'f1': 40.0, 'Macro_f1': 25.0, 'Macro_p': 25.0, 'Macro_r': 25.0}
```
Example 2 : Two or more prediction and reference. Output all score of relation type.
```python
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
[
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
],
[
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
]
]
predictions = [
[
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
],
[
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
{'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}
]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=[])
print(scores)
>>> {'sell': {'tp': 3, 'fp': 1, 'fn': 0, 'p': 75.0, 'r': 100.0, 'f1': 85.71428571428571}, 'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 3, 'fp': 1, 'fn': 1, 'p': 75.0, 'r': 75.0, 'f1': 75.0, 'Macro_f1': 42.857142857142854, 'Macro_p': 37.5, 'Macro_r': 50.0}}
```
Example 3 : Two or more prediction and reference. Output all score of relation type. Consider only the score of type "belongs_to".
```python
metric = evaluate.load("Ikala-allen/relation_extraction")
references = [
[
{"head": "phipigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
],
[
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
{'head': 'A醛賦活緊緻精華', 'tail': 'Serum', 'head_type': 'product', 'tail_type': 'category', 'type': 'belongs_to'},
]
]
predictions = [
[
{"head": "phipigments", "head_type": "product", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
{"head": "tinadaviespigments", "head_type": "brand", "type": "sell", "tail": "國際認證之色乳", "tail_type": "product"},
],
[
{'head': 'SABONTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'},
{'head': 'SNTAIWAN', 'tail': '大馬士革玫瑰有機光燦系列', 'head_type': 'brand', 'tail_type': 'product', 'type': 'sell'}
]
]
scores = metric.compute(predictions=predictions, references=references, mode="boundaries", detailed_scores=True, relation_types=["belongs_to"])
print(scores)
>>> {'belongs_to': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0}, 'ALL': {'tp': 0, 'fp': 0, 'fn': 1, 'p': 0, 'r': 0, 'f1': 0, 'Macro_f1': 0.0, 'Macro_p': 0.0, 'Macro_r': 0.0}}
```
## Limitations and Bias
There are two mode in this metric : `strict` and `boundaries`. It offers multiple `relation_types` to choose from. Ensure you choose appropriate evaluation parameters, as they can significantly impact the F1 score.
The entity(`head`,`tail`,`head_type`,`tail_type`) in both the prediction and reference should match exactly, disregarding case and spaces. If the prediction doesn't match the reference exactly, it will be counted as either a false positive (`fp`) or a false negative (`fn`).
## Citation
```bibtex
@Paper{
author = {Bruno Taillé, Vincent Guigue, Geoffrey Scoutheeten, Patrick Gallinari},
title = {Let's Stop Incorrect Comparisons in End-to-end Relation Extraction!},
year = {2020},
link = https://arxiv.org/abs/2009.10684
}
```
## Further References
This evaluation metric revised from
*https://github.com/btaille/sincere/blob/master/code/utils/evaluation.py* |