Spaces:

yuyijiong
/

quad_match_score

Runtime error

App Files Files Community

quad_match_score / README.md

yuyijiong

Upload 2 files

13f64cc about 2 years ago

preview code

raw

history blame contribute delete

4.5 kB

	---
	title: Quad Match Score
	datasets:
	- SemEval2016 Task5
	tags:
	- evaluate
	- metric
	description: "TODO: add a description here"
	sdk: gradio
	sdk_version: 3.19.1
	app_file: app.py
	pinned: false
	---

	# Metric Card for My Metric

	*Module Card Instructions:* 评估生成模型的情感四元组抽取结果.

	## Metric Description
	评估生成模型的情感四元组抽取结果.
	- 以往的评估指标：将完全一致的四元组视为一个TP样本，其余均为FP或FN样本，
	这样做导致对大部分模型得分很低，尤其是在中文数据集上。
	因为target和opinion的选取具有主观性，评估时要求完全一致，不利于评估模型的真实性能
	- 本指标将四元组的四个方面分别评估，不要求四元组之间完全一致，而是计算其匹配度，若两个四元组匹配度为0.6，则视为0.6个TP样本，0.4个FP样本，0.4个FN样本*
	- 匹配度的定义为：四元组的四个方面的匹配度的加权平均值，权重默认为(1,1,1,1)，可自定义。匹配度会被归一化到[0,1]之间
	- target和opinion的匹配度定义为：两个字符串的rougel指标，在[0,1]之间（也可以使用BLEU、编辑距离等指标）
	- aspect和polarity的匹配度定义为：两个字符串完全一致匹配度为1，否则为0
	- 由于生成模型的输出不可控，不能保证prediction与reference中四元组数量一致、顺序一致，所以需要先进行最优匹配
	- 若prediction有n个四元组，reference有m个四元组，假设m>n，则共有m!/(m-n)!种匹配方案，
	取总匹配度最高的一对一匹配方案（最优匹配）作为最终方案，计算这种匹配下的TP,FP,FN数量，最后将所有样本的TP,FP,FN样本数相加，得到f1*
	- 也可以计算所有样本的平均总匹配度，作为评估指标，但会导致四元组较少的样本得分偏高而拉高总体得分
	- 本指标的优点是:\
	1.不要求四元组之间完全一致，受target和opinion的选取的主观性的影响小\
	2.可以自定义权重，对不同方面的重要性进行调整\
	3.训练过程中作为验证指标，避免过拟合，更好的反映模型的真实性能\
	4.对三元组抽取等任务同样适用

	## How to Use
	```python
	import evaluate

	module = evaluate.load("yuyijiong/quad_match_score")

	predictions=["food \| good \| food#taste \| pos"]
	references=["food \| good \| food#taste \| pos & service \| bad \| service#general \| neg"]

	result=module.compute(predictions=predictions, references=references)
	print(result)

	result={'f1 of exact match': 0.6667,
	'f1 of optimal match of weight (1, 1, 1, 1)': 0.6666666666666666,
	'score of optimal match of weight (1, 1, 1, 1)': 0.5}
	```

	### Inputs
	List all input arguments in the format below
	- predictions (`List[str]`): 模型生成的四元组，列表中每个字符串代表一个样本的生成结果.
	- references (`Union[List[str],List[List[str]]`):
	人工标注的四元组，列表中每个字符串代表一个样本的标签.如果列表元素为list，代表多个reference，评估时取最高分
	- weights (`Tuple[float, float, float, float]`,optional, defaults to (1,1,1,1)):分别代表(对象,观点,方面,极性)四个方面的评估指标的权重
	- tuple_len (`str`, optional, defaults to "0123"): indicate the format of the quad, see the following mapping
	指示四元组的格式，默认为'0123'。对应关系如下所示*
	```
	{'0123': "四元组(对象 \| 观点 \| 方面 \| 极性)",
	'01':'二元组(对象 \| 观点)',
	'012':'三元组(对象 \| 观点 \| 方面)',
	'013':'三元组(对象 \| 观点 \| 极性)',
	'023':'三元组(对象 \| 方面 \| 极性)',
	'23':'二元组(方面 \| 极性)',
	'03':'二元组(对象 \| 极性)',
	'13':'二元组(观点 \| 极性)',
	'3':'单元素(极性)'}
	```
	- sep_token1 (`str`, optional, defaults to " & "): the token to seperate quads 分割不同四元组的token
	- sep_token2 (`str`, optional, defaults to " \| "): the token to seperate units of one quad 四元组中不同元素之间的分隔token

	### Output Values

	最优匹配 f1值、最优匹配样本平均得分、完全匹配 f1值（传统评估）组成的dict，f1值均在[0,1]之间

	*例如:{'f1 of exact match': 0.6667,
	'f1 of optimal match of weight (1, 1, 1, 1)': 0.6666666666666666,
	'score of optimal match of weight (1, 1, 1, 1)': 0.5}*


	## Limitations and Bias
	对比传统评估指标，得分偏高

	## Citation
	论文即将发表