Spaces:

yuyijiong
/

quad_match_score

Runtime error

App Files Files Community

yuyijiong commited on Apr 11, 2023

Commit

6df6fc1

1 Parent(s): 869931b

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -4

README.md CHANGED Viewed

@@ -5,7 +5,7 @@ datasets:
 tags:
 - evaluate
 - metric
-description: "TODO: add a description here"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
@@ -17,7 +17,23 @@ pinned: false
 ***Module Card Instructions:*** *评估生成模型的情感四元组抽取结果.*
 ## Metric Description
-*评估生成模型的情感四元组抽取结果.*
 ## How to Use
 ```python
@@ -31,7 +47,7 @@ references=["food | good | food#taste | pos & service | bad | service#general |
 result=module.compute(predictions=predictions, references=references)
 print(result)
-{'ave match score of weight (1, 1, 1, 1)': 0.375,
 'f1 score of exact match': 0.0,
 'f1 score of optimal match of weight (1, 1, 1, 1)': 0.5}
 ```
@@ -60,7 +76,7 @@ print(result)
 ### Output Values
-*最优匹配 f1值、最优匹配样本平均得分、完全匹配 f1值 组成的dict，f1值均在[0,1]之间*
 *例如: {'ave match score of weight (1, 1, 1, 1)': 0.375,
 'f1 score of exact match': 0.0,

 tags:
 - evaluate
 - metric
+description: "评估生成模型的情感四元组抽取结果"
 sdk: gradio
 sdk_version: 3.19.1
 app_file: app.py
 ***Module Card Instructions:*** *评估生成模型的情感四元组抽取结果.*
 ## Metric Description
+评估生成模型的情感四元组抽取结果.
+- 以往的评估指标：将完全一致的四元组视为一个TP样本，其余均为FP或FN样本，
+这样做导致对大部分模型得分很低，尤其是在中文数据集上。
+因为target和opinion的选取具有主观性，评估时要求完全一致，不利于评估模型的真实性能
+- 本指标将四元组的四个方面分别评估，不要求四元组之间完全一致，而是计算其匹配度，若两个四元组匹配度为0.6，则视为0.6个TP样本，0.4个FP样本，0.4个FN样本*
+- 匹配度的定义为：四元组的四个方面的匹配度的加权平均值，权重默认为(1,1,1,1)，可自定义。匹配度会被归一化到[0,1]之间
+- target和opinion的匹配度定义为：两个字符串的rougel指标，在[0,1]之间 （也可以使用BLEU、编辑距离等指标）
+- aspect和polarity的匹配度定义为：两个字符串完全一致匹配度为1，否则为0
+- 由于生成模型的输出不可控，不能保证prediction与reference中四元组数量一致、顺序一致，所以需要先进行最优匹配
+- 若prediction有n个四元组，reference有m个四元组，假设m>n，则共有m!/(m-n)!种匹配方案，
+取总匹配度最高的一对一匹配方案（最优匹配）作为最终方案，计算这种匹配下的TP,FP,FN数量，最后将所有样本的TP,FP,FN样本数相加，得到f1*
+- 也可以计算所有样本的平均总匹配度，作为评估指标，但会导致四元组较少的样本得分偏高而拉高总体得分
+- 本指标的优点是:\
+1.不要求四元组之间完全一致，受target和opinion的选取的主观性的影响小\
+2.可以自定义权重，对不同方面的重要性进行调整\
+3.训练过程中作为验证指标，避免过拟合，更好的反映模型的真实性能\
+4.对三元组抽取等任务同样适用
 ## How to Use
 ```python
 result=module.compute(predictions=predictions, references=references)
 print(result)
+result={'ave match score of weight (1, 1, 1, 1)': 0.375,
 'f1 score of exact match': 0.0,
 'f1 score of optimal match of weight (1, 1, 1, 1)': 0.5}
 ```
 ### Output Values
+*最优匹配 f1值、最优匹配样本平均得分、完全匹配 f1值（传统评估） 组成的dict，f1值均在[0,1]之间*
 *例如: {'ave match score of weight (1, 1, 1, 1)': 0.375,
 'f1 score of exact match': 0.0,