fix results
Browse files- README.md +33 -29
- eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
- eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +11 -11
- eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
- eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
- src/about.py +24 -103
README.md
CHANGED
@@ -10,36 +10,40 @@ license: apache-2.0
|
|
10 |
short_description: Official Leaderboard for OmniEval
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
},
|
25 |
-
"results": {
|
26 |
-
"task_name": {
|
27 |
-
"metric_name": score,
|
28 |
-
},
|
29 |
-
"task_name2": {
|
30 |
-
"metric_name": score,
|
31 |
-
}
|
32 |
-
}
|
33 |
-
}
|
34 |
-
```
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
|
|
|
|
|
|
39 |
|
40 |
-
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
10 |
short_description: Official Leaderboard for OmniEval
|
11 |
---
|
12 |
|
13 |
+
---
|
14 |
+
license: mit
|
15 |
+
language:
|
16 |
+
- zh
|
17 |
+
- en
|
18 |
+
base_model:
|
19 |
+
- Qwen/Qwen2.5-7B-Instruct
|
20 |
+
pipeline_tag: text-generation
|
21 |
+
---
|
22 |
+
|
23 |
+
# Dataset Information
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
|
26 |
|
27 |
+
1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
|
28 |
+
2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
|
29 |
+
3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
|
30 |
+
4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
|
31 |
|
32 |
+
Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) • 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
|
33 |
|
34 |
+
We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
|
35 |
+
|
36 |
+
We provide the evaluator for other metrics except hallucination in this repo.
|
37 |
+
|
38 |
+
# 🌟 Citation
|
39 |
+
```bibtex
|
40 |
+
@misc{wang2024omnievalomnidirectionalautomaticrag,
|
41 |
+
title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
|
42 |
+
author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
|
43 |
+
year={2024},
|
44 |
+
eprint={2412.13018},
|
45 |
+
archivePrefix={arXiv},
|
46 |
+
primaryClass={cs.CL},
|
47 |
+
url={https://arxiv.org/abs/2412.13018},
|
48 |
+
}
|
49 |
+
```
|
eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.3865492275960769,
|
5 |
+
"map": 0.37771288462347935
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.003156964263164541,
|
9 |
+
"f1": 0.254069117724313,
|
10 |
+
"rouge1": 0.25832549561659673,
|
11 |
+
"rouge2": 0.13269187125919746,
|
12 |
+
"rougeL": 0.16925453426436302,
|
13 |
+
"accuracy": 0.4080060613713853,
|
14 |
+
"completeness": 0.6048002385211687,
|
15 |
+
"hallucination": 0.05973250227243215,
|
16 |
+
"utilization": 0.5193561001042752,
|
17 |
+
"numerical_accuracy": 0.31237373737373736
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.40570358210211727,
|
5 |
+
"map": 0.396066422528097
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.003156964263164541,
|
9 |
+
"f1": 0.25926214898204075,
|
10 |
+
"rouge1": 0.2635672919940079,
|
11 |
+
"rouge2": 0.13850004284564332,
|
12 |
+
"rougeL": 0.17457743506358883,
|
13 |
+
"accuracy": 0.4090794292208612,
|
14 |
+
"completeness": 0.609230539815091,
|
15 |
+
"hallucination": 0.0634184068058778,
|
16 |
+
"utilization": 0.52025545090956,
|
17 |
+
"numerical_accuracy": 0.30601370210606443
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.44908027107799803,
|
5 |
+
"map": 0.4369785747358673
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.004714399966325714,
|
9 |
+
"f1": 0.3299866038099243,
|
10 |
+
"rouge1": 0.31137653557230416,
|
11 |
+
"rouge2": 0.17517183240145648,
|
12 |
+
"rougeL": 0.2279270260032969,
|
13 |
+
"accuracy": 0.409900239929284,
|
14 |
+
"completeness": 0.6072016768977392,
|
15 |
+
"hallucination": 0.0634046368643525,
|
16 |
+
"utilization": 0.519655704008222,
|
17 |
+
"numerical_accuracy": 0.31754059089699227
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.44908027107799803,
|
5 |
+
"map": 0.4369785747358673
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.04034600328324283,
|
9 |
+
"f1": 0.4810416082778636,
|
10 |
+
"rouge1": 0.39948754207404946,
|
11 |
+
"rouge2": 0.23047720731140595,
|
12 |
+
"rougeL": 0.3235410874683177,
|
13 |
+
"accuracy": 0.43982826114408385,
|
14 |
+
"completeness": 0.5925646063170621,
|
15 |
+
"hallucination": 0.07924721546536935,
|
16 |
+
"utilization": 0.4753909254037426,
|
17 |
+
"numerical_accuracy": 0.3087947882736156
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.44908027107799803,
|
5 |
+
"map": 0.4369785747358673
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.004293471397903776,
|
9 |
+
"f1": 0.25632469571916017,
|
10 |
+
"rouge1": 0.26861074169895954,
|
11 |
+
"rouge2": 0.1444095692170222,
|
12 |
+
"rougeL": 0.17778126757506857,
|
13 |
+
"accuracy": 0.4326303826240687,
|
14 |
+
"completeness": 0.6255959475566151,
|
15 |
+
"hallucination": 0.04670259173723377,
|
16 |
+
"utilization": 0.5613256113256113,
|
17 |
+
"numerical_accuracy": 0.3292742328300049
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.0,
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.44908027107799803,
|
5 |
+
"map": 0.4369785747358673
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.0,
|
9 |
+
"f1": 0.09576059934519404,
|
10 |
+
"rouge1": 0.1650998130595869,
|
11 |
+
"rouge2": 0.06697080375857452,
|
12 |
+
"rougeL": 0.05928647212536637,
|
13 |
+
"accuracy": 0.34019446899861094,
|
14 |
+
"completeness": 0.5778415961305925,
|
15 |
+
"hallucination": 0.059720954492111095,
|
16 |
+
"utilization": 0.42293577981651376,
|
17 |
+
"numerical_accuracy": 0.16823529411764707
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.34687460537946707,
|
5 |
+
"map": 0.3395167740034516
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.0040409142568506124,
|
9 |
+
"f1": 0.25528888107857534,
|
10 |
+
"rouge1": 0.2532119544207203,
|
11 |
+
"rouge2": 0.12795048070526135,
|
12 |
+
"rougeL": 0.16617984432034583,
|
13 |
+
"accuracy": 0.3907690364945069,
|
14 |
+
"completeness": 0.5980714606069667,
|
15 |
+
"hallucination": 0.07936304096571209,
|
16 |
+
"utilization": 0.5078436415070079,
|
17 |
+
"numerical_accuracy": 0.28370640291514837
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.42523728170083525,
|
5 |
+
"map": 0.4153046697038724
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.003416856492027335,
|
9 |
+
"f1": 0.38699874429027187,
|
10 |
+
"rouge1": 0.3504002729437697,
|
11 |
+
"rouge2": 0.19632811311525056,
|
12 |
+
"rougeL": 0.24352337911354996,
|
13 |
+
"accuracy": 0.43251708428246016,
|
14 |
+
"completeness": 0.6223938223938223,
|
15 |
+
"hallucination": 0.07180694526191878,
|
16 |
+
"utilization": 0.5366863905325444,
|
17 |
+
"numerical_accuracy": 0.35452103849597133
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.4236332574031891,
|
5 |
+
"map": 0.41523348519362185
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.003986332574031891,
|
9 |
+
"f1": 0.39131580638847696,
|
10 |
+
"rouge1": 0.35726262162172084,
|
11 |
+
"rouge2": 0.20428265081202376,
|
12 |
+
"rougeL": 0.25173121998034476,
|
13 |
+
"accuracy": 0.4450455580865604,
|
14 |
+
"completeness": 0.6207692307692307,
|
15 |
+
"hallucination": 0.07088459285295841,
|
16 |
+
"utilization": 0.541031652989449,
|
17 |
+
"numerical_accuracy": 0.34715960324616774
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.002277904328018223,
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.386455960516325,
|
5 |
+
"map": 0.37688876233864843
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.002277904328018223,
|
9 |
+
"f1": 0.3787448936861267,
|
10 |
+
"rouge1": 0.34038227335702076,
|
11 |
+
"rouge2": 0.1898058362852231,
|
12 |
+
"rougeL": 0.23622836359261534,
|
13 |
+
"accuracy": 0.40689066059225515,
|
14 |
+
"completeness": 0.5954968944099379,
|
15 |
+
"hallucination": 0.07920792079207921,
|
16 |
+
"utilization": 0.5117027501462844,
|
17 |
+
"numerical_accuracy": 0.3050397877984085
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.45742217160212606,
|
5 |
+
"map": 0.4442720197418375
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.005125284738041002,
|
9 |
+
"f1": 0.4353357282548688,
|
10 |
+
"rouge1": 0.39114215500827765,
|
11 |
+
"rouge2": 0.2348958346329388,
|
12 |
+
"rougeL": 0.29164097017642365,
|
13 |
+
"accuracy": 0.4234054669703872,
|
14 |
+
"completeness": 0.60062893081761,
|
15 |
+
"hallucination": 0.075,
|
16 |
+
"utilization": 0.516044340723454,
|
17 |
+
"numerical_accuracy": 0.32132963988919666
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.45742217160212606,
|
5 |
+
"map": 0.4442720197418375
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.05125284738041002,
|
9 |
+
"f1": 0.5042287844817168,
|
10 |
+
"rouge1": 0.4252992013911242,
|
11 |
+
"rouge2": 0.25007376816549043,
|
12 |
+
"rougeL": 0.33900256076984714,
|
13 |
+
"accuracy": 0.4433371298405467,
|
14 |
+
"completeness": 0.574468085106383,
|
15 |
+
"hallucination": 0.11310904872389792,
|
16 |
+
"utilization": 0.47642607683352733,
|
17 |
+
"numerical_accuracy": 0.32676348547717843
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization":
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.45742217160212606,
|
5 |
+
"map": 0.4442720197418375
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.0028473804100227792,
|
9 |
+
"f1": 0.39189804056173694,
|
10 |
+
"rouge1": 0.36142455862500045,
|
11 |
+
"rouge2": 0.20781042503487615,
|
12 |
+
"rougeL": 0.2528346438884966,
|
13 |
+
"accuracy": 0.44760820045558086,
|
14 |
+
"completeness": 0.6189922480620155,
|
15 |
+
"hallucination": 0.061843640606767794,
|
16 |
+
"utilization": 0.5575686732904734,
|
17 |
+
"numerical_accuracy": 0.35951134380453753
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.0,
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.45742217160212606,
|
5 |
+
"map": 0.4442720197418375
|
6 |
},
|
7 |
"generation": {
|
8 |
"em": 0.0,
|
9 |
+
"f1": 0.15831651384807305,
|
10 |
+
"rouge1": 0.2195147064138981,
|
11 |
+
"rouge2": 0.09922121332360972,
|
12 |
+
"rougeL": 0.08869793021948827,
|
13 |
+
"accuracy": 0.3365603644646925,
|
14 |
+
"completeness": 0.5820836621941594,
|
15 |
+
"hallucination": 0.0648202710665881,
|
16 |
+
"utilization": 0.4234421364985163,
|
17 |
+
"numerical_accuracy": 0.18561001042752867
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json
CHANGED
@@ -1,20 +1,20 @@
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
-
"mrr": 0.
|
5 |
-
"map": 0.
|
6 |
},
|
7 |
"generation": {
|
8 |
-
"em": 0.
|
9 |
-
"f1": 0.
|
10 |
-
"rouge1": 0.
|
11 |
-
"rouge2": 0.
|
12 |
-
"rougeL": 0.
|
13 |
-
"accuracy": 0.
|
14 |
-
"completeness": 0.
|
15 |
-
"hallucination": 0.
|
16 |
-
"utilization": 0.
|
17 |
-
"numerical_accuracy": 0.
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
|
|
1 |
{
|
2 |
"results": {
|
3 |
"retrieval": {
|
4 |
+
"mrr": 0.3532839787395595,
|
5 |
+
"map": 0.3458285876993166
|
6 |
},
|
7 |
"generation": {
|
8 |
+
"em": 0.003986332574031891,
|
9 |
+
"f1": 0.38207566850400565,
|
10 |
+
"rouge1": 0.3373954886971943,
|
11 |
+
"rouge2": 0.18428324959065878,
|
12 |
+
"rougeL": 0.2341310217806067,
|
13 |
+
"accuracy": 0.40888382687927105,
|
14 |
+
"completeness": 0.5930414386239249,
|
15 |
+
"hallucination": 0.08864426419466975,
|
16 |
+
"utilization": 0.516260162601626,
|
17 |
+
"numerical_accuracy": 0.3073351903435469
|
18 |
}
|
19 |
},
|
20 |
"config": {
|
src/about.py
CHANGED
@@ -43,118 +43,30 @@ TITLE = """<h1 align="center" id="space-title">🏅 OmniEval Leaderboard</h1>"""
|
|
43 |
|
44 |
# What does your leaderboard evaluate?
|
45 |
INTRODUCTION_TEXT = """
|
46 |
-
<div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
|
47 |
-
"""
|
48 |
-
|
49 |
-
# Which evaluations are you running? how can people reproduce what you have?
|
50 |
-
LLM_BENCHMARKS_TEXT = f"""
|
51 |
-
# <div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
|
52 |
-
|
53 |
-
|
54 |
<div align="center">
|
55 |
-
|
56 |
-
|
57 |
-
<!-- <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-5fc372.svg></a> -->
|
58 |
-
<!-- <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-b181d9.svg></a> -->
|
59 |
-
<a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
|
60 |
-
<a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
|
61 |
-
<a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
|
62 |
-
<a href="https://huggingface.co/spaces/NLPIR-RAG/OmniEval" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue></a>
|
63 |
-
<a href="https://github.com/RUC-NLPIR/FlashRAG/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"></a>
|
64 |
-
<a><img alt="Static Badge" src="https://img.shields.io/badge/made_with-Python-blue"></a>
|
65 |
</div>
|
|
|
66 |
|
67 |
-
|
68 |
-
|
69 |
-
|
70 |
-
|
71 |
-
<p>
|
72 |
-
<a href="#wrench-installation">Installation</a> |
|
73 |
-
<!-- <a href="#sparkles-features">Features</a> | -->
|
74 |
-
<a href="#rocket-quick-start">Quick-Start</a> |
|
75 |
-
<a href="#bookmark-license">License</a> |
|
76 |
-
<a href="#star2-citation">Citation</a>
|
77 |
-
|
78 |
-
</p>
|
79 |
-
|
80 |
-
</h4>
|
81 |
-
|
82 |
-
<!--
|
83 |
-
With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components. -->
|
84 |
-
|
85 |
-
|
86 |
-
## 🔧 Installation
|
87 |
-
`conda env create -f environment.yml && conda activate finrag`
|
88 |
-
|
89 |
-
<!-- ## ✨ Features
|
90 |
-
1. -->
|
91 |
-
## 🚀 Quick-Start
|
92 |
-
Notion:
|
93 |
-
1. The code run path is `./OpenFinBench`
|
94 |
-
2. We provide our auto-generated evaluation dataset in <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
|
95 |
-
### 1. Build the Retrieval Corpus
|
96 |
-
```
|
97 |
-
# cd OpenFinBench
|
98 |
-
sh corpus_builder/build_corpus.sh # Please see the annotation inner the bash file to set parameters.
|
99 |
-
```
|
100 |
-
### 2. Generate Evaluation Data Samples
|
101 |
-
1. Generate evaluation instances
|
102 |
-
```
|
103 |
-
# cd OpenFinBench
|
104 |
-
sh data_generator/generate_data.sh
|
105 |
-
```
|
106 |
-
2. Filter (quality inspection) evaluation instances
|
107 |
-
```
|
108 |
-
sh data_generator/generate_data_filter.sh
|
109 |
-
```
|
110 |
-
### 3. Inference Your Models
|
111 |
-
```
|
112 |
-
# cd OpenFinBench
|
113 |
-
sh evaluator/inference/rag_inference.sh
|
114 |
-
```
|
115 |
-
### 4. Evaluate Your Models
|
116 |
-
#### (a) Rule-based Evaluation
|
117 |
-
```
|
118 |
-
# cd OpenFinBench
|
119 |
-
sh evaluator/judgement/judger.sh # by setting judge_type="rule"
|
120 |
-
```
|
121 |
-
#### (b) Model-based Evalution
|
122 |
-
We propose five model-based metric: accuracy, completeness, utilization, numerical_accuracy, and hallucination. We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.
|
123 |
-
|
124 |
-
Note that the evaluator of hallucination is different from other four. Their model checkpoint can be load from the following huggingface links:
|
125 |
-
1. The evaluator for hallucination metric: <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
|
126 |
-
2. The evaluator for other metric: <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
To implement model-based evaluation, you can first set up two vllm servers by the following codes:
|
131 |
-
```
|
132 |
-
```
|
133 |
-
|
134 |
-
Then conduct the model-based evaluate using the following codes, (change the parameters inner the bash file).
|
135 |
-
```
|
136 |
-
sh evaluator/judgement/judger.sh
|
137 |
-
```
|
138 |
-
|
139 |
-
## 🔖 License
|
140 |
|
141 |
-
OmniEval
|
142 |
|
143 |
-
|
144 |
-
|
|
|
|
|
145 |
|
146 |
-
|
147 |
-
## Pipeline
|
148 |
-
1. Build corpus
|
149 |
-
2. Data generation
|
150 |
-
3. RAG inference
|
151 |
-
4. Result evaluatioin
|
152 |
|
153 |
-
|
154 |
-
1. remove "baichuan"
|
155 |
-
2. remove useless annotation -->
|
156 |
|
|
|
157 |
|
|
|
158 |
"""
|
159 |
|
160 |
EVALUATION_QUEUE_TEXT = """
|
@@ -189,4 +101,13 @@ If everything is done, check you can launch the EleutherAIHarness on your model
|
|
189 |
|
190 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
191 |
CITATION_BUTTON_TEXT = r"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
192 |
"""
|
|
|
43 |
|
44 |
# What does your leaderboard evaluate?
|
45 |
INTRODUCTION_TEXT = """
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
<div align="center">
|
47 |
+
Please contact us if you would like to submit your model to this leaderboard. Email: wangshuting@ruc.edu.cn
|
48 |
+
如果您想将您的模型提交到此排行榜,请联系我们。邮箱:wangshuting@ruc.edu.cn
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
</div>
|
50 |
+
"""
|
51 |
|
52 |
+
# Which evaluations are you running? how can people reproduce what you have?
|
53 |
+
LLM_BENCHMARKS_TEXT = """
|
54 |
+
# Leaderboard Information
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
55 |
|
56 |
+
We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
|
57 |
|
58 |
+
1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
|
59 |
+
2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
|
60 |
+
3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
|
61 |
+
4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
|
62 |
|
63 |
+
Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) • 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
+
We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
|
|
|
|
|
66 |
|
67 |
+
We provide the evaluator for other metrics except hallucination in this repo.
|
68 |
|
69 |
+
# 🌟 Citation
|
70 |
"""
|
71 |
|
72 |
EVALUATION_QUEUE_TEXT = """
|
|
|
101 |
|
102 |
CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
|
103 |
CITATION_BUTTON_TEXT = r"""
|
104 |
+
@misc{wang2024omnievalomnidirectionalautomaticrag,
|
105 |
+
title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
|
106 |
+
author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
|
107 |
+
year={2024},
|
108 |
+
eprint={2412.13018},
|
109 |
+
archivePrefix={arXiv},
|
110 |
+
primaryClass={cs.CL},
|
111 |
+
url={https://arxiv.org/abs/2412.13018},
|
112 |
+
}
|
113 |
"""
|