Spaces:

RUC-NLPIR
/

OmniEval

Running

App Files Files Community

zstanjj commited on Dec 18, 2024

Commit

023a9be

1 Parent(s): ee70019

fix results

Browse files

Files changed (17) hide show

README.md +33 -29
eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +11 -11
eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json +11 -11
eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json +12 -12
src/about.py +24 -103

README.md CHANGED Viewed

@@ -10,36 +10,40 @@ license: apache-2.0
 short_description: Official Leaderboard for OmniEval
 ---
-# Start the configuration
-Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
-Results files should have the following format and be stored as json files:
-```json
-{
-    "config": {
-        "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
-        "model_name": "path of the model on the hub: org/model",
-        "model_sha": "revision on the hub",
-    },
-    "results": {
-        "task_name": {
-            "metric_name": score,
-        },
-        "task_name2": {
-            "metric_name": score,
-        }
-    }
-}
-```
-Request files are created automatically by this tool.
-If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
-# Code logic for more complex edits
-You'll find
-- the main table' columns names and properties in `src/display/utils.py`
-- the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
-- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`

 short_description: Official Leaderboard for OmniEval
 ---
+---
+license: mit
+language:
+- zh
+- en
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+pipeline_tag: text-generation
+---
+# Dataset Information
+We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
+1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
+2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
+3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
+4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
+Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) • 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
+We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
+We provide the evaluator for other metrics except hallucination in this repo.
+# 🌟 Citation
+```bibtex
+@misc{wang2024omnievalomnidirectionalautomaticrag,
+      title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
+      author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
+      year={2024},
+      eprint={2412.13018},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.13018},
+}
+```

eval-results/omnieval-auto/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3097634381445468,
-      "map": 0.30402197247127166
     },
     "generation": {
-      "em": 0.0026518499810582142,
-      "f1": 0.2480828824153542,
-      "rouge1": 0.2493538725800514,
-      "rouge2": 0.1235656068292625,
-      "rougeL": 0.16098924930699862,
-      "accuracy": 0.3906427579239803,
-      "completeness": 0.5930474914396308,
-      "hallucination": 0.06504488096786783,
-      "utilization": 0.5045650189122212,
-      "numerical_accuracy": 0.28149656401119877
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.3865492275960769,
+      "map": 0.37771288462347935
     },
     "generation": {
+      "em": 0.003156964263164541,
+      "f1": 0.254069117724313,
+      "rouge1": 0.25832549561659673,
+      "rouge2": 0.13269187125919746,
+      "rougeL": 0.16925453426436302,
+      "accuracy": 0.4080060613713853,
+      "completeness": 0.6048002385211687,
+      "hallucination": 0.05973250227243215,
+      "utilization": 0.5193561001042752,
+      "numerical_accuracy": 0.31237373737373736
     }
   },
   "config": {

eval-results/omnieval-auto/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.33076566906595944,
-      "map": 0.32402765500694536
     },
     "generation": {
-      "em": 0.002525571410531633,
-      "f1": 0.2524796046548042,
-      "rouge1": 0.2542055585319881,
-      "rouge2": 0.12967013110722864,
-      "rougeL": 0.16623387811734364,
-      "accuracy": 0.4025188916876574,
-      "completeness": 0.6033108522378908,
-      "hallucination": 0.07283603096410979,
-      "utilization": 0.5141388174807198,
-      "numerical_accuracy": 0.3162303664921466
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.40570358210211727,
+      "map": 0.396066422528097
     },
     "generation": {
+      "em": 0.003156964263164541,
+      "f1": 0.25926214898204075,
+      "rouge1": 0.2635672919940079,
+      "rouge2": 0.13850004284564332,
+      "rougeL": 0.17457743506358883,
+      "accuracy": 0.4090794292208612,
+      "completeness": 0.609230539815091,
+      "hallucination": 0.0634184068058778,
+      "utilization": 0.52025545090956,
+      "numerical_accuracy": 0.30601370210606443
     }
   },
   "config": {

eval-results/omnieval-auto/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3406848507808225,
-      "map": 0.3337426863661236
     },
     "generation": {
-      "em": 0.0035568464031653824,
-      "f1": 0.3226028700822056,
-      "rouge1": 0.29804464952499493,
-      "rouge2": 0.1619392409911174,
-      "rougeL": 0.21536150159516076,
-      "accuracy": 0.3783377209477247,
-      "completeness": 0.5935541629364369,
-      "hallucination": 0.06668379802132854,
-      "utilization": 0.48314821907315203,
-      "numerical_accuracy": 0.2761605035405193
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.44908027107799803,
+      "map": 0.4369785747358673
     },
     "generation": {
+      "em": 0.004714399966325714,
+      "f1": 0.3299866038099243,
+      "rouge1": 0.31137653557230416,
+      "rouge2": 0.17517183240145648,
+      "rougeL": 0.2279270260032969,
+      "accuracy": 0.409900239929284,
+      "completeness": 0.6072016768977392,
+      "hallucination": 0.0634046368643525,
+      "utilization": 0.519655704008222,
+      "numerical_accuracy": 0.31754059089699227
     }
   },
   "config": {

eval-results/omnieval-auto/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3406848507808225,
-      "map": 0.3337426863661236
     },
     "generation": {
-      "em": 0.030906680136380857,
-      "f1": 0.4704248712273675,
-      "rouge1": 0.3844331865430577,
-      "rouge2": 0.21544656691735142,
-      "rougeL": 0.3082188596657867,
-      "accuracy": 0.4181714862987751,
-      "completeness": 0.586105675146771,
-      "hallucination": 0.0880543450397334,
-      "utilization": 0.45601078859491395,
-      "numerical_accuracy": 0.2751721876024926
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.44908027107799803,
+      "map": 0.4369785747358673
     },
     "generation": {
+      "em": 0.04034600328324283,
+      "f1": 0.4810416082778636,
+      "rouge1": 0.39948754207404946,
+      "rouge2": 0.23047720731140595,
+      "rougeL": 0.3235410874683177,
+      "accuracy": 0.43982826114408385,
+      "completeness": 0.5925646063170621,
+      "hallucination": 0.07924721546536935,
+      "utilization": 0.4753909254037426,
+      "numerical_accuracy": 0.3087947882736156
     }
   },
   "config": {

eval-results/omnieval-auto/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3406848507808225,
-      "map": 0.3337426863661236
     },
     "generation": {
-      "em": 0.0028412678368480867,
-      "f1": 0.2477112059712835,
-      "rouge1": 0.25666135328401396,
-      "rouge2": 0.13256084364546591,
-      "rougeL": 0.1669344569228441,
-      "accuracy": 0.40573304710190683,
-      "completeness": 0.6131668895824045,
-      "hallucination": 0.05456183245399562,
-      "utilization": 0.5346272891410885,
-      "numerical_accuracy": 0.2971301335972291
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.44908027107799803,
+      "map": 0.4369785747358673
     },
     "generation": {
+      "em": 0.004293471397903776,
+      "f1": 0.25632469571916017,
+      "rouge1": 0.26861074169895954,
+      "rouge2": 0.1444095692170222,
+      "rougeL": 0.17778126757506857,
+      "accuracy": 0.4326303826240687,
+      "completeness": 0.6255959475566151,
+      "hallucination": 0.04670259173723377,
+      "utilization": 0.5613256113256113,
+      "numerical_accuracy": 0.3292742328300049
     }
   },
   "config": {

eval-results/omnieval-auto/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3406848507808225,
-      "map": 0.3337426863661236
     },
     "generation": {
       "em": 0.0,
-      "f1": 0.09732568803130702,
-      "rouge1": 0.1642342072893325,
-      "rouge2": 0.06542075931397044,
-      "rougeL": 0.059256539829821125,
-      "accuracy": 0.3304375804375804,
-      "completeness": 0.5735068912710567,
-      "hallucination": 0.06555017663221248,
-      "utilization": 0.4132755170113409,
-      "numerical_accuracy": 0.175
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.44908027107799803,
+      "map": 0.4369785747358673
     },
     "generation": {
       "em": 0.0,
+      "f1": 0.09576059934519404,
+      "rouge1": 0.1650998130595869,
+      "rouge2": 0.06697080375857452,
+      "rougeL": 0.05928647212536637,
+      "accuracy": 0.34019446899861094,
+      "completeness": 0.5778415961305925,
+      "hallucination": 0.059720954492111095,
+      "utilization": 0.42293577981651376,
+      "numerical_accuracy": 0.16823529411764707
     }
   },
   "config": {

eval-results/omnieval-auto/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.25315906890600665,
-      "map": 0.24830681483352277
     },
     "generation": {
-      "em": 0.0026518499810582142,
-      "f1": 0.24837825152624493,
-      "rouge1": 0.24111819423215256,
-      "rouge2": 0.11665848753826197,
-      "rougeL": 0.1558018779014647,
-      "accuracy": 0.3705644652102538,
-      "completeness": 0.5820335932813437,
-      "hallucination": 0.09210356820816695,
-      "utilization": 0.4738984364905027,
-      "numerical_accuracy": 0.24648820567187915
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.34687460537946707,
+      "map": 0.3395167740034516
     },
     "generation": {
+      "em": 0.0040409142568506124,
+      "f1": 0.25528888107857534,
+      "rouge1": 0.2532119544207203,
+      "rouge2": 0.12795048070526135,
+      "rougeL": 0.16617984432034583,
+      "accuracy": 0.3907690364945069,
+      "completeness": 0.5980714606069667,
+      "hallucination": 0.07936304096571209,
+      "utilization": 0.5078436415070079,
+      "numerical_accuracy": 0.28370640291514837
     }
   },
   "config": {

eval-results/omnieval-human/bge-large-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3426063022019742,
-      "map": 0.33500379650721335
     },
     "generation": {
-      "em": 0.0017084282460136675,
-      "f1": 0.3797528411547138,
-      "rouge1": 0.3372893350582966,
-      "rouge2": 0.18329984910669803,
-      "rougeL": 0.23230144566069125,
-      "accuracy": 0.40888382687927105,
-      "completeness": 0.6021044427123928,
-      "hallucination": 0.08138173302107728,
-      "utilization": 0.5014637002341921,
-      "numerical_accuracy": 0.3100358422939068
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.42523728170083525,
+      "map": 0.4153046697038724
     },
     "generation": {
+      "em": 0.003416856492027335,
+      "f1": 0.38699874429027187,
+      "rouge1": 0.3504002729437697,
+      "rouge2": 0.19632811311525056,
+      "rougeL": 0.24352337911354996,
+      "accuracy": 0.43251708428246016,
+      "completeness": 0.6223938223938223,
+      "hallucination": 0.07180694526191878,
+      "utilization": 0.5366863905325444,
+      "numerical_accuracy": 0.35452103849597133
     }
   },
   "config": {

eval-results/omnieval-human/bge-m3_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.3527809415337889,
-      "map": 0.3458855353075171
     },
     "generation": {
-      "em": 0.0017084282460136675,
-      "f1": 0.38645032979631466,
-      "rouge1": 0.3467267951634575,
-      "rouge2": 0.1930581604826183,
-      "rougeL": 0.24141093461883717,
-      "accuracy": 0.4271070615034169,
-      "completeness": 0.6119287374128582,
-      "hallucination": 0.07481005260081823,
-      "utilization": 0.5400116822429907,
-      "numerical_accuracy": 0.3372093023255814
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.4236332574031891,
+      "map": 0.41523348519362185
     },
     "generation": {
+      "em": 0.003986332574031891,
+      "f1": 0.39131580638847696,
+      "rouge1": 0.35726262162172084,
+      "rouge2": 0.20428265081202376,
+      "rougeL": 0.25173121998034476,
+      "accuracy": 0.4450455580865604,
+      "completeness": 0.6207692307692307,
+      "hallucination": 0.07088459285295841,
+      "utilization": 0.541031652989449,
+      "numerical_accuracy": 0.34715960324616774
     }
   },
   "config": {

eval-results/omnieval-human/e5-mistral-7b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.303246013667426,
-      "map": 0.2960516324981017
     },
     "generation": {
       "em": 0.002277904328018223,
-      "f1": 0.3705164550873997,
-      "rouge1": 0.3270311806826159,
-      "rouge2": 0.17476659877087528,
-      "rougeL": 0.22225645997479143,
-      "accuracy": 0.385250569476082,
-      "completeness": 0.5877535101404057,
-      "hallucination": 0.0924956369982548,
-      "utilization": 0.4793244030285381,
-      "numerical_accuracy": 0.28622540250447226
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.386455960516325,
+      "map": 0.37688876233864843
     },
     "generation": {
       "em": 0.002277904328018223,
+      "f1": 0.3787448936861267,
+      "rouge1": 0.34038227335702076,
+      "rouge2": 0.1898058362852231,
+      "rougeL": 0.23622836359261534,
+      "accuracy": 0.40689066059225515,
+      "completeness": 0.5954968944099379,
+      "hallucination": 0.07920792079207921,
+      "utilization": 0.5117027501462844,
+      "numerical_accuracy": 0.3050397877984085
     }
   },
   "config": {

eval-results/omnieval-human/gte-qwen2-1.5b_deepseek-v2-chat/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.36173120728929387,
-      "map": 0.3512338648443432
     },
     "generation": {
-      "em": 0.0056947608200455585,
-      "f1": 0.4212862409737785,
-      "rouge1": 0.3707328288930376,
-      "rouge2": 0.21393113234607009,
-      "rougeL": 0.2719847145278759,
-      "accuracy": 0.3886674259681093,
-      "completeness": 0.5858823529411765,
-      "hallucination": 0.07893209518282066,
-      "utilization": 0.48166472642607683,
-      "numerical_accuracy": 0.27365491651205937
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.45742217160212606,
+      "map": 0.4442720197418375
     },
     "generation": {
+      "em": 0.005125284738041002,
+      "f1": 0.4353357282548688,
+      "rouge1": 0.39114215500827765,
+      "rouge2": 0.2348958346329388,
+      "rougeL": 0.29164097017642365,
+      "accuracy": 0.4234054669703872,
+      "completeness": 0.60062893081761,
+      "hallucination": 0.075,
+      "utilization": 0.516044340723454,
+      "numerical_accuracy": 0.32132963988919666
     }
   },
   "config": {

eval-results/omnieval-human/gte-qwen2-1.5b_llama3-70b-instruct/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.36173120728929387,
-      "map": 0.3512338648443432
     },
     "generation": {
-      "em": 0.04555808656036447,
-      "f1": 0.4907954247383474,
-      "rouge1": 0.4080491070348775,
-      "rouge2": 0.23130474174425783,
-      "rougeL": 0.3217574785678875,
-      "accuracy": 0.4216970387243736,
-      "completeness": 0.5688146380270486,
-      "hallucination": 0.11832946635730858,
-      "utilization": 0.4491869918699187,
-      "numerical_accuracy": 0.288981288981289
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.45742217160212606,
+      "map": 0.4442720197418375
     },
     "generation": {
+      "em": 0.05125284738041002,
+      "f1": 0.5042287844817168,
+      "rouge1": 0.4252992013911242,
+      "rouge2": 0.25007376816549043,
+      "rougeL": 0.33900256076984714,
+      "accuracy": 0.4433371298405467,
+      "completeness": 0.574468085106383,
+      "hallucination": 0.11310904872389792,
+      "utilization": 0.47642607683352733,
+      "numerical_accuracy": 0.32676348547717843
     }
   },
   "config": {

eval-results/omnieval-human/gte-qwen2-1.5b_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.36173120728929387,
-      "map": 0.3512338648443432
     },
     "generation": {
-      "em": 0.002277904328018223,
-      "f1": 0.3804001391052641,
-      "rouge1": 0.34576336184459094,
-      "rouge2": 0.1928778762677512,
-      "rougeL": 0.2383694455084706,
-      "accuracy": 0.4145785876993166,
-      "completeness": 0.598297213622291,
-      "hallucination": 0.07213496218731821,
-      "utilization": 1.13922942206655,
-      "numerical_accuracy": 0.3218694885361552
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.45742217160212606,
+      "map": 0.4442720197418375
     },
     "generation": {
+      "em": 0.0028473804100227792,
+      "f1": 0.39189804056173694,
+      "rouge1": 0.36142455862500045,
+      "rouge2": 0.20781042503487615,
+      "rougeL": 0.2528346438884966,
+      "accuracy": 0.44760820045558086,
+      "completeness": 0.6189922480620155,
+      "hallucination": 0.061843640606767794,
+      "utilization": 0.5575686732904734,
+      "numerical_accuracy": 0.35951134380453753
     }
   },
   "config": {

eval-results/omnieval-human/gte-qwen2-1.5b_yi15-34b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.36173120728929387,
-      "map": 0.3512338648443432
     },
     "generation": {
       "em": 0.0,
-      "f1": 0.16041349053275844,
-      "rouge1": 0.21775697114621573,
-      "rouge2": 0.09738983880706074,
-      "rougeL": 0.08775246194460379,
-      "accuracy": 0.3211845102505695,
-      "completeness": 0.5703789636504254,
-      "hallucination": 0.07665094339622641,
-      "utilization": 0.40828402366863903,
-      "numerical_accuracy": 0.162
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.45742217160212606,
+      "map": 0.4442720197418375
     },
     "generation": {
       "em": 0.0,
+      "f1": 0.15831651384807305,
+      "rouge1": 0.2195147064138981,
+      "rouge2": 0.09922121332360972,
+      "rougeL": 0.08869793021948827,
+      "accuracy": 0.3365603644646925,
+      "completeness": 0.5820836621941594,
+      "hallucination": 0.0648202710665881,
+      "utilization": 0.4234421364985163,
+      "numerical_accuracy": 0.18561001042752867
     }
   },
   "config": {

eval-results/omnieval-human/jina-zh_qwen2-72b/results_2023-12-08 15:46:20.425378.json CHANGED Viewed

@@ -1,20 +1,20 @@
 {
   "results": {
     "retrieval": {
-      "mrr": 0.27484813971146543,
-      "map": 0.26924354593773725
     },
     "generation": {
-      "em": 0.003416856492027335,
-      "f1": 0.37960439080933656,
-      "rouge1": 0.3255380867320351,
-      "rouge2": 0.1732248556904568,
-      "rougeL": 0.22591939162851002,
-      "accuracy": 0.3826879271070615,
-      "completeness": 0.5793588741204065,
-      "hallucination": 0.0897510133178923,
-      "utilization": 0.4855072463768116,
-      "numerical_accuracy": 0.2663594470046083
     }
   },
   "config": {

 {
   "results": {
     "retrieval": {
+      "mrr": 0.3532839787395595,
+      "map": 0.3458285876993166
     },
     "generation": {
+      "em": 0.003986332574031891,
+      "f1": 0.38207566850400565,
+      "rouge1": 0.3373954886971943,
+      "rouge2": 0.18428324959065878,
+      "rougeL": 0.2341310217806067,
+      "accuracy": 0.40888382687927105,
+      "completeness": 0.5930414386239249,
+      "hallucination": 0.08864426419466975,
+      "utilization": 0.516260162601626,
+      "numerical_accuracy": 0.3073351903435469
     }
   },
   "config": {

src/about.py CHANGED Viewed

@@ -43,118 +43,30 @@ TITLE = """<h1 align="center" id="space-title">🏅 OmniEval Leaderboard</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
-<div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
-"""
-# Which evaluations are you running? how can people reproduce what you have?
-LLM_BENCHMARKS_TEXT = f"""
-# <div align="center">OmniEval: Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain</div>
 <div align="center">
-<!-- <a href="https://arxiv.org/abs/2405.13576" target="_blank"><img src=https://img.shields.io/badge/arXiv-b5212f.svg?logo=arxiv></a> -->
-<!-- <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Datasets-27b3b4.svg></a> -->
-<!-- <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-5fc372.svg></a> -->
-<!-- <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace%20Checkpoint-b181d9.svg></a> -->
-<a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
-<a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
-<a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
-<a href="https://huggingface.co/spaces/NLPIR-RAG/OmniEval" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue></a>
-<a href="https://github.com/RUC-NLPIR/FlashRAG/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/LICENSE-MIT-green"></a>
-<a><img alt="Static Badge" src="https://img.shields.io/badge/made_with-Python-blue"></a>
 </div>
-<!-- [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Leaderboard-blue)](https://huggingface.co/spaces/Vchitect/VBench_Leaderboard) -->
-<h4 align="center">
-<p>
-<a href="#wrench-installation">Installation</a> |
-<!-- <a href="#sparkles-features">Features</a> | -->
-<a href="#rocket-quick-start">Quick-Start</a> |
-<a href="#bookmark-license">License</a> |
-<a href="#star2-citation">Citation</a>
-</p>
-</h4>
-<!--
-With FlashRAG and provided resources, you can effortlessly reproduce existing SOTA works in the RAG domain or implement your custom RAG processes and components. -->
-## 🔧 Installation
-`conda env create -f environment.yml && conda activate finrag`
-<!-- ## ✨ Features
-1. -->
-## 🚀 Quick-Start
-Notion:
-1. The code run path is `./OpenFinBench`
-2. We provide our auto-generated evaluation dataset in <a href="https://huggingface.co/datasets/RUC-NLPIR/FlashRAG_datasets/" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-27b3b4></a>
-### 1. Build the Retrieval Corpus
-```
-# cd OpenFinBench
-sh corpus_builder/build_corpus.sh # Please see the annotation inner the bash file to set parameters.
-```
-### 2. Generate Evaluation Data Samples
-1. Generate evaluation instances
-```
-# cd OpenFinBench
-sh data_generator/generate_data.sh
-```
-2. Filter (quality inspection) evaluation instances
-```
-sh data_generator/generate_data_filter.sh
-```
-### 3. Inference Your Models
-```
-# cd OpenFinBench
-sh evaluator/inference/rag_inference.sh
-```
-### 4. Evaluate Your Models
-#### (a) Rule-based Evaluation
-```
-# cd OpenFinBench
-sh evaluator/judgement/judger.sh # by setting judge_type="rule"
-```
-#### (b) Model-based Evalution
-We propose five model-based metric: accuracy, completeness, utilization, numerical_accuracy, and hallucination. We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.
-Note that the evaluator of hallucination is different from other four. Their model checkpoint can be load from the following huggingface links:
-1. The evaluator for hallucination metric: <a href="https://huggingface.co/ShootingWong/OmniEval-HallucinationEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-b181d9></a>
-2. The evaluator for other metric: <a href="https://huggingface.co/ShootingWong/OmniEval-ModelEvaluator" target="_blank"><img src=https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-5fc372></a>
-To implement model-based evaluation, you can first set up two vllm servers by the following codes:
-```
-```
-Then conduct the model-based evaluate using the following codes, (change the parameters inner the bash file).
-```
-sh evaluator/judgement/judger.sh
-```
-## 🔖 License
-OmniEval is licensed under the [<u>MIT License</u>](./LICENSE).
-## 🌟 Citation
-The paper is waiting to be released!
-<!-- # Check Infos
-## Pipeline
-1. Build corpus
-2. Data generation
-3. RAG inference
-4. Result evaluatioin
-## Code
-1. remove "baichuan"
-2. remove useless annotation -->
 """
 EVALUATION_QUEUE_TEXT = """
@@ -189,4 +101,13 @@ If everything is done, check you can launch the EleutherAIHarness on your model
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
 """

 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 <div align="center">
+Please contact us if you would like to submit your model to this leaderboard. Email: wangshuting@ruc.edu.cn
+如果您想将您的模型提交到此排行榜，请联系我们。邮箱：wangshuting@ruc.edu.cn
 </div>
+"""
+# Which evaluations are you running? how can people reproduce what you have?
+LLM_BENCHMARKS_TEXT = """
+# Leaderboard Information
+We introduce an omnidirectional and automatic RAG benchmark, **OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain**, in the financial domain. Our benchmark is characterized by its multi-dimensional evaluation framework, including:
+1. a matrix-based RAG scenario evaluation system that categorizes queries into five task classes and 16 financial topics, leading to a structured assessment of diverse query scenarios;
+2. a multi-dimensional evaluation data generation approach, which combines GPT-4-based automatic generation and human annotation, achieving an 87.47% acceptance ratio in human evaluations on generated instances;
+3. a multi-stage evaluation system that evaluates both retrieval and generation performance, result in a comprehensive evaluation on the RAG pipeline;
+4. robust evaluation metrics derived from rule-based and LLM-based ones, enhancing the reliability of assessments through manual annotations and supervised fine-tuning of an LLM evaluator.
+Useful Links: 📝 [Paper](https://arxiv.org/abs/2412.13018) • 🤗 [Hugging Face](https://huggingface.co/collections/RUC-NLPIR/omnieval-67629ccbadd3a715a080fd25) • 🧩 [Github](https://github.com/RUC-NLPIR/OmniEval)
+We have trained two models from Qwen2.5-7B by the lora strategy and human-annotation labels to implement model-based evaluation.Note that the evaluator of hallucination is different from other four.
+We provide the evaluator for other metrics except hallucination in this repo.
+# 🌟 Citation
 """
 EVALUATION_QUEUE_TEXT = """
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@misc{wang2024omnievalomnidirectionalautomaticrag,
+      title={OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain},
+      author={Shuting Wang and Jiejun Tan and Zhicheng Dou and Ji-Rong Wen},
+      year={2024},
+      eprint={2412.13018},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.13018},
+}
 """