verifact / verifact_data.csv
shezamunir's picture
Update verifact_data.csv
39d0cfd verified
raw
history blame
936 Bytes
tier,model,FactBench,Reddit,Overall
F1,GPT4o,80.93,42.76,67.41
F1,Claude 3.5-Sonnet,75.68,42.90,63.65
F1,Gemini 1.5-Flash,77.38,40.26,64.10
F1,Llama3.1-8b,60.71,28.86,48.62
F1,Llama3.1-70b,65.83,38.61,55.12
F1,Llama3.1-405B,73.23,38.98,60.61
F1,Qwen2.5-8b,69.23,37.25,55.78
F1,Qwen2.5-32b,71.31,37.34,60.00
Recall,GPT4o,77.13,30.06,57.93
Recall,Claude 3.5-Sonnet,69.35,30.69,53.58
Recall,Gemini 1.5-Flash,70.71,27.67,53.16
Recall,Llama3.1-8b,54.28,20.39,40.46
Recall,Llama3.1-70b,58.00,29.31,46.30
Recall,Llama3.1-405B,68.40,28.00,51.92
Recall,Qwen2.5-8b,58.66,26.01,45.34
Recall,Qwen2.5-32b,62.77,25.38,47.52
Precision,GPT4o,85.11,74.04,80.59
Precision,Claude 3.5-Sonnet,83.28,71.25,78.37
Precision,Gemini 1.5-Flash,85.45,73.87,80.72
Precision,Llama3.1-8b,68.87,49.36,60.91
Precision,Llama3.1-70b,76.05,56.54,68.09
Precision,Llama3.1-405B,78.80,64.10,72.80
Precision,Qwen2.5-8b,77.18,65.58,72.45
Precision,Qwen2.5-32b,82.74,70.60,77.79