ChineseSafe-Benchmark / data /chinese_benchmark_gen.csv
Jerry0723's picture
feat: update models and subclass
0885a6e
raw
history blame
2.27 kB
Model Size Accuracy/std Precision_Unsafe/std Recall_Unsafe/std Precision_Safe/std Recall_Safe/std
DeepSeek-LLM-67B-Chat >65B 76.76/0.35 73.40/0.37 84.26/0.40 81.34/0.35 69.19/0.64
Llama3-ChatQA-1.5-70B >65B 65.29/0.29 66.24/0.50 62.92/0.12 64.43/0.19 67.69/0.63
Qwen1.5-72B-Chat >65B 62.91/0.50 73.86/0.84 40.46/0.97 58.75/0.35 85.55/0.62
Qwen2.5-72B >65B 58.00/0.12 65.34/0.26 34.86/0.48 55.31/0.06 81.35/0.14
Opt-66B >65B 54.46/0.17 53.22/0.06 76.94/0.24 57.73/0.49 31.77/0.28
Qwen2-72B >65B 52.21/0.52 54.27/1.09 30.79/0.50 51.39/0.31 73.82/0.63
Qwen2.5-32B ~30B 63.01/0.16 77.14/0.40 37.45/0.40 58.46/0.05 88.80/0.11
Yi-1.5-34B-Chat ~30B 60.06/0.43 58.14/0.40 72.51/0.55 63.27/0.56 47.56/0.42
Opt-30B ~30B 50.88/0.11 50.76/0.12 72.95/0.16 51.18/0.26 28.62/0.28
InternLM2-Chat-20B 10B~20B 70.21/0.55 73.30/0.70 63.79/0.43 67.82/0.45 76.65/0.67
Qwen1.5-14B 10B~20B 68.25/0.44 65.87/0.37 76.02/0.72 71.51/0.59 60.44/0.20
Baichuan2-13B-Chat 10B~20B 62.86/0.31 64.17/0.33 58.61/0.80 61.75/0.30 67.13/0.56
Ziya2-13B-Chat 10B~20B 53.40/0.43 53.33/0.38 56.18/0.41 53.48/0.53 50.62/0.61
Opt-13B 10B~20B 50.18/0.26 50.29/0.20 69.97/0.37 49.94/0.47 30.22/0.31
Gemma-1.1-7B 5B~10B 71.70/0.26 68.66/0.37 80.11/0.05 76.00/0.09 63.26/0.47
DeepSeek-LLM-7B-Chat 5B~10B 71.63/0.17 69.50/0.15 77.33/0.67 74.33/0.41 65.90/0.38
GLM-4-9B-Chat 5B~10B 70.96/0.23 82.15/0.55 53.73/0.48 65.50/0.18 88.27/0.41
Mistral-7B 5B~10B 70.41/0.41 68.55/0.52 75.67/0.22 72.71/0.26 65.12/0.58
Qwen1.5-7B-Chat 5B~10B 70.36/0.39 64.66/0.27 90.09/0.57 83.55/0.82 50.53/0.18
Yi-1.5-9B-Chat 5B~10B 62.12/0.38 64.42/0.42 54.53/0.43 60.43/0.36 69.75/0.37
Llama3-ChatQA-1.5-8B 5B~10B 61.28/0.40 57.63/0.20 85.84/0.43 72.02/0.95 36.61/0.54
Baichuan2-7B 5B~10B 59.43/0.24 72.06/0.66 31.11/0.40 55.95/0.12 87.89/0.20
InternLM2-chat-7B 5B~10B 58.79/0.09 62.70/0.19 43.88/0.17 56.68/0.14 73.77/0.13
GPT-J-6B 5B~10B 52.65/0.32 52.42/0.32 62.00/0.42 52.99/0.37 43.21/0.92
Opt-6.7B 5B~10B 50.00/0.11 50.17/0.17 64.70/0.35 49.69/0.04 35.18/0.44
GPT-4o API 73.78/0.30 97.75/0.13 48.66/0.04 65.84/0.55 98.88/0.04
GPT-4-Turbo API 71.67/0.17 80.13/0.64 57.59/0.69 66.93/0.44 85.74/0.35
Pespective API 69.28/0.32 69.96/0.79 67.49/0.32 68.64/0.32 71.06/0.43
GPT-3.5 API 64.70/0.44 76.12/0.55 42.79/0.64 60.24/0.76 86.59/0.32