shisa-v2 Base Model ablation

Using a fork of Lightblue's Shaberi benchmark framework:

Model Average ELYZA-tasks-100 MT-Bench Rakuda Tengu-Bench
gpt-4-turbo-2024-04-09 8.75 8.78 8.74 9.18 8.31
CohereForAI/c4ai-command-r-plus 7.69 7.50 7.43 9.05 6.79
gpt-3.5-turbo-0125 7.17 7.24 6.98 7.64 6.82
shisa-ai/shisa-v1-llama3-70b 7.17 7.16 7.45 7.98 6.09
karakuri-ai/karakuri-lm-70b-chat-v0.1 6.84 6.86 6.43 7.85 6.23
lightblue/ao-karasu-72B 6.81 7.19 6.54 7.25 6.27
shisa-ai/shisa-v1-llama3-8b^ 6.29 6.62 6.41 7.05 5.07
shisa-ai/shisa-swallowmx-13a47b-v1 6.17 6.48 6.07 7.11 5.03
shisa-ai/shisa-v1-llama3-8b 6.10 6.52 6.20 6.37 5.33
Rakuten/RakutenAI-7B-chat 5.58 5.92 4.60 6.58 5.24
shisa-ai/shisa-v1-gemma-8b 5.64 6.50 5.42 5.10 5.55
augmxnt/shisa-gamma-7b-v1 5.56 5.84 4.00 6.73 5.68
lightblue/qarasu-14B-chat-plus-unleashed 5.20 5.58 4.74 5.46 5.01
cyberagent/calm2-7b-chat 4.76 4.90 3.58 5.75 4.81
mistralai/Mistral-7B-Instruct-v0.2 4.69 5.78 4.65 3.80 4.53
shisa-ai/shisa-v1-yi1.5-9b 4.63 5.98 4.28 3.26 5.00

^ Shaberi uses temperature=0.0, no sampling, for all generations by default. This is actually different from JA MT-Bench's default settings which has different temperature per category. This means that Shaberi's results can't be compared to other JA MT-Bench results (like my comparison chart or the Nejumi Leaderboard). Like some other models, if you look at the results you'll notice repetition loops. For Llama models, you usually want something like a repetition_penalty of 1.15/1.18 to get rid of repetition loops. Because Shaberi uses the vLLM's OpenAI API server, it doesn't support repetition penalty, doing a frequency_penalty sweep (0.0, 0.5, 0.8) I found 0.5 to remove repetitions and improve output in general. There is no decay/window so for long generations, this may not be optimal. For the improved generations, I used the following sampler settings: temperature 0.2, min_p 0.1, frequency_penalty 0.5 (OpenAI doesn't support min_p, but vLLM adds it and it's basically always the superior sampler).

Downloads last month
31
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for shisa-ai/shisa-v1-llama3-8b.2e5

Finetuned
(487)
this model

Dataset used to train shisa-ai/shisa-v1-llama3-8b.2e5

Space using shisa-ai/shisa-v1-llama3-8b.2e5 1