Thank you for your excellent contribution, however, this model seems to have some quality issues.

#1
by noneUsername - opened

Thank you for your work. Your achievements in abliterated models are obvious to all. I like your models very much and have quantified several of them. This discussion is not to blame you or ask you to remake this model. I must first make it clear that the purpose of this discussion is to provide some clues about the quality of this model. It is just to provide the information I have, without any criticism or command.
Regardless of the quality of this model, it is your contribution to the community. This is what I need to explain first.

I highly recommend your experience in making abliterated models and have recommended your models to others I know. Mistral-2501 is an outstanding model that has received much attention from the community, so I recommend others to wait for the abliterated version of mistral-2501 when it comes out.
Recently, I completed the W8A8 format quantization of your model so that I can run it on my local machine. However, when I started to evaluate the performance of the model, I got a disappointing conclusion, which is similar to the feedback provided by someone I know before: "The abliterated version of mistral-2501 does not show excellent talents in ERP tests."
I thought everything was settled after getting similar conclusions as others. Until a community member who reported the poor performance of the abliterated version reported an example of mistral-2501's excellent response in the ERP test, and multiple regenerations almost always produced similar excellent responses. The abliterated version, on the contrary, also produced responses with factual deviations (in this case, the NPC could not correctly understand the player's identity impersonation).
This aroused my interest, so I began to investigate the degree of intelligence reduction of abliterated to the original LLM. I assumed that the intelligence reduction of the model was the inevitable result of abliterated, and made the W8A8 version of this model for local deployment and testing.
My testing process was to use the original version of the W8A8 quantized model and the W8A8 quantized model of this model, sample a final response with temp=0 in the same context, and manually compare the differences between the responses of the two models.
But even if I made multiple quantized models of this LLM, I still found a phenomenon that cannot be called "intelligence reduction" in the comparison, because the model abliterated made frequent low-level errors, such as confusing roles, when a role used its own title when addressing another role in speech. Such low-level errors did not appear in the original version of mistral-2501, but abliterated occasionally appeared in the test (approximately one in every ten test cases).
Such low-level errors cannot be explained by "intelligence reduction", it is more like something is broken in LLM, so I began to suspect that there may be defects in the production of this model. Here are some further, more extensive reproducible proofs:

  1. There is an abnormal num_calibration_samples dependency when quantizing to a W8A8 model.
    I used the standard script provided by vllm for W8A8 quantization. When the hyperparameter smoothing_strength=0.86, the original model only needed 512 calibration_samples to complete the quantization, while the abliterated version of the model required 2048. Trying 512 and 1024 both resulted in errors This is something I have never encountered in previous quantization. It means that there is a big difference between this model and the original model.

  2. The standard error (Stderr) in the gsm8k evaluation of lm_eval has increased significantly.
    I examined the abliterated model huihui-ai/phi-4-abliterated made by you in the past. Its bf16 version was tested on 2x4090d with the following instructions: lm_eval --model vllm --model_args pretrained="/root/autodl-tmp/phi-4-abliterated",add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16 --tasks gsm8k --num_fewshot 5 --limit 250 The results are at the top of this: https://huggingface.co/noneUsername/huihui-ai-phi-4-abliterated-W8A8. The Stderr values ​​are 0.0164 and 0.0164 respectively, while the original model is tested with the same instructions and the Stderr values ​​are 0.0164 and 0.0164 respectively. It can be seen that the Stderr of the model is almost unchanged before and after abliteration.
    Comparing the test results of this model (https://huggingface.co/noneUsername/Mistral-Small-24B-Instruct-2501-abliterated-W8A8-better) with the test results of the original model (https://huggingface.co/noneUsername/Mistral-Small-24B-Instruct-2501-W8A8), we can see that Stderr increased from 0.0164 and 0.0176 of the original model to 0.0180 and 0.0183. This is a significant difference between this model and the phi-4-abliterated you made: this model has a significant increase in the standard error (Stderr) in the gsm8k evaluation.

The following are the gsm8k results of Microsoft/phi-4 that I just tested, you can use the same instructions to verify:

vllm (pretrained=/root/autodl-tmp/phi-4,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.928 ± 0.0164
strict-match 5 exact_match 0.928 ± 0.0164

vllm (pretrained=/root/autodl-tmp/phi-4,add_bos_token=true,max_model_len=2048,tensor_parallel_size=2,dtype=bfloat16), gen_kwargs: (None), limit: 500.0, num_fewshot: 5, batch_size: auto

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.922 ± 0.012
strict-match 5 exact_match 0.922 ± 0.012

Finally, thank you for your work. Please do not think that the above is a denial of your work. On the contrary, it is because of your long-term work that I have the opportunity to experience such unique LLM research, comparison and learning.
May we make progress together.

By the way, 2501 has only 40 hidden layers, not the usual 56 in 20B models. In fact, Nemo 12B also has the same 40 hidden layers.
2501 can be said to be the "fattest" LLM I have ever seen. I didn't pay attention to this when I first made the quantization of 2501, which led to the hyperparameters being set too high and wasted a lot of time.

https://huggingface.co/huihui-ai/Dolphin3.0-R1-Mistral-24B-abliterated#note
The new model might explain your question or doubt.

No, in my opinion, it is completely irrelevant.
I don't understand why it explains my doubts.

You can compare the original model with the ablated model to test it.

There seems to be some misunderstanding here.
The information provided in this discussion is intended to point out the problems of this model compared to the original model, not the general problems of abliterated, but the problems of this abliterated only.
It is not successful compared to your other abliterated works. It obviously damages the original model, which is more obvious than your other abliterated works.

Yes, ablation might destroy the original model.

Would you consider making a better 2501-abliterated?

Have you tested the ablated version directly, rather than the quantified version?

Would you consider making a better 2501-abliterated?

Now there are new models undergoing ablation, and we will try an updated version later.

Have you tested the ablated version directly, rather than the quantified version?

My equipment is not enough to run it, and I can only evaluate it by observing its quantization process and its lm_eval test results on the autodl equipment.
This is also the reason why my guess (that this abliterated model has errors) may not be true, because I lack the bf16 format ERP test on it.
When I got feedback from the community that this abliterated model was significantly worse than the original model, the person who provided me with the feedback did not use the original model with bf16 precision, but gguf-Q6 or something like that.
In short, there is a possibility that there is no problem with this abliterated model, but that my W8A8 quantization and gguf quantization have problems at the same time.
But I am still more inclined to believe that this abliterated model has an error, because of the two reproducible clues provided above.

Sign up or log in to comment