Can you make a 2.4bpw quantization?

#1
by xldistance - opened

Thanks for quantifying the model

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

I think 2.8 bpw might fit in 24 GB VRAM, but I'm not able to load 3.0 bpw.

You can modify config.json's max_position_embeddings to 10000 and then you can use it under 3.0bpw, but the reply speed is only about 3 tokens/s, very slow!

2.65bpw quantization set max_position_embeddings to 10000, occupy more than 24GB of video memory, 4090 graphics card with very bad

I generally just take the original models' configurations. You can edit the file locally if you need it different than the base.

extremely grateful

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment