jamba 1.6 IQ/imatrix quants

#749
by imi2 - opened

I'd like to request IQ models for Jamba. This model won't run on most apps, only llama.cpp server from a WIP branch.

Here's a test from https://github.com/ggml-org/llama.cpp/pull/7531
image.png

The model after the second response outputs new lines. My test converted jamba https://huggingface.co/imi2/Jamba-1.6-Mini-GGUF

If it needs a non-standard llama.cpp, then chances of me doing imatrix quants are low. The llama ticket is a mess - working code, but not merged for many months? uhoh. I am not happy to use a nonstandard llama.cpp, so I would want to wait until that is merged (after which I'll happily quant all the jamba models anybody ever wants).

@nicoboss do you have an opinion here?

@mradermacher yup If you prefer not to release unrecognized models, its fine with me. I just don't have the RAM to hold the model, and crunching a small imatrix will take a few days from disc. πŸ™‚β€β†•οΈ

imi2 changed discussion status to closed

Wow I think we should do them. Not because of AI21-Jamba-Mini-1.6 but because of the 399B AI21-Jamba-Large-1.6 model I now want to try out.

I'm quite torn how to deal wich such situations. It is quite unlikely that support for this model is getting merged anytime soon as the author of the PR abandoned it and basicaly all open-source projects he worked on judging from his GitHab activity. This PR is still in draft state and never got fully compleated yet it is working and what everyone started to use as a de-facto standard for the Jamba series of models. If we quant it we need to add a disclamer that it only works for this specific version of llama.cpp or a lot of users would try to run it using latest llama.cpp. If you don't want to provide official quants to it I think we should at least compute the imatrix so others can imatrix quant it.

Well, it looks like a lot of effort for something that effectively nobody will use, since none of the frontends will support it.

I think the only reasonable way for this to have a good ending is for the PR to be merged, i.e. somebody has to make it work.

As for imatrix, I would then recommend building the branch, converting the models, quantize to Q8_0 and do the imatrix on it.

If the standard llama quantize works for the resulting gguf, we can push the models through the machinery then. But llama-imatrix and likely convert will require the branch version.

@nicoboss @mradermacher No problem. If you do look into this and get the big one quantized, please share! Upload it to some random account on HF. I was able to do the imatrix quant for the small one so it's all good.

Sign up or log in to comment