THE THREAD OF DOOM

#12
by jukofyork - opened

Just realised I deleted the old "thread of doom" as it was attached to the earliest alpha version of the control vectors :(

jukofyork pinned discussion

Okay, I was wondering if we crossed some sort of line.

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Anyway.. the INCREDIBLY important thing I was saying before the thread disappeared was... I have a feeling it is going to be just like they say. They are going to be liberal with grants. I suspect they will target people who are using the space outside the purpose that was intended... somewhere out there, someone has all their RAW 8k videos of their cats...

Yeah, it's a pity it got deleted (I should have checked more carefully what was linked), but it was getting a bit out of hand with all that scrolling so perhaps not such a bad thing.

I'm just gonna keep up the models that people have downloaded the most and get rid of all the "experimental, but likely broken" stuff with 15 downloads as they really weren't serving much of a purpose.

Also, all the old versions of the control vectors were vastly inferior to the final version due to me figuring out how to get them working as I went along, so it's probably better to just keep up the final v3.0 ones to avoid a lot of the confusion.


image.png

image.png

It looks a lot more like I'm just uploading quality models that people like/use now at least... The creative-writer-v0.1-35b and creative-writer-v0.2-35b models will be going as soon as I get the v1.0 version uploaded, and possible Dusk-Miqu-70B if they do set a hard-limit (I still think Dark-Miqu-70B is worth keeping whatever though).


Also if anybody really misses any I have uploaded, then I can in theory recreate them and upload a LoRA created from the delta using extract_lora.py, but I strongly suspect most of the models nobody will even notice they have gone... Of all that I have created I've only ever used Dark-Miqu-70B myself!

:( Damn there was some good info in that thread.

If you've still got Firefox tabs open somewhere, you'll be able to save some of the thread.

Unfortunately, I cleaned my browser tabs up about an hour ago.

And yeah, if people were using it as free cloud storage then it makes sense. I just think they could have gone about it better, rather than having us wake up and see the limit.

I'm curious, did your quota drop after deleting that? I wonder if all the PNG files attached there were "billed" to you.

@jukofyork I think you're good man. If they start enforcing it, you'll get an exemption for sure.

I come across your contributions randomly all over the place, even on github repos like some fine tuning tool lol

I should probably deduplicate my quants. Often, I was making one because I could not find what I was looking for, then it would turn out a few of us just happened to be making them at the same time, Then I started getting requests. So I just decided I would make a bunch. Need a Huggingverse quant global dedupe...

Looks like a Q2 can be run in 48GB of RAM (and 250GB of disk space) on CPU. Wonder if that means Q4 would run in 128GB?

https://old.reddit.com/r/LocalLLaMA/comments/1hw1nze/deepseek_v3_gguf_2bit_surprisingly_works_bf16/

I don't have time to look into the details at the moment but it's not just a swap file.

DeepSeek-V3 base model is indeed a true base model(not like Qwen).

Just to make sure I'm reading that right, is Mixtral 8x22b v0.3 also a true base model?

@ChuckMcSneed

Find the llama_tensor_get_type function in src/llama-quant.cpp (currently on line 121).

Find this line in the function:

    // for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {

and insert this before it:

    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
        // Use Q4_0 for all the none-shared experts' MLP tensors
        if (name.find("ffn_up_exps") != std::string::npos
            || name.find("ffn_gate_exps") != std::string::npos
            || name.find("ffn_down_exps") != std::string::npos) {
            new_type = GGML_TYPE_Q4_0;
        }
        // Use Q8_0 for everything else
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else

to look like this:

static ggml_type llama_tensor_get_type(quantize_state_impl & qs, ggml_type new_type, const ggml_tensor * tensor, llama_ftype ftype) {
    const std::string name = ggml_get_name(tensor);

    // TODO: avoid hardcoded tensor names - use the TN_* constants
    const llm_arch arch = qs.model.arch;
    const auto       tn = LLM_TN(arch);

    auto use_more_bits = [](int i_layer, int n_layers) -> bool {
        return i_layer < n_layers/8 || i_layer >= 7*n_layers/8 || (i_layer - n_layers/8)%3 == 2;
    };
    const int n_expert = std::max(1, (int)qs.model.hparams.n_expert);
    auto layer_info = [n_expert] (int i_layer, int n_layer, const char * name) {
        if (n_expert > 1) {
            // Believe it or not, "experts" in the FFN of Mixtral-8x7B are not consecutive, but occasionally randomly
            // sprinkled in the model. Hence, simply dividing i_ffn_down by n_expert does not work
            // for getting the current layer as I initially thought, and we need to resort to parsing the
            // tensor name.
            if (sscanf(name, "blk.%d.", &i_layer) != 1) {
                throw std::runtime_error(format("Failed to determine layer for tensor %s", name));
            }
            if (i_layer < 0 || i_layer >= n_layer) {
                throw std::runtime_error(format("Bad layer %d for tensor %s. Must be in [0, %d)", i_layer, name, n_layer));
            }
        }
        return std::make_pair(i_layer, n_layer);
    };

    // for arches that share the same tensor between the token embeddings and the output, we quantize the token embeddings
    // with the quantization of the output tensor
    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
        // Use Q4_0 for all the none-shared experts' MLP tensors
        if (name.find("ffn_up_exps") != std::string::npos
            || name.find("ffn_gate_exps") != std::string::npos
            || name.find("ffn_down_exps") != std::string::npos) {
            new_type = GGML_TYPE_Q4_0;
        }
        // Use Q8_0 for everything else
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else
    if (name == tn(LLM_TENSOR_OUTPUT, "weight") || (!qs.has_output && name == tn(LLM_TENSOR_TOKEN_EMBD, "weight"))) {
.
.
.

Then recompile (there might be a syntax error as I have just written this without testing, but hopefully should be easy/clear to fix if there is).

Re-quantize via llama-quantize using Q4_0 and you should see in the printout that it uses Q8_0 for everything apart from those 3 sets of tensors (which should show as Q4_0).

It should run well on CPU so long as you stick to Q8_0 and Q4_0 only, and for generation (which is memory bound) you should get a big speedup (approaching 2x due to most of the model being made up of these 3 tensors).

You can experiment with with other tensors by looking in llama-arch.cpp for the LLM_ARCH_DEEPSEEK2 tensor names:

        LLM_ARCH_DEEPSEEK2,
        {
            { LLM_TENSOR_TOKEN_EMBD,         "token_embd" },
            { LLM_TENSOR_OUTPUT_NORM,        "output_norm" },
            { LLM_TENSOR_OUTPUT,             "output" },
            { LLM_TENSOR_ATTN_NORM,          "blk.%d.attn_norm" },
            { LLM_TENSOR_ATTN_Q_A_NORM,      "blk.%d.attn_q_a_norm" },
            { LLM_TENSOR_ATTN_KV_A_NORM,     "blk.%d.attn_kv_a_norm" },
            { LLM_TENSOR_ATTN_Q,             "blk.%d.attn_q" },
            { LLM_TENSOR_ATTN_Q_A,           "blk.%d.attn_q_a" },
            { LLM_TENSOR_ATTN_Q_B,           "blk.%d.attn_q_b" },
            { LLM_TENSOR_ATTN_KV_A_MQA,      "blk.%d.attn_kv_a_mqa" },
            { LLM_TENSOR_ATTN_KV_B,          "blk.%d.attn_kv_b" },
            { LLM_TENSOR_ATTN_OUT,           "blk.%d.attn_output" },
            { LLM_TENSOR_FFN_NORM,           "blk.%d.ffn_norm" },
            { LLM_TENSOR_FFN_GATE,           "blk.%d.ffn_gate" },
            { LLM_TENSOR_FFN_UP,             "blk.%d.ffn_up" },
            { LLM_TENSOR_FFN_DOWN,           "blk.%d.ffn_down" },
            { LLM_TENSOR_FFN_GATE_INP,       "blk.%d.ffn_gate_inp" },
            { LLM_TENSOR_FFN_GATE_EXPS,      "blk.%d.ffn_gate_exps" },
            { LLM_TENSOR_FFN_DOWN_EXPS,      "blk.%d.ffn_down_exps" },
            { LLM_TENSOR_FFN_UP_EXPS,        "blk.%d.ffn_up_exps" },
            { LLM_TENSOR_FFN_GATE_INP_SHEXP, "blk.%d.ffn_gate_inp_shexp" },
            { LLM_TENSOR_FFN_GATE_SHEXP,     "blk.%d.ffn_gate_shexp" },
            { LLM_TENSOR_FFN_DOWN_SHEXP,     "blk.%d.ffn_down_shexp" },
            { LLM_TENSOR_FFN_UP_SHEXP,       "blk.%d.ffn_up_shexp" },
            { LLM_TENSOR_FFN_EXP_PROBS_B,    "blk.%d.exp_probs_b" },
        }

One thing to try would be:

    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_0) {
        // Use Q4_0 for all the none-shared experts' MLP up/gate tensors
        if (name.find("ffn_up_exps") != std::string::npos
            || name.find("ffn_gate_exps") != std::string::npos) {
            new_type = GGML_TYPE_Q4_0;
        }
        // Use Q6_K for all the none-shared experts' MLP down tensors
        else if (name.find("ffn_down_exps") != std::string::npos) {
            new_type = GGML_TYPE_Q6_K;
        }
        // Use Q8_0 for everything else
        else {
            new_type = GGML_TYPE_Q8_0;
        }
    }
    else

but in general the "K" quants are slower on CPU-only systems.


You only need this hacked version of llama.cpp to run llama-quantize and can discard it afterwards (the custom GGUF will work fine - I suggest naming it something like -q4_0_XL to know it's different...).

If anyone wants to do this for other (non-MoE / non-CPU-targetted) models, then this is what I use now for all mine now:

    // ### JUK ###
    if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M || ftype == LLAMA_FTYPE_MOSTLY_Q6_K) {
        if (name == tn(LLM_TENSOR_OUTPUT, "weight") || name == tn(LLM_TENSOR_TOKEN_EMBD, "weight")) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("attn_k.weight") != std::string::npos || name.find("attn_v.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("attn_q.weight") != std::string::npos || name.find("attn_output.weight") != std::string::npos) {
            new_type = GGML_TYPE_Q8_0;
        }
        else if (name.find("ffn_down") != std::string::npos) {
            new_type = GGML_TYPE_Q6_K;
        }
        else if (name.find("ffn_gate") != std::string::npos || name.find("ffn_up") != std::string::npos) {
            if (ftype == LLAMA_FTYPE_MOSTLY_Q4_K_M) {
                new_type = GGML_TYPE_Q4_K;
            }
            else if (ftype == LLAMA_FTYPE_MOSTLY_Q5_K_M) {
                new_type = GGML_TYPE_Q5_K;
            }
            else {
                new_type = GGML_TYPE_Q6_K;
            }
        }
        else {
            throw std::runtime_error(format("Unhandled tensor type: %s", name));
        }
    }
    else
    // ### JUK ###

as I find that the attention tensor logic of the llama_tensor_get_type function was likely made way before the days of long-context models, and IMO it's (really!) badly hurting a lot of the newer models... I would rather down-bump the up/gate MLP tensors by 1-bit than use really low quants for the attention tensors (and for long-context story writing and coding this seems to be quite obvious if you try it - POV characters get mixed up much quicker with the default attention tensor logic of MOSTLY_Q4_K_M and even MOSTLY_Q5_K_M!).

The logic of using the smallest quant for the up/gate projections (which have large "fan-out") and then a slightly higher quant for the down projection (which has large "fan-in") also works nicely with the imatrix code, as the Q6_K (and Q8_0) code that uses the imatrix.bin file you create is commented-out in the llama.cpp source; so the imatrix stuff will only be applied to the up/gate projections .


There's a lot of other stuff you can try by hacking the llama_tensor_get_type function (eg: if you find that the gap between say MOSTLY_Q4_K_M and MOSTLY_Q5_K_M if too big: with MOSTLY_Q4_K_M wasting several GB of VRAM but MOSTLY_Q5_K_M OOMing, etc).

I've finally figured out WTF is going on with the Cohere models and the double/single newlines:

It turns out that llama.cpp is tokenizing all double-newlines to a single 2126 token:

https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-2000227286
https://github.com/ggerganov/llama.cpp/issues/6104

whereas for some crazy reason the huggingface BPE tokenizer is tokenizing all double-newlines to a pair of 206 tokens:

https://github.com/huggingface/tokenizers/issues/1534

BUT: Only if the next character is a letter!??

If the next character is a space, number or special token it uses a single 2126 token...

So this means that I am correctly training using the huggingface BPE tokeniser (which one of the devs confirms is using the correct logic that the model was trained on), but obviously as soon as I go into llama.cpp it will tokenize the input prompt wrongly using a single 2126 token (and this explains why the 35b model was so batshit crazy with wanting to use the space-prefixed tokens at the start of each paragraph!!!).

I can't see a good fix for this, so have just elected to completely mask out the label and gradients for these tokens in the same way as I do for the special tokens:

    dataset = dataset.map(lambda x: {'attention_mask': torch.ones_like(x['input_ids']), 'labels': x['input_ids']}, desc='adding attention_mask and labels')
    ########################################################
    # Zero out the labels for all the "special" Cohere tokeniser tokens (+ single/double newline tokens)
    # USE: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
    # SEE: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer.json
    dataset = dataset.map(lambda x: {'labels': torch.where(
        (x['labels'] <= 7) |       # special
        (x['labels'] >= 255000) |  # special
        (x['labels'] == 206) |     # \n
        (x['labels'] == 2126),     # \n\n
        torch.full_like(x['labels'], -100), 
        x['labels']
    )}, desc='masking special tokens')
    #######################################################
    tl.store(logits_ptr + col_offsets, dloss * y, mask = mask)

    ########################################################
    # Zero out the gradients for all the "special" Cohere tokeniser tokens (+ single/double newline tokens)
    # NOTE: This should freeze the probability of generating these regardless of our fine-tuning dataset
    # USE: https://huggingface.co/spaces/Xenova/the-tokenizer-playground
    # SEE: https://huggingface.co/CohereForAI/c4ai-command-r-v01/blob/main/tokenizer.json
    zero_mask = (
        (col_offsets <= 7) |       # special
        (col_offsets >= 255000) |  # special
        (col_offsets == 206) |     # \n
        (col_offsets == 2126)      # \n\n
    ) & (col_offsets < VOCAB_SIZE)
    tl.store(logits_ptr + col_offsets, 0.0, mask = zero_mask)
    ########################################################

It's not ideal, but at least now it won't slowly reduce the output-probability of the 2126 double-newline token, and the fine-tunes should hopefully be way less confused when used in llama.cpp...


I also didn't realise that the Cohere models output both the attention and MLP outputs to the residual stream in parallel like this:

https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-1995533641

image.png

  • This likely explains why the last 2 layers of these models have huge hidden-state norms, whilst other models have only the last layer!
  • It also make me think I should possibly be applying my "Multiplicative-LoRA" to the o_proj tensor as well as the down_proj tensor (trying this now).
  • This architecture, coupled with the use of LayerNorm instead of RMSNorm, likely means the Cohere models are really easy to Merge with mergekit!

https://old.reddit.com/r/LocalLLaMA/comments/1hw1nze/deepseek_v3_gguf_2bit_surprisingly_works_bf16/m5xqtji/

If someone has a Reddit account then please link Daniel the post above - it would be really interesting to see what he can come up with!

I've finally figured out WTF is going on with the Cohere models and the double/single newlines:

It turns out that llama.cpp is tokenizing all double-newlines to a single 2126 token:

https://github.com/ggerganov/llama.cpp/pull/6033#issuecomment-2000227286
https://github.com/ggerganov/llama.cpp/issues/6104

whereas for some crazy reason the huggingface BPE tokenizer is tokenizing all double-newlines to a pair of 206 tokens:

https://github.com/huggingface/tokenizers/issues/1534

BUT: Only if the next character is a letter!??

If the next character is a space, number or special token it uses a single 2126 token...

So this means that I am correctly training using the huggingface BPE tokeniser (which one of the devs confirms is using the correct logic that the model was trained on), but obviously as soon as I go into llama.cpp it will tokenize the input prompt wrongly using a single 2126 token (and this explains why the 35b model was so batshit crazy with wanting to use the space-prefixed tokens at the start of each paragraph!!!).

Actually all this seems outdated - I just did a fresh pull and used the test-tokenizer-0 and can confirm it now gives the same output as the The Tokenizer Playground for command-r models... Maybe I was using an older version before this fix.

How are you running it (CPU I assume?) and what tokens/s are you getting?

Dual Epyc home server@3t/s Q8_0. I can't recommend buying it yet though, it's not very worth it.

For a few weeks I had access to 8xH200s and it was running at 8t/s 🤣 your 3t/s doesn't sound bad in comparison dollar for dollar

@jukofyork Thank for the advice, but how's the intelligence affected by it? I'd rather not dumb the model down. For me intelligence>speed.

Just to make sure I'm reading that right, is Mixtral 8x22b v0.3 also a true base model?

Yes. The non-instruct one is a base model, but not a great one.

image.png
Here are the top probs for DeepSeek-V3 instruct. 42% for K, that's new record for overconfidence. As you can see, DeepSeek is clearly arenamaxxing, they included fancy markdown(*) to get extra votes.

I've played around a bit more with DS. It has superior knowledge of trivia(I was surprised that it got a quite obscure reference right) compared to Largestral and is much better at solving riddles and following instructions. The writing style however is pure GPTslop, and it is more difficult to break it out of it than Largestral. What's worse is that while Largestral remained relatively unslopped in the other language that I tested, DS just wrote the same shit it wrote in English, with same sentence structure and all, which sounded very awkward and unnatural, as if the text was pulled through google translate.

Same story as with base, for comparison.

The sun was a merciless beast, its fiery breath scorching the earth and turning the once-thriving town into a desolate wasteland. The cultist, named Elias, staggered into the ghost town, his yellow robes tattered and filthy, clinging to his sweat-soaked skin like a second layer of torment. His vision blurred, the edges of his sight darkening with every step. The arrow lodged in his shoulder throbbed with a pain that seemed to echo the rhythm of his erratic heartbeat. 

The town was eerily silent, save for the occasional gust of wind that stirred the sand and rattled the broken shutters of the abandoned buildings. Elias's parched throat screamed for water, but he knew better than to hope for such luxuries. His mind was a chaotic storm of delirium and fear. The cult had branded him a traitor, and the memory of their twisted rituals and the malevolent whispers of their leader haunted him. He clutched the small, mysterious mirror tightly in his hand, its surface cool against his feverish skin. It was his only possession, stolen in a moment of desperation from the cult's forbidden vault. He didn't fully understand its power, but he knew it was more than just a trinket.

Elias stumbled into the shadow of a crumbling building, leaning heavily against the wall as he tried to catch his breath. His thoughts were a jumbled mess, fragments of memories and hallucinations intertwining. He could hear the voice of the cult leader, a rasping, venomous whisper that slithered into his mind. 

"You think you can escape us, Elias? You are ours. Always have been, always will be."

"Shut up," Elias muttered, squeezing his eyes shut in a futile attempt to block out the voice. "I'm not going back. I won't."

The mirror in his hand seemed to pulse faintly, a subtle vibration that made his skin crawl. He glanced down at it, his reflection distorted and ghostly in the cracked surface. For a moment, he thought he saw something move within the glass, a shadowy figure that twisted and writhed. He quickly looked away, his heart pounding in his chest.

"You're losing it," he whispered to himself, his voice trembling. "You're losing your mind."

He forced himself to move, pushing away from the wall and staggering down the deserted street. The town seemed to stretch on endlessly, each building a mirror of the last, decaying and lifeless. His strength was fading fast, and he knew he needed to find shelter before the sun fully claimed him.

As he approached what appeared to be an old inn, the door creaked open on its own accord, the sound grating against the silence. Elias hesitated, his instincts screaming at him to turn and run. But where would he go? He was in no condition to travel any further. With a resigned sigh, he stepped inside, the cool darkness enveloping him like a shroud.

The interior was dimly lit by the sunlight filtering through the cracks in the boarded-up windows. The air was thick with dust and the scent of decay. Elias collapsed onto a wooden bench, his body finally giving in to the exhaustion. He set the mirror down beside him, its presence both comforting and unnerving.

As he leaned back, his mind drifted to the events that had led him here. The escape from the cult, the frantic chase through the desert, the arrow that had found its mark in his shoulder. He had thought he was free, but now, alone in this forsaken town, he couldn't shake the feeling that he was still a prisoner.

The mirror shimmered faintly, catching the dim light in a way that made it seem almost alive. Elias reached out to touch it, his fingers trembling. As soon as his skin made contact, a sharp, icy pain shot through his arm, and the world around him seemed to dissolve into darkness.

When he opened his eyes, he was no longer in the inn. He stood in a vast, desolate landscape, the sky a swirling mass of black clouds and crimson light. Before him stood a figure, cloaked in shadows, its eyes gleaming with an otherworldly intensity.

"Welcome, Elias," the figure said, its voice a deep, resonant echo that seemed to come from everywhere and nowhere at once. "You have brought the mirror. Good. There is much we need to discuss."

Elias felt a surge of fear and confusion. "Who are you? What is this place?"

The figure stepped closer, the shadows around it shifting and writhing like living things. "I am the one who has been waiting for you. And this is just the beginning."

Before Elias could respond, the world around him began to dissolve once more, and he found himself back in the inn, gasping for breath. The mirror lay before him, its surface now still and unremarkable.

He stared at it, his mind racing. What had just happened? Was it a hallucination, a product of his fevered mind? Or was it real? He didn't know, but one thing was certain: the mirror was far more dangerous than he had ever imagined.

Sign up or log in to comment