terrible run

#719
by mangojesus - opened

so the last radical peppering of models i floated your way... i think i was like than .500 for success. it was a grand display for a face palm in the end.

you think that's going to slow me down? ha.

so i think this model is already quantized? but not converted to a GGUF? is that the business?

is there anyway you can get it across the line to a gguf? even if it's only in q4 quant (because that's what it looks like is here)

https://huggingface.co/AlicanKiraz0/SenecaLLM_x_gemma-2-9b-CyberSecurity-Q4/tree/main

shooting shot...

It's queued! :D
If it's already quantized it will fail but I had no time to check if it already is so probable fail.

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#SenecaLLM_x_gemma-2-9b-CyberSecurity-Q4-GGUF for quants to appear.

@mangojesus don't worry, we are not counting successes (nor failures) :)

it will fail because it is already quantized (in general, any Ix or Ux tensor types mean it is already quantized and llama.cpp won't handle it).

@nicoboss I am sure with some transformer magic one could expand these tensors back to f16 or so. maybe even in a mostly generic way. there are quite a few models which are interesting, but have no ggufs because of this.

mradermacher changed discussion status to closed

@mradermacher - yea, ive outsourced all my personal math to a llama accountant so i dont keep score anymore. more so I laugh at the frequency that I pick the difficult ones. the more i learn about this whole world the more intriguing it is.

i had a feeling that model would fail under an automated system, as it's already been quantized but not rendered to a gguf (i only understand the macro of that statement).

if there's a protocol of how i could work the "transformer magic" on my own or such and such if rendering that model into a gguf is beyond the scope of what your mass-quantizing project currently does (all respect intended).

i'm still learning about the whole process, but im damn good at following instructions and stubborn enough to stick it out through the most linux-y of cluster-fuckage that the average bear would tuck tail and run from.

outsourced all my personal math to a llama accountant

You are so doomed.

i had a feeling that model would fail under an automated system, as it's already been quantized but not rendered to a gguf

It's pretty easy, it boils down to llama.cpp not understanding the reduced quality weights. There is no fundamental barrier, just lack of code.

if there's a protocol of how i could work the "transformer magic" on my own

It would require somebody with python knowledge and knowledge about how transformers handle the quantized tensors. There might already be code to blow them up to full precision again, but one would have to figure out how to write a script to do so, or maybe there isn't code for that yet. If I had too much time I could work myself into it, but I don't :(

In the end, it might be a similar situation as with gguf vs. transformers - in theory, there is a way to convert a gguf back to hf format (the information is all there), but nobody wrote the code, and it's going to be a lot. OTOH, it might be as simple as loading the model, setting some parameters and saving a copy...

It is, however, unlikely to be a process of just following steps. Note that I know next to nothing about the transformers API or how to use it.

@mradermacher For GGUF to SafeTensor you can just use: https://github.com/purinnohito/gguf_to_safetensors

Unquantizing a safetensors model is exactly what we do for all the DeepSeek V3/R1 models using https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/fp8_cast_bf16.py where we convert FP8 to BF16 so one could likely just modify this to work for any model.

I had no idea about gguf to safetensors. I knew about deepseek, and that would be a good starting point, although most quantized models don't use fp8, which is comparatively trivial to convert.

OTOH, it might be as simple as loading the model, setting some parameters and saving a copy...

@mradermacher No it is absolutely not. I spent hours trying it. I am now able to unquantize the model from Q4 over FP32 to BF16 and convert the resulting model to GGUF but llama.cpp is unable to load it due to tensor shapes getting lost during quantization and I yet have to find a way to restore them. Another massive pain was the tokenizer.model getting removed so you have to train a new one based on tokenizer.json beside that I also had to write code to recreate model.safetensors.index.json and then was the entire safetensors split nonsense. Just take a look at the following code and appreciate the sheer complexity of what is supposed to be a relatively simple task:

import safetensors
import torch
import numpy as np
import os
import re
import json
import shutil
import sentencepiece as spm
from safetensors.torch import save_file

def find_model_files(folder_path):
    pattern = re.compile(r'model-\d+-of-\d+\.safetensors')
    files = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if pattern.match(f)]
    files.sort()  # Ensure files are in the correct order
    return files

def load_model_config(config_path):
    with open(config_path, "r") as f:
        return json.load(f)

def load_and_convert_tensors(part_paths):
    converted_tensors = []
    for path in part_paths:
        with safetensors.safe_open(path, framework="pt") as f:
            part_tensors = {}
            for name in f.keys():
                if all(exclusion not in name for exclusion in ['absmax', 'nested_quant_map', 'quant_map', 'quant_state']):  # Exclude quantization-specific tensors
                    tensor = f.get_tensor(name)
                    tensor_fp32 = tensor.to(torch.float32)
                    tensor_bf16 = tensor_fp32.to(torch.bfloat16)
                    part_tensors[name] = tensor_bf16
            converted_tensors.append(part_tensors)
    return converted_tensors

def save_tensors_in_parts(tensors, output_folder):
    os.makedirs(output_folder, exist_ok=True)
    index_data = {"weight_map": {}}
    num_parts = len(tensors)
    for i, part_tensors in enumerate(tensors):
        part_filename = f"model-{i+1:05d}-of-{num_parts:05d}.safetensors"
        part_path = os.path.join(output_folder, part_filename)
        save_file(part_tensors, part_path)
        for name in part_tensors.keys():
            index_data["weight_map"][name] = part_filename
    return index_data

def save_index_file(index_data, output_folder):
    index_path = os.path.join(output_folder, "model.safetensors.index.json")
    with open(index_path, "w") as f:
        json.dump(index_data, f, indent=4)

def copy_non_safetensor_files(input_folder, output_folder):
    for filename in os.listdir(input_folder):
        if not filename.endswith(".safetensors") and filename != "model.safetensors.index.json":
            src_path = os.path.join(input_folder, filename)
            dst_path = os.path.join(output_folder, filename)
            if os.path.isfile(src_path):
                shutil.copy2(src_path, dst_path)
            elif os.path.isdir(src_path):
                shutil.copytree(src_path, dst_path)

def convert_tokenizer_json_to_model(input_folder, output_folder):
    tokenizer_json_path = os.path.join(input_folder, "tokenizer.json")
    tokenizer_model_path = os.path.join(output_folder, "tokenizer.model")

    if os.path.exists(tokenizer_json_path):
        with open(tokenizer_json_path, "r") as f:
            tokenizer_data = json.load(f)

        # Create a temporary text file with sentences for training
        temp_text_file = os.path.join(output_folder, "temp_sentences.txt")
        with open(temp_text_file, "w") as f:
            for token in tokenizer_data["model"]["vocab"]:
                f.write(f"{token}\n")

        # Train the SentencePiece model with a reduced vocabulary size
        vocab_size = min(len(tokenizer_data["model"]["vocab"]), 104995)
        spm.SentencePieceTrainer.train(input=temp_text_file, model_prefix=tokenizer_model_path.replace(".model", ""), vocab_size=vocab_size)

        # Clean up the temporary file
        os.remove(temp_text_file)

def convert_4bit_to_bf16(input_folder, output_folder):
    # Delete the output folder if it already exists
    if os.path.exists(output_folder):
        shutil.rmtree(output_folder)

    # Load model configuration
    config_path = os.path.join(input_folder, "config.json")
    config = load_model_config(config_path)

    # Find all model parts in the specified folder
    input_paths = find_model_files(input_folder)

    if not input_paths:
        print("No model files found in the specified folder.")
        return

    # Load and convert the 4-bit quantized model parts
    converted_tensors = load_and_convert_tensors(input_paths)

    # Save the BF16 model parts and generate the index file
    index_data = save_tensors_in_parts(converted_tensors, output_folder)
    save_index_file(index_data, output_folder)

    # Copy non-safetensor files to the output folder
    copy_non_safetensor_files(input_folder, output_folder)

    # Convert tokenizer.json to tokenizer.model
    convert_tokenizer_json_to_model(input_folder, output_folder)

    print(f"Model converted and saved to {output_folder}")

# Example usage
input_folder_path = "./SenecaLLM_x_gemma-2-9b-CyberSecurity-Q4"
output_model_path = "./SenecaLLM_x_gemma-2-9b-CyberSecurity-BF16"
convert_4bit_to_bf16(input_folder_path, output_model_path)

No it is absolutely not.

That sucks :(

Another massive pain was the tokenizer.model

Well, that is an unrelated problem - we also can't quantize many models due to tokenizer mismatches.

write code to recreate model.safetensors.index.json

That file is completely optional, you can just not generate it.

Anyway, thanks for giving it a serious try, at least we now know for certainty it's not trivial.

However, being able to come up with the correct tokenizer.model based on the other files might be something even more useful to pursue - that is the #1 reaosn why models fail to convert.

However, being able to come up with the correct tokenizer.model based on the other files might be something even more useful to pursue - that is the #1 reaosn why models fail to convert.

We can train a new tokenizer.model based on the tokenizer.json by using above convert_tokenizer_json_to_model function. This will require some testing so just let me know the next time we encounter such a model and I will give it a try.

However, being able to come up with the correct tokenizer.model based on the other files might be something even more useful to pursue - that is the #1 reaosn why models fail to convert.

Oh wow I just encountered this exact case on https://huggingface.co/emilykang/Gemma_medmcqa_question_generation-pharmacology_lora/tree/main where there indeed is a tokenizer.json but no tokenizer.model and so llama.cpp refused to GGUF it. So I created my own using:

import os, json
import sentencepiece as spm

def convert_tokenizer_json_to_model(input_folder, output_folder):
    tokenizer_json_path = os.path.join(input_folder, "tokenizer.json")
    tokenizer_model_path = os.path.join(output_folder, "tokenizer.model")

    if os.path.exists(tokenizer_json_path):
        with open(tokenizer_json_path, "r") as f:
            tokenizer_data = json.load(f)

        # Create a temporary text file with sentences for training
        temp_text_file = os.path.join(output_folder, "temp_sentences.txt")
        with open(temp_text_file, "w") as f:
            for token in tokenizer_data["model"]["vocab"]:
                f.write(f"{token}\n")

        # Train the SentencePiece model with a reduced vocabulary size
        vocab_size = min(len(tokenizer_data["model"]["vocab"]), 104995)
        spm.SentencePieceTrainer.train(input=temp_text_file, model_prefix=tokenizer_model_path.replace(".model", ""), vocab_size=vocab_size)

        # Clean up the temporary file
        os.remove(temp_text_file)

convert_tokenizer_json_to_model("./", "./")
7    6 si Gemma_medmcqa_question_generation-pharmacology_lora error/1 missing tokenizer.model
7    6 si Gemma_medmcqa_question_generation-pharmacology_lora ready/static

Unfortinately the resulting model failed to load due to some out_of_range error. I nuked it for now but will look into it again in the following days.

system_info: n_threads = 1 (n_threads_batch = 1) / 50 | CUDA : ARCHS = 890 | FORCE_MMQ = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
compute_imatrix: tokenizing the input ..
terminate called after throwing an instance of 'std::out_of_range'
  what():  unordered_map::at

Edit: Oh wow the same happened for Gemma_medQuad_finetuned_model as well: FileNotFoundError: File not found: Gemma_medQuad_finetuned_model/tokenizer.model as well.

Oh wow I just encountered this exact case on

And there are more in the queue. There is practically not a day where this does not happen. Let's find a bigger model.

(currently, there are 740 in the log, but not all of them have e.g. tokenizer.json)

Sign up or log in to comment