Safetensors
qwen2

Custom pseudo "fill in the middle" trained model, designed to handle varying "corruption rates" (randomized UTF8 character substitution). Two custom GRPO reward functions were used to improve the pre-existing SFT trained model in order to have it more reliably attend to the XML styling.

Designed to be used with the (jank, hacky, personalized) PyQT GUI tooling seen at: https://github.com/kalomaze/quest-tools

image/png

Wandb logs for this run can be found here, as well as the attached RL code. Full hyperparameters are observable in the configuration py as well.

Prompt Formatting

Trained without ChatML templating. This model uses a pattern of:

  • Raw "corrupted" text at the beginning with UTF8 substitution for parts of the input.
  • The "objective" as a Claude-style XML tag with newline separators.
  • The beginning of an "original" tag.
    def _format_prompt(self, example: Dict) -> str:
        return (
            f"{example['corrupted']}\n\n"
            "<objective>\n"
            "gently repair the <original> content\n"
            "</objective>\n\n"
            "<original>\n"
        )

The primary utility of this model is as a means to synthesize rejected / lower quality preference data from pre-existing SFT data (i.e, the general pretraining corpus). This is useful in the context of teaching a reward model generalized preferences from lower quality, subtly incoherent base model-esque completions, of which are trivial to produce compared to human annotations.

Acknowledgements

Trained on 8xH200s provided free of charge by Deepshard for research & open source experimentation. Big McThankies.

Downloads last month
14
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for Quest-AI/quest-corruption-7b-s375-v3-GRPO

Base model

Qwen/Qwen2.5-7B
Finetuned
(225)
this model
Quantizations
1 model

Dataset used to train Quest-AI/quest-corruption-7b-s375-v3-GRPO