teknium's picture
Update README.md
58d6255
metadata
license: mit
model: mosaicml/mpt-7b

Base Model: MPT-7B

This is a Hermes Lite version that excludes the training data of Nous Instruct that hermes model was also trained on, and is experimental.

Big thanks to BitTensor foundation for the compute to attempt this experiment!

There seems to have been some sort of problem with the training that I cannot identify, that, while it does seem improved from the base model, does not seem to have learned nearly as much as was learned by Llama in training Hermes.

Typically, the model would response with long responses when asked, be much more contextually intelligent, and answer in a thoughtful way. However, for whatever reason - likely something to do with not training with LLM-Foundry - the model does not like longer responses, and typical responds quite breifly.

I don't believe this is a base model issue, or at least, I believe it is a base model issue related to it and the trainer, as I compared this fine tune with MPT-7B Instruct model, and it had no problem at all producing extremely long responses, etc. If anyone has the time to investigate, please follow up with me in the community tab or on Twitter, @Teknium1!

I trained Replit 3b with the same trainer, same settings, and it's results were phenomenal. So I would love any hypothesis on what may have made this different.

You should load the model and tokenizer like so:

tokenizer = AutoTokenizer.from_pretrained('EleutherAI/gpt-neox-20b')
tokenizer.pad_token = "<|padding|>"
model = AutoModelForCausalLM.from_pretrained(
    "teknium/MPT-7B-Mercury-Experimental",
    torch_dtype=torch.float16,
    device_map='auto',
    trust_remote_code=True
)

You should use the eos_token_id parameter in the generate function, and skip_special_tokens=True in the tokenizer decode.

generated_ids = model.generate(input_ids, max_new_tokens=512, do_sample=True, top_p=0.5, top_k=0, repetition_penalty=1.1,  min_new_tokens=100, eos_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)

While the model is not quite where I'd like it to be, it could be useful for learning how MPT model works, and for some uses, so it is uploaded here.