Text Generation
Transformers
Safetensors
English
mistral
code
text-generation-inference
conversational
Inference Endpoints

It's um ... ok with it's output.

#2
by dyoung - opened

Hello,

I gave this model a run with GGUF Quantization Q5_K_M (https://huggingface.co/TheBloke/CodeNinja-1.0-OpenChat-7B-GGUF) using ctransformers as my backend. And for python bindings.
I've only ran a few tests and I'm finding that it's ok. But it's not ideal compared to others of roughly the same family in my experience. Currently, I'd prefer neural openhermes slerp (OpenHermes-2.5-neural-chat-v3-3-Slerp) over this model at the moment.

I noticed that CodeNinja was prone to continue making the same mistake, even after I pointed it out plainly. And it seems to be very brief in any of it's output.

Example of it being annoyingly brief/terse:

CodeNinja

  • prompt: "GPT4 Correct User: Could you give me an example of using PUB SUB with ZMQ library in python?<|end_of_turn|>GPT4 Correct Assistant:"

  • response: """Here's a basic example of using PUB/SUB pattern with ZeroMQ (ZMQ) library in Python.

import zmq

context = zmq.Context()

publisher = context.socket(zmq.PUB)
subscriber = context.socket(zmq.SUB)
subscriber.connect("tcp://localhost:5556")
publisher.send_string("Hello, world!")

In this example, we first create a ZMQ context and two sockets: one for publishing messages (publisher), and one for subscribing to them (subscriber). We then connect the subscriber socket to a localhost address on port 5556. Finally, we send the string "Hello, world!" from the publisher socket to the subscriber socket.

This is just a basic example to get you started. You can modify and extend it according to your specific needs."""

Contrast that to ...

Open Hermes neural

  • prompt = "<|im_start|>system
    It is the month of may. You are Alex, a highly trained and helpful language model.<|im_end|>
    <|im_start|>user
    Could you give me an example of using PUB SUB with ZMQ library in python?<|im_end|>
    <|im_start|>assistant
    "

  • response: """'''Sure! Here's a simple Python script that demonstrates the Pub/Sub (Publish and Subscribe) pattern using ZeroMQ (ZMQ) library:

import zmq

context = zmq.Context()
publisher = context.socket(zmq.PUB)
subscriber1 = context.socket(zmq.SUB)
subscriber2 = context.socket(zmq.SUB)

# Bind publisher to IP address and port number
publisher.bind("tcp://*:5555")

# Connect subscribers to publisher
subscriber1.connect("tcp://localhost:5555")
subscriber2.connect("tcp://localhost:5555")

# Set filter for Subscriber 1, which will receive only messages with 'topic_a'
subscriber1.setsockopt(zmq.SUBSCRIBE, b"topic_a")

# Send message to all connected subscribers
publisher.send_multipart([b"message", b"Hello World"])

while True:
    # Receive messages from the publisher and print them at Subscriber 1 and Subscriber 2
    topic, msg = subscriber1.recv()
    print("Subscriber 1 received:", str(topic), str(msg))
    msg = subscriber2.recv()
    print("Subscriber 2 received:", str(msg))

In this example, we have a publisher that sends messages to two subscribers using the Pub/Sub pattern. The publisher is bound to IP address "localhost" and port number "5555". Subscriber 1 is set up to receive only messages with the topic "topic_a", while both subscribers will receive any other message without a specific filter.

To run this example, make sure you have ZeroMQ installed on your Python environment:

pip install pyzmq

Then, execute the script in two different terminals (one for publisher and one for each subscriber):

python publisher.py
python subscriber1.py
python subscriber2.py

When you run this code, the Publisher will send a message to all connected Subscribers, including the filtered message "topic_a" to Subscriber 1 only. You'll see the output printed on each terminal, demonstrating the Pub/Sub communication between them.'''
"""

  • tok cnt: 539 | gen time: 26.797 (sec) | tok gen rate: 20.114 (tok/sec)
  • ATTN: LLM output contains python code. You can to use !run_python_code to run it.

As for the other issue, I really wanted to share that I asked it to demonstrate something simple as a script to list the file contents of a windows desktop. It just kept trying to use the path <username>/desktop/*.* even though I had plainly corrected, with a follow up chat, that it was missing the root and the directories prefixing the users name.

I made sure that I got the multi chat the prompt template correct. I should note that I assumed the multi back and forth chat capability was assumed because of the example prompt template on your model card which shows mutli-chat. And the model being based off of a chat instructed model. (I've yet to get familiar with the datasets you used to be transparent. And I'm new to OpenChat. Which I'll be having fun with getting familiar with this week or two due to being curious.)

I figured, like other chat models, that having back and forth chat likely would helped it get it right eventually. But when that was tried and failed, it kind of painted it as pointless and frustrating to me. Other models would of gotten and understood the hint more easily. And I'm use to better success rates.

From what I saw from intro usage of it, CodeNinja understands and shares enough to get the job done. But it is a bit more frustrating then my experience with other recent models.

I point these behaviors out hopefully to help with the research and quality development of LLM's. Just because it tested well, doesn't mean it fits well in practical applications. Testing doesn't fully mimic real life. And you could have a model not really being able to generalize outside of the tests. That being said. I'm still going to play with it for a little longer to see what I can make it do. It's an interesting idea. I'm happy to admit that I may be a bit at fault with the results I got due to some ignorance. So I'm open to proper constructive points sent my way.

hey @dyoung ,

thanks a lot for your comprehensive feedback! 👍

i've set up the model to be concise in its responses, aiming for quick answers without excessive detail - a trait i find bothersome in some models.

i've noticed the prompts we use significantly influence the response quality. to me, this feels like a bug, yet it's quite common in many models. 😅

as illustrated below, a slight modification to your prompt, with additional guidance, yields a better response:

image.png

about the second example you mentioned, could you share the exact prompt you used? i've experimented with various prompts and the model seems to produce accurate results, though markdown formatting appears a bit off in LM Studio.

image.png

in the above case, i first requested a python script, then a rust script to read Windows folders, and finally another python script to read macOS folder contents. so, the model is quite adept at handling full chat conversations too. 🙂

i wonder if some issues might be linked to the quantized version you're using? not sure. the model is designed to be an assistant with less verbosity. i personally prefer testing models not just with a few queries but by integrating them into my daily tasks for at least a week. benchmarks can be misleading, and i believe real-world usefulness, especially for coding tasks, is what truly counts. i've received positive feedback from the community regarding its practicality for daily work, but insights like yours are invaluable. they help me identify areas for improvement, which i'm addressing in an upcoming version. so, please continue to provide your feedback. 🙏

feel free to keep testing the model, use it for actual coding tasks, and see how it fares. also, try the q8 quantized version and compare the results. your effort in evaluating the model is greatly appreciated, thank you so much! ❤️

oh btw, just a small fyi: python adapts the file paths to be compatible with the underlying os, so both backslashes \ and forward slashes / would work 😃

beowolx changed discussion status to closed

Hi,
I'm just getting to the point of seeing that you replied days later. I'm a bit busy with some other important things. This is a personal hobby/interest for me so it's something that sometimes has to take the backburner. I'll get back with you as soon as I can. This gives me some time to think about what has been said. Thanks for getting back to me with some helpful points.

Hello,

8 bit quants it too rich for my blood right now for using a seriously live and reasonable local daily llm coding assistant at home. (8 GB vram) And to be honest, it is for most people for this kind of use. Most consumer graphic cards are bought by gamers. (https://store.steampowered.com/hwsurvey/Steam-Hardware-Software-Survey-Welcome-to-Steam?platform=combined) So reasonable local LLM availability above 8 bit for 7B is still likely highly limited for many. Unless you have a good performer model at 4 or 5 bit quantization. From what I see this is a big reason quantization is appealing in that it's moving llm usage into everyday persons availability.

I could run the 8 bit at home, but would have to run on CPU or spread it across GPU and CPU for a serious token generation rate drop. And the latency would likely make it un-useable for me. I might have to try it anyway just to see how it goes for this case.

Also, sadly I'm not able to shell out what would be needed for a new system that could support 7B model or larger above 5 bit quantization. My rig is a bit aged and would need a whole new one to make the point of more vram worth while and not be a waste.

I do have access to better hardware to use. But that is on time limited access. An hour or two each day. And that will not be enough to use as a serious daily driver. Which is what I'm aiming for. There is the privacy concern as well when using it. But I do, and can, use the limited access hardware for testing purposes, learning and curiosity. As long as it's with things I don't mind possibly being shared.

Also, I'm not sure about renting or paying for hosting service. Still feels a bit expensive token or per hr price wise for serious heavy daily use. (But not for quick and short testing though.) And using a product while I'm a product to someone because I generate data, to be honest, does makes my skin crawl a bit. Unfortunately I do have to use services like this once in a while. And another drive to want a at home solution.

From what I can tell, TheBloke has really long history of quality successful quantization of others models. This could be a bad quantization batch, but it's pretty rare from him. Or at least with the models I've picked to run from him historically. Could be just the model doesn't quant to that low of a bit rate very well. Which seems odd to me since the mistral family seems to have been fine under quantization to these levels. I've run a lot of 5 bit quants of mistral family's at home for a decent while now with output and gen rates I'm happy with.

As for the desktop file listing example, it's a personal bench/perf/eval test I do with local models and it's current setup to see how they respond.
I don't have the input prompt anymore that I used for the files desktop directory listing example. Nor do I have the models responses to that prompt from that time either.
Today I've tried to load the same model and configs that I used to start this discussion to replicate it with my best guess as to what I may of used for that prompt. I was hoping it would generate output similar to what I remember when it failed. But today I'm not able to get the same experience (with the same question/input) I first had with my many prompts with CodeNinja last week.
I did have to reload the same 5 bit quant GGUF model from TheBloke for CodeNinja (and he does have several forms of the 5 bit quants. I made sure I pulled the same one). I did check the post/upload date to his repo for that file and it's dated long before my 1st time use last week. So it looks to be the same GGUF 5 bit quant model I used at the start of this discussion. Likely due to being just a bit off with my guessed prompt compared to the original prompt that was giving me issues.

Like you, I was able to get the same model and setup that I used to start this discussion to provide better examples after asking it explicitly to provide more details. This was tried today. Which is good to have it working decently with the 5 bit quant. Something to remember as a unique case for CodeNinja if I use it. I'd gotten use to other llm's providing more steps by default. Pro's and Con's for long vs. terse. I think that you should default to terse to test your ideas, but keep the model capable of being longform if needed. Not doing so could be at risk of cutting out some of the models ability to reason decent solutions when addressing something outside of it's understanding and/or memorized domain in it's latent space. Like when humans when they can tackle something new with a decent solution using some creative reasoning, observations around their old knowledge and new observed details. Chain of Thought and Tree of Thought/Reasoning to make my point.

A question about the gist link (https://gist.github.com/beowolx/b219466681c02ff67baf8f313a3ad817) (LM Studio Preset) from your model card. Is that what was used for the code generated as a reply to my 1st entry in this discussion? I likely can replicate those generation configs with my custom setup and terminal UI.

I like working with the models raw in terminal for user input and output when prototyping. Seems simpler, reasonable enough for my use. There is less boilerplate and 3rd party libraries to get prototypes going when coding with new ideas. Which is why python bindings with a llm inference was important for me. And my choice for marella c/c++ take on huggingfaces transformers library. (Get some time performance boost that complied lang's offer.) Seems to be the sweet spot for me with how I work and understand related things at the moment.

Also, I would like to see what your community is saying about this model. To see for myself and get a feel of how they are using it different from how I am in order to get new ideas and concepts to learn from.

I also checked the code in this discussion from both python ZMQ OpenHermes generated on my end, and the python ZMQ code generated by CodeNinja on your end, and they both got important parts wrong. As in their scripts will both throw up a error and not be able to do the reasonable task asked of them. As this is getting long. I'll be leaving the details out for now. Maybe I'll point those out later if needed. I don't judge heavily and expect models to get everything right on 1st try. And mitigation and correction after models causing a code mistake/error can be another discussion later if needed.

Another question I have. So there isn't any obvious examples near your model card showing how to prime CodeNinja with a pre-prompt/system prompt. Though I see something in the LM Studio preset in the gist link mentioned above. Suggesting that it's likely a needed part of working with the model. (As I've come to expect from most models.) If any, what is the model trained to expect for a priming pre-prompt/system prompt? My quick guess currently is: "{pre_prompt} GPT4 Correct User: {user_input} <|end_of_turn|>GPT4 Correct Assistant:".

Hello,

I stand corrected!

I was not fully aware of the memory/data size management and it's difference between local LLM backends. Some are more efficient then others. I discovered this when giving llama-cpp-python a test run for the 1st time on the higher end, time limited hardware I have access to.

I was aware of llama-cpp-python and it's parent lib. But had steered away due to GPU setup being harder then what ctransformers offered. And I think ctransformers was more OS agnostic then other python bindings with GPU model loading. (I run multiple OS's at home for my needs. Windows and Linux) It seemed more friendly and reasonable enough for what I needed at the time.

But today I was able to fully load unto my 8 GB GPU a 10.7B 4 bit quant model using llama-cpp-python. Something I didn't think able based off of my past experience with ctransformers and transformers. It also had token gen rates that were reasonable. Bit slow, but reasonable. Which was expected. Still very impressed!

Getting Nvidia's CUDA Toolkit setup was a pain so I could do GPU loading and inference. I'll have to battle test it to see how it holds up for my usage across the many ways I like to use LLM's. I may not be out of the woods yet with it working for my needs. Could have a few surprises or adjustments.

ctransformers has been very slow to catch up with the new excitement in the LLM world of recently. llama-cpp seems to have a very active and adaptable community. I wanted to try the new stuff. So I took the leap to try the llama-cpp family library. I'm glad that I did. Was worth the pain so far.

I'll be giving your 8 bit more attention.

Sign up or log in to comment