C++ onnxruntime+cuda behaves weirdly with cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 models

#3
by idruker - opened

Dear model developers

I've observed a weird behavior when running C++ onnxruntime compiled with CUDA on your cuda/cuda-int4-rtn-block-32 and cuda/cuda-fp16 ONNXmodels.

Steps to reproduce

  • compile my test application with onnxruntime 1.18.0 + CUDA
  • load Llama-3.2-3B-Instruct-ONNX/cuda/cuda-int4-rtn-block-32/model.onnx or Llama-3.2-3B-Instruct-ONNX/cuda/cuda-fp16/model.onnx
  • manage a conversation with prompts and responses put in a chat history with using the chat template:
    1. create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
    2. wait for a response (R11)
    3. create the prompt (P2) 'user Who are you?'
    4. wait for a response (R12)
  • delete the entire chat history and reset kv-cache, input and outputs, then repeat the exact same conversation:
    1. create the prompt (P1) ‘You are a pirate chatbot who always responds in pirate speak!’
    2. wait for a response (R21)
    3. create the prompt (P2) 'user Who are you?'
    4. wait for a response (R22)

The expectation is that R21==R11 and R22==R12. In fact the response R22 is very much different from R12 and looks like a garbage!

The observations:

  • The issue does not occur if the same code is compiled and run without CUDA.
  • The issue does not occur if the loaded provider is not CUDA but CPU even though onnxruntime is compiled with CUDA
  • The issue does not occur if the model is cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4

Questions

  • is there a bug in the way CUDA flavors of LLM models are prepared?
  • is there a bug in onnxruntime+CUDA?

Sign up or log in to comment