ollama q4_0 model starts generating garbage for longer generations
It generates fine for short passages, and then generates garbage and doesn’t stop, like it’s not stopping on an end token or something.
Details:
$ sha256sum Phi-3-mini-4k-instruct-q4.gguf
4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e Phi-3-mini-4k-instruct-q4.gguf
$ ollama --version
ollama version is 0.1.32
$llama show phi3mini --modelfile
# Modelfile generated by "ollama show"
# To build a new Modelfile based on this one, replace the FROM line with:
# FROM phi3mini:latest
FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER num_ctx 4096
PARAMETER stop "<|end|>"
Here’s an example of what it generates when it goes crazy:
Let's extend the previous code to include an endpoint with simulated batching every 500ms:
```python
import asyncio
from datetime import datetime, timedelta
from fastapi import FastAPI, HTTPException
# (other imports remain unchanged)
app = FastAPI()
request_queue: asyncio.Queue[List[str]] = asyncio.Queue(maxsize=10) # Adjust size based on expected load.
@app
.post("/predict")
async def make_prediction(data: List[str]):
request_queue.put_nowait(data)
# Simulated periodic processing every 500ms, adjust the batching interval as needed.
async def process_requests():
while True:
# Get and process a batch of requests in an asynchronous manner using `asyncio` tasks.
async with websockets.connect("ws://localhost:8789/batch-predict") as websocket: # This should be replaced by your real WS endpoint setup.
start_time = datetime.now()
while True:
batch_data = [request_queue.get_nowait() for _ in range(min(10, request_queue.qsize())] # Adjust the number of requests processed count
with `asyncio`.
results = await asyncio.sleep(500ms: this implementation could be implemented using aiopsy.gathering task executor (forces to terminate
the code snippet of Llamasyncio. #python in python code for request processing and logging into an async requests, use oflaglio.
python asyncio. Usefulness andltwo using PyTensor(synchronously execute aio androboticsoloctionary models. Forward-like this model loading model serve
itermainlylog, yout0starter, butler toyfake 500s APIrioting requests. Assumeley processing of Llammingtonskilling serves with async functioning Python that
is serving a server to usecase `asyncecklinglingways, with the model operations every andltwoeither to run asynchronous, buttelenco forge anfake using
FastLlama2 andpy andlsusible models. Use of python server in-like service. You'll200sizeset modeling thatpsyolo use case youtio Python code:
tokens, butter to deployments and import a module loadable (simulate functionallys
requests,pytorch requests for a single modelinglaterkilling operation usages of python asyntockets using functions. The fast-likewheelitelemplates withs
atlas andlambatches andapipeckakes topsend0pinglingly andlatency to performers andkills, theks to beacon for a to usekautrial models andlsaying
Modeltockets, forgeams.pylatercs tops attengure asyntofakelemplexes,mymodel orsnakexilexlinglysight �languages ands20pinglexeskylexaxhing andphasesetypes
andheis_model andsane asserowley andorikaero to beckeryd2lambaker tobifacessimalyticskiles,coclampiveh-apiarubilasticlysomerpiplecoskillingly-
h-requests 0 `tocargoilis you -0
pift, pyrosting and the model
modellary toloud in thehing a new json
requests an pre-
a quickerndreaming to
to beck to
mika server
customs
you to
request -subronning
p larime, tolless no model �ary to the model the,l to b forked as an
toclaiming its requests a request and0
datheques
d20oric:s r-
server:
pyck as itoi
to
toches, for the
to serve
the model to
`API
requests
to �responses
custom
your' ayou
py `mops and batch (10ero
0-binary or no-
to
:
Also happens with the official phi3 ollama model at https://ollama.com/library/phi3
in case it’s relevant - I’m running inference on 2x3090s on Ubuntu 20.04.06
Same issue with LlamaEdge.
Also happens with the fp16 weights too.
fixed with:
python3 gguf-py/scripts/gguf-set-metadata.py Phi-3-mini-4k-instruct-q4.gguf tokenizer.ggml.eos_token_id 32007
gguf-py is part of ggerganov / llama.cpp
adopted from: https://www.reddit.com/r/LocalLLaMA/comments/1cb6cuu/comment/l0we43q
That seems to help a little bit, but still seeing corruptions in generation
Hi there, I'm so sorry this happened. It's an issue seen with a few models that occurs when hitting the context limit. We're working on a longer term fix for better handling cases where the context limit is hit.
In the meantime, I've updated some of the runtime parameters here: https://ollama.com/library/phi3 (you can re-pull with ollama pull phi3
and it should be fast to re-pull it).
For those using the Modelfile
, adding a num_keep parameter command will help:
PARAMETER num_keep 16
Thanks. I'm still seeing issue when I hit the context limit with the latest modelfile. But I guess that's because the last few messages are still overwhelming the context. Thanks for the hard work. Love your work!
I am wondering if you can try this
FROM ./Phi-3-mini-4k-instruct-q4.gguf
TEMPLATE """<|user|>
{{.Prompt}}<|end|>
<|assistant|>"""
PARAMETER stop <|end|>
PARAMETER stop <|endoftext|>
PARAMETER num_ctx 4096
The current phi3 ollama modelfile is currently:
FROM /usr/share/ollama/.ollama/models/blobs/sha256-4fed7364ee3e0c7cb4fe0880148bfdfcd1b630981efa0802a6b62ee52e7da97e
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
"""
PARAMETER num_keep 4
PARAMETER stop "<|user|>"
PARAMETER stop "<|assistant|>"
PARAMETER stop "<|system|>"
PARAMETER stop "<|end|>"
PARAMETER stop "<|endoftext|>"