Text Generation
Transformers
PyTorch
Safetensors
English
llama
text-generation-inference

Why do you set `use_cache=False`? Removing it will speed up generation

#8
by borzunov - opened

Hi,

I wonder why do you set use_cache=False in config.json?

As far as I understand, this gives identical results to use_cache=True for autoregressive models but runs the O(n^3) generation algorithm instead of the O(n^2) one (i.e., re-runs prefix for generating every new token). I think you can significantly speed up generation for this model by removing this line from the config.

borzunov changed discussion title from Why do you set `use_cache=False`? to Why do you set `use_cache=False`? Removing it will speed up generation

It is needed in train proccess. To my mind, you can chage to True in inference.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment