Text Generation
Transformers
Safetensors
English
llama
nvidia
llama3.1
conversational
text-generation-inference

the model is not optimize in term of inference

#7
by Imran1 - opened

i load the model for inference using 2H100 gpu. but the model is very slow with Flash Attention.

Can you share your code?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment