Spaces:

damienbenveniste
/

deploy_vLLM

Sleeping

Damien Benveniste commited on Aug 12, 2024

Commit

ae23345

1 Parent(s): a959d74

modified

Files changed (1) hide show

app.py CHANGED Viewed

@@ -16,7 +16,7 @@ engine = AsyncLLMEngine.from_engine_args(
         max_num_batched_tokens=512,    # Reduced for T4
         max_num_seqs=16,               # Reduced for T4
         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
-        max_model_len=4096,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
         enforce_eager=True,            # Disable CUDA graph
         dtype='half',                  # Use half precision

         max_num_batched_tokens=512,    # Reduced for T4
         max_num_seqs=16,               # Reduced for T4
         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
+        max_model_len=512,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
         enforce_eager=True,            # Disable CUDA graph
         dtype='half',                  # Use half precision