README / README.md
mgoin's picture
Update README.md
c9b8405 verified
|
raw
history blame
1.26 kB
metadata
title: README
emoji: 💻
colorFrom: purple
colorTo: blue
sdk: static
pinned: false

The Future of AI is Open

Neural Magic helps developers in accelerating deep learning performance using automated model compression technologies and inference engines. Download our compression-aware inference engines and open source tools for fast model inference.

  • vLLM: A high-throughput and memory-efficient inference engine for LLMs
  • llm-compressor: HF-native library for applying quantization and sparsity algorithms to llms for optimized deployment with vLLM
  • DeepSparse: Inference runtime offering accelerated performance on CPUs and APIs to integrate ML into your application

NM Workflow

In this profile we provide accurate model checkpoints compressed with SOTA methods ready to run in vLLM such as W4A16, W8A16, W8A8 (int8 and fp8), and many more! If you would like help quantizing a model or have a request for us to add a checkpoint, please open an issue in https://github.com/vllm-project/llm-compressor.