Post
3074
🏅 Quantized Embeddings are here! Unlike model quantization, embedding quantization is a post-processing step for embeddings that converts e.g.
Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!
Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co/blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval
float32
embeddings to binary or int8
embeddings. This saves 32x or 4x memory & disk space, and these embeddings are much easier to compare!Our results show 25-45x speedups in retrieval compared to full-size embeddings, while keeping 96% of the performance!
Learn more about it in our blogpost in collaboration with mixedbread.ai: https://huggingface.co/blog/embedding-quantization
Or try out our demo where we use quantized embeddings to let you search all of Wikipedia (yes, 41,000,000 texts) in 1 second on a CPU Space: sentence-transformers/quantized-retrieval