license: mit
tags:
- sentence-embeddings
- endpoints-template
- optimum
library_name: generic
Optimized and Quantized sentence-transformers/all-MiniLM-L6-v2 with a custom pipeline.py
This repository implements a custom
task for sentence-embeddings
for 🤗 Inference Endpoints for accelerated inference using 🤗 Optimum. The code for the customized pipeline is in the pipeline.py.
In the how to create your own optimized and quantized model you will learn how the model was converted & optimized, it is based on the Accelerate Sentence Transformers with Hugging Face Optimum blog post. It also includes how to create your custom pipeline and test it. There is also a notebook included.
To use deploy this model a an Inference Endpoint you have to select Custom
as task to use the pipeline.py
file. -> double check if it is selected
expected Request payload
{
"inputs": "The sky is a blue today and not gray",
}
below is an example on how to run a request using Python and requests
.
Run Request
import json
from typing import List
import requests as r
import base64
ENDPOINT_URL = ""
HF_TOKEN = ""
def predict(document_string:str=None):
payload = {"inputs": document_string}
response = r.post(
ENDPOINT_URL, headers={"Authorization": f"Bearer {HF_TOKEN}"}, json=payload
)
return response.json()
prediction = predict(
path_to_image="The sky is a blue today and not gray"
)
expected output
{'embeddings': [[-0.021580450236797333,
0.021715054288506508,
0.00979710929095745,
-0.0005379787762649357,
0.04682469740509987,
-0.013600599952042103,
...
}
How to create your own optimized and quantized model
Steps:
1. Convert model to ONNX
2. Optimize & quantize model with Optimum
3. Create Custom Handler for Inference Endpoints
Helpful links:
Setup & Installation
%%writefile requirements.txt
optimum[onnxruntime]==1.3.0
mkl-include
mkl
install requirements
!pip install -r requirements.txt
1. Convert model to ONNX
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
from pathlib import Path
model_id="sentence-transformers/all-MiniLM-L6-v2"
onnx_path = Path(".")
# load vanilla transformers and convert to onnx
model = ORTModelForFeatureExtraction.from_pretrained(model_id, from_transformers=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# save onnx checkpoint and tokenizer
model.save_pretrained(onnx_path)
tokenizer.save_pretrained(onnx_path)
2. Optimize & quantize model with Optimum
from optimum.onnxruntime import ORTOptimizer, ORTQuantizer
from optimum.onnxruntime.configuration import OptimizationConfig, AutoQuantizationConfig
# create ORTOptimizer and define optimization configuration
optimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)
optimization_config = OptimizationConfig(optimization_level=99) # enable all optimizations
# apply the optimization configuration to the model
optimizer.export(
onnx_model_path=onnx_path / "model.onnx",
onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",
optimization_config=optimization_config,
)
# create ORTQuantizer and define quantization configuration
dynamic_quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
dqconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=False)
# apply the quantization configuration to the model
model_quantized_path = dynamic_quantizer.export(
onnx_model_path=onnx_path / "model-optimized.onnx",
onnx_quantized_model_output_path=onnx_path / "model-quantized.onnx",
quantization_config=dqconfig,
)
3. Create Custom Handler for Inference Endpoints
%%writefile pipeline.py
from typing import Dict, List, Any
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch.nn.functional as F
import torch
# copied from the model card
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
class PreTrainedPipeline():
def __init__(self, path=""):
# load the optimized model
self.model = ORTModelForFeatureExtraction.from_pretrained(path, file_name="model-quantized.onnx")
self.tokenizer = AutoTokenizer.from_pretrained(path)
def __call__(self, data: Any) -> List[List[Dict[str, float]]]:
"""
Args:
data (:obj:):
includes the input data and the parameters for the inference.
Return:
A :obj:`list`:. The list contains the embeddings of the inference inputs
"""
inputs = data.get("inputs", data)
# tokenize the input
encoded_inputs = self.tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
# run the model
outputs = self.model(**encoded_inputs)
# Perform pooling
sentence_embeddings = mean_pooling(outputs, encoded_inputs['attention_mask'])
# Normalize embeddings
sentence_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
# postprocess the prediction
return {"embeddings": sentence_embeddings.tolist()}
test custom pipeline
from pipeline import PreTrainedPipeline
# init handler
my_handler = PreTrainedPipeline(path=".")
# prepare sample payload
request = {"inputs": "I am quite excited how this will turn out"}
# test the handler
%timeit my_handler(request)
results
1.55 ms ± 2.04 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)