CoolWP

Feb 12

Can it be applied ONNX optimization to improve inference speed?

Shitao

Beijing Academy of Artificial Intelligence org Feb 16

Yes. There are some open-sourced onnx models in huggingface, like: https://huggingface.co/aapot/bge-m3-onnx

michaelfeil

Feb 20

@CoolWP I am maintainer of https://github.com/michaelfeil/infinity - bge-m3 is compatible and will accelerate your inference speed on gpu around 2-3x by using (async tokenization, fp16, flash-attention, torch nested, torch.compile)

prudant

Feb 20

@michaelfeil hi!, nice project, I have 2 questions:

it will accelerate CPU inference?
on GPU it will reduce the VRAM usage, or only performance optimizations are supported ?

I'm running low on VRAM

michaelfeil

Feb 20

•

edited Feb 20

It will reduce VRAM by 0.5 by using fp16 precision, and can dispatch e.g. memory-efficient attention. If you go for the full-sequence length, I would suggest to limit batch size in infinity to 8.
You can also run ONNX inference (no onnx version for this model at this point in time), which will give you the best in class acceleration for CPU on intel / amd.

prudant

Feb 21

@CoolWP Hi!,

i'm trying infinity with BAAI/bge-m3 but i'm only getting the embeddings results, and the rerank endpoint will not work I suspect to get the scores.... is there any way to get the model scores

ex:

{

'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],

'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],

'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],

'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],

'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]

}

it will be very useful because this feature is the most relevant in my opinion for this great multilingual model, may be thru the re-rank endpoint.

regards

BAAI
/

bge-m3

Optimize inference speed

{

'colbert': [0.7796499729156494, 0.4621465802192688, 0.4523794651031494, 0.7898575067520142],

'sparse': [0.195556640625, 0.00879669189453125, 0.0, 0.1802978515625],

'dense': [0.6259765625, 0.347412109375, 0.349853515625, 0.67822265625],

'sparse+dense': [0.482503205537796, 0.23454029858112335, 0.2332356721162796, 0.5122477412223816],

'colbert+sparse+dense': [0.6013619303703308, 0.3255828022956848, 0.32089319825172424, 0.6232916116714478]

}