How do you suggest using Colbert vectors ?
Because of colbert vectors concept we get 1 vector for each token in our sentences.
So if my question is n tokens : I got (n+1, 1024) sized-vectors.
And let's say I have M documents I got m times (?, 1024) sized-vectors. ? depends of the number of document token.
Thinking quick I would have say that I will mean my vectors to have question_embedding size (1, 1024) and for my documents (m, 1024) vectors. Then I will do a cosine research or whatever.
But I read on your GIthub "Different from other embedding models using mean pooling, BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance. Therefore, make sure to use the correct method to obtain sentence vectors. You can refer to the usage method we provide."
For the dense vector I understand, but for the colbert vectors I don't know how to get it . Could you explained?
Thanks,
The colbert scores is different from the dense scores. You can refer to https://github.com/FlagOpen/FlagEmbedding/blob/master/FlagEmbedding/bge_m3.py#L90 for the method to compute colbert score.