How do you suggest using Colbert vectors ?

by EquinoxElahin - opened

Because of colbert vectors concept we get 1 vector for each token in our sentences.
So if my question is n tokens : I got (n+1, 1024) sized-vectors.
And let's say I have M documents I got m times (?, 1024) sized-vectors. ? depends of the number of document token.

Thinking quick I would have say that I will mean my vectors to have question_embedding size (1, 1024) and for my documents (m, 1024) vectors. Then I will do a cosine research or whatever.

But I read on your GIthub "Different from other embedding models using mean pooling, BGE uses the last hidden state of [cls] as the sentence embedding: sentence_embeddings = model_output[0][:, 0]. If you use mean pooling, there will be a significant decrease in performance. Therefore, make sure to use the correct method to obtain sentence vectors. You can refer to the usage method we provide."
For the dense vector I understand, but for the colbert vectors I don't know how to get it . Could you explained?


Beijing Academy of Artificial Intelligence org

The colbert scores is different from the dense scores. You can refer to for the method to compute colbert score.

Sign up or log in to comment