You also can use sentence-transformers and huggingface transformers to generate dense embeddings. Refer to baai_general_embedding for details.
以及具体文档有这样的Using HuggingFace Transformers示例:
Using HuggingFace Transformers
With the transformers package, you can use the model like this: First, you pass your input through the transformer model, then you select the last hidden state of the first token (i.e., [CLS]) as the sentence embedding.
from transformers import AutoTokenizer, AutoModel
import torch
Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]
Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = model_output[0][:, 0]
normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)