@tomaarsen on Hugging Face: "🏅 Quantized Embeddings are here! Unlike model quantization, embedding…"

Hi @tomaarsen I am always following everything you are cooking up, what a cool update. A question I have is is there way to go directly from corpus to quantization embeddings, rather than as a postprocessing.

I will give you my usecase, I want to fuzzy match a large corpus with names with another list of names. I would love to use quantization to speed stuff up.

## Sharing an example snippet

# Initialize the model (consider using a faster model for large datasets)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

embeddings_name_clean = model.encode(master_file['name_clean'].tolist(), convert_to_tensor=True)

  df_entities_chunks = chunk_dataframe(df_entities, 4)

  # Process each chunk
  all_best_matches = []  # List to store all best matches from all chunks
  for chunk in df_entities_chunks:
      # Encode the 'name_2' column for the current chunk
      embeddings = model.encode(chunk['name_2'].tolist(), convert_to_tensor=True)

      # Calculate similarity matrix for the current chunk
      similarity_matrix = util.cos_sim(embeddings, embeddings_name_clean)

      # Find the best matches for the current chunk
      best_matches = []
      for idx, similarities in enumerate(similarity_matrix):
          highest_similarity_index = similarities.argmax().item()  # Convert to integer
          highest_similarity_score = similarities[highest_similarity_index].item()
          best_match = master_file['name_clean'].iloc[highest_similarity_index]
          original_value = chunk['name_2'].iloc[idx]
          best_matches.append((original_value, best_match, highest_similarity_score))

      # Append the best matches of this chunk to the all_best_matches list
      all_best_matches.extend(best_matches)

  # Create a new DataFrame with all the matches and similarity scores

  matched_df = pd.DataFrame(all_best_matches, columns=['Original', 'Best Match', 'Similarity Score'])

  matched_df = matched_df[matched_df["Similarity Score"]>0.88].sort_values("Similarity Score")

Join the conversation