|
--- |
|
base_model: BAAI/bge-m3 |
|
tags: |
|
- datadreamer |
|
- datadreamer-0.46.0 |
|
- synthetic |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
library_name: sentence-transformers |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
Given a *document*, this retrieval embedding model helps retrieve *instruction templates* from [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) relevant to various chunks / sections of a document or an entire document. |
|
|
|
**Note:** This retrieval embedding is symmetric, so it can also be used to retrieve relevant documents to a [(`compatible_document_description`) of an instruction template](https://huggingface.co/datasets/fineinstructions/finetemplates). |
|
|
|
## Requirements |
|
``` |
|
datasets |
|
faiss |
|
huggingface_hub |
|
numpy |
|
pandas |
|
sentence_transformers |
|
``` |
|
|
|
## Simple Usage Example |
|
|
|
```python |
|
import importlib |
|
import json |
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
def download_and_import_module(module_name, variable): |
|
module = importlib.util.module_from_spec( |
|
importlib.util.spec_from_file_location( |
|
module_name, |
|
hf_hub_download( |
|
repo_id="fineinstructions/instruction_template_retrieval_embedding", |
|
filename=f"{module_name}.py", |
|
), |
|
) |
|
) |
|
module.__spec__.loader.exec_module(module) |
|
return getattr(module, variable) |
|
|
|
|
|
# Import the retriever helper class |
|
InstructionTemplateRetriever = download_and_import_module("instruction_template_retriever", "InstructionTemplateRetriever") |
|
|
|
# Prepare an example document |
|
EXAMPLE_DOC = """ |
|
Title: Surprising Facts about Pigeons |
|
Submitted On: September 24, 2008 |
|
|
|
Fact 1: |
|
During World War I, a homing pigeon named Cher Ami played a critical role in saving nearly 200 soldiers who were trapped behind enemy lines. |
|
Despite being injured by enemy fire, Cher Ami managed to deliver a crucial message that led to their rescue. For this act of bravery, the |
|
French government awarded the pigeon the Croix de Guerre, a military medal of honor. Cher Ami became a symbol of courage and the extraordinary |
|
utility of pigeons in wartime communication. |
|
|
|
Fact 2: |
|
Pigeons possess impressive cognitive abilities, one of the most surprising being their capacity for self-recognition in mirrors. This |
|
trait is rare in the animal kingdom and is often considered a marker of higher intelligence. Experiments have shown that pigeons can distinguish |
|
themselves from other birds when looking into a mirror, suggesting a level of self-awareness previously thought to be unique to primates and a |
|
few other animals. |
|
|
|
Fact 3: |
|
Thanks to centuries of selective breeding, there are now more than 300 recognized breeds of domestic pigeon. These range from show pigeons with |
|
elaborate feather patterns and head crests to performance breeds used in tumbling and racing. The sheer variety reflects the bird’s long history |
|
as a companion species to humans. |
|
|
|
Fact 4: |
|
The Ancient Romans were known for their elaborate grooming rituals, and pigeons played an unexpected role in their beauty routines. Specifically, |
|
they used pigeon droppings as a bleaching agent to style and lighten their hair. This unusual practice was part of the broader Roman obsession with |
|
fashion and appearance, demonstrating how even the most unexpected materials found a place in early cosmetic treatments. |
|
""" |
|
|
|
|
|
# Retrieve relevant instruction templates to different chunks / sections of a document |
|
retriever = InstructionTemplateRetriever( |
|
coverage_chunks=4, sigma=0.05, alpha=1.0 # Ensure instruction templates cover information in the document with 4 chunks/sections |
|
) |
|
print(json.dumps(retriever.search(document=EXAMPLE_DOC), indent=4)) |
|
|
|
# ****************************************************** |
|
# Retrieval results look like: |
|
# ****************************************************** |
|
|
|
# Instruction Templates for Entire Document: |
|
# - "What's something <fi>a few word description of something remarkable or noteworthy</fi> you can tell me" |
|
|
|
# Instruction Templates for Chunk 1/4 of the Document: |
|
# - "write a <fi>a few word description of the type of message</fi> for <fi>a significant achievement or milestone</fi>" |
|
|
|
# Instruction Templates for Chunk 2/4 of the Document: |
|
# - "how are <fi>a type of organism or entity</fi> so <fi>exceptionally strong or notable in some way</fi>?" |
|
|
|
# Instruction Templates for Chunk 3/4 of the Document: |
|
# - "what are the common <fi>a type of organism, creature, or entity</fi>?" |
|
|
|
# Instruction Templates for Chunk 4/4 of the Document: |
|
# - "how did <fi>a group of people</fi> <fi>perform a common practice or activity</fi>" |
|
|
|
# ****************************************************** |
|
# Increasing diversity: |
|
# ----------------------- |
|
# You can increase diversity using the `reweight` parameter |
|
# to increase diversity in instruction length like so: |
|
# `print(json.dumps(retriever.search(document=EXAMPLE_DOC, reweight=True), indent=4))` |
|
# ****************************************************** |
|
|
|
# ****************************************************** |
|
# Documentation: |
|
# ----------------------- |
|
# You can read the full documentation of the `InstructionTemplateRetriever.search` method: |
|
# by opening/reading the instruction_template_retriever.py file here: |
|
# https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding/tree/main |
|
# ****************************************************** |
|
``` |
|
|
|
--- |
|
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json). |