Given a document, this retrieval embedding model helps retrieve instruction templates from FineTemplates relevant to various chunks / sections of a document or an entire document.

Note: This retrieval embedding is symmetric, so it can also be used to retrieve relevant documents to a (compatible_document_description) of an instruction template.

Requirements

datasets
faiss
huggingface_hub
numpy
pandas
sentence_transformers

Simple Usage Example

import importlib
import json
from huggingface_hub import hf_hub_download


def download_and_import_module(module_name, variable):
    module = importlib.util.module_from_spec(
        importlib.util.spec_from_file_location(
            module_name,
            hf_hub_download(
                repo_id="fineinstructions/instruction_template_retrieval_embedding",
                filename=f"{module_name}.py",
            ),
        )
    )
    module.__spec__.loader.exec_module(module)
    return getattr(module, variable)


# Import the retriever helper class
InstructionTemplateRetriever = download_and_import_module("instruction_template_retriever", "InstructionTemplateRetriever")

# Prepare an example document
EXAMPLE_DOC = """
Title: Surprising Facts about Pigeons
Submitted On: September 24, 2008

Fact 1:
During World War I, a homing pigeon named Cher Ami played a critical role in saving nearly 200 soldiers who were trapped behind enemy lines.
Despite being injured by enemy fire, Cher Ami managed to deliver a crucial message that led to their rescue. For this act of bravery, the 
French government awarded the pigeon the Croix de Guerre, a military medal of honor. Cher Ami became a symbol of courage and the extraordinary
utility of pigeons in wartime communication.

Fact 2:
Pigeons possess impressive cognitive abilities, one of the most surprising being their capacity for self-recognition in mirrors. This
trait is rare in the animal kingdom and is often considered a marker of higher intelligence. Experiments have shown that pigeons can distinguish
themselves from other birds when looking into a mirror, suggesting a level of self-awareness previously thought to be unique to primates and a
few other animals.

Fact 3:
Thanks to centuries of selective breeding, there are now more than 300 recognized breeds of domestic pigeon. These range from show pigeons with
elaborate feather patterns and head crests to performance breeds used in tumbling and racing. The sheer variety reflects the bird’s long history
as a companion species to humans.

Fact 4:
The Ancient Romans were known for their elaborate grooming rituals, and pigeons played an unexpected role in their beauty routines. Specifically,
they used pigeon droppings as a bleaching agent to style and lighten their hair. This unusual practice was part of the broader Roman obsession with
fashion and appearance, demonstrating how even the most unexpected materials found a place in early cosmetic treatments.
"""


# Retrieve relevant instruction templates to different chunks / sections of a document
retriever = InstructionTemplateRetriever(
    coverage_chunks=4, sigma=0.05, alpha=1.0    # Ensure instruction templates cover information in the document with 4 chunks/sections
)
print(json.dumps(retriever.search(document=EXAMPLE_DOC), indent=4))

# ******************************************************
# Retrieval results look like:
# ******************************************************

# Instruction Templates for Entire Document:
#    - "What's something <fi>a few word description of something remarkable or noteworthy</fi> you can tell me"

# Instruction Templates for Chunk 1/4 of the Document:
#    - "write a <fi>a few word description of the type of message</fi> for <fi>a significant achievement or milestone</fi>"

# Instruction Templates for Chunk 2/4 of the Document:
#    - "how are <fi>a type of organism or entity</fi> so <fi>exceptionally strong or notable in some way</fi>?"

# Instruction Templates for Chunk 3/4 of the Document:
#    - "what are the common <fi>a type of organism, creature, or entity</fi>?"

# Instruction Templates for Chunk 4/4 of the Document:
#    - "how did <fi>a group of people</fi> <fi>perform a common practice or activity</fi>"

# ******************************************************
# Increasing diversity:
# -----------------------
# You can increase diversity using the `reweight` parameter
# to increase diversity in instruction length like so:
# `print(json.dumps(retriever.search(document=EXAMPLE_DOC, reweight=True), indent=4))`
# ******************************************************

# ******************************************************
# Documentation:
# -----------------------
# You can read the full documentation of the `InstructionTemplateRetriever.search` method:
# by opening/reading the instruction_template_retriever.py file here:
# https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding/tree/main
# ******************************************************

This model was trained with a synthetic dataset with DataDreamer 🤖💤. The synthetic dataset card and model card can be found here. The training arguments can be found here.

Downloads last month
302
Safetensors
Model size
568M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for fineinstructions/instruction_template_retrieval_embedding

Base model

BAAI/bge-m3
Finetuned
(235)
this model