File size: 5,639 Bytes
3228a95 e5c37f0 3228a95 2d2c449 7c7079c 3228a95 7c7079c 9c7fe1d 56dff97 2310fb6 3228a95 68a494f d07e94b 3228a95 d07e94b a5d1dd6 d07e94b 06f9264 afd639f d07e94b 82967ff d07e94b a4fbb80 aeb714f a4fbb80 e5c37f0 3228a95 e5c37f0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
---
base_model: BAAI/bge-m3
tags:
- datadreamer
- datadreamer-0.46.0
- synthetic
- sentence-transformers
- feature-extraction
- sentence-similarity
library_name: sentence-transformers
pipeline_tag: sentence-similarity
---
Given a *document*, this retrieval embedding model helps retrieve *instruction templates* from [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) relevant to various chunks / sections of a document or an entire document.
**Note:** This retrieval embedding is symmetric, so it can also be used to retrieve relevant documents to a [(`compatible_document_description`) of an instruction template](https://huggingface.co/datasets/fineinstructions/finetemplates).
## Requirements
```
datasets
faiss
huggingface_hub
numpy
pandas
sentence_transformers
```
## Simple Usage Example
```python
import importlib
import json
from huggingface_hub import hf_hub_download
def download_and_import_module(module_name, variable):
module = importlib.util.module_from_spec(
importlib.util.spec_from_file_location(
module_name,
hf_hub_download(
repo_id="fineinstructions/instruction_template_retrieval_embedding",
filename=f"{module_name}.py",
),
)
)
module.__spec__.loader.exec_module(module)
return getattr(module, variable)
# Import the retriever helper class
InstructionTemplateRetriever = download_and_import_module("instruction_template_retriever", "InstructionTemplateRetriever")
# Prepare an example document
EXAMPLE_DOC = """
Title: Surprising Facts about Pigeons
Submitted On: September 24, 2008
Fact 1:
During World War I, a homing pigeon named Cher Ami played a critical role in saving nearly 200 soldiers who were trapped behind enemy lines.
Despite being injured by enemy fire, Cher Ami managed to deliver a crucial message that led to their rescue. For this act of bravery, the
French government awarded the pigeon the Croix de Guerre, a military medal of honor. Cher Ami became a symbol of courage and the extraordinary
utility of pigeons in wartime communication.
Fact 2:
Pigeons possess impressive cognitive abilities, one of the most surprising being their capacity for self-recognition in mirrors. This
trait is rare in the animal kingdom and is often considered a marker of higher intelligence. Experiments have shown that pigeons can distinguish
themselves from other birds when looking into a mirror, suggesting a level of self-awareness previously thought to be unique to primates and a
few other animals.
Fact 3:
Thanks to centuries of selective breeding, there are now more than 300 recognized breeds of domestic pigeon. These range from show pigeons with
elaborate feather patterns and head crests to performance breeds used in tumbling and racing. The sheer variety reflects the bird’s long history
as a companion species to humans.
Fact 4:
The Ancient Romans were known for their elaborate grooming rituals, and pigeons played an unexpected role in their beauty routines. Specifically,
they used pigeon droppings as a bleaching agent to style and lighten their hair. This unusual practice was part of the broader Roman obsession with
fashion and appearance, demonstrating how even the most unexpected materials found a place in early cosmetic treatments.
"""
# Retrieve relevant instruction templates to different chunks / sections of a document
retriever = InstructionTemplateRetriever(
coverage_chunks=4, sigma=0.05, alpha=1.0 # Ensure instruction templates cover information in the document with 4 chunks/sections
)
print(json.dumps(retriever.search(document=EXAMPLE_DOC), indent=4))
# ******************************************************
# Retrieval results look like:
# ******************************************************
# Instruction Templates for Entire Document:
# - "What's something <fi>a few word description of something remarkable or noteworthy</fi> you can tell me"
# Instruction Templates for Chunk 1/4 of the Document:
# - "write a <fi>a few word description of the type of message</fi> for <fi>a significant achievement or milestone</fi>"
# Instruction Templates for Chunk 2/4 of the Document:
# - "how are <fi>a type of organism or entity</fi> so <fi>exceptionally strong or notable in some way</fi>?"
# Instruction Templates for Chunk 3/4 of the Document:
# - "what are the common <fi>a type of organism, creature, or entity</fi>?"
# Instruction Templates for Chunk 4/4 of the Document:
# - "how did <fi>a group of people</fi> <fi>perform a common practice or activity</fi>"
# ******************************************************
# Increasing diversity:
# -----------------------
# You can increase diversity using the `reweight` parameter
# to increase diversity in instruction length like so:
# `print(json.dumps(retriever.search(document=EXAMPLE_DOC, reweight=True), indent=4))`
# ******************************************************
# ******************************************************
# Documentation:
# -----------------------
# You can read the full documentation of the `InstructionTemplateRetriever.search` method:
# by opening/reading the instruction_template_retriever.py file here:
# https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding/tree/main
# ******************************************************
```
---
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json). |