File size: 5,639 Bytes
3228a95
e5c37f0
 
 
 
 
 
 
 
 
 
3228a95
2d2c449
7c7079c
3228a95
7c7079c
9c7fe1d
56dff97
 
 
 
 
 
 
 
 
 
2310fb6
3228a95
68a494f
d07e94b
 
 
3228a95
d07e94b
 
 
 
 
 
a5d1dd6
d07e94b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
06f9264
afd639f
d07e94b
 
82967ff
 
 
d07e94b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a4fbb80
 
 
 
aeb714f
a4fbb80
 
 
 
 
 
 
 
 
 
 
e5c37f0
3228a95
e5c37f0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
base_model: BAAI/bge-m3
tags:
- datadreamer
- datadreamer-0.46.0
- synthetic
- sentence-transformers
- feature-extraction
- sentence-similarity
library_name: sentence-transformers
pipeline_tag: sentence-similarity
---

Given a *document*, this retrieval embedding model helps retrieve *instruction templates* from [FineTemplates](https://huggingface.co/datasets/fineinstructions/finetemplates) relevant to various chunks / sections of a document or an entire document.

**Note:** This retrieval embedding is symmetric, so it can also be used to retrieve relevant documents to a [(`compatible_document_description`) of an instruction template](https://huggingface.co/datasets/fineinstructions/finetemplates).

## Requirements
```
datasets
faiss
huggingface_hub
numpy
pandas
sentence_transformers
```

## Simple Usage Example

```python
import importlib
import json
from huggingface_hub import hf_hub_download


def download_and_import_module(module_name, variable):
    module = importlib.util.module_from_spec(
        importlib.util.spec_from_file_location(
            module_name,
            hf_hub_download(
                repo_id="fineinstructions/instruction_template_retrieval_embedding",
                filename=f"{module_name}.py",
            ),
        )
    )
    module.__spec__.loader.exec_module(module)
    return getattr(module, variable)


# Import the retriever helper class
InstructionTemplateRetriever = download_and_import_module("instruction_template_retriever", "InstructionTemplateRetriever")

# Prepare an example document
EXAMPLE_DOC = """
Title: Surprising Facts about Pigeons
Submitted On: September 24, 2008

Fact 1:
During World War I, a homing pigeon named Cher Ami played a critical role in saving nearly 200 soldiers who were trapped behind enemy lines.
Despite being injured by enemy fire, Cher Ami managed to deliver a crucial message that led to their rescue. For this act of bravery, the 
French government awarded the pigeon the Croix de Guerre, a military medal of honor. Cher Ami became a symbol of courage and the extraordinary
utility of pigeons in wartime communication.

Fact 2:
Pigeons possess impressive cognitive abilities, one of the most surprising being their capacity for self-recognition in mirrors. This
trait is rare in the animal kingdom and is often considered a marker of higher intelligence. Experiments have shown that pigeons can distinguish
themselves from other birds when looking into a mirror, suggesting a level of self-awareness previously thought to be unique to primates and a
few other animals.

Fact 3:
Thanks to centuries of selective breeding, there are now more than 300 recognized breeds of domestic pigeon. These range from show pigeons with
elaborate feather patterns and head crests to performance breeds used in tumbling and racing. The sheer variety reflects the bird’s long history
as a companion species to humans.

Fact 4:
The Ancient Romans were known for their elaborate grooming rituals, and pigeons played an unexpected role in their beauty routines. Specifically,
they used pigeon droppings as a bleaching agent to style and lighten their hair. This unusual practice was part of the broader Roman obsession with
fashion and appearance, demonstrating how even the most unexpected materials found a place in early cosmetic treatments.
"""


# Retrieve relevant instruction templates to different chunks / sections of a document
retriever = InstructionTemplateRetriever(
    coverage_chunks=4, sigma=0.05, alpha=1.0    # Ensure instruction templates cover information in the document with 4 chunks/sections
)
print(json.dumps(retriever.search(document=EXAMPLE_DOC), indent=4))

# ******************************************************
# Retrieval results look like:
# ******************************************************

# Instruction Templates for Entire Document:
#    - "What's something <fi>a few word description of something remarkable or noteworthy</fi> you can tell me"

# Instruction Templates for Chunk 1/4 of the Document:
#    - "write a <fi>a few word description of the type of message</fi> for <fi>a significant achievement or milestone</fi>"

# Instruction Templates for Chunk 2/4 of the Document:
#    - "how are <fi>a type of organism or entity</fi> so <fi>exceptionally strong or notable in some way</fi>?"

# Instruction Templates for Chunk 3/4 of the Document:
#    - "what are the common <fi>a type of organism, creature, or entity</fi>?"

# Instruction Templates for Chunk 4/4 of the Document:
#    - "how did <fi>a group of people</fi> <fi>perform a common practice or activity</fi>"

# ******************************************************
# Increasing diversity:
# -----------------------
# You can increase diversity using the `reweight` parameter
# to increase diversity in instruction length like so:
# `print(json.dumps(retriever.search(document=EXAMPLE_DOC, reweight=True), indent=4))`
# ******************************************************

# ******************************************************
# Documentation:
# -----------------------
# You can read the full documentation of the `InstructionTemplateRetriever.search` method:
# by opening/reading the instruction_template_retriever.py file here:
# https://huggingface.co/fineinstructions/instruction_template_retrieval_embedding/tree/main
# ******************************************************
```

---
This model was trained with a synthetic dataset with [DataDreamer 🤖💤](https://datadreamer.dev). The synthetic dataset card and model card can be found [here](datadreamer.json). The training arguments can be found [here](training_args.json).