File size: 14,433 Bytes
d3e5756
 
bf88737
1d4f01c
 
 
 
 
 
 
 
 
175b813
d3e5756
949149f
d7f13e9
46103ec
dd6721a
46103ec
04a3fd4
 
 
 
 
 
949149f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
359c45c
bf88737
d1a4f7b
 
2f12770
949149f
7134c46
2f12770
d1a4f7b
2f12770
949149f
fcfe6c5
 
 
6920fae
 
949149f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6920fae
949149f
 
359c45c
 
 
bf88737
 
 
 
2c8859a
bf88737
dd6721a
bf88737
2c8859a
bf88737
 
46103ec
fcfe6c5
 
 
46103ec
bf88737
 
 
 
 
 
d7f13e9
 
 
bf88737
d7f13e9
bf88737
 
 
 
d7f13e9
bf88737
 
d7f13e9
b8ba037
d7f13e9
 
 
bf88737
d7f13e9
 
bf88737
d7f13e9
 
bf88737
 
d7f13e9
 
bf88737
d7f13e9
bf88737
d7f13e9
 
 
 
 
 
 
 
bf88737
d7f13e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bf88737
d7f13e9
 
 
 
 
 
 
 
 
 
bf88737
d7f13e9
 
 
 
 
 
 
 
 
 
 
bf88737
d7f13e9
 
 
 
 
bf88737
d7f13e9
 
 
 
 
 
bf88737
d7f13e9
 
 
 
bf88737
d7f13e9
 
 
b8ba037
d7f13e9
bf88737
 
d7f13e9
bf88737
 
46103ec
 
bf88737
 
fcfe6c5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
---
datasets:
- aorogat/QueryBridge
- aorogat/Questions_to_Tagged_Questions_Prompts
license: apache-2.0
base_model:
- meta-llama/Meta-Llama-3-8B
pipeline_tag: token-classification
tags:
- Question Answering
- Knowledge Graphs
- DBPedia
- torchtune
---

# Model Overview

This model is a fine-tuned version of llama3 using the [QueryBridge dataset](https://huggingface.co/datasets/aorogat/QueryBridge). We utilized **Low-Rank Adaptation (LoRA)** to train it for tagging question components using the tags in the table below. The demo video shows how the mapped question appears and, after converting it to a graph representation, how we visualized it as shown in the video.

The tagged questions in the QueryBridge dataset are designed to train language models to understand the components and structure of a question effectively. By annotating questions with specific tags such as `<qt>`, `<p>`, `<o>`, and `<s>`, we provide a detailed breakdown of each question's elements, which aids the model in grasping the roles of different components.

<a href="https://youtu.be/J_N-6m8fHz0">
    <img src="https://cdn-uploads.huggingface.co/production/uploads/664adb4a691370727c200af0/sDfp7DiYrGKvH58KdXOIY.png" alt="Training Model with Tagged Questions" width="400" height="300" />
</a>

# Tags Used in Tagged Questions

| Tag   | Description |
|-------|-------------|
| `<qt>` | **Question Type**: Tags the keywords or phrases that denote the type of question being asked, such as 'What', 'Who', 'How many', etc. This tag helps determine the type of SPARQL query to generate. Example: In "What is the capital of Canada?", the tag `<qt>What</qt>` indicates that the question is asking for an entity retrieval. |
| `<o>`  | **Object Entities**: Tags entities that are objects in the question. These are usually noun phrases referring to the entities being described or queried. Example: In "What is the capital of Canada?", the term 'Canada' is tagged as `<o>Canada</o>`. |
| `<s>`  | **Subject Entities**: Tags entities that are subjects in Yes-No questions. This tag is used exclusively for questions that can be answered with 'Yes' or 'No'. Example: In "Is Ottawa the capital of Canada?", the entity 'Ottawa' is tagged as `<s>Ottawa</s>`. |
| `<p>`  | **Predicates**: Tags predicates that represent relationships or attributes in the knowledge graph. Predicates can be verb phrases or noun phrases that describe how entities are related. Example: In "What is the capital of Canada?", the phrase 'is the capital of' is tagged as `<p>is the capital of</p>`. |
| `<cc>` | **Coordinating Conjunctions**: Tags conjunctions that connect multiple predicates or entities in complex queries. These include words like 'and', 'or', and 'nor'. They influence how the SPARQL query combines conditions. Example: In "Who is the CEO and founder of Apple Inc?", the conjunction 'and' is tagged as `<cc>and</cc>`. |
| `<off>`| **Offsets**: Tags specific terms that indicate position or order in a sequence, such as 'first', 'second', etc. These are used in questions asking for ordinal positions. Example: In "What is the second largest country?", the word 'second' is tagged as `<off>second</off>`. |
| `<t>`  | **Entity Types**: Tags that describe the type or category of the entities involved in the question. This can include types like 'person', 'place', 'organization', etc. Example: In "Which film directed by Garry Marshall?", the type 'film' might be tagged as `<t>film</t>`. |
| `<op>` | **Operators**: Tags operators used in questions that involve comparisons or calculations, such as 'greater than', 'less than', 'more than'. Example: In "Which country has a population greater than 50 million?", the operator 'greater than' is tagged as `<op>greater than</op>`. |
| `<ref>`| **References**: Tags in questions that refer back to previously mentioned entities or concepts. These can indicate cycles or self-references in queries. Example: In "Who is the CEO of the company founded by himself?", the word 'himself' is tagged as `<ref>himself</ref>`. |



# How to use the model?
There are two main steps

## 1- Download the model from Huggingface
To use the model, you can run it with TorchTune commands. I have provided the necessary Python code to automate the process. Follow these steps to get started:
- Download the fintuned version including the `meta_model_0.pt` file and the tokenizer. (see the `files and versions` tap in this page).
- Save the model file in the following directory: `/home/USERNAME/Meta-Llama-3-8B/`

## 2- Using the model

<details>
  <summary>Steps</summary>

- **Note:** Replace each `USERNAME` with your username.

### Step 1: Create a Configuration File
First, save a file named `custom_generation_config_bigModel.yaml` in `/home/USERNAME/` with the following content:

```yaml
# Config for running the InferenceRecipe in generate.py to generate output from an LLM

# Model arguments
model:
  _component_: torchtune.models.llama3.llama3_8b

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /home/USERNAME/Meta-Llama-3-8B/
  checkpoint_files: [
    meta_model_0.pt
  ]
  output_dir: /home/USERNAME/Meta-Llama-3-8B/
  model_type: LLAMA3

device: cuda
dtype: bf16

seed: 1234

# Tokenizer arguments
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /home/USERNAME/Meta-Llama-3-8B/original/tokenizer.model

# Generation arguments; defaults taken from gpt-fast
prompt: "### Instruction: \nYou are a powerful model trained to convert questions to tagged questions. Use the tags as follows: \n<qt> to surround question keywords like 'What', 'Who', 'Which', 'How many', 'Return' or any word that represents requests. \n<o> to surround entities as an object like person name, place name, etc. It must be a noun or a noun phrase. \n<s> to surround entities as a subject like person name, place name, etc. The difference between <s> and <o>, <s> only appear in yes/no questions as in the training data you saw before. \n<cc> to surround coordinating conjunctions that connect two or more phrases like 'and', 'or', 'nor', etc. \n<p> to surround predicates that may be an entity attribute or a relationship between two entities. It can be a verb phrase or a noun phrase. The question must contain at least one predicate. \n<off> for offset in questions asking for the second, third, etc. For example, the question 'What is the second largest country?', <off> will be located as follows. 'What is the <off>second</off> largest country?' \n<t> to surround entity types like person, place, etc. \n<op> to surround operators that compare quantities or values, like 'greater than', 'more than', etc. \n<ref> to indicate a reference within the question that requires a cycle to refer back to an entity (e.g., 'Who is the CEO of a company founded by himself?' where 'himself' would be tagged as <ref>himself</ref>). \nInput: Which films directed by a dirctor died in 2014 and starring both Julia Roberts and Richard Gere?\nResponse:"
max_new_tokens: 100
temperature: 0.6
top_k: 1

quantizer: null
```

### Step 2: Set Up the Environment
Create a virtual environment:

```bash
/home/USERNAME/myenv
```

Install TorchTune with:
```bash
pip install torchtune
```

### Step 3: Create the Python File
Next, create a Python file called `command.py` with the following content:

```python
import subprocess
import os
import re
import shlex  # For safely handling command line arguments

def _create_config_file(question):
    # Path to the template and output config file
    template_path = "/home/USERNAME/custom_generation_config_bigModel.yaml"
    output_path = "/tmp/dynamic_generation.yaml"
    
    # Load the template from the file
    with open(template_path, 'r') as file:
        config_template = file.read()

    # Replace the placeholder in the template with the actual question
    updated_prompt = config_template.replace("Input: Which films directed by a dirctor died in 2014 and starring both Julia Roberts and Richard Gere?", f"Input: {question}")
    maxLen = int(1.3*len(question))
    print(f"maxLen: {maxLen}")
    updated_prompt = updated_prompt.replace("max_new_tokens: 100", f"max_new_tokens: {maxLen}")

    # Write the updated configuration to a new file
    with open(output_path, 'w') as file:
        file.write(updated_prompt)
    
    print(f"Configuration file created at: {output_path}")

def get_tagged_question(question):
    # Define the path to the virtual environment's activation script
    activate_env = "/home/USERNAME/myenv/bin/activate"

    # Create configuration file with the question
    _create_config_file(question)

    print('get_tagged_question')
    
    # Command to run within the virtual environment
    command = f"tune run generate --config /tmp/dynamic_generation.yaml"
    
    # Full command to activate the environment and run your command
    full_command = f"source {activate_env} && {command}"
    
    # Run the full command in a shell
    try:
        result = subprocess.run(full_command, shell=True, check=True, text=True, capture_output=True, executable="/bin/bash")
        print("Command output:", result.stdout)
        print("Command error output:", result.stderr)

        output = result.stdout + result.stderr
        # Extract the input and response using modified regular expressions
        input_match = re.search(r'Input: (.*?)(?=Response:)', output, re.S)
        response_match = re.search(r'Response: (.*)', output)

        response_match = response_match.group(1).strip()

        if input_match and response_match:
            print("Input Question: ", question)
            print("Extracted Response: ", response_match)
        else:
            print("Input or Response not found in the output.")
        
    except subprocess.CalledProcessError as e:
        print("An error occurred:", e.stderr)
    return response_match

if __name__ == "__main__":
    # Call the function with a sample question
    get_tagged_question("Who is the president of largest country in Africa?")
```

### Step 4: Run the Script
To run the script and generate tagged questions, execute the following command in your terminal:

```bash
python command.py
```
</details>





# How We Fine-Tuned the Model

We fine-tuned the `Meta-Llama-3-8B` model by two key steps: preparing the dataset and executing the fine-tuning process.

### 1- Prepare the Dataset

For this fine-tuning, we utilized the [QueryBridge dataset](https://huggingface.co/datasets/aorogat/QueryBridge), specifically the pairs of questions and their corresponding tagged questions. However, before we can use this dataset, it is necessary to convert the data into instruct prompts suitable for fine-tuning the model. You can find these prompts at [this link](https://huggingface.co/datasets/aorogat/Questions_to_Tagged_Questions_Prompts). Download the prompts and save them in the directory: `/home/YOUR_USERNAME/data`

### 2- Fine-Tune the Model

To fine-tune the `Meta-Llama-3-8B` model, we leveraged [Torchtune](https://pytorch.org/torchtune/stable/index.html). Follow these steps to complete the process:


<details>
  <summary>Steps</summary>


### Step 1: Download the Model
Begin by downloading the model with the following command. Replace `<ACCESS TOKEN>` with your actual Huggingface token and adjust the output directory as needed:

```bash
tune download \
  meta-llama/Meta-Llama-3-8B \
  --output-dir /home/YOUR_USERNAME/Meta-Llama-3-8B \
  --hf-token <ACCESS TOKEN>
```

### Step 2: Prepare the Configuration File
Next, you need to set up a configuration file. Start by downloading the default configuration:

```bash
tune cp llama3/8B_lora_single_device custom_config.yaml
```
Then, open custom_config.yaml and update it as follows:

```yaml
# Config for single device LoRA finetuning in lora_finetune_single_device.py
# using a Llama3 8B model
#
# Ensure the model is downloaded using the following command before launching:
#   tune download meta-llama/Meta-Llama-3-8B --output-dir /tmp/Meta-Llama-3-8B --hf-token <HF_TOKEN>
#
# To launch on a single device, run this command from the root directory:
#   tune run lora_finetune_single_device --config llama3/8B_lora_single_device
#
# You can add specific overrides through the command line. For example,
# to override the checkpointer directory, use:
#   tune run lora_finetune_single_device --config llama3/8B_lora_single_device checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
#
# This config is for training on a single device.

# Model Arguments
model:
  _component_: torchtune.models.llama3.lora_llama3_8b
  lora_attn_modules: ['q_proj', 'v_proj']
  apply_lora_to_mlp: False
  apply_lora_to_output: False
  lora_rank: 8
  lora_alpha: 16

# Tokenizer
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /home/YOUR_USERNAME/Meta-Llama-3-8B/original/tokenizer.model

checkpointer:
  _component_: torchtune.utils.FullModelMetaCheckpointer
  checkpoint_dir: /home/YOUR_USERNAME/Meta-Llama-3-8B/original/
  checkpoint_files: [
    consolidated.00.pth
  ]
  recipe_checkpoint: null
  output_dir: /home/YOUR_USERNAME/Meta-Llama-3-8B/
  model_type: LLAMA3
resume_from_checkpoint: False

# Dataset and Sampler
dataset:
  _component_: torchtune.datasets.instruct_dataset
  split: train
  source: /home/YOUR_USERNAME/data
  template: AlpacaInstructTemplate
  train_on_input: False
seed: null
shuffle: True
batch_size: 1

# Optimizer and Scheduler
optimizer:
  _component_: torch.optim.AdamW
  weight_decay: 0.01
  lr: 3e-4
lr_scheduler:
  _component_: torchtune.modules.get_cosine_schedule_with_warmup
  num_warmup_steps: 100

loss:
  _component_: torch.nn.CrossEntropyLoss

# Training
epochs: 1
max_steps_per_epoch: null
gradient_accumulation_steps: 64
compile: False

# Logging
output_dir: /home/YOUR_USERNAME/lora_finetune_output
metric_logger:
  _component_: torchtune.utils.metric_logging.DiskLogger
  log_dir: ${output_dir}
log_every_n_steps: null

# Environment
device: cuda
dtype: bf16
enable_activation_checkpointing: True

# Profiler (disabled)
profiler:
  _component_: torchtune.utils.profiler
  enabled: False
```

### Step 3: Run the Finetuning Process
After configuring the file, you can start the finetuning process with the following command:

```bash
tune run lora_finetune_single_device --config /home/YOUR_USERNAME/.../custom_config.yaml
```

The new model can be found in `/home/YOUR_USERNAME/Meta-Llama-3-8B/` directory.

</details>