|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
### Model Description |
|
|
|
This is the model used to seperate long questions into seperate questions if possible. |
|
## Output will be comma seperated list of questions. |
|
|
|
|
|
### Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis? |
|
### Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?'] |
|
|
|
|
|
- **Developed by:** [Geerath Bhat] |
|
- **Funded by [optional]:** [Geerath Bhat] |
|
- **Shared by [optional]:** [Geerath Bhat] |
|
- **Model type:** [Fine-tuned Instruct LLM] |
|
- **Language(s) (NLP):** [English] |
|
- **License:** [] |
|
- **Finetuned from model [optional]:** [] |
|
|
|
|
|
## Uses |
|
|
|
We can use this model to seperate long context data into seperate meaningful parts. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The model will not work in very complex situations but we have measured it's perfommance and it performs well on most complex tasks. |
|
|
|
[More Information Needed] |
|
|
|
### Recommendations |
|
|
|
Give a complex nested questions and it will seperate those questions or contexr into meaningful parts. |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
``` |
|
from transformers import AutoModelForSeq2SeqLM |
|
import nltk |
|
nltk.download('punkt') |
|
nltk.download('punkt_tab') |
|
import string |
|
from transformers import AutoTokenizer |
|
``` |
|
|
|
# Get the tokenizer |
|
|
|
``` |
|
model_checkpoint = "Geerath/context-seperator" |
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) |
|
``` |
|
|
|
# Load the model |
|
``` |
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint) |
|
|
|
max_input_length = 512 |
|
``` |
|
## Inference |
|
``` |
|
prompt = """ |
|
You are given a query that combines multiple questions into a single string. Your task is to break down this combined query into individual questions, ensuring each question is clear and stands alone.""" |
|
text = """ |
|
Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis? |
|
""" |
|
|
|
inputs = [prompt + text] |
|
|
|
inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt") |
|
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=512) |
|
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0] |
|
predicted_title = nltk.sent_tokenize(decoded_output.strip()) |
|
|
|
print(predicted_title) |
|
``` |
|
## Result - ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?'] |
|
|
|
### Training Data |
|
|
|
Custom dataset generated using multiple LLMs. |
|
|
|
### Training Procedure |
|
|
|
Finetuned T5 on custom dataset |
|
|
|
#### Preprocessing [optional] |
|
``` |
|
max_input_length = 512 |
|
max_target_length = 512 |
|
def clean_text(text): |
|
sentences = nltk.sent_tokenize(text.strip()) |
|
sentences_cleaned = [s for sent in sentences for s in sent.split("\n")] |
|
sentences_cleaned_no_titles = [sent for sent in sentences_cleaned |
|
if len(sent) > 0 and |
|
sent[-1] in string.punctuation] |
|
text_cleaned = "\n".join(sentences_cleaned_no_titles) |
|
return text_cleaned |
|
|
|
def preprocess_data(examples): |
|
texts_cleaned = [text for text in examples["input"]] |
|
inputs = [prefix + text for text in texts_cleaned] |
|
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True) |
|
with tokenizer.as_target_tokenizer(): |
|
labels = tokenizer(examples["output"], max_length=max_target_length, |
|
truncation=True) |
|
|
|
model_inputs["labels"] = labels["input_ids"] |
|
return model_inputs |
|
``` |
|
|
|
#### Training Hyperparameters |
|
``` |
|
batch_size = 16 |
|
args = Seq2SeqTrainingArguments( |
|
model_dir, |
|
evaluation_strategy="steps", |
|
eval_steps=100, |
|
logging_strategy="steps", |
|
logging_steps=100, |
|
save_strategy="steps", |
|
save_steps=200, |
|
learning_rate=4e-5, |
|
per_device_train_batch_size=batch_size, |
|
per_device_eval_batch_size=batch_size, |
|
weight_decay=0.01, |
|
save_total_limit=3, |
|
num_train_epochs=10, |
|
predict_with_generate=True, |
|
fp16=True, |
|
load_best_model_at_end=True, |
|
metric_for_best_model="rouge1", |
|
#push_to_hub=True |
|
) |
|
``` |
|
|
|
## Evaluation |
|
|
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
Custom data prepared. |
|
|
|
|
|
#### Summary |
|
|
|
Model take a input as a text and gives output as list of text seperated by commas. |
|
Example - Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis? |
|
Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?'] |
|
|