File size: 4,722 Bytes
1e07331 b3e31d2 1e07331 e2044a7 7810aab ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 dc63dc4 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 1e07331 ad8fc46 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
library_name: transformers
tags: []
---
### Model Description
This is the model used to seperate long questions into seperate questions if possible.
## Output will be comma seperated list of questions.
### Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
### Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']
- **Developed by:** [Geerath Bhat]
- **Funded by [optional]:** [Geerath Bhat]
- **Shared by [optional]:** [Geerath Bhat]
- **Model type:** [Fine-tuned Instruct LLM]
- **Language(s) (NLP):** [English]
- **License:** []
- **Finetuned from model [optional]:** []
## Uses
We can use this model to seperate long context data into seperate meaningful parts.
## Bias, Risks, and Limitations
The model will not work in very complex situations but we have measured it's perfommance and it performs well on most complex tasks.
[More Information Needed]
### Recommendations
Give a complex nested questions and it will seperate those questions or contexr into meaningful parts.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
```
from transformers import AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')
import string
from transformers import AutoTokenizer
```
# Get the tokenizer
```
model_checkpoint = "Geerath/context-seperator"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
```
# Load the model
```
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
max_input_length = 512
```
## Inference
```
prompt = """
You are given a query that combines multiple questions into a single string. Your task is to break down this combined query into individual questions, ensuring each question is clear and stands alone."""
text = """
Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
"""
inputs = [prompt + text]
inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=512)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())
print(predicted_title)
```
## Result - ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']
### Training Data
Custom dataset generated using multiple LLMs.
### Training Procedure
Finetuned T5 on custom dataset
#### Preprocessing [optional]
```
max_input_length = 512
max_target_length = 512
def clean_text(text):
sentences = nltk.sent_tokenize(text.strip())
sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]
sentences_cleaned_no_titles = [sent for sent in sentences_cleaned
if len(sent) > 0 and
sent[-1] in string.punctuation]
text_cleaned = "\n".join(sentences_cleaned_no_titles)
return text_cleaned
def preprocess_data(examples):
texts_cleaned = [text for text in examples["input"]]
inputs = [prefix + text for text in texts_cleaned]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["output"], max_length=max_target_length,
truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
```
#### Training Hyperparameters
```
batch_size = 16
args = Seq2SeqTrainingArguments(
model_dir,
evaluation_strategy="steps",
eval_steps=100,
logging_strategy="steps",
logging_steps=100,
save_strategy="steps",
save_steps=200,
learning_rate=4e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=10,
predict_with_generate=True,
fp16=True,
load_best_model_at_end=True,
metric_for_best_model="rouge1",
#push_to_hub=True
)
```
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Custom data prepared.
#### Summary
Model take a input as a text and gives output as list of text seperated by commas.
Example - Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']
|