File size: 4,722 Bytes
1e07331
 
 
 
 
 
 
b3e31d2
 
1e07331
e2044a7
 
 
 
7810aab
ad8fc46
 
 
 
 
 
 
1e07331
 
 
 
ad8fc46
1e07331
 
 
 
ad8fc46
1e07331
 
 
 
 
ad8fc46
1e07331
 
 
 
 
ad8fc46
 
 
 
 
 
 
 
 
 
 
 
dc63dc4
ad8fc46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e07331
 
 
ad8fc46
1e07331
 
 
ad8fc46
1e07331
 
ad8fc46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e07331
 
ad8fc46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1e07331
 
 
ad8fc46
1e07331
 
 
 
 
ad8fc46
1e07331
 
 
 
ad8fc46
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
library_name: transformers
tags: []
---

### Model Description

This is the model used to seperate long questions into seperate questions if possible. 
## Output will be comma seperated list of questions.


### Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
### Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']


- **Developed by:** [Geerath Bhat]
- **Funded by [optional]:** [Geerath Bhat]
- **Shared by [optional]:** [Geerath Bhat]
- **Model type:** [Fine-tuned Instruct LLM]
- **Language(s) (NLP):** [English]
- **License:** []
- **Finetuned from model [optional]:** []


## Uses

We can use this model to seperate long context data into seperate meaningful parts.


## Bias, Risks, and Limitations

The model will not work in very complex situations but we have measured it's perfommance and it performs well on most complex tasks.

[More Information Needed]

### Recommendations

Give a complex nested questions and it will seperate those questions or contexr into meaningful parts.

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

  ```
  from transformers import AutoModelForSeq2SeqLM
  import nltk
  nltk.download('punkt')
  nltk.download('punkt_tab')
  import string
  from transformers import AutoTokenizer
```

# Get the tokenizer

```
model_checkpoint = "Geerath/context-seperator"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
```

# Load the model
```
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

max_input_length = 512
```
## Inference
```
prompt = """
You are given a query that combines multiple questions into a single string. Your task is to break down this combined query into individual questions, ensuring each question is clear and stands alone."""
text = """
Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
"""

inputs = [prompt + text]

inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs, num_beams=8, do_sample=True, min_length=10, max_length=512)
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
predicted_title = nltk.sent_tokenize(decoded_output.strip())

print(predicted_title)
```
## Result - ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']

### Training Data

Custom dataset generated using multiple LLMs.

### Training Procedure

Finetuned T5 on custom dataset

#### Preprocessing [optional]
```
max_input_length = 512
max_target_length = 512
def clean_text(text):
  sentences = nltk.sent_tokenize(text.strip())
  sentences_cleaned = [s for sent in sentences for s in sent.split("\n")]
  sentences_cleaned_no_titles = [sent for sent in sentences_cleaned
                                 if len(sent) > 0 and
                                 sent[-1] in string.punctuation]
  text_cleaned = "\n".join(sentences_cleaned_no_titles)
  return text_cleaned

def preprocess_data(examples):
  texts_cleaned = [text for text in examples["input"]]
  inputs = [prefix + text for text in texts_cleaned]
  model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(examples["output"], max_length=max_target_length,
                       truncation=True)

  model_inputs["labels"] = labels["input_ids"]
  return model_inputs
```

#### Training Hyperparameters
```
batch_size = 16
args = Seq2SeqTrainingArguments(
    model_dir,
    evaluation_strategy="steps",
    eval_steps=100,
    logging_strategy="steps",
    logging_steps=100,
    save_strategy="steps",
    save_steps=200,
    learning_rate=4e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=10,
    predict_with_generate=True,
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="rouge1",
    #push_to_hub=True
)
```

## Evaluation



### Testing Data, Factors & Metrics

#### Testing Data

Custom data prepared.


#### Summary

Model take a input as a text and gives output as list of text seperated by commas.
Example - Input -> Where is IISc Located, what is GDP?, How can we utilize the power of wind mill and what is photosynthesis?
          Output -> ['Where is IISc Located, what is GDP?, How can we utilize the power of wind mill, and what is photosynthesis?']