prince-canuma
commited on
Commit
•
a3a6e67
1
Parent(s):
421f0a0
Add struct output
Browse files
README.md
CHANGED
@@ -104,6 +104,54 @@ and the film's themes and messages are timeless.
|
|
104 |
I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
|
105 |
```
|
106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
107 |
## Training Details
|
108 |
|
109 |
### Training Data
|
@@ -113,8 +161,6 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
|
|
113 |
In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
|
114 |
|
115 |
|
116 |
-
|
117 |
-
|
118 |
Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
|
119 |
|
120 |
- [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)
|
|
|
104 |
I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
|
105 |
```
|
106 |
|
107 |
+
### Structured Output
|
108 |
+
```python
|
109 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
110 |
+
|
111 |
+
tokenizer = AutoTokenizer.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
|
112 |
+
model = AutoModelForCausalLM.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
|
113 |
+
|
114 |
+
inputs = tokenizer.apply_chat_template(
|
115 |
+
[
|
116 |
+
{"content":"You are a Robot that ONLY outputs JSON. Use this structure: {'entities': [{'type':..., 'name':..,}]}.","role":"system"},
|
117 |
+
{"content":""""Extract the entities of type 'technology' and 'file_type' in JSON format from the following passage: AI is a transformative
|
118 |
+
force in document processing employing technologies such as 'Machine Learning (ML), Natural Language Processing (NLP) and
|
119 |
+
Optical Character Recognition (OCR) to understand, interpret, and summarize text. These technologies enhance accuracy,
|
120 |
+
increase efficiency, and allow you and your company to process high volumes of data in short amount of time.
|
121 |
+
For instance, you can easily extract key points and summarize a large PDF document (i.e., 500 pages) in just a few seconds.""",
|
122 |
+
"role":"user"},
|
123 |
+
], add_generation_prompt=True, return_tensors="pt",
|
124 |
+
).to("cuda")
|
125 |
+
|
126 |
+
outputs = model.generate(inputs, do_sample=False, max_new_tokens=256)
|
127 |
+
|
128 |
+
input_length = inputs.shape[1]
|
129 |
+
print(tokenizer.batch_decode(outputs[:, input_length:], skip_special_tokens=True)[0])
|
130 |
+
```
|
131 |
+
|
132 |
+
Output:
|
133 |
+
```json
|
134 |
+
{
|
135 |
+
"entities": [
|
136 |
+
{
|
137 |
+
"type": "technology",
|
138 |
+
"name": "Machine Learning (ML)"
|
139 |
+
},
|
140 |
+
{
|
141 |
+
"type": "technology",
|
142 |
+
"name": "Natural Language Processing (NLP)"
|
143 |
+
},
|
144 |
+
{
|
145 |
+
"type": "technology",
|
146 |
+
"name": "Optical Character Recognition (OCR)"
|
147 |
+
},
|
148 |
+
{
|
149 |
+
"type": "file_type",
|
150 |
+
"name": "PDF"
|
151 |
+
},
|
152 |
+
]
|
153 |
+
}
|
154 |
+
```
|
155 |
## Training Details
|
156 |
|
157 |
### Training Data
|
|
|
161 |
In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
|
162 |
|
163 |
|
|
|
|
|
164 |
Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
|
165 |
|
166 |
- [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)
|