prince-canuma commited on
Commit
a3a6e67
1 Parent(s): 421f0a0

Add struct output

Browse files
Files changed (1) hide show
  1. README.md +48 -2
README.md CHANGED
@@ -104,6 +104,54 @@ and the film's themes and messages are timeless.
104
  I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
105
  ```
106
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
  ## Training Details
108
 
109
  ### Training Data
@@ -113,8 +161,6 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
113
  In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
114
 
115
 
116
-
117
-
118
  Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
119
 
120
  - [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)
 
104
  I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
105
  ```
106
 
107
+ ### Structured Output
108
+ ```python
109
+ from transformers import AutoTokenizer, AutoModelForCausalLM
110
+
111
+ tokenizer = AutoTokenizer.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
112
+ model = AutoModelForCausalLM.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
113
+
114
+ inputs = tokenizer.apply_chat_template(
115
+ [
116
+ {"content":"You are a Robot that ONLY outputs JSON. Use this structure: {'entities': [{'type':..., 'name':..,}]}.","role":"system"},
117
+ {"content":""""Extract the entities of type 'technology' and 'file_type' in JSON format from the following passage: AI is a transformative
118
+ force in document processing employing technologies such as 'Machine Learning (ML), Natural Language Processing (NLP) and
119
+ Optical Character Recognition (OCR) to understand, interpret, and summarize text. These technologies enhance accuracy,
120
+ increase efficiency, and allow you and your company to process high volumes of data in short amount of time.
121
+ For instance, you can easily extract key points and summarize a large PDF document (i.e., 500 pages) in just a few seconds.""",
122
+ "role":"user"},
123
+ ], add_generation_prompt=True, return_tensors="pt",
124
+ ).to("cuda")
125
+
126
+ outputs = model.generate(inputs, do_sample=False, max_new_tokens=256)
127
+
128
+ input_length = inputs.shape[1]
129
+ print(tokenizer.batch_decode(outputs[:, input_length:], skip_special_tokens=True)[0])
130
+ ```
131
+
132
+ Output:
133
+ ```json
134
+ {
135
+ "entities": [
136
+ {
137
+ "type": "technology",
138
+ "name": "Machine Learning (ML)"
139
+ },
140
+ {
141
+ "type": "technology",
142
+ "name": "Natural Language Processing (NLP)"
143
+ },
144
+ {
145
+ "type": "technology",
146
+ "name": "Optical Character Recognition (OCR)"
147
+ },
148
+ {
149
+ "type": "file_type",
150
+ "name": "PDF"
151
+ },
152
+ ]
153
+ }
154
+ ```
155
  ## Training Details
156
 
157
  ### Training Data
 
161
  In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
162
 
163
 
 
 
164
  Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
165
 
166
  - [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)