prince-canuma
/

Damysus-2.7B-Chat

@@ -104,6 +104,54 @@ and the film's themes and messages are timeless.
 I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
 ```
 ## Training Details
 ### Training Data
@@ -113,8 +161,6 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
 In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset.  This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
 Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
 - [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)

 I highly recommend it to anyone who enjoys a well-crafted and emotionally engaging story.
 ```
+### Structured Output
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
+model = AutoModelForCausalLM.from_pretrained("prince-canuma/Damysus-2.7B-Chat")
+inputs = tokenizer.apply_chat_template(
+    [
+        {"content":"You are a Robot that ONLY outputs JSON. Use this structure: {'entities': [{'type':..., 'name':..,}]}.","role":"system"},
+        {"content":""""Extract the entities of type 'technology' and 'file_type' in JSON format from the following passage: AI is a transformative
+                      force in document processing employing technologies such as 'Machine Learning (ML), Natural Language Processing (NLP) and
+                      Optical Character Recognition (OCR) to understand, interpret, and summarize text. These technologies enhance accuracy,
+                      increase efficiency, and allow you and your company to process high volumes of data in short amount of time.
+                      For instance, you can easily extract key points and summarize a large PDF document (i.e., 500 pages) in just a few seconds.""",
+        "role":"user"},
+    ], add_generation_prompt=True, return_tensors="pt",
+).to("cuda")
+outputs = model.generate(inputs, do_sample=False, max_new_tokens=256)
+input_length = inputs.shape[1]
+print(tokenizer.batch_decode(outputs[:, input_length:], skip_special_tokens=True)[0])
+```
+Output:
+```json
+{
+  "entities": [
+    {
+      "type": "technology",
+      "name": "Machine Learning (ML)"
+    },
+    {
+      "type": "technology",
+      "name": "Natural Language Processing (NLP)"
+    },
+    {
+      "type": "technology",
+      "name": "Optical Character Recognition (OCR)"
+    },
+    {
+      "type": "file_type",
+      "name": "PDF"
+    },
+  ]
+}
+```
 ## Training Details
 ### Training Data
 In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset.  This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
 Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
 - [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)