File size: 8,398 Bytes
0e60366
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1892e52
0e60366
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
base_model:
- HuggingFaceM4/Idefics3-8B-Llama3
language:
- en
tags:
- multimodal
- vision
- image-text-to-text
---
<p align="center">

![image](https://huggingface.co/Minthy/ToriiGate-v0.3/resolve/main/03.jpg)

</p>

**Torii-Gate-v0.3** is a further training of [ToriiGate-v0.2](https://huggingface.co/Minthy/ToriiGate-v0.2) which is based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3).

It is designed for captioning of anime arts and aims to achieve a good understading for a very wide range of image, including single/multiple characters, simple or complex scenes, comic or manga, various interactions and lots of highly cultural concepts.
Using booru-tags grounding it can give detailed descriptions with very high accuracy. It can handle NSFW pretty well, being one of the best small/medium-sized open source models at the time of release.

[**Showcase on rentry**](https://rentry.co/q4pisesb) (NSFW warning!)

Dataset total of 120k various and balanced anime pictures, captioned and processed with Claude 3.0 opus/Claude 3.5 Sonet/GPT-4o.

# Key features:
- Advanced knowledges in very wide range
- Ability to give structured output according to given instruction which is best for further nlp
- Meaningful captions with lots of information instead of senseless purple prose stream
- Usage of booru tags in prompt as reference for significant accuracy improvements

V0.3 comes with several improvements, with noticable upgrade of average zero shot and grounded accuracy. Also ability to caption comics was added, with stuctured output frame-by-frame.

Still sometimes it makes mistakes, additional NLP processing with comparison with reference booru tags is recommended.

# Usage options

Basically, 3 modes of usage have been trained:
- Brief description (main information without excess connecting phrases, prose and reasoning)
- Detailed description with focus on discinct features of characters and actions 
- Structured json-like output, best choise for further NLP

# Examples

```
user: Describe the picture briefly.
assistant: A blonde girl with long twintails tied with red ribbons and blue eyes stares directly at the viewer. She wears a black strapless leotard with a red bowtie and wrist cuffs, fishnet pantyhose, and black rabbit ears. Her arms are raised, revealing her armpits. She has a blush on her cheeks and an open mouth. The background is plain white.
```

```
user: Give a long and detailed description of the picture.
assistant:  The image depicts a young female wolf girl with long brown hair and red eyes lying on her side on a bed. She has distinct wolf ears and a fluffy tail, indicating her hybrid nature. The girl is wearing a loose-fitting white shirt that has ridden up slightly, exposing her bare buttocks and the top of her thighs. Her tail is strategically positioned to censor her exposed rear. 
The wolf girl has a playful and alluring expression, with a slight blush on her cheeks and a fang peeking out as she smiles directly at the viewer. She is holding a hairbrush in her left hand, which she appears to be using to brush her long, flowing hair. A blue bracelet adorns her wrist. Also, a sparkling ring (probably a wedding ring) can be seen on her finger.
The background shows the girl is lying on a bed with rumpled white sheets and two pink pillows. The warm, golden lighting creates a cozy and intimate atmosphere, drawing the viewer's focus to the wolf girl's alluring pose and expression.
Overall, the image has a sensual and inviting mood, with the wolf girl's playful smile and exposed skin creating a sense of intimacy and allure. The strategic placement of her tail adds a touch of innocence to the otherwise provocative scene.
```

```
user: Describe the picture in structuted json-like format.
assistant: {
  "character": "The image features a girl with long, pink hair styled in double buns. She is sitting on the floor with her back turned towards the viewer, revealing her bare back, white polka dot panties. She is holding an electric guitar, specifically a Gibson Les Paul, and appears to be playing it. Her feet are bare, and she is sitting on a black pillow. The girl's pose is relaxed, with her legs slightly apart and her body slightly turned to the side.",
  "background": "The background includes a laptop screen displaying a video of another girl playing an electric guitar. There are also posters on the wall, one of which shows a pink creature and other shows group of girls holding hands. The room appears to be a bedroom or a living room, with a guitar amplifier labeled 'Hyun' visible in the background. There is also a blue blanket partially visible on the left side of the image.",
  "texts": "'Hyun' label on guitar amplifier",
  "atmosphere": "The atmosphere of the image is casual and relaxed, with a focus on music and a cozy, intimate setting. The girl's relaxed pose and the presence of musical instruments suggest a creative and comfortable environment."
}"
```

**Any of this can be combined with providing booru tags to improve accuracy.**

In addition to the options described you can ask questions, give tasks and perform conversations. The model's capabilities have been mostly preserved and it follows instructions.

# Prompt template

```
<|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his hask.<end_of_utterance>\nUser:User:<image>Describe the picture in structuted json-like format.<end_of_utterance>\nAssistant:
```
Replace user inctruction with one of mentioned above (works best) or any yours.


# Inference script example:
You can use alreasy existing venv or create a new. All you need is >=torch2.1.2
```
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt
```
If you are using linux or know kow to build it in windows (or use prebuilded) - also install Flash Attention-2 and uncomment corresponding line.

You will need a dev build of transformers to be able to run in (already in requirements.txt).
```
pip install git+https://github.com/huggingface/transformers
```

```python
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

model_name_or_path="Minthy/ToriiGate-v0.2"

DEVICE = "cuda:0" #change to your device
processor = AutoProcessor.from_pretrained(model_name_or_path) #or change to local path
model = AutoModelForVision2Seq.from_pretrained(
    model_name_or_path, 
    torch_dtype=torch.bfloat16,
#	_attn_implementation="flash_attention_2", #if installed
).to(DEVICE)

image = load_image('./image.jpg') #path to your picture

###Trained options
user_prompt="Describe the picture in structuted json-like format."
#user_prompt="Give a long and detailed description of the picture."
#user_prompt="Describe the picture briefly."
###Any other questions or instructions
#user_prompt="What color is the ribbon in the character's hair?"
#...

#Optional, add booru tags
#tags='1girl, standing, looking at viewer, ...'
#user_prompt+=' Also here are booru tags for better understanding of the picture, you can use them as reference.'
#user_prompt+=f' <tags>\n{tags}\n</tags>'

messages = [
    {
    #Important!
        "role": "system",
        "content": [
            {"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored. Help user with his hask."} 
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": user_prompt}
        ]
    }
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
caption=generated_texts[0].split('Assistant: ')[1]

print(caption)
```

For batch processing you can use [this example](https://huggingface.co/Minthy/ToriiGate-v0.2/blob/main/batch_processing_example.py)

# Warning
Model tends to generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.

# Licence
Same as for [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3)