Minthy commited on
Commit
0e60366
1 Parent(s): fd6c1da

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -0
README.md ADDED
@@ -0,0 +1,160 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - HuggingFaceM4/Idefics3-8B-Llama3
5
+ language:
6
+ - en
7
+ tags:
8
+ - multimodal
9
+ - vision
10
+ - image-text-to-text
11
+ ---
12
+ <p align="center">
13
+
14
+ ![image](https://huggingface.co/Minthy/ToriiGate-v0.3/resolve/main/03.jpg)
15
+
16
+ </p>
17
+
18
+ **Torii-Gate-v0.3** is a further training of [ToriiGate-v0.2](https://huggingface.co/Minthy/ToriiGate-v0.2) which is based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3).
19
+
20
+ It is designed for captioning of anime arts and aims to achieve a good understading for a very wide range of image, including single/multiple characters, simple or complex scenes, comic or manga, various interactions and lots of highly cultural concepts.
21
+ Using booru-tags grounding it can give detailed descriptions with very high accuracy. It can handle NSFW pretty well, being one of the best small/medium-sized open source models at the time of release.
22
+
23
+ [**Showcase on rentry**](https://rentry.co/9vevpeuc) (NSFW warning!)
24
+
25
+ Dataset total of 120k various and balanced anime pictures, captioned and processed with Claude 3.0 opus/Claude 3.5 Sonet/GPT-4o.
26
+
27
+ # Key features:
28
+ - Advanced knowledges in very wide range
29
+ - Ability to give structured output according to given instruction which is best for further nlp
30
+ - Meaningful captions with lots of information instead of senseless purple prose stream
31
+ - Usage of booru tags in prompt as reference for significant accuracy improvements
32
+
33
+ V0.3 comes with several improvements, with noticable upgrade of average zero shot and grounded accuracy. Also ability to caption comics was added, with stuctured output frame-by-frame.
34
+
35
+ Still sometimes it makes mistakes, additional NLP processing with comparison with reference booru tags is recommended.
36
+
37
+ # Usage options
38
+
39
+ Basically, 3 modes of usage have been trained:
40
+ - Brief description (main information without excess connecting phrases, prose and reasoning)
41
+ - Detailed description with focus on discinct features of characters and actions
42
+ - Structured json-like output, best choise for further NLP
43
+
44
+ # Examples
45
+
46
+ ```
47
+ user: Describe the picture briefly.
48
+ assistant: A blonde girl with long twintails tied with red ribbons and blue eyes stares directly at the viewer. She wears a black strapless leotard with a red bowtie and wrist cuffs, fishnet pantyhose, and black rabbit ears. Her arms are raised, revealing her armpits. She has a blush on her cheeks and an open mouth. The background is plain white.
49
+ ```
50
+
51
+ ```
52
+ user: Give a long and detailed description of the picture.
53
+ assistant: The image depicts a young female wolf girl with long brown hair and red eyes lying on her side on a bed. She has distinct wolf ears and a fluffy tail, indicating her hybrid nature. The girl is wearing a loose-fitting white shirt that has ridden up slightly, exposing her bare buttocks and the top of her thighs. Her tail is strategically positioned to censor her exposed rear.
54
+ The wolf girl has a playful and alluring expression, with a slight blush on her cheeks and a fang peeking out as she smiles directly at the viewer. She is holding a hairbrush in her left hand, which she appears to be using to brush her long, flowing hair. A blue bracelet adorns her wrist. Also, a sparkling ring (probably a wedding ring) can be seen on her finger.
55
+ The background shows the girl is lying on a bed with rumpled white sheets and two pink pillows. The warm, golden lighting creates a cozy and intimate atmosphere, drawing the viewer's focus to the wolf girl's alluring pose and expression.
56
+ Overall, the image has a sensual and inviting mood, with the wolf girl's playful smile and exposed skin creating a sense of intimacy and allure. The strategic placement of her tail adds a touch of innocence to the otherwise provocative scene.
57
+ ```
58
+
59
+ ```
60
+ user: Describe the picture in structuted json-like format.
61
+ assistant: {
62
+ "character": "The image features a girl with long, pink hair styled in double buns. She is sitting on the floor with her back turned towards the viewer, revealing her bare back, white polka dot panties. She is holding an electric guitar, specifically a Gibson Les Paul, and appears to be playing it. Her feet are bare, and she is sitting on a black pillow. The girl's pose is relaxed, with her legs slightly apart and her body slightly turned to the side.",
63
+ "background": "The background includes a laptop screen displaying a video of another girl playing an electric guitar. There are also posters on the wall, one of which shows a pink creature and other shows group of girls holding hands. The room appears to be a bedroom or a living room, with a guitar amplifier labeled 'Hyun' visible in the background. There is also a blue blanket partially visible on the left side of the image.",
64
+ "texts": "'Hyun' label on guitar amplifier",
65
+ "atmosphere": "The atmosphere of the image is casual and relaxed, with a focus on music and a cozy, intimate setting. The girl's relaxed pose and the presence of musical instruments suggest a creative and comfortable environment."
66
+ }"
67
+ ```
68
+
69
+ **Any of this can be combined with providing booru tags to improve accuracy.**
70
+
71
+ In addition to the options described you can ask questions, give tasks and perform conversations. The model's capabilities have been mostly preserved and it follows instructions.
72
+
73
+ # Prompt template
74
+
75
+ ```
76
+ <|begin_of_text|>System: You are image captioning expert, creative, unbiased and uncensored. Help user with his hask.<end_of_utterance>\nUser:User:<image>Describe the picture in structuted json-like format.<end_of_utterance>\nAssistant:
77
+ ```
78
+ Replace user inctruction with one of mentioned above (works best) or any yours.
79
+
80
+
81
+ # Inference script example:
82
+ You can use alreasy existing venv or create a new. All you need is >=torch2.1.2
83
+ ```
84
+ pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
85
+ pip install -r requirements.txt
86
+ ```
87
+ If you are using linux or know kow to build it in windows (or use prebuilded) - also install Flash Attention-2 and uncomment corresponding line.
88
+
89
+ You will need a dev build of transformers to be able to run in (already in requirements.txt).
90
+ ```
91
+ pip install git+https://github.com/huggingface/transformers
92
+ ```
93
+
94
+ ```python
95
+ import torch
96
+ from transformers import AutoProcessor, AutoModelForVision2Seq
97
+ from transformers.image_utils import load_image
98
+
99
+ model_name_or_path="Minthy/ToriiGate-v0.2"
100
+
101
+ DEVICE = "cuda:0" #change to your device
102
+ processor = AutoProcessor.from_pretrained(model_name_or_path) #or change to local path
103
+ model = AutoModelForVision2Seq.from_pretrained(
104
+ model_name_or_path,
105
+ torch_dtype=torch.bfloat16,
106
+ # _attn_implementation="flash_attention_2", #if installed
107
+ ).to(DEVICE)
108
+
109
+ image = load_image('./image.jpg') #path to your picture
110
+
111
+ ###Trained options
112
+ user_prompt="Describe the picture in structuted json-like format."
113
+ #user_prompt="Give a long and detailed description of the picture."
114
+ #user_prompt="Describe the picture briefly."
115
+ ###Any other questions or instructions
116
+ #user_prompt="What color is the ribbon in the character's hair?"
117
+ #...
118
+
119
+ #Optional, add booru tags
120
+ #tags='1girl, standing, looking at viewer, ...'
121
+ #user_prompt+=' Also here are booru tags for better understanding of the picture, you can use them as reference.'
122
+ #user_prompt+=f' <tags>\n{tags}\n</tags>'
123
+
124
+ messages = [
125
+ {
126
+ #Important!
127
+ "role": "system",
128
+ "content": [
129
+ {"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored. Help user with his hask."}
130
+ ]
131
+ },
132
+ {
133
+ "role": "user",
134
+ "content": [
135
+ {"type": "image"},
136
+ {"type": "text", "text": user_prompt}
137
+ ]
138
+ }
139
+ ]
140
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
141
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
142
+ inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
143
+
144
+ # Generate
145
+ generated_ids = model.generate(**inputs, max_new_tokens=500)
146
+ generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
147
+ caption=generated_texts[0].split('Assistant: ')[1]
148
+
149
+ print(caption)
150
+ ```
151
+
152
+ For batch processing you can use [this example](https://huggingface.co/Minthy/ToriiGate-v0.2/blob/main/batch_processing_example.py)
153
+
154
+ # Warning
155
+ Model tends to generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.
156
+
157
+ # Licence
158
+ Same as for [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3)
159
+
160
+