akhilpmohan
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,94 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model:
|
3 |
+
- meta-llama/Llama-3.1-8B-Instruct
|
4 |
+
- google/siglip-so400m-patch14-384
|
5 |
+
tags:
|
6 |
+
- captioning
|
7 |
+
---
|
8 |
+
# Model Card for Llama JoyCaption Alpha Two
|
9 |
+
|
10 |
+
[Github](https://github.com/fpgaminer/joycaption)
|
11 |
+
|
12 |
+
JoyCaption is an image captioning Visual Language Model (VLM) being built from the ground up as a free, open, and uncensored model for the community to use in training Diffusion models.
|
13 |
+
|
14 |
+
Key Features:
|
15 |
+
- **Free and Open**: It will be released for free, open weights, no restrictions, and just like [bigASP](https://www.reddit.com/r/StableDiffusion/comments/1dbasvx/the_gory_details_of_finetuning_sdxl_for_30m/), will come with training scripts and lots of juicy details on how it gets built.
|
16 |
+
- **Uncensored**: Equal coverage of SFW and NSFW concepts. No "cylindrical shaped object with a white substance coming out on it" here.
|
17 |
+
- **Diversity**: All are welcome here. Do you like digital art? Photoreal? Anime? Furry? JoyCaption is for everyone. Pains are being taken to ensure broad coverage of image styles, content, ethnicity, gender, orientation, etc.
|
18 |
+
- **Minimal Filtering**: JoyCaption is trained on large swathes of images so that it can understand almost all aspects of our world. almost. Illegal content will never be tolerated in JoyCaption's training.
|
19 |
+
|
20 |
+
|
21 |
+
## Motivation
|
22 |
+
|
23 |
+
Automated descriptive captions enable the training and finetuning of diffusion models on a wider range of images, since trainers are no longer required to either find images with already associated text or write the descriptions themselves. They also improve the quality of generations produced by Text-to-Image models trained on them (ref: DALL-E 3 paper). But to-date, the community has been stuck with ChatGPT, which is expensive and heavily censored; or alternative models, like CogVLM, which are weaker than ChatGPT and have abysmal performance outside of the SFW domain.
|
24 |
+
|
25 |
+
I'm building JoyCaption to help fill this gap by performing near or on-par with GPT4o in captioning images, while being free, unrestricted, and open.
|
26 |
+
|
27 |
+
|
28 |
+
## How to Get Started with the Model
|
29 |
+
|
30 |
+
Please see the [Github](https://github.com/fpgaminer/joycaption) for more details.
|
31 |
+
|
32 |
+
Example usage:
|
33 |
+
|
34 |
+
```
|
35 |
+
import torch
|
36 |
+
import torch.amp
|
37 |
+
import torchvision.transforms.functional as TVF
|
38 |
+
from PIL import Image
|
39 |
+
from transformers import AutoTokenizer, LlavaForConditionalGeneration
|
40 |
+
IMAGE_PATH = "image.jpg"
|
41 |
+
PROMPT = "Write a long descriptive caption for this image in a formal tone."
|
42 |
+
MODEL_NAME = "fancyfeast/llama-joycaption-alpha-two-hf-llava"
|
43 |
+
# Load JoyCaption
|
44 |
+
# bfloat16 is the native dtype of the LLM used in JoyCaption (Llama 3.1)
|
45 |
+
# device_map=0 loads the model into the first GPU
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
|
47 |
+
llava_model = LlavaForConditionalGeneration.from_pretrained(MODEL_NAME, torch_dtype="bfloat16", device_map=0)
|
48 |
+
llava_model.eval()
|
49 |
+
with torch.no_grad():
|
50 |
+
# Load and preprocess image
|
51 |
+
# Normally you would use the Processor here, but the image module's processor
|
52 |
+
# has some buggy behavior and a simple resize in Pillow yields higher quality results
|
53 |
+
image = Image.open(IMAGE_PATH)
|
54 |
+
if image.size != (384, 384):
|
55 |
+
image = image.resize((384, 384), Image.LANCZOS)
|
56 |
+
image = image.convert("RGB")
|
57 |
+
pixel_values = TVF.pil_to_tensor(image)
|
58 |
+
# Normalize the image
|
59 |
+
pixel_values = pixel_values / 255.0
|
60 |
+
pixel_values = TVF.normalize(pixel_values, [0.5], [0.5])
|
61 |
+
pixel_values = pixel_values.to(torch.bfloat16).unsqueeze(0)
|
62 |
+
# Build the conversation
|
63 |
+
convo = [
|
64 |
+
{
|
65 |
+
"role": "system",
|
66 |
+
"content": "You are a helpful image captioner.",
|
67 |
+
},
|
68 |
+
{
|
69 |
+
"role": "user",
|
70 |
+
"content": PROMPT,
|
71 |
+
},
|
72 |
+
]
|
73 |
+
# Format the conversation
|
74 |
+
convo_string = tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=True)
|
75 |
+
# Tokenize the conversation
|
76 |
+
convo_tokens = tokenizer.encode(convo_string, add_special_tokens=False, truncation=False)
|
77 |
+
# Repeat the image tokens
|
78 |
+
input_tokens = []
|
79 |
+
for token in convo_tokens:
|
80 |
+
if token == llava_model.config.image_token_index:
|
81 |
+
input_tokens.extend([llava_model.config.image_token_index] * llava_model.config.image_seq_length)
|
82 |
+
else:
|
83 |
+
input_tokens.append(token)
|
84 |
+
input_ids = torch.tensor(input_tokens, dtype=torch.long).unsqueeze(0)
|
85 |
+
attention_mask = torch.ones_like(input_ids)
|
86 |
+
# Generate the caption
|
87 |
+
generate_ids = llava_model.generate(input_ids=input_ids.to('cuda'), pixel_values=pixel_values.to('cuda'), attention_mask=attention_mask.to('cuda'), max_new_tokens=300, do_sample=True, suppress_tokens=None, use_cache=True)[0]
|
88 |
+
# Trim off the prompt
|
89 |
+
generate_ids = generate_ids[input_ids.shape[1]:]
|
90 |
+
# Decode the caption
|
91 |
+
caption = tokenizer.decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
|
92 |
+
caption = caption.strip()
|
93 |
+
print(caption)
|
94 |
+
```
|