File size: 3,062 Bytes
320db5c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55ae3f5
 
 
 
 
 
 
 
 
 
 
 
 
 
c119e42
55ae3f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f6978bd
55ae3f5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9cab05
55ae3f5
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: apache-2.0
base_model:
- allenai/Molmo-72B-0924
pipeline_tag: image-text-to-text
library_name: transformers
---

Quantized with NF4 double quantization from [allenai/Molmo-72B-0924](https://huggingface.co/allenai/Molmo-72B-0924) using BitsAndBytes.

Vision backbone modules were not quantized to NF4 (though they are still FP16), and need to be run in FP32 at the moment (layer norm precision loss issue), and should be offloaded to CPU or you'll run out of memory on 48 GB VRAM.

This model just *barely* fits in 48 GB (tested on 2 x 3090, and gets about 6 tok/s). It probably doesn't have a very high max sequence length, but at least it works.

For 2 cards with 24 GB VRAM, this requires a very specific device map to work. For single cards with 48 GB VRAM, I imagine it works much more smoothly.

Example usage for image captioning with 2 x 24 GB VRAM GPUs:
```python
import torch
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig, StopStringCriteria
from PIL import Image
import time

# For 2 x 24 GB. If using 1 x 48 GB or more (lucky you), you can just use device_map="auto"
device_map = {
    "model.vision_backbone": "cpu", # Seems to be required to not run out of memory at 48 GB
    "model.transformer.wte": 0,
    "model.transformer.ln_f": 0,
    "model.transformer.ff_out": 1,
}
# For 2 x 24 GB, this works for *only* 38 or 39. Any higher or lower and it'll either only work for 1 token of output or fail completely.
switch_point = 38 # layer index to switch to second GPU
device_map |= {f"model.transformer.blocks.{i}": 0 for i in range(0, switch_point)}
device_map |= {f"model.transformer.blocks.{i}": 1 for i in range(switch_point, 80)}

model_name = "SeanScripts/Molmo-72B-0924-nf4"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    use_safetensors=True,
    device_map=device_map,
    trust_remote_code=True, # Required for Molmo at the moment.
)
model.model.vision_backbone.float() # vision backbone needs to be in FP32 for this

processor = AutoProcessor.from_pretrained(
    model_name,
    trust_remote_code=True, # Required for Molmo at the moment.
)

torch.cuda.empty_cache()

image = Image.open("test.png")
inputs = processor.process(images=image, text="Caption this image.")
inputs = {k: v.to("cuda:0").unsqueeze(0) for k,v in inputs.items()}
prompt_tokens = inputs["input_ids"].size(1)
print("Prompt tokens:", prompt_tokens)

t0 = time.time()
output = model.generate_from_batch(
    inputs,
    generation_config=GenerationConfig(
        max_new_tokens=256,
    ),
    stopping_criteria=[StopStringCriteria(tokenizer=processor.tokenizer, stop_strings=["<|endoftext|>"])],
    tokenizer=processor.tokenizer,
)
t1 = time.time()
total_time = t1 - t0
generated_tokens = output.size(1) - prompt_tokens
time_per_token = generated_tokens/total_time
print(f"Generated {generated_tokens} tokens in {total_time:.3f} s ({time_per_token:.3f} tok/s)")

response = processor.tokenizer.decode(output[0, prompt_tokens:], skip_special_tokens=True)
print(response)

torch.cuda.empty_cache()
```