Video-Text-to-Text
Transformers
Safetensors
English
llava
text-generation
multimodal
Eval Results
Inference Endpoints

Missing steps

#8
by Martins6 - opened

Thanks for the awesome project and sharing of the weights! You guys rock!

On the llava module on the load_pretrained_model function, it has the following line:

model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)

However, there's no class that it is calling. I know this may be a llava problem, but maybe you guys can point a solution? Otherwise, it seems your code is currently unusable..

Martins6 changed discussion status to closed
This comment has been hidden

Guys, so I inspected it further. It seems there's just some missing steps

First, I had to install all of this packages, it would be nice to document this:

"accelerate>=1.0.1",
"av>=13.1.0",
"boto3>=1.35.46",
"decord>=0.6.0",
"einops>=0.6.0",
"flash-attn",
"llava",
"open-clip-torch>=2.28.0",
"transformers>=4.45.2",

Second, the load_pretrained_model function was simply stopping to work when loading the Qwen model.
I had to create a new function to load everything that was necessary:

def load_model():
    model_name = "llava_qwen"
    device_map = "auto"

    model_path = "lmms-lab/LLaVA-Video-7B-Qwen2"
    attn_implementation = None  # "flash_attention_2"
    kwargs = {"device_map": "auto", "torch_dtype": torch.float16}

    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model = LlavaQwenForCausalLM.from_pretrained(
        model_path,
        low_cpu_mem_usage=True,
        attn_implementation=attn_implementation,
        **kwargs,
    )

    if "llava" in model_name.lower():
        mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
        mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
        if mm_use_im_patch_token:
            tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
        if mm_use_im_start_end:
            tokenizer.add_tokens(
                [DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True
            )
        model.resize_token_embeddings(len(tokenizer))

        vision_tower = model.get_vision_tower()
        if not vision_tower.is_loaded:
            vision_tower.load_model(device_map=device_map)
        if device_map != "auto":
            vision_tower.to(device="cuda", dtype=torch.float16)
        image_processor = vision_tower.image_processor

    return model, tokenizer, image_processor
Martins6 changed discussion status to open
Martins6 changed discussion title from LlavaQwenForCausalLM not found. to tokenizer_image_token not working
Martins6 changed discussion title from tokenizer_image_token not working to Missing steps

Hi, I successfully ran the inference code with the 7B model, but encountered an issue when switching to the 32B model. Have you experienced any problems running the 32B model?

hey @RachelZhou , I don't have enough compute to test that :/
But if I do I'll report back to you! Hope that tips I gave here, may help you out on your project

hey @RachelZhou , I did try it, and got some buggy results too. Don't have the Traceback unfortunately. But 72B model runs super smoothly! hope it helps!

I could run the original code once I ensured flash-attn was successfully installed!

Thank you for sharing your experience!!

Hi @Martins6 I’m currently using an NVIDIA A100-SXM4-40GB GPU to run the 72B model but have found it insufficient for the task. I’m curious to know which GPU(s) or resources you are using for this model.

AWS EC2 with instance g5.48xlarge! G5 instances feature up to 8 NVIDIA A10G Tensor Core GPUs and second generation AMD EPYC processors.

Sign up or log in to comment