Inference on two H100 doesn't work

#5
by Maverick17 - opened

Hi,

the inference, even with the code you've provided, doesn't work for me with two H100:

Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

OpenGVLab org

Hi, it seems that the image is not placed on the correct GPU, you can try to fix it.

I could fix it like this:

FIX inv_freq_expanded is on cpu causes matrix multiplication Failure !

@torch .no_grad()
def rot_embed_forward_fix(self, x, position_ids):
if "dynamic" in self.rope_type:
self._dynamic_frequency_update(position_ids, device=x.device)

# Core RoPE block
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 (see https://github.com/huggingface/transformers/pull/29285)
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
    freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
    emb = torch.cat((freqs, freqs), dim=-1)
    cos = emb.cos()
    sin = emb.sin()

# Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
cos = cos * self.attention_scaling
sin = sin * self.attention_scaling

return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

    self.model = AutoModel.from_pretrained(
       path,
       torch_dtype=torch.bfloat16,
       load_in_8bit=True, 
       load_in_4bit=False,
       low_cpu_mem_usage=True,
       trust_remote_code=True,
       device_map=device_map
    ).eval()

    if '40B' in self.model_name or '76B' in self.model_name:
       self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix 

My Issue was: inv_freq_expanded beeing on cpu, only way i could find to fix this was to override: model.language_model.model.rotary_emb.forward with the supplied Function

For me this is necessary for the llama Family Language Models (40B and 76B Model), I am running on P40s .

@HondaVfr800 @czczup I am unable to replicate your code. I am defining a custom chat model so I can use the model with langchain as well. I am running on 4Nvidia A10Gs. Does the following look accurate?

from typing import Any, List, Optional
from langchain_core.language_models import BaseChatModel
from langchain_core.messages import AIMessage, BaseMessage
from langchain_core.outputs import ChatGeneration, ChatResult
import torch
import math
from transformers import AutoModel, AutoTokenizer



@torch
	.no_grad()
def rot_embed_forward_fix(self, x, position_ids):
    if "dynamic" in self.rope_type:
        self._dynamic_frequency_update(position_ids, device=x.device)

    # Core RoPE block
    inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1).to(x.device) # FIX
    position_ids_expanded = position_ids[:, None, :].float()
    # Force float32 (see https://github.com/huggingface/transformers/pull/29285)
    device_type = x.device.type
    device_type = device_type if isinstance(device_type, str) and device_type != "mps" else "cpu"
    with torch.autocast(device_type=device_type, enabled=False):
        freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
        emb = torch.cat((freqs, freqs), dim=-1)
        cos = emb.cos()
        sin = emb.sin()

    # Advanced RoPE types (e.g. yarn) apply a post-processing scaling factor, equivalent to scaling attention
    cos = cos * self.attention_scaling
    sin = sin * self.attention_scaling

    return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)

class CustomChatModel(BaseChatModel):
    model : Any = None
    tokenizer : Any = None
    generation_config : dict = None
    def __init__(self, model_path: str, model_name):
        super().__init__()
        self.model = AutoModel.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            load_in_8bit=True,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
            device_map=self.split_model(model_name)
        ).eval()

        self.model.language_model.model.rotary_emb.__class__.forward = rot_embed_forward_fix
        
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_path, 
            trust_remote_code=True, 
            use_fast=False
        )

        self.generation_config = dict(max_new_tokens=1024, do_sample=True, temperature=0.001)

    def _generate(
        self,
        messages: List[BaseMessage],
        stop: Optional[List[str]] = None,
        run_manager: Optional[Any] = None,
        **kwargs: Any
    ) -> ChatResult:
        """Override the _generate method to implement the chat model logic.
        Args:
            messages (List[BaseMessage]): The list of messages to generate responses from.
            stop (Optional[List[str]], optional): The list of stop words. Defaults to None.
            run_manager (Optional[Any], optional): The run manager. Defaults to None.
        returns:
            ChatResult: The chat result as a LangChain object to be used by parsers.
        """
        prompt = messages[-1].content

        response = self.model.chat(
            self.tokenizer, 
            None, 
            prompt, 
            self.generation_config
        )
        message = AIMessage(content=response)

        generation = ChatGeneration(message=message)
        return ChatResult(generations=[generation])

    def chat(self, pixel_values,prompt,generation_config: Optional[dict] = None, num_patches_list = None) -> str:
        """ Generate a response to a multimodal input.
        Args:
            pixel_values (torch.Tensor): Pixel values of the input image.
            prompt (str): The prompt to generate a response from.
            generation_config (Optional[dict], optional): The generation config. Defaults to None.
        Returns:
            str: The generated response.
        """
        if generation_config is None: generation_config = self.generation_config

        if num_patches_list is None:
            return self.model.chat(self.tokenizer, pixel_values, prompt, generation_config)
        else:
            return self.model.chat(self.tokenizer, pixel_values, prompt, generation_config, num_patches_list=num_patches_list)
        
    def split_model(self,model_name):
        device_map = {}
        world_size = torch.cuda.device_count()
        num_layers = {
            'InternVL2-1B': 24, 'InternVL2-2B': 24, 'InternVL2-4B': 32, 'InternVL2-8B': 32,
            'InternVL2-26B': 48, 'InternVL2-40B': 60, 'InternVL2-Llama3-76B': 80}[model_name]
        # Since the first GPU will be used for ViT, treat it as half a GPU.
        num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
        num_layers_per_gpu = [num_layers_per_gpu] * world_size
        num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
        layer_cnt = 0
        for i, num_layer in enumerate(num_layers_per_gpu):
            for j in range(num_layer):
                device_map[f'language_model.model.layers.{layer_cnt}'] = i
                layer_cnt += 1
        device_map['vision_model'] = 0
        device_map['mlp1'] = 0
        device_map['language_model.model.tok_embeddings'] = 0
        device_map['language_model.model.embed_tokens'] = 0
        device_map['language_model.output'] = 0
        device_map['language_model.model.norm'] = 0
        device_map['language_model.lm_head'] = 0
        device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

        return device_map
    
    @property
    def _llm_type(self) -> str:
        """Get the type of language model used by this chat model."""
        return "InternVL2-40B"

# Usage example
if __name__ == "__main__":
    # Example instantiation and usage
    model_path = "./supply/InternVL2-2B" # Path to the model
    model = CustomChatModel(model_path=model_path)
    
    # Example conversation
    prompt = "Hello, who are you?"
    output = model.invoke(prompt)
    print(output)
OpenGVLab org

We have not explored supporting langchain yet. We also welcome contributions from the community. Are there any problems running this code?

OpenGVLab org

Hello, thank you for your feedback. Could you please let me know which version of Transformers you are using? I have seen related issues where this error occurs when using newer versions of Transformers, such as 4.44.0. If you downgrade to 4.37.2, the issue can be resolved.

Hello, I was able to resolve the issues I was facing by using the above rotary encoding fix as well as setting the following environment variable
Set PYTORCH_CUDA_ALLOC_CONF=backend:cudaMallocAsync

Additionally, running torch.cuda.empty_cache() after each inference was extremely useful in making sure we didn't run out of GPU memory

Finally, this code is fully operational as part of a langchain agent, and can be used simply. For example, the following code creates a custom model and then instantiates an output fixing parser. (https://python.langchain.com/v0.1/docs/modules/model_io/output_parsers/types/output_fixing/)
llm = CustomChatModel(model_path, model_name="InternVL2-40B")
fix_parser = OutputFixingParser.from_llm(parser=parser, llm=llm)

If you would like me to make a community contribution, kindly direct me to the best way to do so.

OpenGVLab org

Thank you very much for the feedback.

czczup changed discussion status to closed

Sign up or log in to comment