Target_module of this phi-3-small model

#3
by hackint0sh - opened

after loading the model use the
for name , module in model.named_modules():
print(name)
to get the module of the layers

for this model it is [ up_proj , down_proj ]

Microsoft org

I'm sorry, but would it be possible to clarify a bit more on the question or provide some additional context ? I'm not sure I understand the issue.

import transformers

model_name = "microsoft/Phi-3-small-128k-instruct"  # Replace with your desired Phi-3-Small variant
model = transformers.AutoModelForSeq2SeqLM.from_pretrained(model_name)

for name, module in model.named_modules():
    print(name)

By running this code, you'll obtain a comprehensive list of all the modules within the model, including those specifically related to its layers. For Phi-3-Small, you can expect to see output similar to:

up_proj
down_proj
... # Other modules in the model

This reveals that the key modules associated with layers in the Phi-3-Small model are named up_proj and down_proj. It's essential to consult the Phi-3 documentation for a detailed explanation of their roles within the model's architecture.

Microsoft org

That is accurate.
up_proj and down_proj are a part of the MLP layer with GEGLU activation (https://arxiv.org/pdf/2002.05202)
See this line.

I was thrown the runtime error when inferencing the model using device_map = "auto". Does it only works with a single GPU for inferencing?
This problem only happens with small; medium and mini work just fine. :shrug.

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1! (when checking argument for argument tensors in method wrapper_CUDA_cat)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)
model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True,
    device_map="auto",
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)

generation_args = {
    "max_new_tokens": 500,
    "return_full_text": False,
    "temperature": 0.0,
    "do_sample": False,
}

output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
Microsoft org
β€’
edited May 24, 2024

Huh interesting,
For some reason, seems like the pipeline allocated the model on one GPU, and the tensors on another (one on "cuda:0", the other one on "cuda:1").
I'd say it might be better to explicitly control the device placement, just to avoid any confusion. Copying from the README below

model_id = "microsoft/Phi-3-small-8k-instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()  # <----- Explicitly specifying the device to send the model to
model = model.to(device)  # <----- Send the model to the particular device
tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
    {"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
    {"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=device  # <----- Also tell the pipeline to use the same device while creating the input tensors
)

Let me know if this fixes the issue ?

By multi-GPU inferencing, do you want to do data parallel inferencing, or tensor-slicing ?
Data parallelism can be done by running the script with any launcher of your choice (torchrun/deepspeed/mpi, just set the current_device correctly based on local rank, and that should work imo).

Tensor slicing is a separate problem: hard to give more info without knowing how you want to do the tensor-slicing.

Thanks. Assigning both the pipeline and model to the same device works.
I'm still not sure why setting device_map="auto" only fails at small but not medium nor mini?

I have tried on A10G with the following code

model_id = "microsoft/Phi-3-small-128k-instruct"
model_kwargs = dict(
    use_cache=False,
    trust_remote_code=True,
    attn_implementation="flash_attention_2",  # loading the model with flash-attenstion support
    torch_dtype=torch.bfloat16,
    device_map=None
)
model = AutoModelForCausalLM.from_pretrained( model_id, **model_kwargs) 
assert torch.cuda.is_available(), "This model needs a GPU to run ..."
device = torch.cuda.current_device()  
model = model.to(device)  
tokenizer = AutoTokenizer.from_pretrained(model_id)

still the code is throwing the error

AssertionError: Flash Attention is not available, but is needed for dense attention

@hackint0sh Hi there! The inference code (here) assumes that flash-attn is installed.

Run pip install flash-attn to fix the error.

$ pip install flash-attn

Cheers!

nguyenbh changed discussion status to closed

@hackint0sh Hi there! The inference code (here) assumes that flash-attn is installed.

Run pip install flash-attn to fix the error.

$ pip install flash-attn

Cheers!

Doesn't work for me:
Traceback (most recent call last):
  File "/home/ubuntu/Multimodal-Uncertainty-Quantification/playground/construct_graph2.py", line 24, in <module>
    model = AutoModelForCausalLM.from_pretrained("numind/NuExtract-large", torch_dtype=torch.bfloat16, trust_remote_code=True)
  File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 559, in from_pretrained
    return model_class.from_pretrained(
  File "/home/ubuntu/miniconda3/envs/llava/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3788, in from_pretrained
    model = cls(config, *model_args, **model_kwargs)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 903, in __init__
    self.model = Phi3SmallModel(config)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in __init__
    self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 745, in <listcomp>
    self.layers = nn.ModuleList([Phi3SmallDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 651, in __init__
    self.self_attn = Phi3SmallSelfAttention(config, layer_idx)
  File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/numind/NuExtract-large/fc8e001871f4a6be8e6079093b33de334a2316c9/modeling_phi3_small.py", line 218, in __init__
    assert is_flash_attention_available, "Flash Attention is not available, but is needed for dense attention"
AssertionError: Flash Attention is not available, but is needed for dense attention

Hi @tpadhi1 ! πŸ€— The error message is shown when this code block fails, which implies that the following code snippet will raise ImportError in your environment:

import flash_attn
if int(flash_attn.__version__.split('.')[0]) < 2:
    from flash_attn.flash_attn_interface import (
        flash_attn_func,
        flash_attn_unpadded_kvpacked_func as flash_attn_varlen_kvpacked_func,
        )

    # rename `max_seqlen`
    def flash_attn_varlen_qkvpacked_func(qkv, cu_seqlens, max_seqlen, dropout_p=0.0, **kwargs):
        return flash_attn_func(qkv, cu_seqlens, dropout_p=dropout_p, max_s=max_seqlen, **kwargs)

else:
    from flash_attn.flash_attn_interface import (
        flash_attn_varlen_kvpacked_func,
    )
    from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input
is_flash_attention_available = True

Can you run the above code? It should raise an exception, which will help you narrow down the root cause.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment