abrooks9944 commited on
Commit
b04f90b
·
verified ·
1 Parent(s): 32d223b

Use Negative Feature Layer Indices

Browse files

There is a misalignment in the feature layers that are currently being used between transformers and vLLM (current values are correct for vLLM and off by one for transformers); In transformers, the first entry is the input embedding ([here](https://github.com/huggingface/transformers/blob/main/src/transformers/models/siglip/modeling_siglip.py#L900)). However, in vLLM, this is not the case for the way the hidden states pool are formed ([here](https://github.com/alex-jw-brooks/vllm/blob/main/vllm/model_executor/models/clip.py#L253)).

In other words, the hidden states array in `transformers` contains 28 entries:
`[emb, h0, h1, ..., h27]`

while the hidden states pool in `vLLM` contains 27 entries:
`[h0, h1, ..., h27]`

The config reflects the correct values for what is used in vLLM, but is off by one in transformers. Both projects support negative indexing into the hidden states (with offset handling in vLLM, since only the deepest feature layer needed it loaded) - this PR changes the vision feature layers to use negative indices, which will fix the misalignment in transformers without changing the output in vLLM (no code changes needed).

I will also submit a PR to vLLM to add the embeddings to the hidden state pool if all hidden states are requested from the visual encoder.

Files changed (1) hide show
  1. config.json +4 -4
config.json CHANGED
@@ -113,10 +113,10 @@
113
  "model_type": "llava_next",
114
  "use_image_newline_parameter": true,
115
  "vision_feature_layer": [
116
- 3,
117
- 7,
118
- 15,
119
- 26
120
  ],
121
  "vision_feature_select_strategy": "full",
122
  "text_config": {
 
113
  "model_type": "llava_next",
114
  "use_image_newline_parameter": true,
115
  "vision_feature_layer": [
116
+ -24,
117
+ -20,
118
+ -12,
119
+ -1
120
  ],
121
  "vision_feature_select_strategy": "full",
122
  "text_config": {