Clarifications on how to use YaRN

#5
by Downtown-Case - opened

I'm trying to implement YaRN for Qwen 2.5 in a longer context framework and wrap my head around the transformers implementation here:

https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L163

The documentation mentions we are supposed to add this to the config for >32K usage with Qwen 2.5:

{
  ...,
  "rope_scaling": {
    "factor": 4.0,
    "original_max_position_embeddings": 32768,
    "type": "yarn"
  }
}

But it appears the transformers implementation doesn't actually read original_max_position_embeddings, but rather max_position_embeddings in the plain model config.

So lets say we want to run Qwen2.5 at 64K context in plain HF transformers... what exactly do I set? To I change max_position_embeddings to 64K, or leave it at 32K and let the framework "override" it? Because that's what it's going to read when computing the yarn scaling factors: https://github.com/huggingface/transformers/blob/2e24ee4dfa39cc0bc264b89edbccc373c8337086/src/transformers/modeling_rope_utils.py#L192

And... is this factor somehow dynamic? I don't see any trigger in transformers that makes it recompute the scale.

i think it supposed to be like that.
example from old llama 2
https://huggingface.co/NousResearch/Yarn-Llama-2-13b-128k/blob/main/config.json

Qwen org

unfortunately, I don't think it's possible for now to use transformers for 128K as YaRN is not supported in transformers. please use vllm.

It is though, its right in the code block I linked?

I ported the same code to exllama, and it seems to work.

Oh, sorry if I missed that. I don't remember we've implemented YaRN for Qwen2, but thanks to HF staff who are so helpful, it is indeed supported now (since transformers>=4.45.0)

That part of configuration in the readme/modelcard is originally supposed to be used by vllm which reads original_max_position_embeddings. But I think using that should be also okay for transformers.

Based on the code you linked, the setting should be the following for transformers:

{
  ...,
  "max_position_embeddings": 32768,
  "rope_scaling": {
    "factor": 4.0,
    "type": "yarn"
  }

(max_position_embeddings has already been 32768 in config.json).

So lets say we want to run Qwen2.5 at 64K context in plain HF transformers... what exactly do I set? To I change max_position_embeddings to 64K, or leave it at 32K and let the framework "override" it?

the thing matters for Qwen2 with YaRN is the rope scaling factor. we've tested factor=4 and the context length can be extended to 128k but shorter length accuracy may degrade. for 64k support, you may also need to set factor=4, but factor=2 may be okay too, depending on your evaluation results.

And... is this factor somehow dynamic? I don't see any trigger in transformers that makes it recompute the scale.

dynamic and static is kind of vague here. YaRN is static in the sense that for all sequence lengths the scaling is done in the same manner, which can be precomputed and cached. It is not like DynamicNTK where the scaling is dfferent for different sequence lengths.

Sign up or log in to comment