HuggingFace version does NOT use efficient MLA caching
As the title suggests, the version of this model provided in modelling_deepseek.py
does NOT make use of the efficient MLA caching mechanism pioneered by DeepSeek V2 and V3.
The relevant code is here: https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/modeling_deepseek.py#L810
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs
)
Notice that the full fat keys and values are cached after up-projection, rather than caching compressed_kv
(or self.kv_a_layernorm(compressed_kv)
) and k_pe
as is done in the native implementation in the DeepSeek repo.
I imagine this was done because of the way HF handles the Cache and the devs porting this for HF didn't want to deal with incompatibilities between typical MHA and MLA.
However, by storing k_pe
as the keys and compressed_kv
as the values we would be able to use efficient MLA caching AND support cache-managed rotary embeddings. Additionally, by fiddling with the config class we can 'trick' the special Cache variants -- like Static or Sink caches -- into pre-allocating the correct tensor shapes which would otherwise be incompatible with MLA.