As we always use Transformers, it's helpful to understand RoPE—Rotary Position Embedding. Since token order matters, RoPE encodes it by rotating token embeddings based on their position, so the model knows how to interpret which token comes first, second, and so on.
Here are 8 types of RoPE that can be implemented in different cases:
4. Multimodal RoPE (MRoPE) -> Qwen2.5-VL Technical Report (2502.13923) Decomposes positional embedding into 3 components: temporal, height and width, so that positional features are aligned across modalities: text, images and videos.
8. XPos (Extrapolatable Position Embedding) -> https://huggingface.co/papers/2212.10 Introduces an exponential decay factor into the rotation matrix, improving stability on long sequences.