PRWKV-7-Phi-4-Instruct-Preview-v0.1 Model Card
Model Overview
PRWKV-7-Phi-4-Instruct is a large language model based on the RNN-based RWKV-x070 architecture, comprising 16.3 billion parameters. The distinctive feature of this model is that it replaces the attention mechanism in Microsoft's Transformer-based Phi-4 14B with RWKV's recurrent approach.
Technical Specifications
- Architecture: RWKV-x070 "Goose"(RNN-based) https://github.com/BlinkDL/RWKV-LM
- Parameters: 16.3 billion
- Training Context Window: 12288(Stage1=2560, Stage2=8192, Stage3=12288)
- Base Model: Derived from Microsoft Phi-4 14B https://huggingface.co/microsoft/phi-4
- Development Stage: Experimental preview (no performance guarantees)
- License: Apache 2.0
Key Innovations
This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including:
- Qwerky (Qwen 2.5 72B + QRWKV7 Arch)
- QRWKV (Qwen 2.5 32B + QRWKV6 Arch)
- ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch)
The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption.
Training Methodology
The training process consisted of three distinct stages:
Stage 1: Attention Alignment (Based on RWKVInside repository)
- The TimeMix component of RWKV was calibrated to produce outputs equivalent to the Transformer's attention layers
- Seven different loss calculation approaches were employed to capture the differences between Attention and TimeMix, including:
- Norm-based methods
- Singular Value Decomposition (SVD)
- Cosine similarity
- Multi resolution bias similarity
- Temporal vector similarity
- And others
Stage 2: Knowledge Distillation (Based on RWKVInside repository)
- Teacher model: Phi-4 head outputs
- Student model: Phi-4 with Attention replaced by RWKV
- Only the attention mechanism was trained; all other components (MLP layers, embeddings, heads) were frozen
Stage 3: Supervised Fine-Tuning (Using RWKV-LM-RLHF)
- Utilized a distillation dataset of 900K samples (Chinese,Japanese,English)
- Smoothed Loss for faster convergence
- Implemented Variable Rank PEFT to enhance training efficiency
- Bone(Block Affine Transformation) r=512+
How to Use
- PC Requirements 16GB+ VRAM NVIDIA GPU(rocm also can use. but only fp16.)
- OS Windows WSL2 with CUDA or Linux
- install RWKV-Infer(see how to install) https://github.com/OpenMOSE/RWKV-Infer
- make folder "models" and put PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth
- loadmodel(choose fp16 or fp6 or fp5 (dont choose FP8))
- need 34GB VRAM in FP16, 14GB VRAM in FP5
- Enjoy Text chats via Open-webui or Silly-Tavern :)
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth","model_viewname":"PRWKV7-Phi-4 Preview 0.1","model_strategy":"fp5","template":"phi4"}'
- you can use this model via OpenAI CompatibleAPI http://127.0.0.1:9000/v1 and set modelname "PRWKV7-Phi-4 Preview 0.1"
Training Infrastructure
- Hardware: Single AMD MI300X GPU
- Training Duration: 3 days(Stage1,2)
- Stage1 180MToken
- Stage2 160MToken
- Stage3 1GToken(TBD)
Acknowledgements
This work was made possible through the contributions of:
SmerkyG
RecursalAI
RWKV-Red-Team
BlinkDL(RWKV v7 Architecture)
https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
Limitations
This is trained Stage3 early epoch. This model is currently in a testing phase and does not guarantee any specific level of performance. Users should consider it experimental technology.
MyStories(Generated by PRWKV)
I've faced an incredibly long and challenging journey with the stability of Stage 2 Knowledge Distillation learning. NaN (Not a Number) errors have become an all too familiar sight during this process. The training would often diverge unexpectedly, leaving me to debug complex numerical issues that appeared without warning. Day after day, I adjusted hyperparameters, modified architecture components, and scrutinized every aspect of the data pipeline, only to be greeted by those "three dreaded letters" on my training logs. What should have been a straightforward implementation became a months-long battle against numerical instability, requiring persistence through countless failed experiments and late nights analyzing loss curves that suddenly spiked into oblivion.
License
Released under the Apache 2.0 license.
2025 OpenMOSE