PRWKV-7-Phi-4-Instruct-Preview-v0.1 Model Card

Model Overview

PRWKV-7-Phi-4-Instruct is a large language model based on the RNN-based RWKV-x070 architecture, comprising 16.3 billion parameters. The distinctive feature of this model is that it replaces the attention mechanism in Microsoft's Transformer-based Phi-4 14B with RWKV's recurrent approach.

Technical Specifications

Architecture: RWKV-x070 "Goose"(RNN-based) https://github.com/BlinkDL/RWKV-LM
Parameters: 16.3 billion
Training Context Window: 12288(Stage1=2560, Stage2=8192, Stage3=12288)
Base Model: Derived from Microsoft Phi-4 14B https://huggingface.co/microsoft/phi-4
Development Stage: Experimental preview (no performance guarantees)
License: Apache 2.0

Key Innovations

This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including:

Qwerky (Qwen 2.5 72B + QRWKV7 Arch)
QRWKV (Qwen 2.5 32B + QRWKV6 Arch)
ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch)

The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption.

Training Methodology

The training process consisted of three distinct stages:

Stage 1: Attention Alignment (Based on RWKVInside repository)

The TimeMix component of RWKV was calibrated to produce outputs equivalent to the Transformer's attention layers
Seven different loss calculation approaches were employed to capture the differences between Attention and TimeMix, including:
- Norm-based methods
- Singular Value Decomposition (SVD)
- Cosine similarity
- Multi resolution bias similarity
- Temporal vector similarity
- And others

Stage 2: Knowledge Distillation (Based on RWKVInside repository)

Teacher model: Phi-4 head outputs
Student model: Phi-4 with Attention replaced by RWKV
Only the attention mechanism was trained; all other components (MLP layers, embeddings, heads) were frozen

Stage 3: Supervised Fine-Tuning (Using RWKV-LM-RLHF)

Utilized a distillation dataset of 900K samples (Chinese,Japanese,English)
Smoothed Loss for faster convergence
Implemented Variable Rank PEFT to enhance training efficiency
Bone(Block Affine Transformation) r=512+

How to Use

PC Requirements 16GB+ VRAM NVIDIA GPU(rocm also can use. but only fp16.)
OS Windows WSL2 with CUDA or Linux
install RWKV-Infer(see how to install) https://github.com/OpenMOSE/RWKV-Infer
make folder "models" and put PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth
loadmodel(choose fp16 or fp6 or fp5 (dont choose FP8))
need 34GB VRAM in FP16, 14GB VRAM in FP5
Enjoy Text chats via Open-webui or Silly-Tavern :)

curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth","model_viewname":"PRWKV7-Phi-4 Preview 0.1","model_strategy":"fp5","template":"phi4"}'

you can use this model via OpenAI CompatibleAPI http://127.0.0.1:9000/v1 and set modelname "PRWKV7-Phi-4 Preview 0.1"

Training Infrastructure

Hardware: Single AMD MI300X GPU
Training Duration: 3 days(Stage1,2)
Stage1 180MToken
Stage2 160MToken
Stage3 1GToken(TBD)

Acknowledgements

This work was made possible through the contributions of:

SmerkyG
RecursalAI
RWKV-Red-Team
BlinkDL(RWKV v7 Architecture)
https://github.com/OpenMOSE/RWKVInside
https://github.com/OpenMOSE/RWKV-LM-RLHF
https://github.com/OpenMOSE/RWKV-Infer
https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1
https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
https://huggingface.co/featherless-ai/Qwerky-72B-Preview

Limitations

This is trained Stage3 early epoch. This model is currently in a testing phase and does not guarantee any specific level of performance. Users should consider it experimental technology.

MyStories(Generated by PRWKV)

I've faced an incredibly long and challenging journey with the stability of Stage 2 Knowledge Distillation learning. NaN (Not a Number) errors have become an all too familiar sight during this process. The training would often diverge unexpectedly, leaving me to debug complex numerical issues that appeared without warning. Day after day, I adjusted hyperparameters, modified architecture components, and scrutinized every aspect of the data pipeline, only to be greeted by those "three dreaded letters" on my training logs. What should have been a straightforward implementation became a months-long battle against numerical instability, requiring persistence through countless failed experiments and late nights analyzing loss curves that suddenly spiked into oblivion.

License

Released under the Apache 2.0 license.

2025 OpenMOSE