About rope theta
#20
by
spcmxxd
- opened
Dear author,
I notice that in class RotaryEmbedding
, base = base * self.rope_ratio
, where base == 10000
and rope_ratio == 10000
.
May I ask that, at the training stage, glm-4-9b-chat-1m
is trained with such values, i.e., base=10000*10000=10^8
? So that the training and inference stages have the same hyperparameter values?