About rope theta

#20
by spcmxxd - opened

Dear author,

I notice that in class RotaryEmbedding, base = base * self.rope_ratio, where base == 10000 and rope_ratio == 10000.

May I ask that, at the training stage, glm-4-9b-chat-1m is trained with such values, i.e., base=10000*10000=10^8 ? So that the training and inference stages have the same hyperparameter values?

Sign up or log in to comment