Model converted by the transformers' pt_to_tf CLI. All converted model outputs and hidden layers were validated against its Pytorch counterpart.

Maximum crossload output difference=1.312e-02; Maximum crossload hidden layer difference=1.495e-01;
Maximum conversion output difference=1.312e-02; Maximum conversion hidden layer difference=1.495e-01;

CAUTION: The maximum admissible error was manually increased to 0.9!

These look a little high no? Larger than the v1 differences in https://huggingface.co/openai/whisper-large/discussions/5
cc @joaogante @Rocketknight1 @amyeroberts @ArthurZ

Max output differences:
List of maximum output differences above the threshold (5e-05):                                                         
past_key_values[0][2]: 3.088e-03                                                                                        
past_key_values[0][3]: 2.923e-03                                                                                        
past_key_values[1][2]: 5.472e-03                                                                                        
past_key_values[1][3]: 2.577e-03                                                                                        
past_key_values[2][2]: 5.428e-03                                                                                        
past_key_values[2][3]: 2.440e-03                                                                                        
past_key_values[3][2]: 5.165e-03                                                                                        
past_key_values[3][3]: 5.683e-03                                                                                        
past_key_values[4][2]: 3.817e-03                                                                                        
past_key_values[4][3]: 2.737e-03                                                                                        
past_key_values[5][2]: 3.044e-03                                                                                        
past_key_values[5][3]: 2.732e-03                                                                                        
past_key_values[6][2]: 4.379e-03                                                                                        
past_key_values[6][3]: 3.248e-03                                                                                        
past_key_values[7][2]: 5.574e-03                                                                                        
past_key_values[7][3]: 3.467e-03                                                                                        
past_key_values[8][2]: 4.893e-03                                                                                        
past_key_values[8][3]: 2.842e-03                                                                                        
past_key_values[9][2]: 4.357e-03                                                                                        
past_key_values[9][3]: 2.044e-03                                                                                        
past_key_values[10][2]: 4.426e-03                                                                                       
past_key_values[10][3]: 3.944e-03                                                                                       
past_key_values[11][2]: 6.138e-03                                                                                       
past_key_values[11][3]: 2.681e-03                                                                                       
past_key_values[12][2]: 5.786e-03
past_key_values[12][3]: 2.861e-03
past_key_values[13][2]: 7.667e-03
past_key_values[13][3]: 3.061e-03
past_key_values[14][2]: 5.735e-03
past_key_values[14][3]: 3.063e-03
past_key_values[15][2]: 5.144e-03
past_key_values[15][3]: 3.372e-03
past_key_values[16][2]: 6.787e-03
past_key_values[16][3]: 3.290e-03
past_key_values[17][2]: 4.978e-03
past_key_values[17][3]: 3.473e-03
past_key_values[18][2]: 1.110e-02
past_key_values[18][3]: 5.202e-03
past_key_values[19][2]: 6.616e-03
past_key_values[19][3]: 4.259e-03
past_key_values[20][2]: 5.492e-03
past_key_values[20][3]: 3.165e-03
past_key_values[21][2]: 6.696e-03
past_key_values[21][3]: 3.471e-03
past_key_values[22][2]: 4.457e-03
past_key_values[22][3]: 2.330e-03
past_key_values[23][2]: 6.278e-03
past_key_values[23][3]: 4.703e-03
past_key_values[24][2]: 4.363e-03
past_key_values[24][3]: 4.692e-03
past_key_values[25][2]: 5.248e-03
past_key_values[25][3]: 3.060e-03
past_key_values[26][2]: 4.492e-03
past_key_values[26][3]: 4.331e-03
past_key_values[27][2]: 7.060e-03
past_key_values[27][3]: 3.399e-03
past_key_values[28][2]: 8.746e-03
past_key_values[28][3]: 3.638e-03
past_key_values[29][2]: 7.522e-03
past_key_values[29][3]: 3.332e-03
past_key_values[30][2]: 9.381e-03
past_key_values[30][3]: 2.854e-03
past_key_values[31][2]: 1.312e-02
past_key_values[31][3]: 3.715e-03

List of maximum hidden layer differences above the threshold (5e-05):
decoder_hidden_states[28]: 8.774e-05
decoder_hidden_states[29]: 9.155e-05
decoder_hidden_states[30]: 9.346e-05
decoder_hidden_states[31]: 8.392e-05
encoder_last_hidden_state: 3.584e-03
encoder_hidden_states[11]: 5.317e-05
encoder_hidden_states[12]: 6.831e-05
encoder_hidden_states[13]: 7.272e-05
encoder_hidden_states[14]: 9.775e-05
encoder_hidden_states[15]: 1.304e-04
encoder_hidden_states[16]: 1.554e-04
encoder_hidden_states[17]: 2.334e-04
encoder_hidden_states[18]: 4.997e-04
encoder_hidden_states[19]: 6.695e-04
encoder_hidden_states[20]: 1.229e-03
encoder_hidden_states[21]: 2.126e-03
encoder_hidden_states[22]: 2.824e-03
encoder_hidden_states[23]: 3.500e-03
encoder_hidden_states[24]: 4.278e-03
encoder_hidden_states[25]: 4.837e-03
encoder_hidden_states[26]: 6.305e-03
encoder_hidden_states[27]: 6.945e-03
encoder_hidden_states[28]: 1.120e-02
encoder_hidden_states[29]: 1.467e-02
encoder_hidden_states[30]: 3.117e-02
encoder_hidden_states[31]: 1.495e-01
encoder_hidden_states[32]: 3.584e-03

Hey @sanchit-gandhi -- with v1, we ran a comparison on the validation set of librispeech_asr. Since both frameworks generated the same output, we decided to merge regardless of the significant difference. I'd say to do the same here.

Nevertheless, it would be nice to pinpoint the difference in the future, it could make a difference in some applications!

Awesome, thanks for the details @joaogante ! Running inference on LibriSpeech test-clean. Will report back with the PT and TF WER results.

PT: 2.87% WER
TF: 2.86% WER
=> LGTM!

Awesome, LGTM!

(following the criteria applied to v1, we can merge this one)

sanchit-gandhi changed pull request status to merged

Thanks a lot for adding this!

Sign up or log in to comment