yangxiaoyu6
commited on
Commit
·
f147fc2
1
Parent(s):
fdcf711
add files
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-11-40 +47 -0
- inference_audio_tagging/log-decode-epoch-10-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-14-57 +47 -0
- inference_audio_tagging/log-decode-epoch-5-avg-1-use-averaged-model-2024-08-10-21-04-27 +8 -0
- inference_audio_tagging/log-decode-epoch-5-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-10-21-06-53 +46 -0
- inference_audio_tagging/log-decode-epoch-6-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-24-14 +45 -0
- inference_audio_tagging/log-decode-epoch-7-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-20-27 +48 -0
- inference_audio_tagging/log-decode-epoch-8-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-13-44 +48 -0
- inference_audio_tagging/log-decode-epoch-9-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-18-06 +48 -0
- inference_audio_tagging/log-decode-epoch-9-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-21-05 +43 -0
- inference_audio_tagging/log-decode-iter-164000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-14-27-35 +45 -0
- inference_audio_tagging/log-decode-iter-212000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-10-36-05 +44 -0
- inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-16-left-context-frames-256-2024-08-13-16-38-15 +49 -0
- inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-16-07-33 +3 -0
- inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-16-07-59 +40 -0
- inference_audio_tagging/log-decode-iter-224000-avg-2-use-averaged-model-2024-08-17-12-43-23 +52 -0
- inference_audio_tagging/log-decode-iter-224000-avg-2-use-averaged-model-2024-08-17-12-48-01 +46 -0
- inference_audio_tagging/log-decode-iter-256000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-14-09-26-56 +47 -0
- inference_audio_tagging/log-decode-iter-272000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-14-15-36-47 +47 -0
- inference_audio_tagging/log-decode-iter-312000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-15-10-22-43 +49 -0
- inference_audio_tagging/log-decode-iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-16-14-25-49 +48 -0
- inference_audio_tagging/log-decode-iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-17-12-37-47 +41 -0
- inference_audio_tagging/log-decode-iter-360000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-45-34 +47 -0
- inference_audio_tagging/log-decode-iter-360000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-42-23 +53 -0
- inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-48-48 +45 -0
- inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-32-left-context-frames-256-2024-08-18-01-34-58 +43 -0
- inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-45-34 +54 -0
- inference_audio_tagging/log-decode-iter-360000-avg-4-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-01-39-49 +54 -0
- inference_audio_tagging/log-decode-iter-360000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-51-57 +49 -0
- inference_audio_tagging/log-decode-iter-360000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-48-47 +46 -0
- inference_audio_tagging/log-decode-iter-364000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-34-47 +47 -0
- inference_audio_tagging/log-decode-iter-364000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-32-47 +42 -0
- inference_audio_tagging/log-decode-iter-364000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-19-03-18 +43 -0
- inference_audio_tagging/log-decode-iter-364000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-38-35 +45 -0
- inference_audio_tagging/log-decode-iter-364000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-34-47 +48 -0
- inference_audio_tagging/log-decode-iter-364000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-42-23 +48 -0
- inference_audio_tagging/log-decode-iter-364000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-38-35 +47 -0
- inference_audio_tagging/log-decode-iter-368000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-25-19 +45 -0
- inference_audio_tagging/log-decode-iter-368000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-18-53 +47 -0
- inference_audio_tagging/log-decode-iter-368000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-19-00-07 +50 -0
- inference_audio_tagging/log-decode-iter-368000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-29-01 +42 -0
- inference_audio_tagging/log-decode-iter-368000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-25-19 +46 -0
- inference_audio_tagging/log-decode-iter-368000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-32-47 +57 -0
- inference_audio_tagging/log-decode-iter-368000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-29-01 +44 -0
- inference_audio_tagging/log-decode-iter-372000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-12-05 +41 -0
- inference_audio_tagging/log-decode-iter-372000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-08-21 +46 -0
- inference_audio_tagging/log-decode-iter-372000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-18-56-51 +49 -0
- inference_audio_tagging/log-decode-iter-372000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-15-56 +47 -0
- inference_audio_tagging/log-decode-iter-372000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-12-05 +48 -0
- inference_audio_tagging/log-decode-iter-372000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-18-53 +49 -0
- inference_audio_tagging/log-decode-iter-372000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-15-56 +48 -0
inference_audio_tagging/log-decode-epoch-10-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-11-40
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-12 10:11:40,255 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-12 10:11:40,255 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 10, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-10-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-12 10:11:40,256 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-12 10:11:40,806 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 9 (excluded) to 10
|
5 |
+
2024-08-12 10:12:01,769 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-12 10:12:01,769 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-12 10:12:01,990 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-12 10:12:02,474 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-12 10:12:11,994 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-12 10:12:18,898 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-12 10:12:18,991 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6207, 3.9234, 4.4274, 4.5322], device='cuda:0')
|
12 |
+
2024-08-12 10:12:26,299 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-12 10:12:33,409 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-12 10:12:40,229 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-12 10:12:46,578 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-12 10:12:52,856 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-12 10:12:58,843 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
18 |
+
2024-08-12 10:13:00,729 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7741, 2.0148, 1.5138, 1.3013, 1.4477, 1.2348, 1.9011, 1.7445],
|
19 |
+
device='cuda:0')
|
20 |
+
2024-08-12 10:13:05,303 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
21 |
+
2024-08-12 10:13:11,451 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-12 10:13:17,708 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-12 10:13:24,114 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
24 |
+
2024-08-12 10:13:28,133 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1392, 3.9165, 3.3835, 3.6333], device='cuda:0')
|
25 |
+
2024-08-12 10:13:30,436 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
26 |
+
2024-08-12 10:13:36,728 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
27 |
+
2024-08-12 10:13:43,131 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
28 |
+
2024-08-12 10:13:49,434 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
29 |
+
2024-08-12 10:13:55,467 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
30 |
+
2024-08-12 10:14:01,522 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
31 |
+
2024-08-12 10:14:07,784 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
32 |
+
2024-08-12 10:14:12,975 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5382, 2.0938, 1.9094, 1.8908], device='cuda:0')
|
33 |
+
2024-08-12 10:14:13,619 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
34 |
+
2024-08-12 10:14:17,324 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6812, 1.8236, 1.8123, 1.7040, 2.2912, 1.7958, 1.8224, 1.7346],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-12 10:14:19,775 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
37 |
+
2024-08-12 10:14:25,558 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
38 |
+
2024-08-12 10:14:30,124 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([1.7760e-03, 1.3354e-02, 4.7700e-03, 3.5204e+00, 1.5936e-04, 1.8017e-02,
|
39 |
+
3.4849e-02, 4.3207e-02], device='cuda:0')
|
40 |
+
2024-08-12 10:14:31,554 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
41 |
+
2024-08-12 10:14:37,334 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
42 |
+
2024-08-12 10:14:38,811 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0466, 1.6653, 1.6963, 1.3708], device='cuda:0')
|
43 |
+
2024-08-12 10:14:43,398 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
44 |
+
2024-08-12 10:14:49,515 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
45 |
+
2024-08-12 10:14:50,226 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
46 |
+
2024-08-12 10:14:52,454 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.44177527860417753
|
47 |
+
2024-08-12 10:14:52,454 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-10-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-14-57
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-12 10:14:57,731 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-12 10:14:57,731 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 10, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-10-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-12 10:14:57,732 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-12 10:14:58,114 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 8 (excluded) to 10
|
5 |
+
2024-08-12 10:15:11,306 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-12 10:15:11,306 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-12 10:15:11,355 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-12 10:15:11,815 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-12 10:15:21,716 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-12 10:15:28,437 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-12 10:15:35,519 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-12 10:15:42,520 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-12 10:15:43,881 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9017, 3.0960, 3.2667, 2.7275], device='cuda:0')
|
14 |
+
2024-08-12 10:15:49,124 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-12 10:15:55,419 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-12 10:16:01,748 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-12 10:16:03,426 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0454, 1.5283, 1.9360, 2.2070], device='cuda:0')
|
18 |
+
2024-08-12 10:16:07,586 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-12 10:16:13,086 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0915, 3.8474, 3.4336, 3.7420], device='cuda:0')
|
20 |
+
2024-08-12 10:16:13,834 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
21 |
+
2024-08-12 10:16:19,515 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5161, 2.1007, 2.0855, 2.0830], device='cuda:0')
|
22 |
+
2024-08-12 10:16:20,246 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
23 |
+
2024-08-12 10:16:26,316 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
24 |
+
2024-08-12 10:16:26,914 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1140, 1.4474, 2.0416, 2.3627], device='cuda:0')
|
25 |
+
2024-08-12 10:16:32,809 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
26 |
+
2024-08-12 10:16:39,087 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
27 |
+
2024-08-12 10:16:45,509 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
28 |
+
2024-08-12 10:16:51,997 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
29 |
+
2024-08-12 10:16:58,375 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
30 |
+
2024-08-12 10:17:04,209 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
31 |
+
2024-08-12 10:17:10,624 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
32 |
+
2024-08-12 10:17:13,947 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([9.0344e-04, 4.5948e-02, 4.8250e-03, 3.5204e+00, 1.6742e-04, 2.8113e-02,
|
33 |
+
3.7402e-02, 3.4932e-02], device='cuda:0')
|
34 |
+
2024-08-12 10:17:16,461 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
35 |
+
2024-08-12 10:17:22,462 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
36 |
+
2024-08-12 10:17:23,123 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7014e-04, 4.5074e-02, 1.0580e-03, 3.5204e+00, 4.0545e-04, 2.5738e-02,
|
37 |
+
4.1491e-02, 4.1274e-02], device='cuda:0')
|
38 |
+
2024-08-12 10:17:28,770 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
39 |
+
2024-08-12 10:17:34,776 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
40 |
+
2024-08-12 10:17:40,912 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
41 |
+
2024-08-12 10:17:44,851 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4611, 3.1433, 3.3871, 3.2476], device='cuda:0')
|
42 |
+
2024-08-12 10:17:47,031 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
43 |
+
2024-08-12 10:17:53,061 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
44 |
+
2024-08-12 10:17:59,397 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
45 |
+
2024-08-12 10:17:59,939 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
46 |
+
2024-08-12 10:18:01,501 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.44082753134216507
|
47 |
+
2024-08-12 10:18:01,501 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-5-avg-1-use-averaged-model-2024-08-10-21-04-27
ADDED
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-10 21:04:27,097 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-10 21:04:27,097 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 5, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-5-avg-1-use-averaged-model'}
|
3 |
+
2024-08-10 21:04:27,097 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-10 21:04:27,595 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 4 (excluded) to 5
|
5 |
+
2024-08-10 21:04:45,144 INFO [inference_audio_tagging.py:421] Number of model parameters: 65577734
|
6 |
+
2024-08-10 21:04:45,144 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-10 21:04:45,341 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-10 21:04:45,833 INFO [kd_datamodule.py:591] About to create dev dataloader
|
inference_audio_tagging/log-decode-epoch-5-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-10-21-06-53
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-10 21:06:53,334 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-10 21:06:53,334 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 5, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-5-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-10 21:06:53,334 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-10 21:06:53,765 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 4 (excluded) to 5
|
5 |
+
2024-08-10 21:07:03,306 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-10 21:07:03,306 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-10 21:07:03,363 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-10 21:07:03,826 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-10 21:07:12,438 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-10 21:07:24,866 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-10 21:07:33,963 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7425, 3.1152, 3.3191, 3.0266], device='cuda:0')
|
12 |
+
2024-08-10 21:07:39,156 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-10 21:07:52,806 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-10 21:08:03,825 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9742, 3.3658, 2.1983, 3.7544], device='cuda:0')
|
15 |
+
2024-08-10 21:08:05,316 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
16 |
+
2024-08-10 21:08:17,742 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-10 21:08:29,006 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.7768, 3.0038, 3.0061, 2.6028], device='cuda:0')
|
18 |
+
2024-08-10 21:08:29,036 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
19 |
+
2024-08-10 21:08:40,262 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
20 |
+
2024-08-10 21:08:51,591 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
21 |
+
2024-08-10 21:09:03,078 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-10 21:09:11,147 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-10 21:09:14,762 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
24 |
+
2024-08-10 21:09:18,655 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
25 |
+
2024-08-10 21:09:22,273 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
26 |
+
2024-08-10 21:09:26,096 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
27 |
+
2024-08-10 21:09:29,932 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
28 |
+
2024-08-10 21:09:33,726 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
29 |
+
2024-08-10 21:09:34,866 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5211, 1.7122, 1.4975, 1.5531, 2.1921, 1.3644, 1.6474, 1.6538],
|
30 |
+
device='cuda:0')
|
31 |
+
2024-08-10 21:09:37,540 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
32 |
+
2024-08-10 21:09:41,363 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([0.0139, 0.0385, 0.0424, 3.5204, 0.0321, 0.0829, 0.0493, 0.0689],
|
33 |
+
device='cuda:0')
|
34 |
+
2024-08-10 21:09:41,432 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
35 |
+
2024-08-10 21:09:45,250 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
36 |
+
2024-08-10 21:09:49,266 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
37 |
+
2024-08-10 21:09:49,607 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8457, 1.4846, 1.5774, 1.2241, 0.7602, 1.3123, 1.8451, 0.9581],
|
38 |
+
device='cuda:0')
|
39 |
+
2024-08-10 21:09:52,782 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
40 |
+
2024-08-10 21:09:56,559 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
41 |
+
2024-08-10 21:10:00,346 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
42 |
+
2024-08-10 21:10:04,126 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
43 |
+
2024-08-10 21:10:07,863 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
44 |
+
2024-08-10 21:10:08,313 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
45 |
+
2024-08-10 21:10:10,319 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4008754292161127
|
46 |
+
2024-08-10 21:10:10,319 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-6-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-24-14
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-11 15:24:14,841 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-11 15:24:14,841 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 6, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-6-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-11 15:24:14,841 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-11 15:24:15,258 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 5 (excluded) to 6
|
5 |
+
2024-08-11 15:24:29,581 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-11 15:24:29,581 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-11 15:24:29,643 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-11 15:24:30,125 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-11 15:24:40,013 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-11 15:24:43,556 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1601, 3.8407, 3.4532, 3.5441], device='cuda:0')
|
11 |
+
2024-08-11 15:24:48,321 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
12 |
+
2024-08-11 15:24:56,816 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-11 15:25:05,395 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-11 15:25:12,803 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-11 15:25:20,121 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-11 15:25:26,700 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-11 15:25:26,802 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7413, 2.0136, 1.4408, 1.3566, 1.2551, 1.3010, 1.7183, 1.5829],
|
18 |
+
device='cuda:0')
|
19 |
+
2024-08-11 15:25:33,345 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
20 |
+
2024-08-11 15:25:40,287 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
21 |
+
2024-08-11 15:25:47,012 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-11 15:25:53,941 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-11 15:26:01,227 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
24 |
+
2024-08-11 15:26:08,582 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
25 |
+
2024-08-11 15:26:15,915 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
26 |
+
2024-08-11 15:26:20,418 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6425, 3.1317, 3.2957, 3.1463], device='cuda:0')
|
27 |
+
2024-08-11 15:26:22,843 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
28 |
+
2024-08-11 15:26:28,829 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
29 |
+
2024-08-11 15:26:35,783 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
30 |
+
2024-08-11 15:26:43,378 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
31 |
+
2024-08-11 15:26:51,057 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
32 |
+
2024-08-11 15:26:56,415 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4594, 1.7878, 1.6870, 1.6532], device='cuda:0')
|
33 |
+
2024-08-11 15:26:57,967 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
34 |
+
2024-08-11 15:27:04,525 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
35 |
+
2024-08-11 15:27:11,516 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
36 |
+
2024-08-11 15:27:18,993 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
37 |
+
2024-08-11 15:27:22,306 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5913, 1.4571, 1.6978, 1.5156, 2.1868, 1.4900, 1.6061, 1.6160],
|
38 |
+
device='cuda:0')
|
39 |
+
2024-08-11 15:27:26,009 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
40 |
+
2024-08-11 15:27:32,881 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
41 |
+
2024-08-11 15:27:34,182 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8435, 1.4508, 1.4836, 1.3242], device='cuda:0')
|
42 |
+
2024-08-11 15:27:40,232 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
43 |
+
2024-08-11 15:27:40,903 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
44 |
+
2024-08-11 15:27:42,503 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.41741617776389317
|
45 |
+
2024-08-11 15:27:42,504 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-7-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-20-27
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-11 15:20:27,267 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-11 15:20:27,268 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 7, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-7-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-11 15:20:27,270 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-11 15:20:27,717 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 6 (excluded) to 7
|
5 |
+
2024-08-11 15:20:47,015 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-11 15:20:47,016 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-11 15:20:47,071 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-11 15:20:47,598 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-11 15:20:56,311 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-11 15:21:06,498 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-11 15:21:16,779 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-11 15:21:26,230 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-11 15:21:28,017 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5407, 3.0696, 3.1057, 3.1079], device='cuda:0')
|
14 |
+
2024-08-11 15:21:34,135 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5935, 1.8986, 1.4909, 1.2528], device='cuda:0')
|
15 |
+
2024-08-11 15:21:34,959 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
16 |
+
2024-08-11 15:21:43,139 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-11 15:21:50,882 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-11 15:21:58,716 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-11 15:22:06,384 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-11 15:22:13,984 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-11 15:22:15,250 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6088, 3.9562, 4.3177, 4.4562], device='cuda:0')
|
22 |
+
2024-08-11 15:22:17,126 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9640, 1.5064, 1.5592, 1.4950], device='cuda:0')
|
23 |
+
2024-08-11 15:22:21,539 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
24 |
+
2024-08-11 15:22:29,096 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
25 |
+
2024-08-11 15:22:36,863 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
26 |
+
2024-08-11 15:22:42,426 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1636, 3.8079, 3.5103, 3.5051], device='cuda:0')
|
27 |
+
2024-08-11 15:22:43,827 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
28 |
+
2024-08-11 15:22:51,019 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
29 |
+
2024-08-11 15:22:58,088 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
30 |
+
2024-08-11 15:23:04,721 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
31 |
+
2024-08-11 15:23:11,793 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
32 |
+
2024-08-11 15:23:12,462 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8357, 1.6009, 1.5953, 1.1134, 0.9491, 1.4577, 1.9169, 1.1942],
|
33 |
+
device='cuda:0')
|
34 |
+
2024-08-11 15:23:14,999 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5790, 1.9387, 1.5552, 1.2077], device='cuda:0')
|
35 |
+
2024-08-11 15:23:17,011 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4937, 1.9604, 1.7352, 1.5111], device='cuda:0')
|
36 |
+
2024-08-11 15:23:18,794 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
37 |
+
2024-08-11 15:23:25,621 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
38 |
+
2024-08-11 15:23:32,548 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
39 |
+
2024-08-11 15:23:39,407 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
40 |
+
2024-08-11 15:23:42,423 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8686, 1.4719, 1.6162, 1.1380, 0.9494, 1.6360, 1.9117, 1.1741],
|
41 |
+
device='cuda:0')
|
42 |
+
2024-08-11 15:23:46,260 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
43 |
+
2024-08-11 15:23:53,498 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
44 |
+
2024-08-11 15:24:00,613 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
45 |
+
2024-08-11 15:24:07,709 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
46 |
+
2024-08-11 15:24:08,590 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
47 |
+
2024-08-11 15:24:10,145 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.42866893372274933
|
48 |
+
2024-08-11 15:24:10,146 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-8-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-11-15-13-44
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-11 15:13:44,653 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-11 15:13:44,653 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 8, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-8-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-11 15:13:44,653 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-11 15:13:45,135 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 7 (excluded) to 8
|
5 |
+
2024-08-11 15:14:05,910 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-11 15:14:05,910 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-11 15:14:06,138 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-11 15:14:06,617 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-11 15:14:18,260 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-11 15:14:36,585 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-11 15:14:53,945 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6263, 2.0387, 1.5764, 1.3913], device='cuda:0')
|
12 |
+
2024-08-11 15:14:54,023 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-11 15:15:11,053 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-11 15:15:25,693 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-11 15:15:41,421 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-11 15:15:54,607 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-11 15:16:08,467 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
18 |
+
2024-08-11 15:16:15,667 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5002, 3.1119, 3.3308, 3.2054], device='cuda:0')
|
19 |
+
2024-08-11 15:16:22,780 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-11 15:16:36,637 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-11 15:16:51,540 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-11 15:17:07,100 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-11 15:17:08,776 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0024, 3.4377, 2.2585, 3.7320], device='cuda:0')
|
24 |
+
2024-08-11 15:17:21,959 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
25 |
+
2024-08-11 15:17:37,405 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
26 |
+
2024-08-11 15:17:40,360 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9895, 2.4353, 1.8017, 1.2165], device='cuda:0')
|
27 |
+
2024-08-11 15:17:49,766 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
28 |
+
2024-08-11 15:17:59,838 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6536, 1.5882, 1.5707, 1.6327, 2.2921, 1.4276, 1.6744, 1.6394],
|
29 |
+
device='cuda:0')
|
30 |
+
2024-08-11 15:18:03,817 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-11 15:18:09,157 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5139, 1.8761, 1.7587, 1.8108], device='cuda:0')
|
32 |
+
2024-08-11 15:18:17,827 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
33 |
+
2024-08-11 15:18:31,472 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
34 |
+
2024-08-11 15:18:36,644 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1484, 3.8326, 3.4643, 3.5841], device='cuda:0')
|
35 |
+
2024-08-11 15:18:44,416 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
36 |
+
2024-08-11 15:18:47,474 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7679, 2.0467, 1.5614, 1.4747, 1.3512, 1.4268, 1.5992, 1.6495],
|
37 |
+
device='cuda:0')
|
38 |
+
2024-08-11 15:18:57,655 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
39 |
+
2024-08-11 15:19:12,042 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
40 |
+
2024-08-11 15:19:25,744 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
41 |
+
2024-08-11 15:19:38,224 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
42 |
+
2024-08-11 15:19:53,270 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
43 |
+
2024-08-11 15:19:57,666 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6340, 1.7626, 1.6098, 1.0846], device='cuda:0')
|
44 |
+
2024-08-11 15:20:06,554 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
45 |
+
2024-08-11 15:20:19,412 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
46 |
+
2024-08-11 15:20:20,710 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
47 |
+
2024-08-11 15:20:22,201 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.43377530668679676
|
48 |
+
2024-08-11 15:20:22,201 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-9-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-18-06
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-12 10:18:06,439 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-12 10:18:06,440 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 9, 'iter': 0, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-9-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-12 10:18:06,440 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-12 10:18:06,863 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 8 (excluded) to 9
|
5 |
+
2024-08-12 10:18:13,335 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-12 10:18:13,336 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-12 10:18:13,381 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-12 10:18:13,816 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-12 10:18:24,130 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-12 10:18:30,658 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-12 10:18:36,947 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8196, 3.1959, 3.1312, 2.6847], device='cuda:0')
|
12 |
+
2024-08-12 10:18:37,871 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-12 10:18:44,328 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-12 10:18:50,645 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-12 10:18:56,981 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-12 10:18:57,475 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0384, 4.8802, 4.9236, 4.9786], device='cuda:0')
|
17 |
+
2024-08-12 10:19:02,934 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-12 10:19:08,884 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-12 10:19:14,912 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-12 10:19:21,086 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5199, 2.0271, 1.8606, 1.8901], device='cuda:0')
|
21 |
+
2024-08-12 10:19:21,179 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-12 10:19:27,351 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-12 10:19:29,837 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5859, 3.8614, 4.3719, 4.4977], device='cuda:0')
|
24 |
+
2024-08-12 10:19:33,790 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
25 |
+
2024-08-12 10:19:38,524 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.1208, 3.8876, 3.3335, 3.5077], device='cuda:0')
|
26 |
+
2024-08-12 10:19:40,065 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
27 |
+
2024-08-12 10:19:46,228 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
28 |
+
2024-08-12 10:19:51,989 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6210, 3.9008, 4.3919, 4.5136], device='cuda:0')
|
29 |
+
2024-08-12 10:19:52,454 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
30 |
+
2024-08-12 10:19:58,573 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-12 10:20:04,634 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
32 |
+
2024-08-12 10:20:10,590 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
33 |
+
2024-08-12 10:20:16,501 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
34 |
+
2024-08-12 10:20:19,792 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7290, 2.0262, 1.4975, 1.3461, 1.4410, 1.3872, 1.8411, 1.5780],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-12 10:20:20,061 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9230, 2.5583, 1.7227, 1.1983], device='cuda:0')
|
37 |
+
2024-08-12 10:20:22,545 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6147, 1.8226, 1.8562, 1.0958], device='cuda:0')
|
38 |
+
2024-08-12 10:20:22,647 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
39 |
+
2024-08-12 10:20:28,498 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
40 |
+
2024-08-12 10:20:34,731 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
41 |
+
2024-08-12 10:20:38,132 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4780, 3.1288, 3.3483, 3.2593], device='cuda:0')
|
42 |
+
2024-08-12 10:20:40,597 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
43 |
+
2024-08-12 10:20:46,563 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
44 |
+
2024-08-12 10:20:52,618 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
45 |
+
2024-08-12 10:20:58,621 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
46 |
+
2024-08-12 10:20:59,278 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
47 |
+
2024-08-12 10:21:00,810 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.43811300958522703
|
48 |
+
2024-08-12 10:21:00,810 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-epoch-9-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-10-21-05
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-12 10:21:05,435 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-12 10:21:05,435 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 9, 'iter': 0, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'epoch-9-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-12 10:21:05,435 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-12 10:21:05,827 INFO [inference_audio_tagging.py:403] Calculating the averaged model over epoch range from 7 (excluded) to 9
|
5 |
+
2024-08-12 10:21:18,080 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-12 10:21:18,080 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-12 10:21:18,134 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-12 10:21:18,568 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-12 10:21:28,336 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-12 10:21:34,989 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-12 10:21:42,042 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-12 10:21:49,178 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-12 10:21:55,388 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
14 |
+
2024-08-12 10:22:01,463 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
15 |
+
2024-08-12 10:22:07,392 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
16 |
+
2024-08-12 10:22:11,856 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6005, 1.8020, 1.6668, 1.1235], device='cuda:0')
|
17 |
+
2024-08-12 10:22:13,512 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
18 |
+
2024-08-12 10:22:19,404 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
19 |
+
2024-08-12 10:22:25,554 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
20 |
+
2024-08-12 10:22:32,028 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
21 |
+
2024-08-12 10:22:38,483 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
22 |
+
2024-08-12 10:22:45,012 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
23 |
+
2024-08-12 10:22:51,279 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
24 |
+
2024-08-12 10:22:52,402 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0512, 4.8597, 4.9187, 4.9602], device='cuda:0')
|
25 |
+
2024-08-12 10:22:57,600 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-12 10:23:03,912 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-12 10:23:10,067 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-12 10:23:16,195 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
29 |
+
2024-08-12 10:23:20,841 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7589, 1.9783, 1.3725, 1.4213, 1.2875, 1.1779, 1.5035, 1.4281],
|
30 |
+
device='cuda:0')
|
31 |
+
2024-08-12 10:23:22,141 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
32 |
+
2024-08-12 10:23:28,129 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
33 |
+
2024-08-12 10:23:34,098 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
34 |
+
2024-08-12 10:23:36,128 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9530, 1.5972, 1.5369, 1.3486], device='cuda:0')
|
35 |
+
2024-08-12 10:23:37,022 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5155, 2.0749, 1.8400, 1.8845], device='cuda:0')
|
36 |
+
2024-08-12 10:23:40,175 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
37 |
+
2024-08-12 10:23:46,520 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
38 |
+
2024-08-12 10:23:52,921 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
39 |
+
2024-08-12 10:23:59,163 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
40 |
+
2024-08-12 10:24:05,466 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
41 |
+
2024-08-12 10:24:06,017 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
42 |
+
2024-08-12 10:24:07,562 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.43657015586534526
|
43 |
+
2024-08-12 10:24:07,562 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-164000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-12-14-27-35
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-12 14:27:35,488 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-12 14:27:35,488 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 164000, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-164000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-12 14:27:35,488 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-12 14:27:36,042 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-160000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-164000.pt
|
5 |
+
2024-08-12 14:27:57,684 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-12 14:27:57,685 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-12 14:27:57,782 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-12 14:27:58,246 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-12 14:28:06,699 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-12 14:28:14,927 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-12 14:28:23,808 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-12 14:28:32,452 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-12 14:28:40,155 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
14 |
+
2024-08-12 14:28:48,045 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
15 |
+
2024-08-12 14:28:55,605 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
16 |
+
2024-08-12 14:29:03,165 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
17 |
+
2024-08-12 14:29:10,757 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
18 |
+
2024-08-12 14:29:18,153 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
19 |
+
2024-08-12 14:29:25,666 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
20 |
+
2024-08-12 14:29:33,373 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
21 |
+
2024-08-12 14:29:34,901 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0051, 3.2733, 2.2210, 3.7081], device='cuda:0')
|
22 |
+
2024-08-12 14:29:41,161 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
23 |
+
2024-08-12 14:29:45,025 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.3682, 3.1002, 3.3222, 3.2444], device='cuda:0')
|
24 |
+
2024-08-12 14:29:48,876 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-12 14:29:50,480 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5391, 1.9336, 2.1516, 2.0519], device='cuda:0')
|
26 |
+
2024-08-12 14:29:55,910 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
27 |
+
2024-08-12 14:29:58,088 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0913, 3.9217, 3.4885, 3.6534], device='cuda:0')
|
28 |
+
2024-08-12 14:30:03,489 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
29 |
+
2024-08-12 14:30:10,250 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9576e-04, 2.9941e-03, 4.8557e-03, 3.5204e+00, 3.8472e-03, 4.8063e-02,
|
30 |
+
5.0845e-02, 3.9885e-02], device='cuda:0')
|
31 |
+
2024-08-12 14:30:11,917 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
32 |
+
2024-08-12 14:30:19,354 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
33 |
+
2024-08-12 14:30:26,967 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
34 |
+
2024-08-12 14:30:29,084 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9446, 3.1948, 3.3669, 2.9388], device='cuda:0')
|
35 |
+
2024-08-12 14:30:34,228 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
36 |
+
2024-08-12 14:30:41,498 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
37 |
+
2024-08-12 14:30:48,767 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
38 |
+
2024-08-12 14:30:56,037 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
39 |
+
2024-08-12 14:30:56,363 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9286, 3.0788, 3.2374, 2.7937], device='cuda:0')
|
40 |
+
2024-08-12 14:31:03,398 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
41 |
+
2024-08-12 14:31:10,763 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
42 |
+
2024-08-12 14:31:18,231 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
43 |
+
2024-08-12 14:31:18,933 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
44 |
+
2024-08-12 14:31:20,816 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.444418861729792
|
45 |
+
2024-08-12 14:31:20,816 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-212000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-10-36-05
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-13 10:36:05,896 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-13 10:36:05,896 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 212000, 'avg': 1, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-212000-avg-1-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-13 10:36:05,896 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-13 10:36:06,339 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-208000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-212000.pt
|
5 |
+
2024-08-13 10:36:27,989 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-13 10:36:27,989 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-13 10:36:28,201 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-13 10:36:28,659 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-13 10:36:39,402 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-13 10:36:44,200 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7366, 1.8888, 2.0098, 1.9449], device='cuda:0')
|
11 |
+
2024-08-13 10:36:48,060 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
12 |
+
2024-08-13 10:36:57,618 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-13 10:37:07,856 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-13 10:37:10,361 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9014, 2.4892, 1.9077, 1.8023], device='cuda:0')
|
15 |
+
2024-08-13 10:37:17,958 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
16 |
+
2024-08-13 10:37:27,554 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-13 10:37:36,254 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-13 10:37:43,662 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1169, 1.8231, 1.6520, 1.5892], device='cuda:0')
|
19 |
+
2024-08-13 10:37:45,083 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
20 |
+
2024-08-13 10:37:49,393 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0414, 4.8483, 4.9327, 4.9859], device='cuda:0')
|
21 |
+
2024-08-13 10:37:53,141 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
22 |
+
2024-08-13 10:37:57,384 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([8.3769e-04, 9.1416e-04, 1.2606e-03, 3.5204e+00, 3.6819e-03, 2.5346e-02,
|
23 |
+
2.6215e-02, 2.9305e-02], device='cuda:0')
|
24 |
+
2024-08-13 10:38:00,784 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
25 |
+
2024-08-13 10:38:08,339 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
26 |
+
2024-08-13 10:38:16,663 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
27 |
+
2024-08-13 10:38:24,733 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
28 |
+
2024-08-13 10:38:33,104 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
29 |
+
2024-08-13 10:38:41,076 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
30 |
+
2024-08-13 10:38:48,677 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-13 10:38:56,703 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
32 |
+
2024-08-13 10:39:04,545 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
33 |
+
2024-08-13 10:39:11,322 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
34 |
+
2024-08-13 10:39:19,003 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
35 |
+
2024-08-13 10:39:23,160 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5210, 2.1105, 2.2562, 2.1534], device='cuda:0')
|
36 |
+
2024-08-13 10:39:26,862 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
37 |
+
2024-08-13 10:39:33,744 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
38 |
+
2024-08-13 10:39:41,116 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
39 |
+
2024-08-13 10:39:49,025 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
40 |
+
2024-08-13 10:39:56,124 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
41 |
+
2024-08-13 10:40:03,383 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
42 |
+
2024-08-13 10:40:03,859 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
43 |
+
2024-08-13 10:40:05,502 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.44835602392952234
|
44 |
+
2024-08-13 10:40:05,502 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-16-left-context-frames-256-2024-08-13-16-38-15
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-13 16:38:15,225 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-13 16:38:15,225 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 220000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-220000-avg-2-use-averaged-model-chunk-size-16-left-context-frames-256'}
|
3 |
+
2024-08-13 16:38:15,226 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-13 16:38:15,623 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-212000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-220000.pt
|
5 |
+
2024-08-13 16:38:31,486 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-13 16:38:31,486 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-13 16:38:31,569 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-13 16:38:32,010 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-13 16:38:40,688 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-13 16:38:42,298 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7358, 1.4018, 1.6890, 1.1992, 1.1008, 1.5992, 2.0602, 1.1234],
|
11 |
+
device='cuda:0')
|
12 |
+
2024-08-13 16:38:47,591 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
13 |
+
2024-08-13 16:38:54,867 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
14 |
+
2024-08-13 16:38:57,240 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9784, 3.8372, 3.3383, 3.5979], device='cuda:0')
|
15 |
+
2024-08-13 16:39:01,526 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
16 |
+
2024-08-13 16:39:05,561 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7384, 1.4156, 1.7056, 1.2035, 1.0885, 1.6827, 2.0982, 1.0984],
|
17 |
+
device='cuda:0')
|
18 |
+
2024-08-13 16:39:08,159 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
19 |
+
2024-08-13 16:39:10,909 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9219, 3.1374, 3.0633, 2.8094], device='cuda:0')
|
20 |
+
2024-08-13 16:39:14,359 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
21 |
+
2024-08-13 16:39:20,472 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
22 |
+
2024-08-13 16:39:26,484 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
23 |
+
2024-08-13 16:39:32,680 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
24 |
+
2024-08-13 16:39:39,083 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
25 |
+
2024-08-13 16:39:45,616 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
26 |
+
2024-08-13 16:39:52,215 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
27 |
+
2024-08-13 16:39:58,395 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
28 |
+
2024-08-13 16:40:04,818 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
29 |
+
2024-08-13 16:40:10,983 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
30 |
+
2024-08-13 16:40:16,843 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-13 16:40:17,104 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1991e-05, 1.2486e-03, 3.9054e-03, 3.5658e+00, 1.2145e-02, 2.3230e-02,
|
32 |
+
1.1746e-02, 3.2749e-02], device='cuda:0')
|
33 |
+
2024-08-13 16:40:22,759 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
34 |
+
2024-08-13 16:40:25,176 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5530, 3.8580, 4.3436, 4.4815], device='cuda:0')
|
35 |
+
2024-08-13 16:40:28,693 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
36 |
+
2024-08-13 16:40:35,020 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
37 |
+
2024-08-13 16:40:41,175 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
38 |
+
2024-08-13 16:40:47,216 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
39 |
+
2024-08-13 16:40:53,468 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
40 |
+
2024-08-13 16:40:59,506 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
41 |
+
2024-08-13 16:41:05,595 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
42 |
+
2024-08-13 16:41:07,660 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.4657, 2.0241, 2.2700, 2.0440], device='cuda:0')
|
43 |
+
2024-08-13 16:41:09,754 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0016, 3.8464, 3.3515, 3.6829], device='cuda:0')
|
44 |
+
2024-08-13 16:41:11,654 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
45 |
+
2024-08-13 16:41:12,824 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9717, 3.7848, 3.3097, 3.6519], device='cuda:0')
|
46 |
+
2024-08-13 16:41:17,859 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
47 |
+
2024-08-13 16:41:18,401 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
48 |
+
2024-08-13 16:41:19,853 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45038568052483735
|
49 |
+
2024-08-13 16:41:19,853 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-16-07-33
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-13 16:07:33,100 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-13 16:07:33,100 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 220000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-13 16:07:33,101 INFO [inference_audio_tagging.py:324] About to create model
|
inference_audio_tagging/log-decode-iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-13-16-07-59
ADDED
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-13 16:07:59,024 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-13 16:07:59,025 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 220000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-220000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-13 16:07:59,025 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-13 16:07:59,513 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-212000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-220000.pt
|
5 |
+
2024-08-13 16:08:28,031 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-13 16:08:28,031 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-13 16:08:28,157 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-13 16:08:28,715 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-13 16:08:35,468 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-13 16:08:49,416 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-13 16:08:57,889 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1973, 1.5526, 1.9500, 2.3412], device='cuda:0')
|
12 |
+
2024-08-13 16:09:04,949 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-13 16:09:17,809 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-13 16:09:31,179 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-13 16:09:44,307 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-13 16:09:54,946 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-13 16:09:57,934 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9298, 3.7535, 3.2980, 3.7303], device='cuda:0')
|
18 |
+
2024-08-13 16:10:02,091 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-13 16:10:09,581 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-13 16:10:17,048 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-13 16:10:24,829 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-13 16:10:32,639 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-13 16:10:38,069 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
24 |
+
2024-08-13 16:10:41,787 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-13 16:10:45,569 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-13 16:10:49,444 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-13 16:10:53,233 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-13 16:10:56,985 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
29 |
+
2024-08-13 16:11:00,855 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
30 |
+
2024-08-13 16:11:04,582 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
31 |
+
2024-08-13 16:11:08,663 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
32 |
+
2024-08-13 16:11:12,482 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
33 |
+
2024-08-13 16:11:16,270 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
34 |
+
2024-08-13 16:11:20,073 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
35 |
+
2024-08-13 16:11:23,726 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6751, 3.9433, 4.4075, 4.5739], device='cuda:0')
|
36 |
+
2024-08-13 16:11:24,016 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
37 |
+
2024-08-13 16:11:27,824 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
38 |
+
2024-08-13 16:11:28,274 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
39 |
+
2024-08-13 16:11:29,941 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45195257883130513
|
40 |
+
2024-08-13 16:11:29,941 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-224000-avg-2-use-averaged-model-2024-08-17-12-43-23
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-17 12:43:23,663 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-17 12:43:23,664 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 224000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-224000-avg-2-use-averaged-model'}
|
3 |
+
2024-08-17 12:43:23,664 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-17 12:43:24,005 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-216000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-224000.pt
|
5 |
+
2024-08-17 12:43:44,405 INFO [inference_audio_tagging.py:421] Number of model parameters: 65577734
|
6 |
+
2024-08-17 12:43:44,405 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-17 12:43:44,461 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-17 12:43:44,877 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-17 12:43:51,163 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-17 12:44:01,995 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8713, 3.6317, 3.3175, 2.2232, 1.5292, 3.9032, 3.4263, 1.3087],
|
11 |
+
device='cuda:0')
|
12 |
+
2024-08-17 12:44:02,041 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
13 |
+
2024-08-17 12:44:06,557 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.7718, 5.4808, 5.8920, 5.7851], device='cuda:0')
|
14 |
+
2024-08-17 12:44:12,645 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.6673, 5.4388, 5.5596, 5.5677], device='cuda:0')
|
15 |
+
2024-08-17 12:44:13,304 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
16 |
+
2024-08-17 12:44:24,934 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
17 |
+
2024-08-17 12:44:34,114 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
18 |
+
2024-08-17 12:44:44,486 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9751, 5.7112, 5.8767, 5.9406], device='cuda:0')
|
19 |
+
2024-08-17 12:44:44,575 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
20 |
+
2024-08-17 12:44:53,934 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
21 |
+
2024-08-17 12:45:02,227 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
22 |
+
2024-08-17 12:45:11,660 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
23 |
+
2024-08-17 12:45:12,417 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9600, 3.7429, 3.9020, 2.5979, 2.6934, 3.1868, 3.1089, 3.6281],
|
24 |
+
device='cuda:0')
|
25 |
+
2024-08-17 12:45:12,920 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6521e-04, 1.2800e-01, 1.1459e-03, 4.7469e-02, 6.7336e-09, 2.3094e-01,
|
26 |
+
3.4149e-02, 3.9689e-02], device='cuda:0')
|
27 |
+
2024-08-17 12:45:15,161 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5506, 4.4263, 4.4228, 3.6450], device='cuda:0')
|
28 |
+
2024-08-17 12:45:21,540 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
29 |
+
2024-08-17 12:45:22,091 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.5398, 4.4905, 0.5230, 4.5585], device='cuda:0')
|
30 |
+
2024-08-17 12:45:30,603 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
31 |
+
2024-08-17 12:45:39,737 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
32 |
+
2024-08-17 12:45:49,431 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
33 |
+
2024-08-17 12:45:59,198 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
34 |
+
2024-08-17 12:46:07,822 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
35 |
+
2024-08-17 12:46:15,828 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2153, 4.6042, 4.0496, 3.9721], device='cuda:0')
|
36 |
+
2024-08-17 12:46:16,360 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
37 |
+
2024-08-17 12:46:17,968 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8466, 4.1769, 4.9373, 3.8322], device='cuda:0')
|
38 |
+
2024-08-17 12:46:25,907 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
39 |
+
2024-08-17 12:46:36,157 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
40 |
+
2024-08-17 12:46:45,099 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
41 |
+
2024-08-17 12:46:50,766 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9510, 3.7431, 3.8880, 2.5511, 2.5861, 3.1840, 3.1143, 3.6279],
|
42 |
+
device='cuda:0')
|
43 |
+
2024-08-17 12:46:53,611 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
44 |
+
2024-08-17 12:47:03,010 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
45 |
+
2024-08-17 12:47:12,878 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
46 |
+
2024-08-17 12:47:19,001 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
47 |
+
2024-08-17 12:47:24,348 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
48 |
+
2024-08-17 12:47:30,461 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
49 |
+
2024-08-17 12:47:36,543 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
50 |
+
2024-08-17 12:47:37,272 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
51 |
+
2024-08-17 12:47:38,654 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.00691143397780095
|
52 |
+
2024-08-17 12:47:38,654 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-224000-avg-2-use-averaged-model-2024-08-17-12-48-01
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-17 12:48:01,618 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-17 12:48:01,618 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 224000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': False, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-224000-avg-2-use-averaged-model'}
|
3 |
+
2024-08-17 12:48:01,618 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-17 12:48:01,966 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-216000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-224000.pt
|
5 |
+
2024-08-17 12:48:05,206 INFO [inference_audio_tagging.py:421] Number of model parameters: 65577734
|
6 |
+
2024-08-17 12:48:05,206 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-17 12:48:05,253 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-17 12:48:05,664 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-17 12:48:11,858 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-17 12:48:19,317 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-17 12:48:26,916 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-17 12:48:33,866 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-17 12:48:40,924 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
14 |
+
2024-08-17 12:48:47,522 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
15 |
+
2024-08-17 12:48:54,118 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
16 |
+
2024-08-17 12:49:00,564 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
17 |
+
2024-08-17 12:49:02,476 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2148, 2.8000, 3.8232, 4.1030], device='cuda:0')
|
18 |
+
2024-08-17 12:49:06,932 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
19 |
+
2024-08-17 12:49:08,285 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.9429, 5.7100, 5.8488, 5.9095], device='cuda:0')
|
20 |
+
2024-08-17 12:49:13,280 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-17 12:49:20,088 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-17 12:49:26,874 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-17 12:49:33,568 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
24 |
+
2024-08-17 12:49:40,057 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-17 12:49:46,304 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-17 12:49:52,588 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-17 12:49:58,863 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-17 12:50:05,083 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
29 |
+
2024-08-17 12:50:06,261 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.8057, 5.5924, 5.9777, 5.9412], device='cuda:0')
|
30 |
+
2024-08-17 12:50:07,625 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.7783, 4.0281, 4.8420, 4.0611], device='cuda:0')
|
31 |
+
2024-08-17 12:50:11,477 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
32 |
+
2024-08-17 12:50:17,910 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
33 |
+
2024-08-17 12:50:24,172 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
34 |
+
2024-08-17 12:50:28,935 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8829, 3.4486, 3.6587, 2.9680, 3.7674, 3.2865, 3.6840, 3.7479],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-17 12:50:30,309 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
37 |
+
2024-08-17 12:50:36,560 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
38 |
+
2024-08-17 12:50:42,639 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
39 |
+
2024-08-17 12:50:45,167 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.2130, 2.8556, 3.8183, 4.1412], device='cuda:0')
|
40 |
+
2024-08-17 12:50:47,648 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([0.0018, 0.1121, 0.0003, 0.0496, 0.0003, 0.2586, 0.0050, 0.1315],
|
41 |
+
device='cuda:0')
|
42 |
+
2024-08-17 12:50:48,883 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
43 |
+
2024-08-17 12:50:55,217 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
44 |
+
2024-08-17 12:50:55,751 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
45 |
+
2024-08-17 12:50:57,161 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.007296872691144849
|
46 |
+
2024-08-17 12:50:57,161 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-256000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-14-09-26-56
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-14 09:26:56,233 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-14 09:26:56,234 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 256000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-256000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-14 09:26:56,234 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-14 09:26:56,708 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-248000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-256000.pt
|
5 |
+
2024-08-14 09:27:19,571 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-14 09:27:19,571 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-14 09:27:19,817 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-14 09:27:20,313 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-14 09:27:33,098 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-14 09:27:41,509 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-14 09:27:47,829 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7219, 1.5406, 1.6569, 1.2398, 1.2325, 1.7129, 2.1135, 1.2056],
|
12 |
+
device='cuda:0')
|
13 |
+
2024-08-14 09:27:50,282 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
14 |
+
2024-08-14 09:27:58,670 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
15 |
+
2024-08-14 09:28:02,259 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2015, 1.8229, 2.0339, 2.4653], device='cuda:0')
|
16 |
+
2024-08-14 09:28:04,323 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9585, 3.2798, 2.2416, 3.8481], device='cuda:0')
|
17 |
+
2024-08-14 09:28:06,358 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
18 |
+
2024-08-14 09:28:13,823 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
19 |
+
2024-08-14 09:28:20,770 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
20 |
+
2024-08-14 09:28:27,690 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
21 |
+
2024-08-14 09:28:35,157 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
22 |
+
2024-08-14 09:28:42,186 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
23 |
+
2024-08-14 09:28:49,611 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
24 |
+
2024-08-14 09:28:57,094 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
25 |
+
2024-08-14 09:29:04,581 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
26 |
+
2024-08-14 09:29:06,795 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0301, 3.9043, 3.3810, 3.6431], device='cuda:0')
|
27 |
+
2024-08-14 09:29:12,305 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
28 |
+
2024-08-14 09:29:19,831 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
29 |
+
2024-08-14 09:29:20,937 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5099, 2.1304, 2.2948, 2.1047], device='cuda:0')
|
30 |
+
2024-08-14 09:29:26,753 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-14 09:29:33,765 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
32 |
+
2024-08-14 09:29:40,913 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
33 |
+
2024-08-14 09:29:48,072 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
34 |
+
2024-08-14 09:29:54,987 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
35 |
+
2024-08-14 09:30:02,187 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
36 |
+
2024-08-14 09:30:09,244 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
37 |
+
2024-08-14 09:30:13,333 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5243, 2.1675, 2.2982, 2.2210], device='cuda:0')
|
38 |
+
2024-08-14 09:30:15,962 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
39 |
+
2024-08-14 09:30:23,079 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
40 |
+
2024-08-14 09:30:25,585 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9858, 3.3134, 3.3545, 3.0993], device='cuda:0')
|
41 |
+
2024-08-14 09:30:30,000 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
42 |
+
2024-08-14 09:30:37,032 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
43 |
+
2024-08-14 09:30:37,165 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7819, 1.5824, 1.6678, 1.2826, 1.1834, 1.7101, 2.2432, 1.2719],
|
44 |
+
device='cuda:0')
|
45 |
+
2024-08-14 09:30:37,743 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
46 |
+
2024-08-14 09:30:39,579 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45542812467339416
|
47 |
+
2024-08-14 09:30:39,579 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-272000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-14-15-36-47
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-14 15:36:47,183 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-14 15:36:47,183 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': 'a6c2f7a4-dirty', 'icefall-git-date': 'Thu Aug 8 16:21:21 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 272000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-272000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-14 15:36:47,184 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-14 15:36:47,652 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-264000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-272000.pt
|
5 |
+
2024-08-14 15:37:08,888 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-14 15:37:08,889 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-14 15:37:08,942 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-14 15:37:09,436 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-14 15:37:19,448 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-14 15:37:27,216 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-14 15:37:27,850 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0499, 4.8285, 4.9338, 4.9839], device='cuda:0')
|
12 |
+
2024-08-14 15:37:35,567 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-14 15:37:43,489 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-14 15:37:50,345 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-14 15:37:57,953 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-14 15:38:04,666 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-14 15:38:11,383 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
18 |
+
2024-08-14 15:38:18,421 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
19 |
+
2024-08-14 15:38:25,117 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
20 |
+
2024-08-14 15:38:32,571 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
21 |
+
2024-08-14 15:38:39,768 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
22 |
+
2024-08-14 15:38:46,601 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
23 |
+
2024-08-14 15:38:53,831 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
24 |
+
2024-08-14 15:39:00,890 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
25 |
+
2024-08-14 15:39:07,469 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
26 |
+
2024-08-14 15:39:08,344 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1708, 1.6763, 1.7162, 1.7789], device='cuda:0')
|
27 |
+
2024-08-14 15:39:13,867 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-14 15:39:20,501 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
29 |
+
2024-08-14 15:39:22,509 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7966, 1.5144, 1.7197, 1.2377, 1.1744, 1.8064, 2.2673, 1.2229],
|
30 |
+
device='cuda:0')
|
31 |
+
2024-08-14 15:39:26,620 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9911, 3.3313, 2.1503, 3.7904], device='cuda:0')
|
32 |
+
2024-08-14 15:39:27,425 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
33 |
+
2024-08-14 15:39:34,416 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
34 |
+
2024-08-14 15:39:39,372 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7683, 1.8626, 1.8632, 1.7903, 2.3835, 1.8736, 1.8106, 1.8140],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-14 15:39:41,105 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
37 |
+
2024-08-14 15:39:43,680 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0456, 4.8189, 4.9261, 4.9940], device='cuda:0')
|
38 |
+
2024-08-14 15:39:47,825 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
39 |
+
2024-08-14 15:39:54,390 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
40 |
+
2024-08-14 15:40:01,038 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
41 |
+
2024-08-14 15:40:01,530 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0675, 2.5988, 2.2924, 2.0476], device='cuda:0')
|
42 |
+
2024-08-14 15:40:07,616 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0205, 3.3193, 2.3530, 3.8492], device='cuda:0')
|
43 |
+
2024-08-14 15:40:07,716 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
44 |
+
2024-08-14 15:40:14,253 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
45 |
+
2024-08-14 15:40:14,889 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
46 |
+
2024-08-14 15:40:16,705 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45548497993926523
|
47 |
+
2024-08-14 15:40:16,706 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-312000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-15-10-22-43
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-15 10:22:43,400 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-15 10:22:43,400 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 312000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-312000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-15 10:22:43,400 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-15 10:22:43,846 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-304000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-312000.pt
|
5 |
+
2024-08-15 10:23:06,205 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-15 10:23:06,206 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-15 10:23:06,464 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-15 10:23:06,996 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-15 10:23:19,647 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-15 10:23:27,230 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-15 10:23:30,896 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7672, 2.1088, 1.8151, 1.3543, 1.5991, 1.5252, 2.0788, 1.7250],
|
12 |
+
device='cuda:0')
|
13 |
+
2024-08-15 10:23:35,819 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
14 |
+
2024-08-15 10:23:43,687 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
15 |
+
2024-08-15 10:23:44,763 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1439, 3.0719, 3.2177, 3.1074], device='cuda:0')
|
16 |
+
2024-08-15 10:23:51,148 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
17 |
+
2024-08-15 10:23:55,200 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0180, 3.8565, 3.3195, 3.7030], device='cuda:0')
|
18 |
+
2024-08-15 10:23:57,870 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
19 |
+
2024-08-15 10:24:00,524 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1575, 1.6678, 1.8357, 1.6334], device='cuda:0')
|
20 |
+
2024-08-15 10:24:04,865 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
21 |
+
2024-08-15 10:24:05,071 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0013, 3.1210, 3.1729, 3.0276], device='cuda:0')
|
22 |
+
2024-08-15 10:24:11,941 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
23 |
+
2024-08-15 10:24:19,034 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
24 |
+
2024-08-15 10:24:26,155 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
25 |
+
2024-08-15 10:24:33,596 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
26 |
+
2024-08-15 10:24:40,792 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
27 |
+
2024-08-15 10:24:46,500 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0305, 4.7928, 4.9185, 4.9848], device='cuda:0')
|
28 |
+
2024-08-15 10:24:47,957 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
29 |
+
2024-08-15 10:24:55,405 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
30 |
+
2024-08-15 10:25:02,506 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
31 |
+
2024-08-15 10:25:09,386 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
32 |
+
2024-08-15 10:25:16,294 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
33 |
+
2024-08-15 10:25:23,187 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
34 |
+
2024-08-15 10:25:30,148 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
35 |
+
2024-08-15 10:25:31,105 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6344, 3.8733, 4.4210, 4.4999], device='cuda:0')
|
36 |
+
2024-08-15 10:25:36,816 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
37 |
+
2024-08-15 10:25:42,041 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7904, 2.1233, 1.6198, 1.2424, 1.5676, 1.5375, 2.0214, 1.8020],
|
38 |
+
device='cuda:0')
|
39 |
+
2024-08-15 10:25:43,395 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
40 |
+
2024-08-15 10:25:50,326 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
41 |
+
2024-08-15 10:25:50,909 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8078, 2.0394, 1.8046, 1.2229, 1.5979, 1.5856, 1.9052, 1.8364],
|
42 |
+
device='cuda:0')
|
43 |
+
2024-08-15 10:25:57,145 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
44 |
+
2024-08-15 10:26:03,897 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
45 |
+
2024-08-15 10:26:10,467 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
46 |
+
2024-08-15 10:26:17,276 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
47 |
+
2024-08-15 10:26:18,066 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
48 |
+
2024-08-15 10:26:19,786 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4582703411376689
|
49 |
+
2024-08-15 10:26:19,786 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-16-14-25-49
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-16 14:25:49,442 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-16 14:25:49,442 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 332000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-16 14:25:49,442 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-16 14:25:49,787 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-324000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-332000.pt
|
5 |
+
2024-08-16 14:31:43,155 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-16 14:31:43,155 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-16 14:31:43,198 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-16 14:31:43,604 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-16 14:32:38,552 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-16 14:40:41,249 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-16 14:47:27,967 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.9663, 4.7687, 4.7854, 4.8755], device='cuda:0')
|
12 |
+
2024-08-16 14:48:51,786 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-16 14:56:42,074 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-16 15:05:07,149 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-16 15:13:44,624 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-16 15:21:58,951 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-16 15:23:45,923 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([7.3986e-04, 5.7475e-03, 8.6311e-04, 3.5204e+00, 1.3223e-04, 2.9355e-02,
|
18 |
+
3.6493e-02, 4.3895e-02], device='cuda:0')
|
19 |
+
2024-08-16 15:32:10,740 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
20 |
+
2024-08-16 15:37:12,969 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0175, 3.8674, 3.4225, 3.7672], device='cuda:0')
|
21 |
+
2024-08-16 15:40:31,337 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
22 |
+
2024-08-16 15:48:32,448 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
23 |
+
2024-08-16 15:50:13,329 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0558, 4.7987, 4.9565, 5.0084], device='cuda:0')
|
24 |
+
2024-08-16 15:57:00,541 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
25 |
+
2024-08-16 16:05:23,095 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
26 |
+
2024-08-16 16:08:44,937 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0177, 3.1595, 3.1460, 3.0573], device='cuda:0')
|
27 |
+
2024-08-16 16:09:35,496 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6537, 3.8775, 4.4542, 4.5006], device='cuda:0')
|
28 |
+
2024-08-16 16:13:47,961 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
29 |
+
2024-08-16 16:22:24,691 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
30 |
+
2024-08-16 16:30:38,291 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
31 |
+
2024-08-16 16:39:04,705 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
32 |
+
2024-08-16 16:41:35,878 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7632, 1.9858, 2.1763, 2.0882], device='cuda:0')
|
33 |
+
2024-08-16 16:47:30,653 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
34 |
+
2024-08-16 16:55:56,017 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
35 |
+
2024-08-16 17:04:22,786 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
36 |
+
2024-08-16 17:12:48,439 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
37 |
+
2024-08-16 17:21:14,846 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
38 |
+
2024-08-16 17:29:39,075 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
39 |
+
2024-08-16 17:38:03,640 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
40 |
+
2024-08-16 17:46:28,142 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
41 |
+
2024-08-16 17:47:20,087 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8109, 1.9455, 2.0168, 1.9513, 2.3885, 1.8640, 2.0212, 1.8049],
|
42 |
+
device='cuda:0')
|
43 |
+
2024-08-16 17:54:54,141 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
44 |
+
2024-08-16 17:56:31,504 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5979, 1.8924, 2.2459, 1.0960], device='cuda:0')
|
45 |
+
2024-08-16 18:03:21,491 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
46 |
+
2024-08-16 18:03:41,745 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
47 |
+
2024-08-16 18:03:43,122 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4552332877710252
|
48 |
+
2024-08-16 18:03:43,122 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-17-12-37-47
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-17 12:37:47,555 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-17 12:37:47,555 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 332000, 'avg': 2, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-332000-avg-2-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-17 12:37:47,555 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-17 12:37:47,901 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-324000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-332000.pt
|
5 |
+
2024-08-17 12:37:50,772 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-17 12:37:50,772 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-17 12:37:50,811 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-17 12:37:51,225 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-17 12:37:56,590 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-17 12:38:01,392 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-17 12:38:02,783 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6883, 3.9024, 4.4877, 4.5186], device='cuda:0')
|
12 |
+
2024-08-17 12:38:06,506 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-17 12:38:08,387 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1443, 1.8318, 1.8563, 1.8249], device='cuda:0')
|
14 |
+
2024-08-17 12:38:11,325 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
15 |
+
2024-08-17 12:38:15,773 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
16 |
+
2024-08-17 12:38:20,275 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-17 12:38:24,107 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-17 12:38:28,251 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-17 12:38:32,438 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-17 12:38:36,571 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-17 12:38:40,969 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-17 12:38:45,564 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-17 12:38:49,969 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
24 |
+
2024-08-17 12:38:54,342 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-17 12:38:58,610 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-17 12:39:02,725 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-17 12:39:07,051 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-17 12:39:12,158 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
29 |
+
2024-08-17 12:39:12,720 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1644, 1.8228, 1.9461, 1.7229], device='cuda:0')
|
30 |
+
2024-08-17 12:39:17,824 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
31 |
+
2024-08-17 12:39:19,936 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1940, 1.9554, 1.8000, 1.8779], device='cuda:0')
|
32 |
+
2024-08-17 12:39:23,738 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
33 |
+
2024-08-17 12:39:30,370 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
34 |
+
2024-08-17 12:39:36,975 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
35 |
+
2024-08-17 12:39:43,096 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
36 |
+
2024-08-17 12:39:49,069 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
37 |
+
2024-08-17 12:39:55,515 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
38 |
+
2024-08-17 12:40:01,454 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
39 |
+
2024-08-17 12:40:01,892 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
40 |
+
2024-08-17 12:40:03,224 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4552332877710252
|
41 |
+
2024-08-17 12:40:03,224 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-45-34
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:45:34,875 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:45:34,876 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-3-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:45:34,876 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:45:35,228 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:45:43,991 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:45:43,992 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:45:44,047 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:45:44,455 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:45:53,758 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:46:00,193 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:46:07,986 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:46:10,269 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1514, 2.1762, 2.1618, 2.1262], device='cuda:0')
|
13 |
+
2024-08-19 14:46:11,725 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6205, 3.4734, 3.0762, 3.3875], device='cuda:0')
|
14 |
+
2024-08-19 14:46:15,534 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:46:22,333 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:46:28,843 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:46:35,601 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:46:41,891 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3648, 1.7014, 1.7502, 1.7427, 1.9580, 1.5974, 1.7750, 1.7042],
|
19 |
+
device='cuda:0')
|
20 |
+
2024-08-19 14:46:42,376 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2173, 1.6039, 2.2085, 0.8856], device='cuda:0')
|
21 |
+
2024-08-19 14:46:42,477 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
22 |
+
2024-08-19 14:46:43,922 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1960, 2.1279, 2.1224, 2.0989], device='cuda:0')
|
23 |
+
2024-08-19 14:46:48,986 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
24 |
+
2024-08-19 14:46:55,527 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
25 |
+
2024-08-19 14:47:02,284 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
26 |
+
2024-08-19 14:47:09,271 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
27 |
+
2024-08-19 14:47:09,375 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6740, 3.0526, 2.2697, 3.3074], device='cuda:0')
|
28 |
+
2024-08-19 14:47:16,076 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
29 |
+
2024-08-19 14:47:23,036 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
30 |
+
2024-08-19 14:47:29,679 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
31 |
+
2024-08-19 14:47:36,153 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
32 |
+
2024-08-19 14:47:37,549 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4560, 2.1751, 1.9414, 1.7934], device='cuda:0')
|
33 |
+
2024-08-19 14:47:42,504 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
34 |
+
2024-08-19 14:47:49,034 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
35 |
+
2024-08-19 14:47:55,583 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
36 |
+
2024-08-19 14:48:02,075 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:48:08,554 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:48:15,241 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
39 |
+
2024-08-19 14:48:22,194 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
40 |
+
2024-08-19 14:48:28,657 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
41 |
+
2024-08-19 14:48:35,192 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
42 |
+
2024-08-19 14:48:39,479 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([9.1739e-05, 4.8120e-03, 2.0679e-04, 3.8312e+00, 4.0365e-03, 2.8302e-02,
|
43 |
+
2.6785e-02, 4.6582e-03], device='cuda:0')
|
44 |
+
2024-08-19 14:48:41,359 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
45 |
+
2024-08-19 14:48:41,959 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
46 |
+
2024-08-19 14:48:43,343 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006251230414977811
|
47 |
+
2024-08-19 14:48:43,343 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-42-23
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:42:23,690 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:42:23,690 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-3-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:42:23,691 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:42:24,067 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:42:32,927 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:42:32,927 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:42:32,991 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:42:33,432 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:42:41,462 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:42:48,535 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:42:55,917 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:43:03,016 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:43:04,902 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5716, 2.0453, 2.2604, 1.1945], device='cuda:0')
|
14 |
+
2024-08-19 14:43:09,776 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
15 |
+
2024-08-19 14:43:14,346 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0205, 3.8998, 3.4281, 3.7416], device='cuda:0')
|
16 |
+
2024-08-19 14:43:16,399 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:43:23,134 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:43:29,596 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
19 |
+
2024-08-19 14:43:31,478 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7931, 1.8428, 1.8062, 1.7894, 2.3366, 1.7605, 1.9416, 1.8438],
|
20 |
+
device='cuda:0')
|
21 |
+
2024-08-19 14:43:31,591 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0214, 4.8166, 4.9611, 5.0007], device='cuda:0')
|
22 |
+
2024-08-19 14:43:36,213 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
23 |
+
2024-08-19 14:43:38,347 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1516, 1.7325, 1.8580, 1.7148], device='cuda:0')
|
24 |
+
2024-08-19 14:43:42,694 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
25 |
+
2024-08-19 14:43:49,658 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
26 |
+
2024-08-19 14:43:56,830 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
27 |
+
2024-08-19 14:44:03,455 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
28 |
+
2024-08-19 14:44:06,156 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0192, 3.1303, 3.1324, 3.0366], device='cuda:0')
|
29 |
+
2024-08-19 14:44:10,307 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
30 |
+
2024-08-19 14:44:16,799 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
31 |
+
2024-08-19 14:44:23,622 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
32 |
+
2024-08-19 14:44:29,871 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
33 |
+
2024-08-19 14:44:36,267 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:44:37,604 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.8891e-06, 4.9229e-03, 4.8567e-05, 3.5204e+00, 4.6982e-03, 3.4259e-02,
|
35 |
+
2.6114e-02, 3.3677e-02], device='cuda:0')
|
36 |
+
2024-08-19 14:44:43,053 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
37 |
+
2024-08-19 14:44:44,546 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5299, 2.1826, 2.2746, 2.2623], device='cuda:0')
|
38 |
+
2024-08-19 14:44:45,653 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0366, 3.9335, 3.4611, 3.7657], device='cuda:0')
|
39 |
+
2024-08-19 14:44:47,839 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8047, 1.4974, 1.7249, 1.1864, 1.3694, 1.8371, 2.3136, 1.4361],
|
40 |
+
device='cuda:0')
|
41 |
+
2024-08-19 14:44:49,739 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
42 |
+
2024-08-19 14:44:56,209 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
43 |
+
2024-08-19 14:44:56,928 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8004, 1.8721, 1.9542, 1.9382, 2.3510, 1.7640, 1.9537, 1.9715],
|
44 |
+
device='cuda:0')
|
45 |
+
2024-08-19 14:45:02,376 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
46 |
+
2024-08-19 14:45:08,928 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
47 |
+
2024-08-19 14:45:15,687 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
48 |
+
2024-08-19 14:45:22,100 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
49 |
+
2024-08-19 14:45:23,711 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0311, 3.1373, 3.2597, 3.0196], device='cuda:0')
|
50 |
+
2024-08-19 14:45:28,416 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
51 |
+
2024-08-19 14:45:28,920 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
52 |
+
2024-08-19 14:45:30,246 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.0062545950171461395
|
53 |
+
2024-08-19 14:45:30,246 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-48-48
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:48:48,058 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:48:48,058 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-4-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:48:48,059 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:48:48,410 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt']
|
5 |
+
2024-08-19 14:48:58,446 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:48:58,446 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:48:58,496 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:48:58,902 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:49:06,022 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:49:11,859 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8919, 1.9659, 2.0038, 2.1935], device='cuda:0')
|
11 |
+
2024-08-19 14:49:13,344 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
12 |
+
2024-08-19 14:49:21,006 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
13 |
+
2024-08-19 14:49:28,009 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
14 |
+
2024-08-19 14:49:34,459 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
15 |
+
2024-08-19 14:49:41,170 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
16 |
+
2024-08-19 14:49:47,885 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
17 |
+
2024-08-19 14:49:51,827 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1873, 2.0366, 2.1540, 2.0702], device='cuda:0')
|
18 |
+
2024-08-19 14:49:54,232 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
19 |
+
2024-08-19 14:50:00,747 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:50:07,337 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
21 |
+
2024-08-19 14:50:13,841 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
22 |
+
2024-08-19 14:50:20,806 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
23 |
+
2024-08-19 14:50:27,310 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
24 |
+
2024-08-19 14:50:34,044 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
25 |
+
2024-08-19 14:50:40,653 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
26 |
+
2024-08-19 14:50:46,903 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5513, 2.9677, 2.8968, 2.8970], device='cuda:0')
|
27 |
+
2024-08-19 14:50:46,931 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
28 |
+
2024-08-19 14:50:53,515 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
29 |
+
2024-08-19 14:50:59,946 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
30 |
+
2024-08-19 14:51:06,601 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
31 |
+
2024-08-19 14:51:12,349 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3881, 2.1953, 1.9652, 1.7687], device='cuda:0')
|
32 |
+
2024-08-19 14:51:13,064 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
33 |
+
2024-08-19 14:51:19,160 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
34 |
+
2024-08-19 14:51:25,290 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
35 |
+
2024-08-19 14:51:31,650 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
36 |
+
2024-08-19 14:51:38,098 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
37 |
+
2024-08-19 14:51:40,540 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6309, 3.5018, 3.1097, 3.4175], device='cuda:0')
|
38 |
+
2024-08-19 14:51:41,961 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6175, 3.4913, 3.0783, 3.3693], device='cuda:0')
|
39 |
+
2024-08-19 14:51:44,551 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4590, 1.8796, 1.6307, 1.2486, 1.4482, 1.4584, 1.7977, 1.6899],
|
40 |
+
device='cuda:0')
|
41 |
+
2024-08-19 14:51:44,603 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
42 |
+
2024-08-19 14:51:51,191 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
43 |
+
2024-08-19 14:51:51,812 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
44 |
+
2024-08-19 14:51:53,187 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006266084996028279
|
45 |
+
2024-08-19 14:51:53,188 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-32-left-context-frames-256-2024-08-18-01-34-58
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-18 01:34:58,522 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-18 01:34:58,522 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-4-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-18 01:34:58,522 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-18 01:34:58,865 INFO [inference_audio_tagging.py:353] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt']
|
5 |
+
2024-08-18 01:35:35,256 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-18 01:35:35,256 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-18 01:35:35,314 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-18 01:35:35,729 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-18 01:35:43,527 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-18 01:35:51,124 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-18 01:35:59,309 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-18 01:36:06,896 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-18 01:36:13,999 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
14 |
+
2024-08-18 01:36:20,749 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1785, 1.8529, 1.8660, 1.7641], device='cuda:0')
|
15 |
+
2024-08-18 01:36:21,235 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
16 |
+
2024-08-18 01:36:28,472 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
17 |
+
2024-08-18 01:36:35,149 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
18 |
+
2024-08-18 01:36:37,877 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6286, 3.8694, 4.4968, 4.5054], device='cuda:0')
|
19 |
+
2024-08-18 01:36:41,495 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-18 01:36:47,970 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-18 01:36:55,169 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-18 01:37:02,020 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-18 01:37:08,797 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
24 |
+
2024-08-18 01:37:15,562 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-18 01:37:22,782 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-18 01:37:29,342 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-18 01:37:36,148 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
28 |
+
2024-08-18 01:37:40,107 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6814, 3.9547, 4.4801, 4.5148], device='cuda:0')
|
29 |
+
2024-08-18 01:37:43,143 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
30 |
+
2024-08-18 01:37:49,886 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
31 |
+
2024-08-18 01:37:55,346 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0398, 3.0654, 3.0929, 2.9929], device='cuda:0')
|
32 |
+
2024-08-18 01:37:56,427 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
33 |
+
2024-08-18 01:38:03,027 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
34 |
+
2024-08-18 01:38:04,033 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7050, 2.1893, 2.1578, 1.9861], device='cuda:0')
|
35 |
+
2024-08-18 01:38:09,464 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
36 |
+
2024-08-18 01:38:16,149 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
37 |
+
2024-08-18 01:38:22,460 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
38 |
+
2024-08-18 01:38:29,247 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
39 |
+
2024-08-18 01:38:31,084 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6627, 3.9136, 4.4766, 4.5106], device='cuda:0')
|
40 |
+
2024-08-18 01:38:35,727 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
41 |
+
2024-08-18 01:38:36,248 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
42 |
+
2024-08-18 01:38:37,753 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.455545715973715
|
43 |
+
2024-08-18 01:38:37,753 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-45-34
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:45:34,887 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:45:34,887 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-4-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:45:34,888 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:45:35,238 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt']
|
5 |
+
2024-08-19 14:45:45,479 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:45:45,480 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:45:45,535 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:45:45,980 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:45:53,101 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:46:00,182 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:46:07,987 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:46:15,539 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:46:22,265 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1371, 1.7378, 1.7845, 1.8883], device='cuda:0')
|
14 |
+
2024-08-19 14:46:22,352 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
15 |
+
2024-08-19 14:46:25,502 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6610, 2.1686, 2.1478, 1.9660], device='cuda:0')
|
16 |
+
2024-08-19 14:46:28,919 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:46:35,602 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:46:42,463 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
19 |
+
2024-08-19 14:46:46,242 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0424, 3.2116, 3.2670, 2.9530], device='cuda:0')
|
20 |
+
2024-08-19 14:46:48,985 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
21 |
+
2024-08-19 14:46:49,069 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9900, 3.8753, 3.4286, 3.7620], device='cuda:0')
|
22 |
+
2024-08-19 14:46:55,027 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([1.6708e-04, 5.6033e-03, 2.9184e-05, 3.5204e+00, 3.5862e-03, 3.6500e-02,
|
23 |
+
3.1549e-02, 3.9373e-02], device='cuda:0')
|
24 |
+
2024-08-19 14:46:55,524 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
25 |
+
2024-08-19 14:47:02,284 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
26 |
+
2024-08-19 14:47:09,265 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
27 |
+
2024-08-19 14:47:16,076 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
28 |
+
2024-08-19 14:47:22,974 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7983, 1.6149, 1.6895, 1.2183, 1.3283, 1.7911, 2.3341, 1.3121],
|
29 |
+
device='cuda:0')
|
30 |
+
2024-08-19 14:47:23,024 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
31 |
+
2024-08-19 14:47:29,610 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8131, 2.1109, 1.6780, 1.4145, 1.7030, 1.5408, 1.8389, 1.7901],
|
32 |
+
device='cuda:0')
|
33 |
+
2024-08-19 14:47:29,691 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
34 |
+
2024-08-19 14:47:36,177 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
35 |
+
2024-08-19 14:47:42,457 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
36 |
+
2024-08-19 14:47:44,237 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0170, 3.9432, 3.4828, 3.7634], device='cuda:0')
|
37 |
+
2024-08-19 14:47:46,176 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8165, 1.8768, 1.9821, 1.8578, 2.3580, 1.8195, 1.9136, 1.7514],
|
38 |
+
device='cuda:0')
|
39 |
+
2024-08-19 14:47:46,688 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6568, 3.8928, 4.4199, 4.5319], device='cuda:0')
|
40 |
+
2024-08-19 14:47:49,052 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
41 |
+
2024-08-19 14:47:51,829 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5969, 1.9597, 2.3921, 1.1524], device='cuda:0')
|
42 |
+
2024-08-19 14:47:55,583 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
43 |
+
2024-08-19 14:48:02,074 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
44 |
+
2024-08-19 14:48:08,554 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
45 |
+
2024-08-19 14:48:15,244 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
46 |
+
2024-08-19 14:48:22,204 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
47 |
+
2024-08-19 14:48:28,655 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
48 |
+
2024-08-19 14:48:33,773 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([1.6894e-05, 6.8658e-03, 6.2284e-04, 3.5204e+00, 4.7326e-03, 4.2937e-02,
|
49 |
+
3.4770e-02, 3.7411e-02], device='cuda:0')
|
50 |
+
2024-08-19 14:48:35,186 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
51 |
+
2024-08-19 14:48:41,471 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
52 |
+
2024-08-19 14:48:41,955 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
53 |
+
2024-08-19 14:48:43,284 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006256989198255971
|
54 |
+
2024-08-19 14:48:43,284 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-4-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-01-39-49
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-18 01:39:49,975 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-18 01:39:49,976 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 4, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-4-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-18 01:39:49,976 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-18 01:39:50,332 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-344000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt
|
5 |
+
2024-08-18 01:40:01,756 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-18 01:40:01,756 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-18 01:40:01,808 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-18 01:40:02,230 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-18 01:40:10,968 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-18 01:40:11,521 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7337, 2.0828, 2.1756, 1.9740], device='cuda:0')
|
11 |
+
2024-08-18 01:40:18,386 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
12 |
+
2024-08-18 01:40:19,938 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7543, 1.9684, 1.8354, 1.4377, 1.7168, 1.5420, 1.9884, 1.8234],
|
13 |
+
device='cuda:0')
|
14 |
+
2024-08-18 01:40:20,286 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9981, 3.3801, 2.3584, 3.8499], device='cuda:0')
|
15 |
+
2024-08-18 01:40:20,953 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0392, 2.5494, 2.4251, 2.1908], device='cuda:0')
|
16 |
+
2024-08-18 01:40:26,366 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
17 |
+
2024-08-18 01:40:30,167 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0064, 4.7680, 4.8375, 4.8829], device='cuda:0')
|
18 |
+
2024-08-18 01:40:33,811 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
19 |
+
2024-08-18 01:40:40,998 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
20 |
+
2024-08-18 01:40:46,318 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0041, 3.0130, 3.2056, 2.9115], device='cuda:0')
|
21 |
+
2024-08-18 01:40:47,612 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
22 |
+
2024-08-18 01:40:52,258 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1503, 2.9979, 3.2428, 3.1529], device='cuda:0')
|
23 |
+
2024-08-18 01:40:54,290 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
24 |
+
2024-08-18 01:40:56,674 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8160, 1.6457, 1.7980, 1.2207, 1.4842, 1.8967, 2.3137, 1.4885],
|
25 |
+
device='cuda:0')
|
26 |
+
2024-08-18 01:41:00,317 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
27 |
+
2024-08-18 01:41:06,766 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
28 |
+
2024-08-18 01:41:13,142 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
29 |
+
2024-08-18 01:41:19,754 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
30 |
+
2024-08-18 01:41:24,980 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5964, 2.1139, 2.1582, 2.0165], device='cuda:0')
|
31 |
+
2024-08-18 01:41:25,595 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6624, 3.8644, 4.4833, 4.5140], device='cuda:0')
|
32 |
+
2024-08-18 01:41:26,620 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
33 |
+
2024-08-18 01:41:32,848 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0198, 3.0941, 3.1419, 3.0465], device='cuda:0')
|
34 |
+
2024-08-18 01:41:33,066 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
35 |
+
2024-08-18 01:41:39,372 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
36 |
+
2024-08-18 01:41:39,898 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6113, 2.0692, 2.3331, 1.0845], device='cuda:0')
|
37 |
+
2024-08-18 01:41:45,902 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
38 |
+
2024-08-18 01:41:52,347 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
39 |
+
2024-08-18 01:41:58,460 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
40 |
+
2024-08-18 01:42:04,963 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
41 |
+
2024-08-18 01:42:09,474 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0545, 4.8139, 4.9753, 5.0198], device='cuda:0')
|
42 |
+
2024-08-18 01:42:11,440 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
43 |
+
2024-08-18 01:42:17,792 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
44 |
+
2024-08-18 01:42:24,111 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2696, 2.0030, 2.1678, 2.3193], device='cuda:0')
|
45 |
+
2024-08-18 01:42:24,353 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
46 |
+
2024-08-18 01:42:30,968 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
47 |
+
2024-08-18 01:42:31,984 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1311, 1.8622, 1.9359, 1.7651], device='cuda:0')
|
48 |
+
2024-08-18 01:42:37,397 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
49 |
+
2024-08-18 01:42:43,606 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
50 |
+
2024-08-18 01:42:49,851 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
51 |
+
2024-08-18 01:42:56,088 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
52 |
+
2024-08-18 01:42:56,655 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
53 |
+
2024-08-18 01:42:58,030 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45901194132149314
|
54 |
+
2024-08-18 01:42:58,030 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-51-57
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:51:57,406 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-19 14:51:57,407 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-5-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:51:57,407 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-19 14:51:57,769 INFO [inference_audio_tagging.py:353] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-344000.pt']
|
5 |
+
2024-08-19 14:52:10,585 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:52:10,586 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:52:10,630 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:52:11,041 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:52:18,592 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:52:24,764 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:52:26,766 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2125, 1.8090, 2.3108, 1.2153], device='cuda:0')
|
12 |
+
2024-08-19 14:52:30,769 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5696, 2.9937, 2.9685, 2.8535], device='cuda:0')
|
13 |
+
2024-08-19 14:52:31,395 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:52:37,734 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:52:43,727 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:52:49,872 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:52:55,423 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:53:01,056 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-19 14:53:06,864 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:53:07,469 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8985, 1.8141, 1.8623, 1.5981], device='cuda:0')
|
21 |
+
2024-08-19 14:53:12,652 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-19 14:53:18,617 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-19 14:53:24,530 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
24 |
+
2024-08-19 14:53:25,341 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1692, 2.0479, 2.1348, 2.0785], device='cuda:0')
|
25 |
+
2024-08-19 14:53:30,300 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1873, 2.0157, 2.1240, 1.7883], device='cuda:0')
|
26 |
+
2024-08-19 14:53:30,393 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:53:36,066 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:53:41,423 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:53:43,965 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4776, 2.1499, 2.0038, 1.8490], device='cuda:0')
|
30 |
+
2024-08-19 14:53:46,975 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
31 |
+
2024-08-19 14:53:47,823 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3570, 2.1628, 1.9386, 1.6751], device='cuda:0')
|
32 |
+
2024-08-19 14:53:52,570 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
33 |
+
2024-08-19 14:53:57,916 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:54:02,383 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3778, 2.1888, 1.9201, 1.7523], device='cuda:0')
|
35 |
+
2024-08-19 14:54:03,419 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
36 |
+
2024-08-19 14:54:08,727 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:54:14,225 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:54:14,362 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6209, 4.4041, 4.5290, 4.6020], device='cuda:0')
|
39 |
+
2024-08-19 14:54:16,634 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8594, 1.7356, 1.7612, 1.6396], device='cuda:0')
|
40 |
+
2024-08-19 14:54:19,574 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
41 |
+
2024-08-19 14:54:23,142 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6166, 3.5221, 3.0759, 3.3663], device='cuda:0')
|
42 |
+
2024-08-19 14:54:25,062 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
43 |
+
2024-08-19 14:54:27,826 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5530, 2.8278, 2.8895, 2.7994], device='cuda:0')
|
44 |
+
2024-08-19 14:54:30,586 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
45 |
+
2024-08-19 14:54:36,176 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
46 |
+
2024-08-19 14:54:41,704 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
47 |
+
2024-08-19 14:54:42,214 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
48 |
+
2024-08-19 14:54:43,610 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.44927865231736125
|
49 |
+
2024-08-19 14:54:43,610 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-360000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-48-47
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:48:47,744 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:48:47,744 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 360000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-360000-avg-5-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:48:47,744 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:48:48,092 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-344000.pt']
|
5 |
+
2024-08-19 14:49:06,636 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:49:06,637 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:49:06,684 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:49:07,169 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:49:13,203 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:49:18,052 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:49:18,763 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([0.0244, 0.0282, 0.0241, 3.5494, 0.0284, 0.0602, 0.0513, 0.0545],
|
12 |
+
device='cuda:0')
|
13 |
+
2024-08-19 14:49:20,345 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6155, 2.0666, 2.3997, 1.1235], device='cuda:0')
|
14 |
+
2024-08-19 14:49:21,258 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9216, 2.9430, 3.1584, 2.9089], device='cuda:0')
|
15 |
+
2024-08-19 14:49:21,647 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
16 |
+
2024-08-19 14:49:28,003 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
17 |
+
2024-08-19 14:49:34,467 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
18 |
+
2024-08-19 14:49:41,169 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
19 |
+
2024-08-19 14:49:47,900 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
20 |
+
2024-08-19 14:49:54,160 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
21 |
+
2024-08-19 14:50:00,754 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
22 |
+
2024-08-19 14:50:07,339 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:50:13,842 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:50:20,235 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5506, 2.1299, 2.3226, 2.0764], device='cuda:0')
|
25 |
+
2024-08-19 14:50:20,811 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
26 |
+
2024-08-19 14:50:27,311 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:50:34,048 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:50:40,608 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:50:46,926 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
30 |
+
2024-08-19 14:50:53,513 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
31 |
+
2024-08-19 14:50:59,899 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7945, 1.9688, 2.0106, 1.8597, 2.3245, 1.8897, 1.9754, 1.8306],
|
32 |
+
device='cuda:0')
|
33 |
+
2024-08-19 14:50:59,959 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:51:06,587 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1381, 2.9582, 3.1499, 3.0275], device='cuda:0')
|
35 |
+
2024-08-19 14:51:06,611 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
36 |
+
2024-08-19 14:51:13,067 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:51:19,159 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:51:25,231 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5388, 2.2308, 2.3270, 2.2773], device='cuda:0')
|
39 |
+
2024-08-19 14:51:25,294 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
40 |
+
2024-08-19 14:51:31,656 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
41 |
+
2024-08-19 14:51:38,104 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
42 |
+
2024-08-19 14:51:44,599 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
43 |
+
2024-08-19 14:51:51,216 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
44 |
+
2024-08-19 14:51:51,812 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
45 |
+
2024-08-19 14:51:53,153 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006257999753186118
|
46 |
+
2024-08-19 14:51:53,153 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-34-47
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:34:47,224 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:34:47,225 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-3-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:34:47,225 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:34:47,613 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:34:56,304 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:34:56,305 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:34:56,353 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:34:56,766 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:35:06,318 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:35:14,267 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:35:23,128 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:35:32,184 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:35:40,394 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:35:48,026 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
15 |
+
2024-08-19 14:35:52,021 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9803, 1.9685, 2.0641, 2.2098], device='cuda:0')
|
16 |
+
2024-08-19 14:35:55,491 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
17 |
+
2024-08-19 14:36:03,748 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
18 |
+
2024-08-19 14:36:08,310 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8355, 1.6421, 1.7883, 1.6789], device='cuda:0')
|
19 |
+
2024-08-19 14:36:11,499 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:36:19,675 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
21 |
+
2024-08-19 14:36:28,136 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
22 |
+
2024-08-19 14:36:36,038 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
23 |
+
2024-08-19 14:36:44,560 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
24 |
+
2024-08-19 14:36:53,008 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
25 |
+
2024-08-19 14:36:58,643 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3743, 2.1593, 1.8810, 1.6905], device='cuda:0')
|
26 |
+
2024-08-19 14:37:00,965 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
27 |
+
2024-08-19 14:37:09,112 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
28 |
+
2024-08-19 14:37:17,025 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
29 |
+
2024-08-19 14:37:17,770 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4347, 2.1896, 1.9714, 1.8076], device='cuda:0')
|
30 |
+
2024-08-19 14:37:17,878 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3084, 3.5573, 4.0724, 4.1450], device='cuda:0')
|
31 |
+
2024-08-19 14:37:21,122 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1756, 1.9711, 2.0506, 1.9309], device='cuda:0')
|
32 |
+
2024-08-19 14:37:24,860 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
33 |
+
2024-08-19 14:37:32,708 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
34 |
+
2024-08-19 14:37:38,467 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3568, 1.4320, 1.5865, 1.1970, 1.2469, 1.6081, 1.7005, 1.2170],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-19 14:37:40,489 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:37:44,222 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6274, 4.3992, 4.5260, 4.5700], device='cuda:0')
|
38 |
+
2024-08-19 14:37:48,360 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
39 |
+
2024-08-19 14:37:52,958 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5496, 2.7513, 2.8661, 2.7740], device='cuda:0')
|
40 |
+
2024-08-19 14:37:56,240 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
41 |
+
2024-08-19 14:38:04,376 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:38:12,156 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
43 |
+
2024-08-19 14:38:20,607 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
44 |
+
2024-08-19 14:38:28,407 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
45 |
+
2024-08-19 14:38:29,021 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
46 |
+
2024-08-19 14:38:30,413 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006232222201802086
|
47 |
+
2024-08-19 14:38:30,413 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-32-47
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:32:47,861 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:32:47,861 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-3-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:32:47,861 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:32:48,218 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:32:53,092 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:32:53,092 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:32:53,134 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:32:53,541 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:33:00,512 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:33:06,399 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:33:11,398 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:33:14,562 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:33:19,204 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:33:22,139 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
15 |
+
2024-08-19 14:33:25,798 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
16 |
+
2024-08-19 14:33:29,203 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
17 |
+
2024-08-19 14:33:32,699 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
18 |
+
2024-08-19 14:33:35,605 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
19 |
+
2024-08-19 14:33:38,971 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
20 |
+
2024-08-19 14:33:43,165 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
21 |
+
2024-08-19 14:33:47,517 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
22 |
+
2024-08-19 14:33:51,973 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6355, 3.7327, 4.4487, 4.4651], device='cuda:0')
|
23 |
+
2024-08-19 14:33:52,074 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
24 |
+
2024-08-19 14:33:53,403 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0962, 2.9522, 3.2029, 3.1077], device='cuda:0')
|
25 |
+
2024-08-19 14:33:56,328 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
26 |
+
2024-08-19 14:34:00,122 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
27 |
+
2024-08-19 14:34:04,113 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
28 |
+
2024-08-19 14:34:08,151 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
29 |
+
2024-08-19 14:34:11,965 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
30 |
+
2024-08-19 14:34:12,346 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9678, 2.4082, 2.2589, 2.0819], device='cuda:0')
|
31 |
+
2024-08-19 14:34:16,055 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9855, 3.4370, 2.6414, 3.7209], device='cuda:0')
|
32 |
+
2024-08-19 14:34:16,140 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
33 |
+
2024-08-19 14:34:19,988 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
34 |
+
2024-08-19 14:34:24,045 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
35 |
+
2024-08-19 14:34:24,872 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2082, 1.9709, 1.9551, 1.7361], device='cuda:0')
|
36 |
+
2024-08-19 14:34:28,262 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
37 |
+
2024-08-19 14:34:32,336 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
38 |
+
2024-08-19 14:34:36,369 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
39 |
+
2024-08-19 14:34:40,091 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
40 |
+
2024-08-19 14:34:40,490 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
41 |
+
2024-08-19 14:34:41,873 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006229590512837757
|
42 |
+
2024-08-19 14:34:41,873 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-19-03-18
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-18 19:03:18,570 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-18 19:03:18,570 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-18 19:03:18,571 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-18 19:03:18,945 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt
|
5 |
+
2024-08-18 19:03:32,259 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-18 19:03:32,259 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-18 19:03:32,301 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-18 19:03:32,707 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-18 19:03:40,483 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-18 19:03:48,097 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-18 19:03:56,101 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-18 19:04:00,075 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6474, 2.0869, 2.1492, 1.9388], device='cuda:0')
|
13 |
+
2024-08-18 19:04:03,506 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
14 |
+
2024-08-18 19:04:10,508 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
15 |
+
2024-08-18 19:04:15,209 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1043, 2.9839, 3.2047, 3.0724], device='cuda:0')
|
16 |
+
2024-08-18 19:04:17,075 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-18 19:04:23,406 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-18 19:04:30,112 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
19 |
+
2024-08-18 19:04:36,556 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
20 |
+
2024-08-18 19:04:43,240 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
21 |
+
2024-08-18 19:04:50,188 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
22 |
+
2024-08-18 19:04:57,151 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
23 |
+
2024-08-18 19:05:03,466 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
24 |
+
2024-08-18 19:05:10,247 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
25 |
+
2024-08-18 19:05:16,629 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
26 |
+
2024-08-18 19:05:23,092 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
27 |
+
2024-08-18 19:05:27,807 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6507, 3.9187, 4.4839, 4.5067], device='cuda:0')
|
28 |
+
2024-08-18 19:05:29,383 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
29 |
+
2024-08-18 19:05:35,591 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
30 |
+
2024-08-18 19:05:41,794 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
31 |
+
2024-08-18 19:05:47,845 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
32 |
+
2024-08-18 19:05:54,012 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
33 |
+
2024-08-18 19:05:57,406 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0563, 4.7753, 4.9644, 4.9934], device='cuda:0')
|
34 |
+
2024-08-18 19:05:58,050 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0229, 3.8622, 3.4058, 3.7227], device='cuda:0')
|
35 |
+
2024-08-18 19:06:00,419 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
36 |
+
2024-08-18 19:06:06,562 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
37 |
+
2024-08-18 19:06:12,574 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
38 |
+
2024-08-18 19:06:18,491 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
39 |
+
2024-08-18 19:06:21,520 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7121, 2.0879, 2.2363, 2.0286], device='cuda:0')
|
40 |
+
2024-08-18 19:06:24,656 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
41 |
+
2024-08-18 19:06:25,342 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
42 |
+
2024-08-18 19:06:26,661 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.45885937050539577
|
43 |
+
2024-08-18 19:06:26,661 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-38-35
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:38:35,412 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:38:35,412 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-4-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:38:35,413 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:38:35,793 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:38:47,100 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:38:47,100 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:38:47,159 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:38:47,566 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:38:55,903 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:39:03,927 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:39:13,729 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:39:22,353 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:39:28,604 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4347, 1.7949, 1.5499, 1.3042, 1.4440, 1.3978, 1.6042, 1.6124],
|
14 |
+
device='cuda:0')
|
15 |
+
2024-08-19 14:39:30,298 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:39:38,923 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:39:43,823 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6632, 3.5067, 3.0989, 3.3890], device='cuda:0')
|
18 |
+
2024-08-19 14:39:46,775 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
19 |
+
2024-08-19 14:39:54,924 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
20 |
+
2024-08-19 14:40:02,489 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
21 |
+
2024-08-19 14:40:10,445 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
22 |
+
2024-08-19 14:40:18,965 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
23 |
+
2024-08-19 14:40:27,248 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
24 |
+
2024-08-19 14:40:34,971 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
25 |
+
2024-08-19 14:40:42,819 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
26 |
+
2024-08-19 14:40:48,455 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5522, 2.9160, 2.9495, 2.7956], device='cuda:0')
|
27 |
+
2024-08-19 14:40:50,698 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:40:57,985 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
29 |
+
2024-08-19 14:41:06,043 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
30 |
+
2024-08-19 14:41:09,939 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5455, 2.8782, 2.9039, 2.6925], device='cuda:0')
|
31 |
+
2024-08-19 14:41:14,041 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
32 |
+
2024-08-19 14:41:23,683 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
33 |
+
2024-08-19 14:41:39,293 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
34 |
+
2024-08-19 14:41:45,615 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
35 |
+
2024-08-19 14:41:48,849 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3731, 1.6633, 1.7346, 1.6947, 1.9589, 1.6482, 1.6501, 1.6803],
|
36 |
+
device='cuda:0')
|
37 |
+
2024-08-19 14:41:51,741 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
38 |
+
2024-08-19 14:41:58,383 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6547, 3.1267, 2.2467, 3.3458], device='cuda:0')
|
39 |
+
2024-08-19 14:41:58,502 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
40 |
+
2024-08-19 14:42:04,947 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
41 |
+
2024-08-19 14:42:11,195 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
42 |
+
2024-08-19 14:42:17,523 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
43 |
+
2024-08-19 14:42:18,056 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
44 |
+
2024-08-19 14:42:19,421 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006236495399567375
|
45 |
+
2024-08-19 14:42:19,421 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-34-47
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:34:47,216 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:34:47,217 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-4-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:34:47,217 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:34:47,575 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:34:58,365 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:34:58,366 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:34:58,420 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:34:58,827 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:35:06,039 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:35:12,717 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7617, 2.0990, 1.7541, 1.4904, 1.6853, 1.4827, 1.9573, 1.6755],
|
11 |
+
device='cuda:0')
|
12 |
+
2024-08-19 14:35:14,212 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
13 |
+
2024-08-19 14:35:23,127 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:35:32,172 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:35:40,395 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:35:48,019 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:35:50,567 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8112, 1.6813, 1.6980, 1.2072, 1.3818, 1.8065, 2.3209, 1.4003],
|
18 |
+
device='cuda:0')
|
19 |
+
2024-08-19 14:35:55,483 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
20 |
+
2024-08-19 14:36:03,747 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
21 |
+
2024-08-19 14:36:11,499 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
22 |
+
2024-08-19 14:36:19,647 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:36:28,128 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:36:29,104 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6562, 3.8948, 4.4724, 4.5516], device='cuda:0')
|
25 |
+
2024-08-19 14:36:36,033 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
26 |
+
2024-08-19 14:36:44,549 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:36:53,001 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:37:00,959 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:37:04,351 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9878, 3.3651, 2.3755, 3.8272], device='cuda:0')
|
30 |
+
2024-08-19 14:37:09,110 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
31 |
+
2024-08-19 14:37:17,027 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
32 |
+
2024-08-19 14:37:17,120 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5955, 2.0017, 2.4220, 1.1255], device='cuda:0')
|
33 |
+
2024-08-19 14:37:24,856 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:37:32,701 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
35 |
+
2024-08-19 14:37:40,488 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
36 |
+
2024-08-19 14:37:48,339 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
37 |
+
2024-08-19 14:37:56,239 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
38 |
+
2024-08-19 14:37:58,750 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2597, 2.0263, 2.2071, 2.4330], device='cuda:0')
|
39 |
+
2024-08-19 14:38:03,777 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7964, 1.6602, 1.7114, 1.2576, 1.4105, 1.8427, 2.3414, 1.3989],
|
40 |
+
device='cuda:0')
|
41 |
+
2024-08-19 14:38:04,366 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:38:12,151 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
43 |
+
2024-08-19 14:38:19,664 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0328, 3.9045, 3.4568, 3.7544], device='cuda:0')
|
44 |
+
2024-08-19 14:38:20,605 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
45 |
+
2024-08-19 14:38:28,401 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
46 |
+
2024-08-19 14:38:29,128 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
47 |
+
2024-08-19 14:38:30,491 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006231987240028731
|
48 |
+
2024-08-19 14:38:30,491 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-42-23
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:42:23,889 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:42:23,889 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-5-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:42:23,889 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:42:24,237 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt']
|
5 |
+
2024-08-19 14:42:36,686 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:42:36,687 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:42:36,744 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:42:37,154 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:42:43,480 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:42:48,522 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:42:55,908 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:43:03,016 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:43:09,776 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:43:16,401 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
15 |
+
2024-08-19 14:43:23,134 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
16 |
+
2024-08-19 14:43:29,580 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
17 |
+
2024-08-19 14:43:33,507 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8280, 1.6160, 1.7732, 1.6550], device='cuda:0')
|
18 |
+
2024-08-19 14:43:36,211 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
19 |
+
2024-08-19 14:43:42,697 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
20 |
+
2024-08-19 14:43:49,667 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
21 |
+
2024-08-19 14:43:55,363 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4175, 2.1509, 1.9796, 1.7901], device='cuda:0')
|
22 |
+
2024-08-19 14:43:56,833 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
23 |
+
2024-08-19 14:44:03,451 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
24 |
+
2024-08-19 14:44:10,296 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
25 |
+
2024-08-19 14:44:16,797 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
26 |
+
2024-08-19 14:44:20,982 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5955, 3.4789, 2.9974, 3.3493], device='cuda:0')
|
27 |
+
2024-08-19 14:44:23,743 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
28 |
+
2024-08-19 14:44:29,868 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
29 |
+
2024-08-19 14:44:36,272 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
30 |
+
2024-08-19 14:44:43,052 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
31 |
+
2024-08-19 14:44:49,733 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
32 |
+
2024-08-19 14:44:56,200 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
33 |
+
2024-08-19 14:45:02,376 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
34 |
+
2024-08-19 14:45:06,471 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3490, 1.3917, 1.5056, 1.1746, 1.2445, 1.6057, 1.6763, 1.1689],
|
35 |
+
device='cuda:0')
|
36 |
+
2024-08-19 14:45:08,859 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4424, 1.8441, 1.5231, 1.2765, 1.3688, 1.3618, 1.7145, 1.6043],
|
37 |
+
device='cuda:0')
|
38 |
+
2024-08-19 14:45:08,937 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
39 |
+
2024-08-19 14:45:09,650 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([7.0687e-05, 5.2157e-03, 3.7882e-04, 3.8312e+00, 4.6391e-03, 3.1489e-02,
|
40 |
+
3.2639e-02, 4.8207e-03], device='cuda:0')
|
41 |
+
2024-08-19 14:45:10,317 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3607, 1.7372, 1.7374, 1.7325, 2.0404, 1.6311, 1.7996, 1.6355],
|
42 |
+
device='cuda:0')
|
43 |
+
2024-08-19 14:45:15,690 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
44 |
+
2024-08-19 14:45:22,099 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
45 |
+
2024-08-19 14:45:28,409 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
46 |
+
2024-08-19 14:45:28,917 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
47 |
+
2024-08-19 14:45:30,292 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.0062510653291932596
|
48 |
+
2024-08-19 14:45:30,292 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-364000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-38-35
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:38:35,415 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:38:35,415 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 364000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-364000-avg-5-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:38:35,415 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:38:35,787 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-348000.pt']
|
5 |
+
2024-08-19 14:38:47,219 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:38:47,220 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:38:47,267 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:38:47,678 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:38:55,312 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:39:00,586 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7967, 1.9408, 1.9796, 1.8336, 2.3214, 1.9258, 1.9736, 1.8454],
|
11 |
+
device='cuda:0')
|
12 |
+
2024-08-19 14:39:03,916 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
13 |
+
2024-08-19 14:39:13,726 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:39:19,056 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1612, 1.8195, 1.9205, 1.8113], device='cuda:0')
|
15 |
+
2024-08-19 14:39:21,640 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0320, 3.1448, 3.2061, 3.0798], device='cuda:0')
|
16 |
+
2024-08-19 14:39:22,348 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
17 |
+
2024-08-19 14:39:30,296 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
18 |
+
2024-08-19 14:39:38,937 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
19 |
+
2024-08-19 14:39:41,202 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1001, 2.9282, 3.1899, 3.0417], device='cuda:0')
|
20 |
+
2024-08-19 14:39:46,765 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
21 |
+
2024-08-19 14:39:51,768 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0363, 3.2175, 3.1634, 3.0310], device='cuda:0')
|
22 |
+
2024-08-19 14:39:54,920 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
23 |
+
2024-08-19 14:40:02,623 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
24 |
+
2024-08-19 14:40:04,744 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8900, 2.4573, 2.1441, 2.0718], device='cuda:0')
|
25 |
+
2024-08-19 14:40:10,442 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
26 |
+
2024-08-19 14:40:18,974 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
27 |
+
2024-08-19 14:40:27,240 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
28 |
+
2024-08-19 14:40:34,987 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
29 |
+
2024-08-19 14:40:42,813 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
30 |
+
2024-08-19 14:40:50,617 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
31 |
+
2024-08-19 14:40:58,010 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
32 |
+
2024-08-19 14:41:06,038 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
33 |
+
2024-08-19 14:41:14,088 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:41:23,669 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
35 |
+
2024-08-19 14:41:26,144 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5605, 2.1645, 2.3992, 2.1785], device='cuda:0')
|
36 |
+
2024-08-19 14:41:32,644 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2658, 2.1449, 2.2484, 2.4850], device='cuda:0')
|
37 |
+
2024-08-19 14:41:38,514 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0365, 4.7835, 4.9560, 4.9926], device='cuda:0')
|
38 |
+
2024-08-19 14:41:39,230 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
39 |
+
2024-08-19 14:41:45,610 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
40 |
+
2024-08-19 14:41:51,720 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
41 |
+
2024-08-19 14:41:58,470 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:42:04,915 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
43 |
+
2024-08-19 14:42:11,189 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
44 |
+
2024-08-19 14:42:17,519 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
45 |
+
2024-08-19 14:42:18,062 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
46 |
+
2024-08-19 14:42:19,427 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006243430153884536
|
47 |
+
2024-08-19 14:42:19,427 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-25-19
ADDED
@@ -0,0 +1,45 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:25:19,228 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:25:19,228 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-3-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:25:19,228 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:25:19,583 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt']
|
5 |
+
2024-08-19 14:25:29,150 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:25:29,150 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:25:29,219 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:25:29,637 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:25:38,253 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:25:40,375 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1525, 1.9659, 2.1218, 1.9768], device='cuda:0')
|
11 |
+
2024-08-19 14:25:43,894 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1649, 2.0681, 2.1322, 1.8978], device='cuda:0')
|
12 |
+
2024-08-19 14:25:44,853 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
13 |
+
2024-08-19 14:25:54,105 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:26:02,646 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:26:07,909 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6329, 4.4165, 4.5361, 4.5914], device='cuda:0')
|
16 |
+
2024-08-19 14:26:07,950 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6510, 3.5226, 3.0833, 3.3628], device='cuda:0')
|
17 |
+
2024-08-19 14:26:11,120 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
18 |
+
2024-08-19 14:26:18,872 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
19 |
+
2024-08-19 14:26:26,424 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
20 |
+
2024-08-19 14:26:34,443 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
21 |
+
2024-08-19 14:26:42,400 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
22 |
+
2024-08-19 14:26:49,897 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:26:57,716 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:27:05,611 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
25 |
+
2024-08-19 14:27:08,737 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6169, 3.4644, 2.9327, 3.3376], device='cuda:0')
|
26 |
+
2024-08-19 14:27:13,217 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:27:21,283 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:27:28,811 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:27:32,743 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1736, 1.9807, 2.1279, 1.8548], device='cuda:0')
|
30 |
+
2024-08-19 14:27:36,740 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
31 |
+
2024-08-19 14:27:41,668 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3245, 3.5692, 4.1190, 4.1438], device='cuda:0')
|
32 |
+
2024-08-19 14:27:44,558 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
33 |
+
2024-08-19 14:27:52,330 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:28:00,057 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
35 |
+
2024-08-19 14:28:07,912 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
36 |
+
2024-08-19 14:28:12,396 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6277, 3.4423, 3.0098, 3.3641], device='cuda:0')
|
37 |
+
2024-08-19 14:28:15,535 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:28:23,082 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
39 |
+
2024-08-19 14:28:30,742 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
40 |
+
2024-08-19 14:28:38,865 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
41 |
+
2024-08-19 14:28:46,868 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
42 |
+
2024-08-19 14:28:54,426 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
43 |
+
2024-08-19 14:28:55,041 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
44 |
+
2024-08-19 14:28:56,421 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006218727938345955
|
45 |
+
2024-08-19 14:28:56,422 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-18-53
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:18:53,213 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:18:53,213 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-3-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:18:53,216 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:18:53,578 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt']
|
5 |
+
2024-08-19 14:19:00,988 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:19:00,989 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:19:01,066 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:19:01,519 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:19:09,860 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:19:23,768 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:19:34,897 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5176, 2.2308, 2.0595, 1.8177], device='cuda:0')
|
12 |
+
2024-08-19 14:19:38,635 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.8871, 2.9483, 3.2412, 2.9302], device='cuda:0')
|
13 |
+
2024-08-19 14:19:39,374 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:19:50,754 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6563, 1.8522, 1.9505, 1.8813, 2.3946, 1.8699, 1.9026, 1.9199],
|
15 |
+
device='cuda:0')
|
16 |
+
2024-08-19 14:19:55,582 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
17 |
+
2024-08-19 14:20:04,487 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5288, 2.0842, 2.4669, 2.1669], device='cuda:0')
|
18 |
+
2024-08-19 14:20:09,189 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
19 |
+
2024-08-19 14:20:22,784 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
20 |
+
2024-08-19 14:20:37,391 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
21 |
+
2024-08-19 14:20:52,302 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
22 |
+
2024-08-19 14:21:06,011 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
23 |
+
2024-08-19 14:21:21,388 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
24 |
+
2024-08-19 14:21:35,198 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
25 |
+
2024-08-19 14:21:51,025 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
26 |
+
2024-08-19 14:22:06,468 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:22:21,210 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:22:36,412 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:22:51,522 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
30 |
+
2024-08-19 14:23:05,341 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
31 |
+
2024-08-19 14:23:17,750 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
32 |
+
2024-08-19 14:23:18,856 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7121, 2.2222, 2.1049, 1.9988], device='cuda:0')
|
33 |
+
2024-08-19 14:23:31,182 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
34 |
+
2024-08-19 14:23:43,431 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
35 |
+
2024-08-19 14:23:52,115 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1725, 1.7983, 1.8844, 1.8923], device='cuda:0')
|
36 |
+
2024-08-19 14:23:56,642 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
37 |
+
2024-08-19 14:24:07,328 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1799, 1.7652, 1.9000, 1.7795], device='cuda:0')
|
38 |
+
2024-08-19 14:24:09,982 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
39 |
+
2024-08-19 14:24:24,078 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
40 |
+
2024-08-19 14:24:38,499 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
41 |
+
2024-08-19 14:24:53,730 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([1.5249e-06, 4.1799e-03, 1.3520e-19, 3.5204e+00, 8.0398e-04, 3.8541e-02,
|
42 |
+
2.7948e-02, 3.6437e-02], device='cuda:0')
|
43 |
+
2024-08-19 14:24:53,798 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
44 |
+
2024-08-19 14:25:08,346 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
45 |
+
2024-08-19 14:25:09,339 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
46 |
+
2024-08-19 14:25:10,750 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.00620927883604464
|
47 |
+
2024-08-19 14:25:10,750 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-19-00-07
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-18 19:00:07,974 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-18 19:00:07,974 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-18 19:00:07,974 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-18 19:00:08,376 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt
|
5 |
+
2024-08-18 19:00:19,842 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-18 19:00:19,843 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-18 19:00:19,889 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-18 19:00:20,297 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-18 19:00:29,872 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-18 19:00:33,446 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7318, 2.1787, 2.1106, 2.0037], device='cuda:0')
|
11 |
+
2024-08-18 19:00:37,296 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
12 |
+
2024-08-18 19:00:44,690 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
13 |
+
2024-08-18 19:00:51,509 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7361, 1.6049, 1.8740, 1.2133, 1.4232, 1.8366, 2.1996, 1.4228],
|
14 |
+
device='cuda:0')
|
15 |
+
2024-08-18 19:00:51,559 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
16 |
+
2024-08-18 19:00:58,013 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
17 |
+
2024-08-18 19:01:00,711 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0094, 3.9035, 3.4622, 3.7351], device='cuda:0')
|
18 |
+
2024-08-18 19:01:03,730 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7991, 1.9880, 1.9187, 1.8920, 2.3395, 1.8599, 1.9845, 1.8652],
|
19 |
+
device='cuda:0')
|
20 |
+
2024-08-18 19:01:04,371 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
21 |
+
2024-08-18 19:01:08,650 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1469, 1.9987, 1.9741, 1.8064], device='cuda:0')
|
22 |
+
2024-08-18 19:01:10,700 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
23 |
+
2024-08-18 19:01:17,069 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
24 |
+
2024-08-18 19:01:23,519 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
25 |
+
2024-08-18 19:01:29,789 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
26 |
+
2024-08-18 19:01:32,007 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9806, 3.3670, 2.4537, 3.8093], device='cuda:0')
|
27 |
+
2024-08-18 19:01:36,604 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
28 |
+
2024-08-18 19:01:43,465 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
29 |
+
2024-08-18 19:01:50,113 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
30 |
+
2024-08-18 19:01:56,528 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
31 |
+
2024-08-18 19:01:58,050 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5313, 2.0734, 2.3348, 1.9903], device='cuda:0')
|
32 |
+
2024-08-18 19:02:02,943 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
33 |
+
2024-08-18 19:02:09,071 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
34 |
+
2024-08-18 19:02:15,385 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
35 |
+
2024-08-18 19:02:19,906 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([1.6153e-06, 3.7259e-03, 4.8937e-15, 3.5204e+00, 2.1927e-03, 3.7322e-02,
|
36 |
+
2.6767e-02, 3.4430e-02], device='cuda:0')
|
37 |
+
2024-08-18 19:02:21,822 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
38 |
+
2024-08-18 19:02:26,071 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6837, 3.8403, 4.4889, 4.5387], device='cuda:0')
|
39 |
+
2024-08-18 19:02:28,154 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
40 |
+
2024-08-18 19:02:34,163 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
41 |
+
2024-08-18 19:02:40,481 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
42 |
+
2024-08-18 19:02:46,545 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
43 |
+
2024-08-18 19:02:52,933 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
44 |
+
2024-08-18 19:02:59,353 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
45 |
+
2024-08-18 19:03:05,980 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
46 |
+
2024-08-18 19:03:09,862 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9757, 3.3193, 2.3843, 3.8652], device='cuda:0')
|
47 |
+
2024-08-18 19:03:12,256 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
48 |
+
2024-08-18 19:03:12,969 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
49 |
+
2024-08-18 19:03:14,270 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4585850749931198
|
50 |
+
2024-08-18 19:03:14,270 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-29-01
ADDED
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:29:01,269 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:29:01,269 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-4-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:29:01,270 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:29:01,650 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:29:14,057 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:29:14,057 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:29:14,121 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:29:14,537 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:29:23,678 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:29:32,126 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:29:41,305 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5643, 2.2426, 2.1013, 1.9517], device='cuda:0')
|
12 |
+
2024-08-19 14:29:41,350 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
13 |
+
2024-08-19 14:29:49,688 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
14 |
+
2024-08-19 14:29:57,396 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
15 |
+
2024-08-19 14:30:05,184 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
16 |
+
2024-08-19 14:30:13,185 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
17 |
+
2024-08-19 14:30:20,743 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
18 |
+
2024-08-19 14:30:28,600 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
19 |
+
2024-08-19 14:30:36,274 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
20 |
+
2024-08-19 14:30:38,643 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4441, 2.1935, 1.9507, 1.7613], device='cuda:0')
|
21 |
+
2024-08-19 14:30:42,646 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1770, 2.0252, 2.0653, 1.9952], device='cuda:0')
|
22 |
+
2024-08-19 14:30:43,878 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
23 |
+
2024-08-19 14:30:51,728 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
24 |
+
2024-08-19 14:30:59,080 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
25 |
+
2024-08-19 14:31:07,028 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
26 |
+
2024-08-19 14:31:14,477 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
27 |
+
2024-08-19 14:31:22,136 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
28 |
+
2024-08-19 14:31:23,710 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2327, 1.7746, 2.3049, 1.0523], device='cuda:0')
|
29 |
+
2024-08-19 14:31:29,475 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
30 |
+
2024-08-19 14:31:36,442 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6456, 2.9976, 2.1081, 3.3537], device='cuda:0')
|
31 |
+
2024-08-19 14:31:37,157 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
32 |
+
2024-08-19 14:31:44,597 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
33 |
+
2024-08-19 14:31:52,596 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
34 |
+
2024-08-19 14:32:02,846 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
35 |
+
2024-08-19 14:32:11,854 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
36 |
+
2024-08-19 14:32:20,885 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
37 |
+
2024-08-19 14:32:30,102 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
38 |
+
2024-08-19 14:32:36,121 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
39 |
+
2024-08-19 14:32:41,668 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
40 |
+
2024-08-19 14:32:42,206 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
41 |
+
2024-08-19 14:32:43,587 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006227004686049486
|
42 |
+
2024-08-19 14:32:43,587 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-25-19
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:25:19,223 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:25:19,223 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-4-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:25:19,223 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:25:19,578 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:25:30,195 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:25:30,196 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:25:30,245 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:25:30,657 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:25:37,339 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:25:44,852 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:25:46,900 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8176, 2.0766, 2.0546, 1.9663, 2.4135, 1.9287, 2.0256, 1.9964],
|
12 |
+
device='cuda:0')
|
13 |
+
2024-08-19 14:25:54,113 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:26:00,694 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6444, 3.9001, 4.4793, 4.4818], device='cuda:0')
|
15 |
+
2024-08-19 14:26:02,648 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
16 |
+
2024-08-19 14:26:11,164 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
17 |
+
2024-08-19 14:26:15,328 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6349, 1.9431, 2.3711, 1.2170], device='cuda:0')
|
18 |
+
2024-08-19 14:26:18,914 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
19 |
+
2024-08-19 14:26:26,414 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
20 |
+
2024-08-19 14:26:34,443 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
21 |
+
2024-08-19 14:26:42,397 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
22 |
+
2024-08-19 14:26:49,857 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:26:57,794 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:27:05,596 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
25 |
+
2024-08-19 14:27:13,213 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
26 |
+
2024-08-19 14:27:21,216 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
27 |
+
2024-08-19 14:27:28,814 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:27:36,667 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
29 |
+
2024-08-19 14:27:42,505 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6224, 1.9755, 2.4989, 1.1186], device='cuda:0')
|
30 |
+
2024-08-19 14:27:44,557 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
31 |
+
2024-08-19 14:27:52,327 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
32 |
+
2024-08-19 14:28:00,062 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
33 |
+
2024-08-19 14:28:06,249 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8319, 1.9345, 2.0021, 1.8261, 2.3631, 1.8293, 2.0056, 1.8702],
|
34 |
+
device='cuda:0')
|
35 |
+
2024-08-19 14:28:07,913 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
36 |
+
2024-08-19 14:28:12,395 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0234, 3.8515, 3.4006, 3.7693], device='cuda:0')
|
37 |
+
2024-08-19 14:28:15,535 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:28:23,086 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
39 |
+
2024-08-19 14:28:30,461 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6135, 2.0137, 2.4261, 1.1304], device='cuda:0')
|
40 |
+
2024-08-19 14:28:30,745 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
41 |
+
2024-08-19 14:28:38,859 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
42 |
+
2024-08-19 14:28:46,869 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
43 |
+
2024-08-19 14:28:54,424 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
44 |
+
2024-08-19 14:28:55,025 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
45 |
+
2024-08-19 14:28:56,385 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006220715757695241
|
46 |
+
2024-08-19 14:28:56,385 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-32-47
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:32:47,852 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:32:47,852 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-5-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:32:47,852 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:32:48,218 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:32:53,759 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:32:53,760 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:32:53,806 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:32:54,217 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:33:00,498 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:33:00,931 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5943, 3.4421, 3.0372, 3.3559], device='cuda:0')
|
11 |
+
2024-08-19 14:33:06,388 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
12 |
+
2024-08-19 14:33:11,383 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
13 |
+
2024-08-19 14:33:12,907 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2069, 1.7886, 2.2269, 1.0601], device='cuda:0')
|
14 |
+
2024-08-19 14:33:14,364 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:33:18,800 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1870, 1.9781, 2.0873, 1.8613], device='cuda:0')
|
16 |
+
2024-08-19 14:33:18,866 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
17 |
+
2024-08-19 14:33:21,786 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
18 |
+
2024-08-19 14:33:25,792 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
19 |
+
2024-08-19 14:33:29,093 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
20 |
+
2024-08-19 14:33:30,540 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4494, 1.7598, 1.4758, 1.2946, 1.3815, 1.3906, 1.7218, 1.5629],
|
21 |
+
device='cuda:0')
|
22 |
+
2024-08-19 14:33:32,244 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5713, 2.9163, 2.9554, 2.8707], device='cuda:0')
|
23 |
+
2024-08-19 14:33:32,655 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
24 |
+
2024-08-19 14:33:33,188 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9002, 1.8060, 1.8553, 1.6741], device='cuda:0')
|
25 |
+
2024-08-19 14:33:34,194 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9500, 1.8734, 2.0038, 2.2145], device='cuda:0')
|
26 |
+
2024-08-19 14:33:35,526 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
27 |
+
2024-08-19 14:33:38,722 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
28 |
+
2024-08-19 14:33:39,532 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3425, 1.4109, 1.4935, 1.1735, 1.2567, 1.5711, 1.6564, 1.2132],
|
29 |
+
device='cuda:0')
|
30 |
+
2024-08-19 14:33:39,538 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2171e-04, 4.9535e-03, 2.8742e-10, 3.8312e+00, 1.8223e-03, 2.3840e-02,
|
31 |
+
2.8728e-02, 5.2500e-03], device='cuda:0')
|
32 |
+
2024-08-19 14:33:42,237 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8981, 1.7659, 1.9977, 2.0965], device='cuda:0')
|
33 |
+
2024-08-19 14:33:43,027 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
34 |
+
2024-08-19 14:33:47,516 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
35 |
+
2024-08-19 14:33:51,293 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4292, 1.7730, 1.5874, 1.2930, 1.3599, 1.3755, 1.7398, 1.5881],
|
36 |
+
device='cuda:0')
|
37 |
+
2024-08-19 14:33:51,824 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6048, 3.4811, 3.0138, 3.3330], device='cuda:0')
|
38 |
+
2024-08-19 14:33:52,053 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
39 |
+
2024-08-19 14:33:55,384 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3339, 3.5566, 4.0646, 4.1476], device='cuda:0')
|
40 |
+
2024-08-19 14:33:56,502 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
41 |
+
2024-08-19 14:34:00,307 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
42 |
+
2024-08-19 14:34:04,284 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
43 |
+
2024-08-19 14:34:08,212 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
44 |
+
2024-08-19 14:34:11,938 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
45 |
+
2024-08-19 14:34:16,070 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
46 |
+
2024-08-19 14:34:19,588 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5738, 2.7414, 2.8793, 2.7299], device='cuda:0')
|
47 |
+
2024-08-19 14:34:19,984 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
48 |
+
2024-08-19 14:34:24,029 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
49 |
+
2024-08-19 14:34:28,158 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5385, 2.8928, 2.7668, 2.8395], device='cuda:0')
|
50 |
+
2024-08-19 14:34:28,173 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
51 |
+
2024-08-19 14:34:32,308 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
52 |
+
2024-08-19 14:34:36,252 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
53 |
+
2024-08-19 14:34:37,864 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.3222, 3.5968, 4.1039, 4.1566], device='cuda:0')
|
54 |
+
2024-08-19 14:34:39,922 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
55 |
+
2024-08-19 14:34:40,327 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
56 |
+
2024-08-19 14:34:41,689 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006231160227763539
|
57 |
+
2024-08-19 14:34:41,689 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-368000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-29-01
ADDED
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:29:01,259 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:29:01,260 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 368000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-368000-avg-5-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:29:01,260 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:29:01,677 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-352000.pt']
|
5 |
+
2024-08-19 14:29:22,221 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:29:22,221 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:29:22,269 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:29:22,690 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:29:29,529 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:29:33,032 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:29:41,336 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:29:49,677 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:29:57,385 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:29:57,999 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0491, 4.7704, 4.9560, 5.0059], device='cuda:0')
|
15 |
+
2024-08-19 14:30:05,174 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
16 |
+
2024-08-19 14:30:13,169 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
17 |
+
2024-08-19 14:30:19,322 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1285, 2.9087, 3.1937, 3.0400], device='cuda:0')
|
18 |
+
2024-08-19 14:30:20,733 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
19 |
+
2024-08-19 14:30:28,579 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:30:35,625 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1850, 1.9687, 1.9993, 1.8701], device='cuda:0')
|
21 |
+
2024-08-19 14:30:36,250 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
22 |
+
2024-08-19 14:30:43,871 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
23 |
+
2024-08-19 14:30:51,721 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
24 |
+
2024-08-19 14:30:59,066 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
25 |
+
2024-08-19 14:31:07,022 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
26 |
+
2024-08-19 14:31:07,528 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9835, 3.4115, 2.4777, 3.7442], device='cuda:0')
|
27 |
+
2024-08-19 14:31:14,465 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:31:22,128 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
29 |
+
2024-08-19 14:31:29,463 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
30 |
+
2024-08-19 14:31:37,221 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
31 |
+
2024-08-19 14:31:44,580 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
32 |
+
2024-08-19 14:31:52,590 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
33 |
+
2024-08-19 14:32:02,842 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
34 |
+
2024-08-19 14:32:11,908 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
35 |
+
2024-08-19 14:32:20,871 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
36 |
+
2024-08-19 14:32:28,190 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0080, 3.8692, 3.4249, 3.7496], device='cuda:0')
|
37 |
+
2024-08-19 14:32:28,226 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7870, 1.6932, 1.7481, 1.2202, 1.3390, 1.8823, 2.3075, 1.4565],
|
38 |
+
device='cuda:0')
|
39 |
+
2024-08-19 14:32:30,020 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
40 |
+
2024-08-19 14:32:36,114 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
41 |
+
2024-08-19 14:32:41,657 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
42 |
+
2024-08-19 14:32:42,145 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
43 |
+
2024-08-19 14:32:43,494 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006224042046449931
|
44 |
+
2024-08-19 14:32:43,494 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-3-chunk-size-16-left-context-frames-128-2024-08-19-14-12-05
ADDED
@@ -0,0 +1,41 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:12:05,414 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:12:05,414 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-3-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:12:05,415 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:12:05,784 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt']
|
5 |
+
2024-08-19 14:12:13,254 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:12:13,254 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:12:13,306 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:12:13,715 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:12:21,321 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:12:29,378 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:12:31,203 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3209, 1.5506, 1.6857, 1.2290, 1.2352, 1.6203, 1.7398, 1.2019],
|
12 |
+
device='cuda:0')
|
13 |
+
2024-08-19 14:12:38,301 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:12:46,951 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
15 |
+
2024-08-19 14:12:55,329 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:13:03,410 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:13:11,629 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:13:12,027 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5805, 2.7576, 2.8950, 2.7807], device='cuda:0')
|
19 |
+
2024-08-19 14:13:19,358 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
20 |
+
2024-08-19 14:13:27,850 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
21 |
+
2024-08-19 14:13:34,196 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5852, 2.7204, 2.8903, 2.7450], device='cuda:0')
|
22 |
+
2024-08-19 14:13:35,108 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:13:43,586 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:13:52,148 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
25 |
+
2024-08-19 14:14:00,180 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
26 |
+
2024-08-19 14:14:08,254 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
27 |
+
2024-08-19 14:14:16,330 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:14:24,335 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
29 |
+
2024-08-19 14:14:32,477 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
30 |
+
2024-08-19 14:14:40,330 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
31 |
+
2024-08-19 14:14:48,482 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
32 |
+
2024-08-19 14:14:56,390 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
33 |
+
2024-08-19 14:15:04,740 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
34 |
+
2024-08-19 14:15:12,643 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
35 |
+
2024-08-19 14:15:20,024 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
36 |
+
2024-08-19 14:15:29,935 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
37 |
+
2024-08-19 14:15:39,708 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
38 |
+
2024-08-19 14:15:49,119 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
39 |
+
2024-08-19 14:15:49,774 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
40 |
+
2024-08-19 14:15:51,166 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006220058080454957
|
41 |
+
2024-08-19 14:15:51,166 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-3-chunk-size-32-left-context-frames-256-2024-08-19-14-08-21
ADDED
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:08:21,865 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:08:21,866 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 3, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-3-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:08:21,866 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:08:22,249 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt']
|
5 |
+
2024-08-19 14:08:31,225 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:08:31,226 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:08:31,293 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:08:31,728 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:08:39,013 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:08:46,459 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:08:54,855 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:09:03,606 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:09:11,121 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:09:18,803 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
15 |
+
2024-08-19 14:09:26,226 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
16 |
+
2024-08-19 14:09:29,146 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9780, 3.3728, 2.4395, 3.8253], device='cuda:0')
|
17 |
+
2024-08-19 14:09:34,036 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
18 |
+
2024-08-19 14:09:38,586 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0453, 3.1485, 3.2636, 2.9689], device='cuda:0')
|
19 |
+
2024-08-19 14:09:41,402 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:09:49,587 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
21 |
+
2024-08-19 14:09:57,098 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
22 |
+
2024-08-19 14:10:04,843 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
23 |
+
2024-08-19 14:10:13,169 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
24 |
+
2024-08-19 14:10:21,269 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
25 |
+
2024-08-19 14:10:27,812 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7314, 2.1459, 1.7188, 1.3760, 1.7430, 1.5790, 1.8994, 1.8905],
|
26 |
+
device='cuda:0')
|
27 |
+
2024-08-19 14:10:29,499 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:10:33,140 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0314, 2.4548, 2.3920, 2.1893], device='cuda:0')
|
29 |
+
2024-08-19 14:10:37,669 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
30 |
+
2024-08-19 14:10:45,986 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
31 |
+
2024-08-19 14:10:54,297 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
32 |
+
2024-08-19 14:10:54,853 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5327, 2.1670, 2.4655, 2.2192], device='cuda:0')
|
33 |
+
2024-08-19 14:11:02,430 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
34 |
+
2024-08-19 14:11:10,371 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
35 |
+
2024-08-19 14:11:10,592 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2003, 1.9148, 2.1743, 2.3338], device='cuda:0')
|
36 |
+
2024-08-19 14:11:17,372 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0017, 3.3896, 2.4983, 3.8090], device='cuda:0')
|
37 |
+
2024-08-19 14:11:18,276 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:11:26,496 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
39 |
+
2024-08-19 14:11:34,603 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
40 |
+
2024-08-19 14:11:38,797 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0235, 3.0048, 3.1163, 3.0398], device='cuda:0')
|
41 |
+
2024-08-19 14:11:42,826 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
42 |
+
2024-08-19 14:11:50,501 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
43 |
+
2024-08-19 14:11:58,896 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
44 |
+
2024-08-19 14:11:59,346 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
45 |
+
2024-08-19 14:12:00,699 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.00620196050590261
|
46 |
+
2024-08-19 14:12:00,700 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256-2024-08-18-18-56-51
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-18 18:56:51,545 INFO [inference_audio_tagging.py:316] Evaluation started
|
2 |
+
2024-08-18 18:56:51,545 INFO [inference_audio_tagging.py:318] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-clean', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 3, 'use_averaged_model': True, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-3-use-averaged-model-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-18 18:56:51,546 INFO [inference_audio_tagging.py:324] About to create model
|
4 |
+
2024-08-18 18:56:51,949 INFO [inference_audio_tagging.py:384] Calculating the averaged model over iteration checkpoints from multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt (excluded) to multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt
|
5 |
+
2024-08-18 18:57:05,885 INFO [inference_audio_tagging.py:421] Number of model parameters: 66139654
|
6 |
+
2024-08-18 18:57:05,885 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-18 18:57:05,937 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-18 18:57:06,345 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-18 18:57:14,907 INFO [inference_audio_tagging.py:286] Processed 60 cuts already.
|
10 |
+
2024-08-18 18:57:22,353 INFO [inference_audio_tagging.py:286] Processed 660 cuts already.
|
11 |
+
2024-08-18 18:57:30,302 INFO [inference_audio_tagging.py:286] Processed 1260 cuts already.
|
12 |
+
2024-08-18 18:57:37,936 INFO [inference_audio_tagging.py:286] Processed 1860 cuts already.
|
13 |
+
2024-08-18 18:57:44,606 INFO [inference_audio_tagging.py:286] Processed 2460 cuts already.
|
14 |
+
2024-08-18 18:57:49,725 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7316, 2.2162, 2.2331, 2.0539], device='cuda:0')
|
15 |
+
2024-08-18 18:57:50,817 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0457, 3.9214, 3.4020, 3.7208], device='cuda:0')
|
16 |
+
2024-08-18 18:57:51,775 INFO [inference_audio_tagging.py:286] Processed 3060 cuts already.
|
17 |
+
2024-08-18 18:57:58,063 INFO [inference_audio_tagging.py:286] Processed 3660 cuts already.
|
18 |
+
2024-08-18 18:58:02,756 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0312, 3.8963, 3.4236, 3.7549], device='cuda:0')
|
19 |
+
2024-08-18 18:58:03,988 INFO [inference_audio_tagging.py:286] Processed 4260 cuts already.
|
20 |
+
2024-08-18 18:58:10,392 INFO [inference_audio_tagging.py:286] Processed 4860 cuts already.
|
21 |
+
2024-08-18 18:58:17,420 INFO [inference_audio_tagging.py:286] Processed 5460 cuts already.
|
22 |
+
2024-08-18 18:58:23,898 INFO [inference_audio_tagging.py:286] Processed 6060 cuts already.
|
23 |
+
2024-08-18 18:58:30,898 INFO [inference_audio_tagging.py:286] Processed 6660 cuts already.
|
24 |
+
2024-08-18 18:58:37,541 INFO [inference_audio_tagging.py:286] Processed 7260 cuts already.
|
25 |
+
2024-08-18 18:58:41,459 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6547, 3.8374, 4.4411, 4.5599], device='cuda:0')
|
26 |
+
2024-08-18 18:58:44,345 INFO [inference_audio_tagging.py:286] Processed 7860 cuts already.
|
27 |
+
2024-08-18 18:58:50,930 INFO [inference_audio_tagging.py:286] Processed 8460 cuts already.
|
28 |
+
2024-08-18 18:58:52,718 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0363, 3.1148, 3.2213, 3.0300], device='cuda:0')
|
29 |
+
2024-08-18 18:58:57,298 INFO [inference_audio_tagging.py:286] Processed 9060 cuts already.
|
30 |
+
2024-08-18 18:59:03,463 INFO [inference_audio_tagging.py:286] Processed 9660 cuts already.
|
31 |
+
2024-08-18 18:59:09,117 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1536, 2.9653, 3.1950, 3.0706], device='cuda:0')
|
32 |
+
2024-08-18 18:59:09,963 INFO [inference_audio_tagging.py:286] Processed 10260 cuts already.
|
33 |
+
2024-08-18 18:59:16,573 INFO [inference_audio_tagging.py:286] Processed 10860 cuts already.
|
34 |
+
2024-08-18 18:59:23,033 INFO [inference_audio_tagging.py:286] Processed 11460 cuts already.
|
35 |
+
2024-08-18 18:59:24,919 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7958, 2.0946, 1.7705, 1.4153, 1.7191, 1.5723, 1.7873, 1.7513],
|
36 |
+
device='cuda:0')
|
37 |
+
2024-08-18 18:59:29,457 INFO [inference_audio_tagging.py:286] Processed 12060 cuts already.
|
38 |
+
2024-08-18 18:59:30,848 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6787, 3.8668, 4.4425, 4.5342], device='cuda:0')
|
39 |
+
2024-08-18 18:59:35,763 INFO [inference_audio_tagging.py:286] Processed 12660 cuts already.
|
40 |
+
2024-08-18 18:59:39,036 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8164, 1.9244, 2.0591, 2.0073, 2.4492, 1.9407, 1.9982, 1.9435],
|
41 |
+
device='cuda:0')
|
42 |
+
2024-08-18 18:59:41,621 INFO [inference_audio_tagging.py:286] Processed 13260 cuts already.
|
43 |
+
2024-08-18 18:59:48,282 INFO [inference_audio_tagging.py:286] Processed 13860 cuts already.
|
44 |
+
2024-08-18 18:59:54,691 INFO [inference_audio_tagging.py:286] Processed 14460 cuts already.
|
45 |
+
2024-08-18 18:59:59,160 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0088, 3.8764, 3.4298, 3.7547], device='cuda:0')
|
46 |
+
2024-08-18 19:00:00,841 INFO [inference_audio_tagging.py:286] Processed 15060 cuts already.
|
47 |
+
2024-08-18 19:00:01,455 INFO [inference_audio_tagging.py:287] Finish collecting audio logits
|
48 |
+
2024-08-18 19:00:03,119 INFO [inference_audio_tagging.py:454] mAP for audioset eval is: 0.4579585731070496
|
49 |
+
2024-08-18 19:00:03,120 INFO [inference_audio_tagging.py:456] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-4-chunk-size-16-left-context-frames-128-2024-08-19-14-15-56
ADDED
@@ -0,0 +1,47 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:15:56,517 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:15:56,517 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-4-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:15:56,518 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:15:56,887 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt']
|
5 |
+
2024-08-19 14:16:05,234 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:16:05,234 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:16:05,291 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:16:05,725 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:16:13,383 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:16:21,825 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:16:21,903 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7421e-05, 2.5388e-03, 1.0623e-04, 3.8312e+00, 2.4914e-04, 2.2624e-02,
|
12 |
+
2.5287e-02, 6.7430e-03], device='cuda:0')
|
13 |
+
2024-08-19 14:16:29,814 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
14 |
+
2024-08-19 14:16:35,868 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3338, 2.2583, 1.9410, 1.7865], device='cuda:0')
|
15 |
+
2024-08-19 14:16:37,234 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
16 |
+
2024-08-19 14:16:42,648 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
17 |
+
2024-08-19 14:16:47,972 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9276, 1.8515, 1.9909, 2.1603], device='cuda:0')
|
18 |
+
2024-08-19 14:16:48,836 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
19 |
+
2024-08-19 14:17:00,476 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
20 |
+
2024-08-19 14:17:11,747 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
21 |
+
2024-08-19 14:17:22,472 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
22 |
+
2024-08-19 14:17:29,860 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:17:31,936 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6247, 3.4854, 3.0251, 3.3613], device='cuda:0')
|
24 |
+
2024-08-19 14:17:36,457 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
25 |
+
2024-08-19 14:17:42,010 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
26 |
+
2024-08-19 14:17:46,499 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:17:51,195 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:17:54,880 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9487, 1.8452, 1.9134, 2.1173], device='cuda:0')
|
29 |
+
2024-08-19 14:17:55,706 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
30 |
+
2024-08-19 14:17:57,369 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.6698, 2.1566, 2.1341, 1.9105], device='cuda:0')
|
31 |
+
2024-08-19 14:18:00,678 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
32 |
+
2024-08-19 14:18:05,373 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
33 |
+
2024-08-19 14:18:10,120 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:18:14,880 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
35 |
+
2024-08-19 14:18:19,516 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5887, 2.7958, 2.8635, 2.7193], device='cuda:0')
|
36 |
+
2024-08-19 14:18:19,729 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:18:24,233 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:18:26,361 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.2114, 1.7660, 2.2590, 1.0550], device='cuda:0')
|
39 |
+
2024-08-19 14:18:28,438 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
40 |
+
2024-08-19 14:18:30,902 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5168, 2.8308, 2.8739, 2.8247], device='cuda:0')
|
41 |
+
2024-08-19 14:18:33,166 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:18:37,631 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
43 |
+
2024-08-19 14:18:42,283 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
44 |
+
2024-08-19 14:18:46,500 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
45 |
+
2024-08-19 14:18:46,906 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
46 |
+
2024-08-19 14:18:48,401 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006230305719515452
|
47 |
+
2024-08-19 14:18:48,402 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-4-chunk-size-32-left-context-frames-256-2024-08-19-14-12-05
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:12:05,409 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:12:05,409 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 4, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-4-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:12:05,410 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:12:05,778 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt']
|
5 |
+
2024-08-19 14:12:16,291 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:12:16,292 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:12:16,364 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:12:16,777 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:12:23,258 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:12:29,402 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:12:38,306 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:12:44,929 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9726, 3.4110, 2.6548, 3.7477], device='cuda:0')
|
13 |
+
2024-08-19 14:12:46,962 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
14 |
+
2024-08-19 14:12:47,780 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9649, 3.4329, 2.6452, 3.7194], device='cuda:0')
|
15 |
+
2024-08-19 14:12:55,330 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
16 |
+
2024-08-19 14:13:03,329 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
17 |
+
2024-08-19 14:13:11,627 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
18 |
+
2024-08-19 14:13:12,183 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([5.0437, 4.8382, 4.9500, 5.0010], device='cuda:0')
|
19 |
+
2024-08-19 14:13:19,286 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
20 |
+
2024-08-19 14:13:27,813 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
21 |
+
2024-08-19 14:13:34,981 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9675, 3.3711, 2.3085, 3.8170], device='cuda:0')
|
22 |
+
2024-08-19 14:13:35,056 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:13:38,462 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9353, 3.4064, 2.3712, 3.7478], device='cuda:0')
|
24 |
+
2024-08-19 14:13:43,450 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
25 |
+
2024-08-19 14:13:52,154 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
26 |
+
2024-08-19 14:13:55,327 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5453e-04, 1.1201e-03, 3.4938e-09, 3.5204e+00, 1.1338e-03, 3.6365e-02,
|
27 |
+
2.2804e-02, 4.0654e-02], device='cuda:0')
|
28 |
+
2024-08-19 14:14:00,083 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.6101, 2.0320, 2.4429, 1.0841], device='cuda:0')
|
29 |
+
2024-08-19 14:14:00,183 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
30 |
+
2024-08-19 14:14:08,247 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
31 |
+
2024-08-19 14:14:16,329 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
32 |
+
2024-08-19 14:14:24,328 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
33 |
+
2024-08-19 14:14:32,477 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
34 |
+
2024-08-19 14:14:40,331 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
35 |
+
2024-08-19 14:14:48,479 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
36 |
+
2024-08-19 14:14:56,389 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:14:58,137 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9765, 3.4381, 2.5538, 3.6984], device='cuda:0')
|
38 |
+
2024-08-19 14:15:01,610 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8102, 1.6126, 1.8653, 1.2080, 1.4627, 1.9411, 2.3336, 1.4421],
|
39 |
+
device='cuda:0')
|
40 |
+
2024-08-19 14:15:04,739 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
41 |
+
2024-08-19 14:15:12,639 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
42 |
+
2024-08-19 14:15:20,031 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
43 |
+
2024-08-19 14:15:29,936 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
44 |
+
2024-08-19 14:15:39,787 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
45 |
+
2024-08-19 14:15:49,115 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
46 |
+
2024-08-19 14:15:49,773 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
47 |
+
2024-08-19 14:15:51,158 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006216533105513755
|
48 |
+
2024-08-19 14:15:51,158 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-5-chunk-size-16-left-context-frames-128-2024-08-19-14-18-53
ADDED
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:18:53,217 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:18:53,217 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '16', 'left_context_frames': '128', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-5-chunk-size-16-left-context-frames-128'}
|
3 |
+
2024-08-19 14:18:53,217 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:18:53,579 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:19:08,540 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:19:08,541 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:19:08,598 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:19:09,029 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:19:16,991 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:19:23,759 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:19:39,339 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:19:55,616 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:20:01,345 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5254e-05, 4.1481e-03, 1.7270e-15, 3.8312e+00, 1.6097e-04, 3.0191e-02,
|
14 |
+
2.2083e-02, 5.1140e-03], device='cuda:0')
|
15 |
+
2024-08-19 14:20:04,543 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4100, 2.1848, 1.9585, 1.7504], device='cuda:0')
|
16 |
+
2024-08-19 14:20:09,187 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
17 |
+
2024-08-19 14:20:22,769 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
18 |
+
2024-08-19 14:20:37,387 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
19 |
+
2024-08-19 14:20:52,311 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
20 |
+
2024-08-19 14:21:06,227 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
21 |
+
2024-08-19 14:21:09,012 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.1949, 2.0553, 2.0628, 2.0304], device='cuda:0')
|
22 |
+
2024-08-19 14:21:21,230 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
23 |
+
2024-08-19 14:21:35,306 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
24 |
+
2024-08-19 14:21:51,084 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
25 |
+
2024-08-19 14:21:52,161 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5870, 2.7855, 2.9124, 2.7665], device='cuda:0')
|
26 |
+
2024-08-19 14:22:06,465 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
27 |
+
2024-08-19 14:22:21,057 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
28 |
+
2024-08-19 14:22:36,413 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
29 |
+
2024-08-19 14:22:51,482 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
30 |
+
2024-08-19 14:23:05,484 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
31 |
+
2024-08-19 14:23:10,156 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.3550, 1.6440, 1.6494, 1.6352, 1.9975, 1.6017, 1.7393, 1.7340],
|
32 |
+
device='cuda:0')
|
33 |
+
2024-08-19 14:23:17,842 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
34 |
+
2024-08-19 14:23:31,291 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
35 |
+
2024-08-19 14:23:37,179 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.7553e-06, 1.1953e-03, 1.8276e-05, 3.8312e+00, 1.4163e-03, 2.7707e-02,
|
36 |
+
2.0762e-02, 4.5919e-03], device='cuda:0')
|
37 |
+
2024-08-19 14:23:43,565 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
38 |
+
2024-08-19 14:23:56,631 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
39 |
+
2024-08-19 14:24:02,973 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.5841, 2.0529, 2.0782, 1.8686], device='cuda:0')
|
40 |
+
2024-08-19 14:24:09,979 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
41 |
+
2024-08-19 14:24:24,176 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:24:32,577 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.6308, 4.3681, 4.5373, 4.5872], device='cuda:0')
|
43 |
+
2024-08-19 14:24:38,496 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
44 |
+
2024-08-19 14:24:53,778 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
45 |
+
2024-08-19 14:24:57,843 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.4009, 2.1757, 1.9792, 1.7281], device='cuda:0')
|
46 |
+
2024-08-19 14:25:08,374 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
47 |
+
2024-08-19 14:25:09,328 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
48 |
+
2024-08-19 14:25:10,657 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006232581206165736
|
49 |
+
2024-08-19 14:25:10,657 INFO [inference_audio_tagging.py:457] Done
|
inference_audio_tagging/log-decode-iter-372000-avg-5-chunk-size-32-left-context-frames-256-2024-08-19-14-15-56
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
2024-08-19 14:15:56,511 INFO [inference_audio_tagging.py:317] Evaluation started
|
2 |
+
2024-08-19 14:15:56,512 INFO [inference_audio_tagging.py:319] {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'valid_interval': 3000, 'feature_dim': 80, 'subsampling_factor': 4, 'warm_step': 2000, 'env_info': {'k2-version': '1.24.3', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': 'e400fa3b456faf8afe0ee5bfe572946b4921a3db', 'k2-git-date': 'Sat Jul 15 04:21:50 2023', 'lhotse-version': '1.16.0', 'torch-version': '2.0.1+cu117', 'torch-cuda-available': True, 'torch-cuda-version': '11.7', 'python-version': '3.9', 'icefall-git-branch': 'multi_KD_with_wenet', 'icefall-git-sha1': '0d2af1df-dirty', 'icefall-git-date': 'Wed Aug 14 17:27:16 2024', 'icefall-path': '/xy/mnt/yangxiaoyu/workspace/icefall_multi_KD', 'k2-path': '/root/anaconda3/lib/python3.9/site-packages/k2/__init__.py', 'lhotse-path': '/root/anaconda3/lib/python3.9/site-packages/lhotse/__init__.py', 'hostname': 'NGK_xiaoyu'}, 'epoch': 30, 'iter': 372000, 'avg': 5, 'use_averaged_model': False, 'exp_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16'), 'trained_with_distillation': True, 'trained_with_multitask': False, 'freeze_encoder': False, 'num_events': 527, 'eval_subset': 'eval', 'vocab_size': 500, 'blank_id': 0, 'context_size': 2, 'do_audio_tagging': True, 'use_encoder_projection': True, 'encoder_projection_dim': 2560, 'freezing_encoder_layer_index': '-1', 'freeze_encoder_steps': -1, 'save_logits': False, 'num_encoder_layers': '2,2,3,4,3,2', 'downsampling_factor': '1,2,4,8,4,2', 'feedforward_dim': '512,768,1024,1536,1024,768', 'num_heads': '4,4,4,8,4,4', 'encoder_dim': '192,256,384,512,384,256', 'query_head_dim': '32', 'value_head_dim': '12', 'pos_head_dim': '4', 'pos_dim': 48, 'encoder_unmasked_dim': '192,192,256,256,256,192', 'cnn_module_kernel': '31,31,15,15,15,31', 'decoder_dim': 512, 'joiner_dim': 512, 'causal': True, 'chunk_size': '32', 'left_context_frames': '256', 'use_transducer': True, 'use_ctc': False, 'speaker_input_idx': 2, 'whisper_dim': 1280, 'use_task_id': False, 'num_codebooks': 32, 'mvq_kd_layer_idx': -1, 'use_subsampled_output': True, 'delta_t': 0, 'full_libri': True, 'mini_libri': False, 'use_libriheavy': False, 'libriheavy_subset': 'small', 'use_librispeech': False, 'use_wenetspeech': False, 'use_audioset': False, 'audioset_subset': 'balanced', 'use_voxceleb': False, 'voxceleb_subset': 'vox1', 'use_fma': False, 'fma_subset': 'large', 'manifest_dir': PosixPath('data/fbank_LS_Vox_AS_fma'), 'max_duration': 300, 'bucketing_sampler': True, 'num_buckets': 30, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'drop_last': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'enable_audioset': False, 'use_musan_separately': False, 'input_strategy': 'PrecomputedFeatures', 'drop_features': False, 'return_audio': False, 'use_beats': True, 'use_ecapa': False, 'use_whisper': True, 'whisper_mvq': False, 'beats_ckpt': 'data/models/BEATs/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt', 'whisper_version': 'small.en', 'use_mert': False, 'lm_vocab_size': 500, 'lm_epoch': 7, 'lm_avg': 1, 'lm_exp_dir': None, 'rnn_lm_embedding_dim': 2048, 'rnn_lm_hidden_dim': 2048, 'rnn_lm_num_layers': 3, 'rnn_lm_tie_weights': True, 'transformer_lm_exp_dir': None, 'transformer_lm_dim_feedforward': 2048, 'transformer_lm_encoder_dim': 768, 'transformer_lm_embedding_dim': 768, 'transformer_lm_nhead': 8, 'transformer_lm_num_layers': 16, 'transformer_lm_tie_weights': True, 'res_dir': PosixPath('multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/inference_audio_tagging'), 'suffix': 'iter-372000-avg-5-chunk-size-32-left-context-frames-256'}
|
3 |
+
2024-08-19 14:15:56,512 INFO [inference_audio_tagging.py:325] About to create model
|
4 |
+
2024-08-19 14:15:56,887 INFO [inference_audio_tagging.py:354] averaging ['multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-372000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-368000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-364000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-360000.pt', 'multi_KD/exp_causal1_delta6KD_LS1_5fold+wenetspech0_0fold+as_unbalanced1+vox_1_vox2_base_lr_0.045_use_beats_1_scale_1.0_use_ecapa_1_layer_2_scale_10.0_1_scale_1.0_specaug0_musan0_with_task_ID_stop_early1_share_asr1_md1500_amp_bf16/checkpoint-356000.pt']
|
5 |
+
2024-08-19 14:16:18,814 INFO [inference_audio_tagging.py:422] Number of model parameters: 66139654
|
6 |
+
2024-08-19 14:16:18,815 INFO [kd_datamodule.py:912] About to get the audioset eval cuts.
|
7 |
+
2024-08-19 14:16:18,866 INFO [kd_datamodule.py:570] About to create dev dataset
|
8 |
+
2024-08-19 14:16:19,277 INFO [kd_datamodule.py:591] About to create dev dataloader
|
9 |
+
2024-08-19 14:16:26,256 INFO [inference_audio_tagging.py:287] Processed 60 cuts already.
|
10 |
+
2024-08-19 14:16:33,842 INFO [inference_audio_tagging.py:287] Processed 660 cuts already.
|
11 |
+
2024-08-19 14:16:37,383 INFO [inference_audio_tagging.py:287] Processed 1260 cuts already.
|
12 |
+
2024-08-19 14:16:40,931 INFO [inference_audio_tagging.py:287] Processed 1860 cuts already.
|
13 |
+
2024-08-19 14:16:44,468 INFO [inference_audio_tagging.py:287] Processed 2460 cuts already.
|
14 |
+
2024-08-19 14:16:48,829 INFO [inference_audio_tagging.py:287] Processed 3060 cuts already.
|
15 |
+
2024-08-19 14:17:00,415 INFO [inference_audio_tagging.py:287] Processed 3660 cuts already.
|
16 |
+
2024-08-19 14:17:11,660 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.9579, 3.2629, 2.3329, 3.7608], device='cuda:0')
|
17 |
+
2024-08-19 14:17:11,737 INFO [inference_audio_tagging.py:287] Processed 4260 cuts already.
|
18 |
+
2024-08-19 14:17:20,508 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0389, 3.1006, 3.2634, 3.0996], device='cuda:0')
|
19 |
+
2024-08-19 14:17:22,397 INFO [inference_audio_tagging.py:287] Processed 4860 cuts already.
|
20 |
+
2024-08-19 14:17:29,985 INFO [inference_audio_tagging.py:287] Processed 5460 cuts already.
|
21 |
+
2024-08-19 14:17:36,411 INFO [inference_audio_tagging.py:287] Processed 6060 cuts already.
|
22 |
+
2024-08-19 14:17:36,764 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0898, 2.4269, 2.4887, 2.2428], device='cuda:0')
|
23 |
+
2024-08-19 14:17:41,853 INFO [inference_audio_tagging.py:287] Processed 6660 cuts already.
|
24 |
+
2024-08-19 14:17:46,288 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.0986, 2.9629, 3.2274, 3.0579], device='cuda:0')
|
25 |
+
2024-08-19 14:17:46,298 INFO [inference_audio_tagging.py:287] Processed 7260 cuts already.
|
26 |
+
2024-08-19 14:17:51,192 INFO [inference_audio_tagging.py:287] Processed 7860 cuts already.
|
27 |
+
2024-08-19 14:17:55,474 INFO [inference_audio_tagging.py:287] Processed 8460 cuts already.
|
28 |
+
2024-08-19 14:17:57,825 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5376, 2.2437, 2.4409, 2.2191], device='cuda:0')
|
29 |
+
2024-08-19 14:18:00,616 INFO [inference_audio_tagging.py:287] Processed 9060 cuts already.
|
30 |
+
2024-08-19 14:18:01,867 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([3.5178, 2.0555, 2.2890, 2.1063], device='cuda:0')
|
31 |
+
2024-08-19 14:18:05,215 INFO [inference_audio_tagging.py:287] Processed 9660 cuts already.
|
32 |
+
2024-08-19 14:18:10,142 INFO [inference_audio_tagging.py:287] Processed 10260 cuts already.
|
33 |
+
2024-08-19 14:18:11,221 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.8100, 1.6149, 1.8980, 1.2209, 1.4218, 1.9370, 2.3120, 1.4507],
|
34 |
+
device='cuda:0')
|
35 |
+
2024-08-19 14:18:14,890 INFO [inference_audio_tagging.py:287] Processed 10860 cuts already.
|
36 |
+
2024-08-19 14:18:19,821 INFO [inference_audio_tagging.py:287] Processed 11460 cuts already.
|
37 |
+
2024-08-19 14:18:24,157 INFO [inference_audio_tagging.py:287] Processed 12060 cuts already.
|
38 |
+
2024-08-19 14:18:27,048 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0205, 3.8945, 3.4242, 3.6845], device='cuda:0')
|
39 |
+
2024-08-19 14:18:28,394 INFO [inference_audio_tagging.py:287] Processed 12660 cuts already.
|
40 |
+
2024-08-19 14:18:30,906 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([4.0363, 3.0922, 3.2624, 3.1036], device='cuda:0')
|
41 |
+
2024-08-19 14:18:32,667 INFO [inference_audio_tagging.py:287] Processed 13260 cuts already.
|
42 |
+
2024-08-19 14:18:35,975 INFO [zipformer.py:1877] name=None, attn_weights_entropy = tensor([2.9094, 2.3659, 2.2842, 2.1428], device='cuda:0')
|
43 |
+
2024-08-19 14:18:37,287 INFO [inference_audio_tagging.py:287] Processed 13860 cuts already.
|
44 |
+
2024-08-19 14:18:41,790 INFO [inference_audio_tagging.py:287] Processed 14460 cuts already.
|
45 |
+
2024-08-19 14:18:46,440 INFO [inference_audio_tagging.py:287] Processed 15060 cuts already.
|
46 |
+
2024-08-19 14:18:46,827 INFO [inference_audio_tagging.py:288] Finish collecting audio logits
|
47 |
+
2024-08-19 14:18:48,226 INFO [inference_audio_tagging.py:455] mAP for audioset eval is: 0.006222875499287261
|
48 |
+
2024-08-19 14:18:48,226 INFO [inference_audio_tagging.py:457] Done
|