lhotse: Loading sampler state dict error

I resume to train wenetspeech model by specifying --start-batch 24000, related code is listed below:

python3 ./pruned_transducer_stateless5/train.py \
  --start-batch 24000 \
  --use-fp16 True \
  --lang-dir data/lang_char \
  --exp-dir pruned_transducer_stateless5/exp_L_streaming \
  --world-size 8 \
  --num-epochs 15 \
  --start-epoch 1 \
  --max-duration 140 \
  --valid-interval 3000 \
  --model-warm-step 3000 \
  --save-every-n 2000 \
  --average-period 1000 \
  --training-subset L \
  --dynamic-chunk-training True \
  --causal-convolution True \
  --short-chunk-size 25 \
  --num-left-chunks 4

The train_dl.sampler.load_state_dict(sampler_state_dict) takes about three hours, and then I get the following error when loading:

2022-08-08 16:47:14,334 INFO [train.py:943] (0/8) Training started
2022-08-08 16:47:14,334 INFO [train.py:943] (7/8) Training started
2022-08-08 16:47:14,334 INFO [train.py:953] (7/8) Device: cuda:7
2022-08-08 16:47:14,335 INFO [train.py:943] (5/8) Training started
2022-08-08 16:47:14,336 INFO [train.py:953] (5/8) Device: cuda:5
2022-08-08 16:47:14,337 INFO [train.py:953] (0/8) Device: cuda:0
2022-08-08 16:47:14,340 INFO [train.py:943] (6/8) Training started
2022-08-08 16:47:14,340 INFO [train.py:953] (6/8) Device: cuda:6
2022-08-08 16:47:14,340 INFO [train.py:943] (1/8) Training started
2022-08-08 16:47:14,341 INFO [train.py:953] (1/8) Device: cuda:1
2022-08-08 16:47:14,341 INFO [train.py:943] (4/8) Training started
2022-08-08 16:47:14,341 INFO [train.py:953] (4/8) Device: cuda:4
2022-08-08 16:47:14,342 INFO [train.py:943] (2/8) Training started
2022-08-08 16:47:14,342 INFO [train.py:953] (2/8) Device: cuda:2
2022-08-08 16:47:14,342 INFO [train.py:943] (3/8) Training started
2022-08-08 16:47:14,343 INFO [train.py:953] (3/8) Device: cuda:3
2022-08-08 16:47:16,459 INFO [lexicon.py:176] (2/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,476 INFO [lexicon.py:176] (6/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,547 INFO [lexicon.py:176] (0/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,552 INFO [lexicon.py:176] (3/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,554 INFO [lexicon.py:176] (4/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,555 INFO [lexicon.py:176] (1/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,563 INFO [lexicon.py:176] (7/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,576 INFO [lexicon.py:176] (5/8) Loading pre-compiled data/lang_char/Linv.pt
2022-08-08 16:47:16,686 INFO [train.py:969] (2/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,686 INFO [train.py:971] (2/8) About to create model
2022-08-08 16:47:16,697 INFO [train.py:969] (6/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,698 INFO [train.py:971] (6/8) About to create model
2022-08-08 16:47:16,771 INFO [train.py:969] (4/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,772 INFO [train.py:971] (4/8) About to create model
2022-08-08 16:47:16,775 INFO [train.py:969] (3/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,775 INFO [train.py:971] (3/8) About to create model
2022-08-08 16:47:16,775 INFO [train.py:969] (1/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,776 INFO [train.py:971] (1/8) About to create model
2022-08-08 16:47:16,776 INFO [train.py:969] (0/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,776 INFO [train.py:971] (0/8) About to create model
2022-08-08 16:47:16,787 INFO [train.py:969] (7/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,788 INFO [train.py:971] (7/8) About to create model
2022-08-08 16:47:16,800 INFO [train.py:969] (5/8) {'best_train_loss': inf, 'best_valid_loss': inf, 'best_train_epoch': -1, 'best_valid_epoch': -1, 'batch_idx_train': 0, 'log_interval': 50, 'reset_interval': 200, 'feature_dim': 80, 'subsampling_factor': 4, 'env_info': {'k2-version': '1.17', 'k2-build-type': 'Release', 'k2-with-cuda': True, 'k2-git-sha1': '7dcabf85e8bf06984c4abab0400ef1322b5ff3df', 'k2-git-date': 'Tue Aug 2 21:22:39 2022', 'lhotse-version': '1.5.0.dev+git.08a613a.clean', 'torch-version': '1.12.0', 'torch-cuda-available': True, 'torch-cuda-version': '10.2', 'python-version': '3.9', 'icefall-git-branch': 'master', 'icefall-git-sha1': 'f24b76e-dirty', 'icefall-git-date': 'Sat Aug 6 18:33:43 2022', 'icefall-path': '/home/storage04/zhuangweiji/workspace/kaldi2/icefall', 'k2-path': '/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/k2-1.17.dev20220803+cuda10.2.torch1.12.0-py3.9-linux-x86_64.egg/k2/__init__.py', 'lhotse-path': '/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/__init__.py', 'hostname': 'tj1-asr-train-v100-01.kscn', 'IP address': '10.38.10.45'}, 'world_size': 8, 'master_port': 12354, 'tensorboard': True, 'num_epochs': 15, 'start_epoch': 1, 'start_batch': 24000, 'exp_dir': PosixPath('pruned_transducer_stateless5/exp_L_streaming'), 'lang_dir': PosixPath('data/lang_char'), 'initial_lr': 0.003, 'lr_batches': 5000, 'lr_epochs': 6, 'context_size': 2, 'prune_range': 5, 'lm_scale': 0.25, 'am_scale': 0.0, 'simple_loss_scale': 0.5, 'seed': 42, 'print_diagnostics': False, 'save_every_n': 2000, 'keep_last_k': 30, 'average_period': 1000, 'use_fp16': True, 'valid_interval': 3000, 'model_warm_step': 3000, 'num_encoder_layers': 24, 'dim_feedforward': 1536, 'nhead': 8, 'encoder_dim': 384, 'decoder_dim': 512, 'joiner_dim': 512, 'dynamic_chunk_training': True, 'causal_convolution': True, 'short_chunk_size': 25, 'num_left_chunks': 4, 'manifest_dir': PosixPath('data/fbank'), 'max_duration': 140, 'bucketing_sampler': True, 'num_buckets': 300, 'concatenate_cuts': False, 'duration_factor': 1.0, 'gap': 1.0, 'on_the_fly_feats': False, 'shuffle': True, 'return_cuts': True, 'num_workers': 2, 'enable_spec_aug': True, 'spec_aug_time_warp_factor': 80, 'enable_musan': True, 'training_subset': 'L', 'blank_id': 0, 'vocab_size': 5537}
2022-08-08 16:47:16,800 INFO [train.py:971] (5/8) About to create model
2022-08-08 16:47:17,336 INFO [train.py:975] (2/8) Number of model parameters: 97487351
2022-08-08 16:47:17,337 INFO [checkpoint.py:112] (2/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,349 INFO [train.py:975] (6/8) Number of model parameters: 97487351
2022-08-08 16:47:17,350 INFO [checkpoint.py:112] (6/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,449 INFO [train.py:975] (0/8) Number of model parameters: 97487351
2022-08-08 16:47:17,450 INFO [train.py:975] (3/8) Number of model parameters: 97487351
2022-08-08 16:47:17,450 INFO [checkpoint.py:112] (3/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,450 INFO [train.py:975] (1/8) Number of model parameters: 97487351
2022-08-08 16:47:17,450 INFO [checkpoint.py:112] (1/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,453 INFO [train.py:975] (4/8) Number of model parameters: 97487351
2022-08-08 16:47:17,454 INFO [checkpoint.py:112] (4/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,472 INFO [train.py:975] (7/8) Number of model parameters: 97487351
2022-08-08 16:47:17,473 INFO [checkpoint.py:112] (7/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,475 INFO [train.py:975] (5/8) Number of model parameters: 97487351
2022-08-08 16:47:17,476 INFO [checkpoint.py:112] (5/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:17,856 INFO [checkpoint.py:112] (0/8) Loading checkpoint from pruned_transducer_stateless5/exp_L_streaming/checkpoint-24000.pt
2022-08-08 16:47:19,757 INFO [checkpoint.py:131] (0/8) Loading averaged model
2022-08-08 16:47:23,825 INFO [train.py:990] (0/8) Using DDP
2022-08-08 16:47:24,070 INFO [train.py:990] (4/8) Using DDP
2022-08-08 16:47:24,094 INFO [train.py:990] (2/8) Using DDP
2022-08-08 16:47:24,104 INFO [train.py:990] (7/8) Using DDP
2022-08-08 16:47:24,207 INFO [train.py:990] (1/8) Using DDP
2022-08-08 16:47:24,213 INFO [train.py:990] (6/8) Using DDP
2022-08-08 16:47:24,314 INFO [train.py:990] (3/8) Using DDP
2022-08-08 16:47:24,378 INFO [train.py:990] (5/8) Using DDP
2022-08-08 16:47:25,337 INFO [train.py:998] (0/8) Loading optimizer state dict
2022-08-08 16:47:25,343 INFO [train.py:998] (2/8) Loading optimizer state dict
2022-08-08 16:47:25,344 INFO [train.py:998] (1/8) Loading optimizer state dict
2022-08-08 16:47:25,344 INFO [train.py:998] (5/8) Loading optimizer state dict
2022-08-08 16:47:25,344 INFO [train.py:998] (6/8) Loading optimizer state dict
2022-08-08 16:47:25,345 INFO [train.py:998] (3/8) Loading optimizer state dict
2022-08-08 16:47:25,345 INFO [train.py:998] (7/8) Loading optimizer state dict
2022-08-08 16:47:25,345 INFO [train.py:998] (4/8) Loading optimizer state dict
2022-08-08 16:47:26,359 INFO [train.py:1006] (0/8) Loading scheduler state dict
2022-08-08 16:47:26,360 INFO [asr_datamodule.py:415] (0/8) About to get train cuts
2022-08-08 16:47:26,367 INFO [asr_datamodule.py:424] (0/8) About to get dev cuts
2022-08-08 16:47:26,369 INFO [asr_datamodule.py:347] (0/8) About to create dev dataset
2022-08-08 16:47:26,411 INFO [train.py:1006] (6/8) Loading scheduler state dict
2022-08-08 16:47:26,411 INFO [asr_datamodule.py:415] (6/8) About to get train cuts
2022-08-08 16:47:26,414 INFO [asr_datamodule.py:424] (6/8) About to get dev cuts
2022-08-08 16:47:26,415 INFO [asr_datamodule.py:347] (6/8) About to create dev dataset
2022-08-08 16:47:26,494 INFO [train.py:1006] (2/8) Loading scheduler state dict
2022-08-08 16:47:26,495 INFO [asr_datamodule.py:415] (2/8) About to get train cuts
2022-08-08 16:47:26,494 INFO [train.py:1006] (4/8) Loading scheduler state dict
2022-08-08 16:47:26,495 INFO [asr_datamodule.py:415] (4/8) About to get train cuts
2022-08-08 16:47:26,498 INFO [asr_datamodule.py:424] (2/8) About to get dev cuts
2022-08-08 16:47:26,498 INFO [asr_datamodule.py:424] (4/8) About to get dev cuts
2022-08-08 16:47:26,499 INFO [asr_datamodule.py:347] (2/8) About to create dev dataset
2022-08-08 16:47:26,499 INFO [asr_datamodule.py:347] (4/8) About to create dev dataset
2022-08-08 16:47:26,569 INFO [train.py:1006] (7/8) Loading scheduler state dict
2022-08-08 16:47:26,570 INFO [asr_datamodule.py:415] (7/8) About to get train cuts
2022-08-08 16:47:26,574 INFO [asr_datamodule.py:424] (7/8) About to get dev cuts
2022-08-08 16:47:26,575 INFO [asr_datamodule.py:347] (7/8) About to create dev dataset
2022-08-08 16:47:26,593 INFO [train.py:1006] (3/8) Loading scheduler state dict
2022-08-08 16:47:26,593 INFO [asr_datamodule.py:415] (3/8) About to get train cuts
2022-08-08 16:47:26,595 INFO [asr_datamodule.py:424] (3/8) About to get dev cuts
2022-08-08 16:47:26,596 INFO [asr_datamodule.py:347] (3/8) About to create dev dataset
2022-08-08 16:47:27,086 INFO [asr_datamodule.py:368] (0/8) About to create dev dataloader
2022-08-08 16:47:27,087 INFO [asr_datamodule.py:214] (0/8) About to get Musan cuts
2022-08-08 16:47:27,199 INFO [asr_datamodule.py:368] (6/8) About to create dev dataloader
2022-08-08 16:47:27,200 INFO [asr_datamodule.py:214] (6/8) About to get Musan cuts
2022-08-08 16:47:27,213 INFO [asr_datamodule.py:368] (2/8) About to create dev dataloader
2022-08-08 16:47:27,214 INFO [asr_datamodule.py:214] (2/8) About to get Musan cuts
2022-08-08 16:47:27,218 INFO [asr_datamodule.py:368] (4/8) About to create dev dataloader
2022-08-08 16:47:27,220 INFO [asr_datamodule.py:214] (4/8) About to get Musan cuts
2022-08-08 16:47:27,282 INFO [asr_datamodule.py:368] (7/8) About to create dev dataloader
2022-08-08 16:47:27,283 INFO [asr_datamodule.py:214] (7/8) About to get Musan cuts
2022-08-08 16:47:27,324 INFO [asr_datamodule.py:368] (3/8) About to create dev dataloader
2022-08-08 16:47:27,325 INFO [asr_datamodule.py:214] (3/8) About to get Musan cuts
2022-08-08 16:47:27,440 INFO [train.py:1006] (5/8) Loading scheduler state dict
2022-08-08 16:47:27,440 INFO [asr_datamodule.py:415] (5/8) About to get train cuts
2022-08-08 16:47:27,443 INFO [asr_datamodule.py:424] (5/8) About to get dev cuts
2022-08-08 16:47:27,444 INFO [asr_datamodule.py:347] (5/8) About to create dev dataset
2022-08-08 16:47:27,501 INFO [train.py:1006] (1/8) Loading scheduler state dict
2022-08-08 16:47:27,501 INFO [asr_datamodule.py:415] (1/8) About to get train cuts
2022-08-08 16:47:27,505 INFO [asr_datamodule.py:424] (1/8) About to get dev cuts
2022-08-08 16:47:27,506 INFO [asr_datamodule.py:347] (1/8) About to create dev dataset
2022-08-08 16:47:28,157 INFO [asr_datamodule.py:368] (5/8) About to create dev dataloader
2022-08-08 16:47:28,158 INFO [asr_datamodule.py:214] (5/8) About to get Musan cuts
2022-08-08 16:47:28,218 INFO [asr_datamodule.py:368] (1/8) About to create dev dataloader
2022-08-08 16:47:28,219 INFO [asr_datamodule.py:214] (1/8) About to get Musan cuts
2022-08-08 16:47:29,614 INFO [asr_datamodule.py:221] (0/8) Enable MUSAN
2022-08-08 16:47:29,614 INFO [asr_datamodule.py:246] (0/8) Enable SpecAugment
2022-08-08 16:47:29,614 INFO [asr_datamodule.py:247] (0/8) Time warp factor: 80
2022-08-08 16:47:29,615 INFO [asr_datamodule.py:259] (0/8) Num frame mask: 10
2022-08-08 16:47:29,615 INFO [asr_datamodule.py:272] (0/8) About to create train dataset
2022-08-08 16:47:29,615 INFO [asr_datamodule.py:300] (0/8) Using DynamicBucketingSampler.
2022-08-08 16:47:29,736 INFO [asr_datamodule.py:221] (2/8) Enable MUSAN
2022-08-08 16:47:29,737 INFO [asr_datamodule.py:246] (2/8) Enable SpecAugment
2022-08-08 16:47:29,737 INFO [asr_datamodule.py:247] (2/8) Time warp factor: 80
2022-08-08 16:47:29,737 INFO [asr_datamodule.py:259] (2/8) Num frame mask: 10
2022-08-08 16:47:29,737 INFO [asr_datamodule.py:272] (2/8) About to create train dataset
2022-08-08 16:47:29,737 INFO [asr_datamodule.py:300] (2/8) Using DynamicBucketingSampler.
2022-08-08 16:47:29,738 INFO [asr_datamodule.py:221] (6/8) Enable MUSAN
2022-08-08 16:47:29,738 INFO [asr_datamodule.py:246] (6/8) Enable SpecAugment
2022-08-08 16:47:29,738 INFO [asr_datamodule.py:247] (6/8) Time warp factor: 80
2022-08-08 16:47:29,739 INFO [asr_datamodule.py:259] (6/8) Num frame mask: 10
2022-08-08 16:47:29,739 INFO [asr_datamodule.py:272] (6/8) About to create train dataset
2022-08-08 16:47:29,739 INFO [asr_datamodule.py:300] (6/8) Using DynamicBucketingSampler.
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:221] (4/8) Enable MUSAN
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:246] (4/8) Enable SpecAugment
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:247] (4/8) Time warp factor: 80
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:259] (4/8) Num frame mask: 10
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:272] (4/8) About to create train dataset
2022-08-08 16:47:29,792 INFO [asr_datamodule.py:300] (4/8) Using DynamicBucketingSampler.
2022-08-08 16:47:29,854 INFO [asr_datamodule.py:221] (7/8) Enable MUSAN
2022-08-08 16:47:29,855 INFO [asr_datamodule.py:246] (7/8) Enable SpecAugment
2022-08-08 16:47:29,855 INFO [asr_datamodule.py:247] (7/8) Time warp factor: 80
2022-08-08 16:47:29,855 INFO [asr_datamodule.py:259] (7/8) Num frame mask: 10
2022-08-08 16:47:29,855 INFO [asr_datamodule.py:272] (7/8) About to create train dataset
2022-08-08 16:47:29,855 INFO [asr_datamodule.py:300] (7/8) Using DynamicBucketingSampler.
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:221] (3/8) Enable MUSAN
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:246] (3/8) Enable SpecAugment
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:247] (3/8) Time warp factor: 80
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:259] (3/8) Num frame mask: 10
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:272] (3/8) About to create train dataset
2022-08-08 16:47:29,930 INFO [asr_datamodule.py:300] (3/8) Using DynamicBucketingSampler.
2022-08-08 16:47:30,668 INFO [asr_datamodule.py:221] (5/8) Enable MUSAN
2022-08-08 16:47:30,668 INFO [asr_datamodule.py:246] (5/8) Enable SpecAugment
2022-08-08 16:47:30,668 INFO [asr_datamodule.py:247] (5/8) Time warp factor: 80
2022-08-08 16:47:30,669 INFO [asr_datamodule.py:259] (5/8) Num frame mask: 10
2022-08-08 16:47:30,669 INFO [asr_datamodule.py:272] (5/8) About to create train dataset
2022-08-08 16:47:30,669 INFO [asr_datamodule.py:300] (5/8) Using DynamicBucketingSampler.
2022-08-08 16:47:31,057 INFO [asr_datamodule.py:221] (1/8) Enable MUSAN
2022-08-08 16:47:31,057 INFO [asr_datamodule.py:246] (1/8) Enable SpecAugment
2022-08-08 16:47:31,057 INFO [asr_datamodule.py:247] (1/8) Time warp factor: 80
2022-08-08 16:47:31,057 INFO [asr_datamodule.py:259] (1/8) Num frame mask: 10
2022-08-08 16:47:31,058 INFO [asr_datamodule.py:272] (1/8) About to create train dataset
2022-08-08 16:47:31,058 INFO [asr_datamodule.py:300] (1/8) Using DynamicBucketingSampler.
2022-08-08 16:47:33,049 INFO [asr_datamodule.py:316] (0/8) About to create train dataloader
2022-08-08 16:47:33,050 INFO [asr_datamodule.py:333] (0/8) Loading sampler state dict
2022-08-08 16:47:33,268 INFO [asr_datamodule.py:316] (2/8) About to create train dataloader
2022-08-08 16:47:33,269 INFO [asr_datamodule.py:333] (2/8) Loading sampler state dict
2022-08-08 16:47:33,270 INFO [asr_datamodule.py:316] (6/8) About to create train dataloader
2022-08-08 16:47:33,271 INFO [asr_datamodule.py:333] (6/8) Loading sampler state dict
2022-08-08 16:47:33,339 INFO [asr_datamodule.py:316] (4/8) About to create train dataloader
2022-08-08 16:47:33,340 INFO [asr_datamodule.py:333] (4/8) Loading sampler state dict
2022-08-08 16:47:33,390 INFO [asr_datamodule.py:316] (7/8) About to create train dataloader
2022-08-08 16:47:33,392 INFO [asr_datamodule.py:333] (7/8) Loading sampler state dict
2022-08-08 16:47:33,545 INFO [asr_datamodule.py:316] (3/8) About to create train dataloader
2022-08-08 16:47:33,547 INFO [asr_datamodule.py:333] (3/8) Loading sampler state dict
2022-08-08 16:47:34,213 INFO [asr_datamodule.py:316] (5/8) About to create train dataloader
2022-08-08 16:47:34,214 INFO [asr_datamodule.py:333] (5/8) Loading sampler state dict
2022-08-08 16:47:34,730 INFO [asr_datamodule.py:316] (1/8) About to create train dataloader
2022-08-08 16:47:34,731 INFO [asr_datamodule.py:333] (1/8) Loading sampler state dict
Traceback (most recent call last):
  File "/home/storage04/zhuangweiji/workspace/kaldi2/icefall/egs/wenetspeech/ASR/./pruned_transducer_stateless5/train.py", line 1204, in <module>
    main()
  File "/home/storage04/zhuangweiji/workspace/kaldi2/icefall/egs/wenetspeech/ASR/./pruned_transducer_stateless5/train.py", line 1195, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 2 terminated with the following error:
Traceback (most recent call last):
  File "/home/storage04/zhuangweiji/tools/anaconda3/envs/k2-py39-cuda10.2-torch1.12/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/storage04/zhuangweiji/workspace/kaldi2/icefall/egs/wenetspeech/ASR/pruned_transducer_stateless5/train.py", line 1042, in run
    train_dl = wenetspeech.train_dataloaders(
  File "/home/storage04/zhuangweiji/workspace/kaldi2/icefall/egs/wenetspeech/ASR/pruned_transducer_stateless5/asr_datamodule.py", line 334, in train_dataloaders
    train_dl.sampler.load_state_dict(sampler_state_dict)
  File "/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 174, in load_state_dict
    self._fast_forward()
  File "/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 190, in _fast_forward
    next(self)
  File "/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/dataset/sampling/base.py", line 261, in __next__
    batch = self._next_batch()
  File "/home/storage04/zhuangweiji/workspace/kaldi2/lhotse/lhotse/dataset/sampling/dynamic_bucketing.py", line 232, in _next_batch
    batch = next(self.cuts_iter)
StopIteration

lhotse-version’: '1.5.0.dev+git.08a613a.clean

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 16 (5 by maintainers)

Most upvoted comments

I think I finally realized what is the issue… please try again with this PR https://github.com/lhotse-speech/lhotse/pull/854

You will need to start a new training for the fix to kick in, as the existing checkpoints are already “corrupted” (unless you manually edit kept_batches/kept_cuts).

Maybe he switched dataset? If that’s the case he might wnt to comment out the part where it loads the sampler state dict.