dolly: ConnectionResetError: [Errno 104] Connection reset by peer

I am running the sample training script with:

  • g5.24xlarge
  • cpu offload set in ds_z3_bf16_config.json
  • num_gpus to 4
  • train and eval batchsize = 4 (instead of 8)
  • logging_steps=100, eval_steps=1000, save_steps=2000
  • folders: like
Local Output Dir: /dolly/local_training/dolly__2023-04-10T00:45:05
DBFS Output Dir: /dolly/output/dolly__2023-04-10T00:45:05
Tensorboard Display Dir: /dolly/local_training/dolly__2023-04-10T00:45:05/runs

and got the following error messages. It looks the training itself almost finished and crashed at the really end.

Is there any way to avoid the below?

Thanks.

---------------------------------------------------------------------------
The Python process exited with an unknown exit code.

The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
---------------------------------------------------------------------------
Last messages on stderr:
Sun Apr  9 11:04:28 2023 Connection to spark from PID  2322
Sun Apr  9 11:04:28 2023 Initialized gateway on port 38899
Sun Apr  9 11:04:28 2023 Connected to spark.
2023-04-09 11:04:32.509624: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-04-09 11:04:47 INFO [training.trainer] Loading tatsu-lab/alpaca dataset
2023-04-09 11:04:49 WARNING [datasets.builder] Using custom data configuration tatsu-lab--alpaca-715f206eec35a791
2023-04-09 11:04:50 INFO [training.trainer] Found 52002 rows
2023-04-09 11:04:56 INFO [training.trainer] Loading tokenizer for EleutherAI/gpt-j-6B
2023-04-09 11:19:38 INFO [root] Exception while sending command.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command
    self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
2023-04-09 11:20:13 INFO [root] Exception while sending command.
Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 503, in send_command
    self.socket.sendall(command.encode("utf-8"))
ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1038, in send_command
    response = connection.send_command(command)
  File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/clientserver.py", line 506, in send_command
    raise Py4JNetworkError(
py4j.protocol.Py4JNetworkError: Error while sending
---------------------------------------------------------------------------
Last messages on stdout:
ameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   curriculum_enabled_legacy .... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   curriculum_params_legacy ..... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   data_efficiency_enabled ...... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dataloader_drop_last ......... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   disable_allgather ............ False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dump_state ................... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   dynamic_loss_scale_args ...... None

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_enabled ........... False

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_gas_boundary_resolution  1

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_layer_name ........ bert.encoder.layer

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_layer_num ......... 0

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_max_iter .......... 100

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_stability ......... 1e-06

[2023-04-10 00:02:17,956] [INFO] [config.py:1012:print]   eigenvalue_tol ............... 0.01

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   eigenvalue_verbose ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   elasticity_enabled ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   flops_profiler_config ........ {

    "enabled": false, 

    "profile_step": 1, 

    "module_depth": -1, 

    "top_modules": 1, 

    "detailed": true, 

    "output_file": null

}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_auto_cast ............... None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_enabled ................. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   fp16_master_weights_and_gradients  False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   global_rank .................. 0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   grad_accum_dtype ............. None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_accumulation_steps .. 1

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_clipping ............ 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   gradient_predivide_factor .... 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   initial_dynamic_scale ........ 1

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   load_universal_checkpoint .... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   loss_scale ................... 1.0

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   memory_breakdown ............. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   monitor_config ............... <deepspeed.monitor.config.DeepSpeedMonitorConfig object at 0x7fe5c7fd2760>

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   nebula_config ................ {

    "enabled": false, 

    "persistent_storage_path": null, 

    "persistent_time_interval": 100, 

    "num_of_version_in_retention": 2, 

    "enable_nebula_load": true, 

    "load_path": null

}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_legacy_fusion ...... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_name ............... adamw

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   optimizer_params ............. {'lr': 1e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pld_enabled .................. False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   pld_params ................... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   prescale_gradients ........... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   scheduler_name ............... WarmupLR

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   scheduler_params ............. {'warmup_min_lr': 0, 'warmup_max_lr': 1e-05, 'warmup_num_steps': 0}

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   sparse_attention ............. None

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   sparse_gradients_enabled ..... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   steps_per_print .............. 2000

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   train_batch_size ............. 16

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   train_micro_batch_size_per_gpu  4

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   use_node_local_storage ....... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   wall_clock_breakdown ......... False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   world_size ................... 4

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   zero_allow_untested_optimizer  False

[2023-04-10 00:02:17,957] [INFO] [config.py:1012:print]   zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=16777216 allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=DeepSpeedZeroOffloadOptimizerConfig(device='cpu', nvme_path=None, buffer_count=4, pin_memory=True, pipeline=False, pipeline_read=False, pipeline_write=False, fast_init=False) sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=15099494 param_persistence_threshold=40960 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False

[2023-04-10 00:02:17,958] [INFO] [config.py:1012:print]   zero_enabled ................. True

[2023-04-10 00:02:17,958] [INFO] [config.py:1012:print]   zero_optimization_stage ...... 3

[2023-04-10 00:02:17,958] [INFO] [config.py:997:print_user_config]   json = {

    "bf16": {

        "enabled": true

    }, 

    "optimizer": {

        "type": "AdamW", 

        "params": {

            "lr": 1e-05, 

            "betas": [0.9, 0.999], 

            "eps": 1e-08, 

            "weight_decay": 0.0

        }

    }, 

    "scheduler": {

        "type": "WarmupLR", 

        "params": {

            "warmup_min_lr": 0, 

            "warmup_max_lr": 1e-05, 

            "warmup_num_steps": 0

        }

    }, 

    "zero_optimization": {

        "stage": 3, 

        "overlap_comm": true, 

        "contiguous_gradients": true, 

        "sub_group_size": 1.000000e+09, 

        "reduce_bucket_size": 1.677722e+07, 

        "stage3_prefetch_bucket_size": 1.509949e+07, 

        "stage3_param_persistence_threshold": 4.096000e+04, 

        "stage3_max_live_parameters": 1.000000e+09, 

        "stage3_max_reuse_distance": 1.000000e+09, 

        "stage3_gather_16bit_weights_on_model_save": true, 

        "offload_optimizer": {

            "device": "cpu", 

            "pin_memory": true

        }

    }, 

    "gradient_accumulation_steps": 1, 

    "gradient_clipping": 1.0, 

    "steps_per_print": 2.000000e+03, 

    "train_batch_size": 16, 

    "train_micro_batch_size_per_gpu": 4, 

    "wall_clock_breakdown": false

}

Using /root/.cache/torch_extensions/py39_cu117 as PyTorch extensions root...

No modifications detected for re-loaded extension module utils, skipping build step...

Loading extension module utils...

Time to load utils op: 0.00030159950256347656 seconds

Attempting to resume from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000

[2023-04-10 00:02:17,962] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt...

[2023-04-10 00:02:21,172] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt.

[2023-04-10 00:02:21,173] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt...

[2023-04-10 00:02:21,202] [INFO] [torch_checkpoint_engine.py:23:load] [Torch] Loaded checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/zero_pp_rank_0_mp_rank_00_model_states.pt.

[2023-04-10 00:02:21,222] [INFO] [torch_checkpoint_engine.py:21:load] [Torch] Loading checkpoint from /dolly/local_training/dolly__2023-04-09T11:19:57/checkpoint-2000/global_step2000/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt...

[2023-04-10 00:11:10,776] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 5962

[2023-04-10 00:11:10,777] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 5963

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (1 by maintainers)

Most upvoted comments

Yeah, that actually worked. The errors are not actually from the training process. It’s some weird issue in the Databricks notebook that’s complaining there, though it can be ignored. (And needs to be fixed!)