DeepSpeed: DeepSpeed is slower than FSDP

Describe the bug I am still familiarizing with DeepSpeed so here is a n00b question. I wrapped my model with DeepSpeed and seeing good ZeRO2 performance. However, when I switch to ZeRO3, the all gathers are not overlapping and they are very fragmented even though the default params look good. How can I learn more about why all gathers are so fragmented and how to make them less granular?

DeepSpeedEngine configuration:
activation_checkpointing_config  {
    "partition_activations": false, 
    "contiguous_memory_optimization": false, 
    "cpu_checkpointing": false, 
    "number_checkpoints": null, 
    "synchronize_checkpoint_boundary": false, 
    "profile": false
}
aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
amp_enabled .................. False
amp_params ................... False
autotuning_config ............ {
    "enabled": false, 
    "start_step": null, 
    "end_step": null, 
    "metric_path": null, 
    "arg_mappings": null, 
    "metric": "throughput", 
    "model_info": null, 
    "results_dir": "autotuning_results", 
    "exps_dir": "autotuning_exps", 
    "overwrite": true, 
    "fast": true, 
    "start_profile_step": 3, 
    "end_profile_step": 5, 
    "tuner_type": "gridsearch", 
    "tuner_early_stopping": 5, 
    "tuner_num_trials": 50, 
    "model_info_path": null, 
    "mp_size": 1, 
    "max_train_batch_size": null, 
    "min_train_batch_size": 1, 
    "max_train_micro_batch_size_per_gpu": 1.024000e+03, 
    "min_train_micro_batch_size_per_gpu": 1, 
    "num_tuning_micro_batch_sizes": 3
}
bfloat16_enabled ............. True
checkpoint_parallel_write_pipeline  False
checkpoint_tag_validation_enabled  True
checkpoint_tag_validation_fail  False
comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f09382c8a90>
communication_data_type ...... None
compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
curriculum_enabled_legacy .... False
curriculum_params_legacy ..... False
data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
data_efficiency_enabled ...... False
dataloader_drop_last ......... False
disable_allgather ............ False
dump_state ................... False
dynamic_loss_scale_args ...... None
eigenvalue_enabled ........... False
eigenvalue_gas_boundary_resolution  1
eigenvalue_layer_name ........ bert.encoder.layer
eigenvalue_layer_num ......... 0
eigenvalue_max_iter .......... 100
eigenvalue_stability ......... 1e-06
eigenvalue_tol ............... 0.01
eigenvalue_verbose ........... False
elasticity_enabled ........... False
flops_profiler_config ........ {
    "enabled": false, 
    "recompute_fwd_factor": 0.0, 
    "profile_step": 1, 
    "module_depth": -1, 
    "top_modules": 1, 
    "detailed": true, 
    "output_file": null
}
fp16_auto_cast ............... None
fp16_enabled ................. False
fp16_master_weights_and_gradients  False
global_rank .................. 0
grad_accum_dtype ............. None
gradient_accumulation_steps .. 4
gradient_clipping ............ 1.0
gradient_predivide_factor .... 1.0
graph_harvesting ............. False
hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
initial_dynamic_scale ........ 1
load_universal_checkpoint .... False
loss_scale ................... 1.0
memory_breakdown ............. False
mics_hierarchial_params_gather  False
mics_shard_size .............. -1
monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
nebula_config ................ {
    "enabled": false, 
    "persistent_storage_path": null, 
    "persistent_time_interval": 100, 
    "num_of_version_in_retention": 2, 
    "enable_nebula_load": true, 
    "load_path": null
}
optimizer_legacy_fusion ...... False
optimizer_name ............... None
optimizer_params ............. None
pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
pld_enabled .................. False
pld_params ................... False
prescale_gradients ........... False
scheduler_name ............... None
scheduler_params ............. None
seq_parallel_communication_data_type  torch.float32
sparse_attention ............. None
sparse_gradients_enabled ..... False
steps_per_print .............. 1
train_batch_size ............. 1024
train_micro_batch_size_per_gpu  8
use_data_before_expert_parallel_  False
use_node_local_storage ....... False
wall_clock_breakdown ......... False
weight_quantization_config ... None
world_size ................... 32
zero_allow_untested_optimizer  False
zero_config
..................
stage=3
contiguous_gradients=True
reduce_scatter=True
reduce_bucket_size=500000000
use_multi_rank_bucket_allreduce=True
allgather_partitions=True
allgather_bucket_size=500000000
overlap_comm=True
load_from_fp32_weights=True
elastic_checkpoint=False
offload_param=None
offload_optimizer=None
sub_group_size=1,000,000,000
cpu_offload_param=None
cpu_offload_use_pin_memory=None
cpu_offload=None
prefetch_bucket_size=50,000,000
param_persistence_threshold=100,000
model_persistence_threshold=sys.maxsize
max_live_parameters=1,000,000,000
max_reuse_distance=1,000,000,000
gather_16bit_weights_on_model_save=False
stage3_gather_fp16_weights_on_model_save=False
ignore_unused_parameters=True
legacy_stage1=False
round_robin_gradients=False
zero_hpz_partition_size=1
zero_quantized_weights=False
zero_quantized_nontrainable_weights=False
zero_quantized_gradients=False
mics_shard_size=-1
mics_hierarchical_params_gather=False
memory_efficient_linear=True
pipeline_loading_checkpoint=False
override_module_apply=True

About this issue

Original URL
State: closed
Created 5 months ago
Comments: 16 (5 by maintainers)

Most upvoted comments

I was able to get ZeRO to 95% of the MFU I got with FSDP for my network, primarily by using leaf nodes and setting the zero3 parameters better. The remaining gap appears to be due to ZeRO partitioning everything in my model. There are parts of my model where there shouldn’t be any partitioning since they are computation heavy but memory light, but it seems there is no straightforward way to exclude parts of a network from being partitioned with ZeRO.

Hi @halilakin , really nice chatting with you yesterday. Given we are clear about all the collective communication calls, and now performance-wise is ok. I will close this issue for now. I already noted the feature you requested. Thx a ton.

GuanhuaWang on Feb 8, 2024

Thanks for the help @GuanhuaWang!

halilakin on Feb 8, 2024

I was able to get ZeRO to 95% of the MFU I got with FSDP for my network, primarily by using leaf nodes and setting the zero3 parameters better. The remaining gap appears to be due to ZeRO partitioning everything in my model. There are parts of my model where there shouldn’t be any partitioning since they are computation heavy but memory light, but it seems there is no straightforward way to exclude parts of a network from being partitioned with ZeRO.

halilakin on Feb 8, 2024

Thanks @tohtana. Let me test the leaf module thoroughly in combination with other flags today and update the thread.

halilakin on Feb 6, 2024

Hi @halilakin, Thank you for investigating the issue! I think we can do the same as FSDP in theory using the leaf module feature. You can specify the transformer layers class as a leaf module. The feature was merged into master branch. If it doesn’t work, we may have some issues regarding communication synchronization. I think our team also can check that part.

tohtana on Feb 6, 2024

Thanks for the quick response @tohtana. I’ve stopped using leaf module for now until I fully understand all the knobs and finish reading the code. I will update the thread with more information but it’s quite possible that I am not correctly setting all the parameters.

halilakin on Feb 5, 2024