pytorch-lightning: Multiple GPU per node could fail silently with KubeflowEnvironment
š Bug
If the user tries to submit a ddp job to a Kubeflow env with multi-gpus per node by following multi-GPU docs and passing the right args (num_nodes and devices) one of the following would happen:
WORLD_SIZEandRANKare set to total number of processes -> the job gets stuck becausecreates_processes_externally=Truedoesnāt let ddp launch other processes.WORLD_SIZEandRANKare set to total number of nodes -> the job starts with only local rank 0 of each node participating in distributed training. The major issue here apart from the idle GPUs is thatDDPStrategystill works correctly and passes the right number of replicas to the distributed sampler:
...
self.cluster_environment.set_global_rank(self.node_rank * self.num_processes + self.local_rank)
self.cluster_environment.set_world_size(self.num_nodes * self.num_processes)
So local rank 0 GPUs will get 1/num_processes of the data assuming other (idle) GPUs are processing the rest. All while training is being done only on a subset of the dataset that was assigned to local rank 0 of each node. The user is unaware of this since they assume they passed devices/gpus and num_nodes to trainer correctly.
To Reproduce
N/A (itās how KubeflowEnvironment works)
Expected behavior
Iām not sure if this is the expected behavior. I am using Google Vertex AI that runs Kubeflow under the hood. When a Pytorch Lightning job is submitted to Vertex, Pytorch Lightning automatically selects KubeflowEnvironment as the cluster environment.
Please let me know if the expectation is to have a separate cluster environment class for something like VertexAI. Iād be happy to create a PR to add the new Env. But the reason why I decided to report this as a bug are:
KubeflowEnvironmenthas two very specific requirements a. nodes with a single GPU and b. manual creation of the processes. Neither of these requirements are related to or enforced by Kubeflow. The requirements are also not mentioned in the docs and the user wouldnāt know this until they look at the code.- The
detectmethod ofKubeflowEnvironmentcan be used for any Kubernetes env, and the rest of its methods basically implement an especial case ofLightningEnvironmentwhere the user has to manually run the processes.
cc @awaelchli
About this issue
- Original URL
- State: open
- Created 2 years ago
- Reactions: 2
- Comments: 21 (10 by maintainers)
PytorchJob operator sets the WORLD_SIZE to the total number of replicas by default (here and here) which is different from what torch and lightning expect. So
KubeflowEnvironmentshould letDDPStrategyset global_rank /world_size and create processes externally if needed.Updating the following methods would be enough to make
KubeflowEnvironmenta generic env thatās compatible with Trainer args and multi-gpu clusters:That said this would make it very similar to
LightningEnvironment. Not sure if thatās a problem.Thanks for the context @awaelchli This is very helpful. Iāll start the investigation. I apologize but Iām new to this kind of work, and I appreciate any help around this. I love the sound of the strong community, very cool. We are Google partners and if we managed to make this work on Vertex AI it would be amazing. So many projects we could use it on!
Was this issue ever resolved? Was the PR created and merged? I think Iām running into the same issue: running lightning on Vertex AI on 2 nodes with 4 GPU each and the training hangs. Thanks
@neggert yes, LOCAL_RANK would be set for subprocesses spun up by PL. And what you said about the PyTorchJobās assumption makes sense, itās just that ideally the KubeflowEnv and LightningEnv should interpret num_nodes the same way.
@awaelchli Iāll send a PR with the proposed changes soon. Thank you both!
Hey all, thanks for pointing this out and sorry for the late response. I think thereās an assumption built into PyTorchJob that the user will only be running one process per pod. If you want more processes, you can spin up more pods. Thereās a slight terminology mis-match because Lightning sees each Kubernetes pod as a separate node. So you can use multiple GPUs per node, they just need to be in separate pods.
num_nodesis reallynum_podsin Kubernetes.I guess this one-process-per-pod thing is just an assumption, not a hard requirement, so thereās no reason that Lightning couldnāt spin up multiple processes per pod. Sounds like the proposed changes would allow you to do that. The other option would be to assert that
devices==num_nodesin Kubeflow.Iāve been out of the game for a while, so my memory is a little rusty. Is the idea that
LOCAL_RANKwould be set for a sub-process spun up by Lightning? PyTorchJob itself will never setLOCAL_RANK. If thatās the case, I think the changes proposed by @RamtinRassoli will accomplish what you want.