NVFlare: [BUG] FedOpt algorithm not working as expected in cifar10 example

Describe the bug The FedOpt algorithm is not working as expected in cifar10 example when I change the model from the pre-existing ModerateCNN to another model like MobileNetv2 or Resnet18 and others. The problem is that the accuracy of the global model is not increasing or increasing too slow with the FedOpt algorithm while the other algorithms works just fine even changing the model.

To Reproduce

  1. Add in ‘cifar10_nets.py’ the new model : class MyModel(nn.Module): def init(self): super(MyModel, self).init() model = models.mobilenet_v2(weights=‘DEFAULT’) model.classifier = nn.Sequential( nn.Dropout(0.4), nn.Linear(1280, 10), ) self.model = model

    def forward(self, x): return self.model(x)

  2. Import and change the model in file ‘cifar10_learner.py’

  3. Launch the example with ./run_simulator.sh cifar10_fedopt 0.1 8 8

  4. See the results in tensorboard with tensorboard --logdir=/tmp/nvflare/sim_cifar10 under the section ‘val_acc_global_model’

Expected behavior I expect reading the algorithm proposed in Reddi, Sashank, et al. “Adaptive federated optimization.” arXiv preprint arXiv:2003.00295 (2020), to obtain the same performance of FedAvg using SGD optimizer with lr = 1.0 and no scheduler. Also obtain better results changing optimizer and adding a scheduler.

Screenshots Screenshot from 2023-04-28 10-09-15 Purple = FedAvg Pink = FedOpt

Desktop (please complete the following information):

  • OS: ubuntu 22.04
  • Python Version 3.10
  • NVFlare Version 2.3.0

Ty in advance!

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Reactions: 1
  • Comments: 18

Most upvoted comments

Okay, I was able to reproduce the behavior. It has to do with the batch norm layers of these more complex models. When updating the global model using SGD, the batch norm parameters are actually not included in self.model.named_parameters(), and therefore the optimizer doesn’t update them.

The FedOpt paper also uses group norm instead of batch norm to avoid these kinds of issues: "We train a modified ResNet-18 on both datasets, where the batch normalization layers are replaced by group normalization layers (Wu & He, 2018). We use two groups in each group normalization layer. As shown by Hsieh et al. (2019), group normalization can lead to significant gains in accuracy over batch normalization in federated settings."

I provided a workaround for this issue by updating the batch norm parameters using FedAvg and only updating the trainable parameters using the FedOpt optimizer for the global model: https://github.com/NVIDIA/NVFlare/pull/1851

I have not tried multiple settings but lr 1, no momentum and no scheduler, which should be identical to FedAvg, is getting stuck around 0.1. No errors in the logger, and the same script runs fine when using FedAvg/FedProx.

Edit:

What is interesting is that the training loss and the local model accuracies are correct (yellow line is FedAvg, orange is the FedOpt equivalent):

image image

Meanwhile the global model malfunctions, but somehow that is not propagated to the next round’s local models?

image

Hi @holgerroth, I can confirm there is problematic behaviour when using anything other than the ModerateCNN and SimpleCNN. Global model validation metrics get stuck at 0.1 from the first round of aggregation.

Ye @holgerroth, I tried using momentum with different values and I also tried to don’t use it. Even if the results were changing and I was obtaining better results with some values compared to others, they were still bad results like I reached at max 0.5 acc that is pretty low compared with the other algorithms. I also noticed that with other models like the SimpleCNN or other models built from scratch it’s working fine, problems come when using pretrained CNN. Hope this can help!

This is a graph from Tensorboard containing also the other experiments : Screenshot from 2023-05-01 09-33-08 Pink = Scaffold Dark Grey = FedProx Yellow = FedAvg Purple = FedOpt All the experiments has been done using the same model and configuration. They did 20 rounds of FL and 4 local epochs for each of the 4 clients involved by each experiment. The FedOpt experiment is worst than the other posted by me before due to a different scheduler.

Ty for the support!