apex: Data parallel error with O2 and not O1

When using O2, data parallel does not work: RuntimeError: Expected tensor for argument #1 ‘input’ to have the same device as tensor for argument #2 ‘weight’; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

however with O1, everything works just fine.

model = GeneralVae(encoder, decoder, rep_size=500).cuda()
optimizer = optim.Adam(model.parameters(), lr=LR)
model, optimizer = amp.initialize(model, optimizer, opt_level='O2')
if data_para and torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model)
    model = model.cuda()

loss_picture = customLoss()

val_losses = []
train_losses = []

def train(epoch):
    train_loader_food = generate_data_loader(train_root, get_batch_size(epoch), int(rampDataSize * data_size))
    print("Epoch {}: batch_size {}".format(epoch, get_batch_size(epoch)))
    model.train()
    train_loss = 0
    loss = None
    for batch_idx, (data, _, aff) in enumerate(train_loader_food):
        data = data[0].cuda(0)

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 19
Comments: 32 (4 by maintainers)

Links to this issue

[NVIDIA APEX] Amp에 대해 알아보자 (Automatic Mixed Precision)

Most upvoted comments

@mcarilli still seeing this issue. any idea when the support for O2 + DataParallel will kick in?

thanks

+12

iariav on Aug 11, 2019

Historically we only test with DistributedDataParallel because performance tends to be better, but the dataset sharing issue raised by @seongwook-ham in https://github.com/NVIDIA/apex/issues/269 is a compelling use case. @ptrblck and I will look into it. Current to-do list is better fused optimizers, checkpointing, sparse gradients, and then DataParallel, so it may be a couple weeks before I can give it undivided attention.

+12

mcarilli on Apr 24, 2019

I’m running into the same error for O0, O2, O3: RuntimeError: Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 1 does not equal 0 (while checking arguments for cudnn_convolution)

O1 is working as expected.

+11

tullie on Mar 29, 2019

RIght now I’m working hard on native Pytorch support for mixed precision which will accommodate DistributedDataParallel, DataParallel, and model parallel training, targeting the 1.5 release. Apex as a source for mixed precision is not a future-proof path, it’s annoying for people to install something separate. If Apex helps, that’s great, but the sooner we get something that’s packaged and tested as a native component of Pytorch, the better. If Apex does not work for you currently, my best advice is to wait for the upstream support. See https://github.com/NVIDIA/apex/issues/269#issuecomment-566841593.

mcarilli on Dec 18, 2019

Same problem with level O0 and nn.DataParallel . I tried all the above suggestions and they did not work

John1231983 on Oct 18, 2019

Same issue with DataParallel and O2, for O1 I get a CUDA out of memory, while the float32 version (without amp) fits.

visionscaper on Sep 3, 2019

Workaround:

model = apex.amp.initialize(torch.nn.Sequential(model), opt_level = 'O2')[0]
model = torch.nn.DataParallel(model, device_ids = args.devices)
model.forward = lambda *args, old_fwd = model.forward, input_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_type']), output_caster = lambda tensor: tensor.to(apex.amp._amp_state.opt_properties.options['cast_model_outputs'] if apex.amp._amp_state.opt_properties.options.get('cast_model_outputs') is not None else torch.float32), **kwargs: apex.amp._initialize.applier(old_fwd(*apex.amp._initialize.applier(args, input_caster), **apex.amp._initialize.applier(kwargs, input_caster)), output_caster)

In case of DataParallel, forward must be patched after DataParallel(…) call.

vadimkantorov on Dec 16, 2019

In my case, torch.nn.parallel.DistributedDataParallel doesn’t work except O1

AtsunoriFujita on Aug 21, 2019

i find that old api(FP16_Optimizer) works well with nn.dataparalllel if you need dataparallel, you could use it with FP16_Optimizer and model.half()

seongwook-ham on May 13, 2019

In general we strongly recommend DistributedDataParallel (either torch.nn.parallel.DistributedDataParallel or apex.parallel.DistributedDataParallel) over DataParallel, because global interpreter lock sharing within a single process is not great for performance. Currently, I don’t test with or claim to support DataParallel. If you are open to trying DistributedDataParallel, I have a simple example showing proper DDP initialization and launch. The Imagenet example also shows DDP use along with distributed data sampling.

That being said, I don’t think DataParallel is fundamentally incompatible with Amp control flow. I see one potential problem with your code above: you are calling .cuda on the model after it’s been returned from amp.initialize. You should be doing things in the following order:

model.cuda() # Cuda-ing your model should occur before the call to amp.initialize
model, optimizer = amp.initialize(model, optimizer)
model = nn.DataParallel(model)

Try this and let me know if it works.

The fact that cuda-ing your model should occur before amp.initialize is a general truth, independent of DataParallel or DistributedDataParallel. However, I can’t really set up hard checks for that, because people may legitimately want part of their model to remain on the CPU.

mcarilli on Mar 31, 2019

Same here, at least for O2. O1 does work.

zplizzi on Mar 31, 2019

yes new api o1 is significantly slower than old api with FP16_Optimizer and half in nn.dataparallel case. in this case 1.41it/s vs 2.4it/s. in same setting fp32 is 1.2it/s also in apex distributed dataparallel case, o1 with adam is significantly slower than o2 with fusedadam 2.9 its/s vs 4.58 its/s this test is based on modified version of https://github.com/huggingface/pytorch-pretrained-BERT in environment with 6950x 3x2080ti 1xtitan rtx when tested on dgx-1(8xv100(32G NVlink)), result is similar

seongwook-ham on May 14, 2019