DiffuSeq: the Multi-GPU training acutally duplicates data in each GPU ?

Hello.

I find that the Dataloader constructed in diffuseq/text_datasets.py not used pytorch’s DistributedSampler https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L47

, which makes the data is actually duplicated in each GPU, e.g., in func:forward_backward in train_util.py https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/train_util.py#L235

i.e., each GPU is processing the same data, which makes distributed training pointless.

Is my conjecture correct?

just FYI, the training script in Diffusion-LM’s repo train_run.py uses transformers’s training script run_clm.py, in which DistributedSampler is used in the Trainer

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 19 (7 by maintainers)

Most upvoted comments

Hi,

Good question!

We follow the training script in Diffusion-LM’s repo script/run_train.py. The train_run.py you mentioned uses run_clm.py to train the classifier instead of LM itself.

It’s true that we “split data per gpu” when we do sampling. That’s because we only want to iterate each test case once and in order. However, when training, we set shuffle=True, which means each GPU gets a different batch of data. It functions in the same way as using DistributedSampler.

I understood that you implemented different data loader with shuffle=True keyword in load_data_text function, in diffuseq/text_datset.py. However, this works improperly, due to reasons below.

in torch.utils.data.DataLoader, when using shuffle=True, DataLoader objects makes torch.utils.data.RandomSampler instead of torch.utils.data.SequentialSampler. In this case, without generator keyword, RandomSampler makes new torch.Generator() instance, and its seed is set by torch._C._default_generator. (see torch source code.) which is dependent on transformer.set_seed(args.seed) function in train.py. As a consequence, we will get same data per each process, even though using shuffle=True keyword.

My PR intended to fix this issues. By setting only generator’s seed different, other processes will go on with same seed per each processes!

@Dawn-LX In default, torch random operation such as torch.randn uses default random generator torch._C._default_genetator (it’s for CPU and genrrator for GPU also exists), unless we don’t add generator keyword. My intend (for PR) was to make dataloader use another generator instead of default one. Meanwhile, in funcction load_model_emb, when local rank equals 0 function calls torch.nn.init.normal_(model.weight), and this is random operation which uses default generator. (Model initialization is also random process too but it’s same in all processes and doesn’t make gap)

I tested the outputs, and checked that current code returns same data outputs. Since process 0 uses random generator while initializing random embedding (load_model_emb), data of process 0 was different, however, other processes’ datum were same, in output.

python -m torch.distributed.launch --nproc_per_node=4 --master_port=12233 --use_env run_train.py --diff_steps 2000 --lr 0.0001 --learning_steps 140000 --save_interval 20000 --seed 102 --noise_schedule sqrt --hidden_dim 128 --bsz 2048 --microbatch 64 --dataset dialogue --data_dir datasets/CommonsenseConversation --vocab bert --seq_len 128 --schedule_sampler lossaware --notes dialogue
  • modified train.py
"""
Train a diffusion model on images.
"""

import argparse
import json, torch, os
import numpy as np
from diffuseq.utils import dist_util, logger
from diffuseq.text_datasets import load_data_text
from diffuseq.step_sample import create_named_schedule_sampler
from basic_utils import (
    load_defaults_config,
    create_model_and_diffusion,
    args_to_dict,
    add_dict_to_argparser,
    load_model_emb,
    load_tokenizer
)
from train_util import TrainLoop
from transformers import set_seed
import wandb

### custom your wandb setting here ###
# os.environ["WANDB_API_KEY"] = ""
os.environ["WANDB_MODE"] = "offline"

def create_argparser():
    defaults = dict()
    defaults.update(load_defaults_config())
    parser = argparse.ArgumentParser()
    add_dict_to_argparser(parser, defaults) # update latest args according to argparse
    return parser

def main():
    args = create_argparser().parse_args()
    set_seed(args.seed) 
    dist_util.setup_dist()
    logger.configure()
    logger.log("### Creating data loader...")

    tokenizer = load_tokenizer(args)
    model_weight, tokenizer = load_model_emb(args, tokenizer)

    data = load_data_text(
        batch_size=args.batch_size,
        seq_len=args.seq_len,
        data_args = args,
        loaded_vocab=tokenizer,
        model_emb=model_weight # use model's weights as init
    )
    from torch.distributed import barrier, get_rank
    barrier()
    print(get_rank(), next(data)[1]['input_ids'][0])
    import sys
    sys.exit(0)
  • Outputs
1 tensor([  101,  1045,  2219,  2019,  5641, 14007, 16419,  1012,   102,   102,
          101,  1018,  7787,  1029,  4638,  1012,  3194,  1999,  2392,  1029,
         4638,  1012,  8134,  1037, 18851,  1012,  2039, 22994,  2063,  1012,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])
2 tensor([  101,  1045,  2219,  2019,  5641, 14007, 16419,  1012,   102,   102,
          101,  1018,  7787,  1029,  4638,  1012,  3194,  1999,  2392,  1029,
         4638,  1012,  8134,  1037, 18851,  1012,  2039, 22994,  2063,  1012,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])
0 tensor([  101, 25742,  5092,  2497,  2034,  4836,  2385,  2086,  3283,  1010,
         2021,  4268,  2024,  3652,  2039,  2007,  1996,  9476,  2157,  2085,
         1012,  4268,  3065,  2031,  1037,  3109,  1997,  1037, 11142,  2166,
         1012,   102,   102,   101,  2026,  2048,  2095,  2214,  2365,  7425,
         2015, 25742,  5092,  2497,  1012,   102,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])
3 tensor([  101,  1045,  2219,  2019,  5641, 14007, 16419,  1012,   102,   102,
          101,  1018,  7787,  1029,  4638,  1012,  3194,  1999,  2392,  1029,
         4638,  1012,  8134,  1037, 18851,  1012,  2039, 22994,  2063,  1012,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0])

@Dawn-LX It doesn’t mean that process 0 use another seed. In original code all processes use same seed, however, process 0 has ONLY 1 MORE RANDOM OPERATION before RandomSampler being initialized, and this ONLY 1 MORE RANDOM OPERATION makes different random output (for dataloader’s seed) even with same seed.

@Dawn-LX In default, torch random operation such as torch.randn uses default random generator torch._C._default_genetator (it’s for CPU and genrrator for GPU also exists), unless we don’t add generator keyword. My intend (for PR) was to make dataloader use another generator instead of default one. Meanwhile, in funcction load_model_emb, when local rank equals 0 function calls torch.nn.init.normal_(model.weight), and this is random operation which uses default generator. (Model initialization is also random process too but it’s same in all processes and doesn’t make gap)

Thank you very much ! I get it !

Hi

@kdha0727 You’re right. I just tested the case when --nproc_per_node=2. When the number of nodes is greater than 2, the duplication problem does exist. In this situation, using DistrubutedSampler is a more convenient implementation solution, instead of manually setting random seed? I will fix this soon.

Yes. my method is fine for infinite loops, however, considering more general cases, DistributedSampler would be more compact solution. Thank you for reviewing!

If you want to use same seed per nodes, you can consider alternative code like below:

Step 1. Change this function to my code below.

https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/diffuseq/text_datasets.py#L11

def load_data_text(
    batch_size, 
    seq_len, 
    deterministic=False, 
    data_args=None, 
    model_emb=None,
    split='train', 
    loaded_vocab=None,
    loop=True,
    seed=None,  # ADD THIS
):
    training_data = get_corpus(data_args, seq_len, split=split, loaded_vocab=loaded_vocab)

    dataset = TextDataset(
        training_data,
        data_args,
        model_emb=model_emb
    )

    if seed is not None:
        batch_generator = torch.Generator()
        batch_generator.manual_seed(hash(seed) + int(os.environ.get("LOCAL_RANK", "0")))
    else:
        batch_generator = None

    data_loader = DataLoader(
        dataset,
        batch_size=batch_size,  # 20,
        # drop_last=True,
        shuffle=not deterministic,
        num_workers=0,
        generator=batch_generator,  # ADDED
    )
    if loop:
        return infinite_loader(data_loader)
    else:
        # print(data_loader)
        return iter(data_loader)

Step 2. Add seed argument in training script.

https://github.com/Shark-NLP/DiffuSeq/blob/bea43e1fd0a954486bc36ad62f2a71dcb2bd300a/train.py#L44

line 44~63

    data = load_data_text(
        batch_size=args.batch_size,
        seq_len=args.seq_len,
        data_args = args,
        loaded_vocab=tokenizer,
        model_emb=model_weight,  # use model's weights as init
        seed=args.seed
    )
    next(data)

    data_valid = load_data_text(
        batch_size=args.batch_size,
        seq_len=args.seq_len,
        data_args=args,
        split='valid',
        deterministic=True,
        loaded_vocab=tokenizer,
        model_emb=model_weight,  # using the same embedding wight with tranining data
        seed=args.seed
    )

Thank you for your reply! But I am still confused.

  1. the script/run_train.py in Diffusion-LM’s repo also doesn’t use DistributedSampler, which means they also have the problem of “data is actually duplicated in each gpu”
  2. set shuffle=True has nothing to do with “each GPU gets a different batch of data”.

To verfiy my point, we can turn infinite_loader off and see how many batch iteration it actually runs. say, if a signle GPU training script has a dataloader of 800 interations. Then for 4 GPUs training, the dataloader (with DistributedSampler) will run 200 iterations (for the same batchsize) Note that for DistributedSampler & DistributedDataParallel, the the batchsize of dataloader is directly the batchsize on each GPU.

But with the existing multi-gpu training script, the data is duplicated in each gpu, an it will still run 800 iterations for 4 GPU training

Hi,

Good question!

We follow the training script in Diffusion-LM’s repo script/run_train.py. The train_run.py you mentioned uses run_clm.py to train the classifier instead of LM itself.

It’s true that we “split data per gpu” when we do sampling. That’s because we only want to iterate each test case once and in order. However, when training, we set shuffle=True, which means each GPU gets a different batch of data. It functions in the same way as using DistributedSampler.