DALI: Wrong accuracy with external data in a distributed setting

Hello,

First of all, many thanks for the great library.

I’m using the custom DALI dalaloader object below, which is in turn leveraging an external source ExternalInputIterator. The latter is reading the files train_tasksets.txt (containing paths to images) and val_tasksets.txt (containing labels). I’m running that code in a distributed setting using Horovod.

My issue: when training a ResNet-50 model with the ImageNet dataset (blurred version), I get a top1 accuracy which is too high (above > 80%), which makes me think something is wrong with the data pipeline. Do you see something fishy with the code below?

I added is a “wrapper” to yield the data returned by the DALIGenericIterator.

What I tried: the accuracy achieved with the exact same dataset passed to fn.readers.file (without using an external source) is correct (about 75.3). Then the issue must come from my ExternalInputIterator.

class ExternalInputIterator:
    def __init__(self, data_dir, split, batch_size, shard_id, num_shards):
        self.batch_size = batch_size
        self.shard_id = shard_id
        self.num_shards = num_shards

        with open(f"{data_dir}/{split}_taskset.txt", 'r') as f:
            self.files = [line.rstrip() for line in f if line != '']
        with open(f"{data_dir}/{split}_taskset_labels.txt", 'r') as f:
            self.labels = [line.rstrip() for line in f if line != '']

        self.data_set_len = len(self.files)
        inf = self.data_set_len * shard_id // num_shards
        sup = self.data_set_len * (shard_id + 1) // num_shards
        self.files = np.array(self.files[inf:sup])
        self.labels = np.array(self.labels[inf:sup])
        self.n = len(self.files)
        self.full_iterations = self.n // batch_size
        self.iterations = math.ceil(self.n / batch_size)

    def __iter__(self):
        self.i = 0
        perm = np.random.permutation(len(self.files))
        self.files = self.files[perm]
        self.labels = self.labels[perm]
        return self

    def __next__(self):
        if self.i >= self.n:
            self.__iter__()
            raise StopIteration
        batch_files = []
        batch_labels = []
        for _ in range(self.batch_size):
            sample_idx = self.i % self.n
            with open(self.files[sample_idx], 'rb') as f:
                batch_files.append(np.frombuffer(f.read(), dtype=np.uint8))
            batch_labels.append(np.int64([self.labels[sample_idx]]))
            self.i += 1
        return batch_files, batch_labels

class DaliDataLoader(object):
    def __init__(self, data_dir, batch_size, num_workers=1,
                 device_id=0, shard_id=0, num_shards=1, precision=32,
                 training=True, **kwargs):
        self.batch_size = batch_size
        decoder_device, device = ("mixed", "gpu") if cuda else ("cpu", "cpu")
        crop_size = 224
        val_size = 256
        img_type = types.FLOAT16 if precision == 16 else types.FLOAT

        # ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
        device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
        host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
        # ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
        preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
        preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0

        split = 'train' if training else 'val'
        self.external_data = ExternalInputIterator(
            data_dir, split, batch_size, shard_id, num_shards)

        pipeline = Pipeline(batch_size, num_workers, device_id)
        with pipeline:
           // I get the expected accuracy using the reader directly
            #inputs, target = fn.readers.file(file_root=data_dir,
            #                shard_id=shard_id,
            #                num_shards=num_shards,
            #                random_shuffle=training,
            #                pad_last_batch=True,
            #                name="Reader")
            inputs, target = fn.external_source(
                source=self.external_data, num_outputs=2)
            if training:
                images = fn.decoders.image_random_crop(inputs,
                                                       device=decoder_device, output_type=types.RGB,
                                                       device_memory_padding=device_memory_padding,
                                                       host_memory_padding=host_memory_padding,
                                                       preallocate_width_hint=preallocate_width_hint,
                                                       preallocate_height_hint=preallocate_height_hint,
                                                       random_aspect_ratio=[
                                                           0.8, 1.25],
                                                       random_area=[0.1, 1.0],
                                                       num_attempts=100)
                images = fn.resize(images,
                                   device=device,
                                   resize_x=crop_size,
                                   resize_y=crop_size,
                                   interp_type=types.INTERP_TRIANGULAR)
                mirror = fn.random.coin_flip(probability=0.5)
            else:
                images = fn.decoders.image(inputs,
                                           device=decoder_device,
                                           output_type=types.RGB)
                images = fn.resize(images,
                                   device=device,
                                   size=val_size,
                                   mode="not_smaller",
                                   interp_type=types.INTERP_TRIANGULAR)
                mirror = False

            images = fn.crop_mirror_normalize(images.gpu() if cuda else images,
                                              dtype=img_type,
                                              output_layout="CHW",
                                              crop=(crop_size, crop_size),
                                              mean=[0.485 * 255, 0.456 *
                                                    255, 0.406 * 255],
                                              std=[0.229 * 255, 0.224 *
                                                   255, 0.225 * 255],
                                              mirror=mirror)
            if cuda:
                target = target.gpu()
            pipeline.set_outputs(images, target)

        self.iterator = DALIGenericIterator(
            pipeline,
            ["x", "y"],
            auto_reset=True
        )

    def __len__(self):
        return self.external_data.full_iterations

    def __iter__(self):
        for token in self.iterator:
            x = token[0]['x']
            y = token[0]['y'].squeeze().long()
            yield x, y

This is how I create my data loaders.

    train_loader = DaliDataLoader(
                args.train_dir, allreduce_batch_size,
                device_id=hvd.local_rank(), shard_id=hvd.rank(),
                num_shards=hvd.size(), training=True)

    val_loader = DaliDataLoader(
            args.val_dir, args.val_batch_size,
            device_id=hvd.local_rank(), shard_id=hvd.rank(),
            num_shards=hvd.size(), training=False)

What do I do wrong?

Thanks!

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 20 (20 by maintainers)

Most upvoted comments

Hey @JanuszL,

You’re right. I got confused once again. The ids I attached to files were not sharded correctly, resulting in a mismatch between actual samples and ids. I confirm each id (and thus sample) is seen exactly once.

Your thought is really promising though. My first test is encouraging. Indeed, without pre-shuffling the dataset, each train/val shard contains a subset of the samples. I’ll run a full experiment to confirm this is leading to my issue.

thomas-bouvier on Nov 16, 2022

Hi @thomas-bouvier ,

I’m happy that the file reader works as expected. In the case of the external source, you need to make sure that in each epoch all samples from the data set are returned and there are no repetitions. Printing or dumping the loaded file names into the file would be probably the best way to go now.

JanuszL on Nov 10, 2022