DALI: Wrong accuracy with external data in a distributed setting
Hello,
First of all, many thanks for the great library.
I’m using the custom DALI dalaloader object below, which is in turn leveraging an external source ExternalInputIterator
. The latter is reading the files train_tasksets.txt
(containing paths to images) and val_tasksets.txt
(containing labels). I’m running that code in a distributed setting using Horovod.
My issue: when training a ResNet-50 model with the ImageNet dataset (blurred version), I get a top1 accuracy which is too high (above > 80%), which makes me think something is wrong with the data pipeline. Do you see something fishy with the code below?
I added is a “wrapper” to yield the data returned by the DALIGenericIterator
.
What I tried: the accuracy achieved with the exact same dataset passed to fn.readers.file
(without using an external source) is correct (about 75.3). Then the issue must come from my ExternalInputIterator
.
class ExternalInputIterator:
def __init__(self, data_dir, split, batch_size, shard_id, num_shards):
self.batch_size = batch_size
self.shard_id = shard_id
self.num_shards = num_shards
with open(f"{data_dir}/{split}_taskset.txt", 'r') as f:
self.files = [line.rstrip() for line in f if line != '']
with open(f"{data_dir}/{split}_taskset_labels.txt", 'r') as f:
self.labels = [line.rstrip() for line in f if line != '']
self.data_set_len = len(self.files)
inf = self.data_set_len * shard_id // num_shards
sup = self.data_set_len * (shard_id + 1) // num_shards
self.files = np.array(self.files[inf:sup])
self.labels = np.array(self.labels[inf:sup])
self.n = len(self.files)
self.full_iterations = self.n // batch_size
self.iterations = math.ceil(self.n / batch_size)
def __iter__(self):
self.i = 0
perm = np.random.permutation(len(self.files))
self.files = self.files[perm]
self.labels = self.labels[perm]
return self
def __next__(self):
if self.i >= self.n:
self.__iter__()
raise StopIteration
batch_files = []
batch_labels = []
for _ in range(self.batch_size):
sample_idx = self.i % self.n
with open(self.files[sample_idx], 'rb') as f:
batch_files.append(np.frombuffer(f.read(), dtype=np.uint8))
batch_labels.append(np.int64([self.labels[sample_idx]]))
self.i += 1
return batch_files, batch_labels
class DaliDataLoader(object):
def __init__(self, data_dir, batch_size, num_workers=1,
device_id=0, shard_id=0, num_shards=1, precision=32,
training=True, **kwargs):
self.batch_size = batch_size
decoder_device, device = ("mixed", "gpu") if cuda else ("cpu", "cpu")
crop_size = 224
val_size = 256
img_type = types.FLOAT16 if precision == 16 else types.FLOAT
# ask nvJPEG to preallocate memory for the biggest sample in ImageNet for CPU and GPU to avoid reallocations in runtime
device_memory_padding = 211025920 if decoder_device == 'mixed' else 0
host_memory_padding = 140544512 if decoder_device == 'mixed' else 0
# ask HW NVJPEG to allocate memory ahead for the biggest image in the data set to avoid reallocations in runtime
preallocate_width_hint = 5980 if decoder_device == 'mixed' else 0
preallocate_height_hint = 6430 if decoder_device == 'mixed' else 0
split = 'train' if training else 'val'
self.external_data = ExternalInputIterator(
data_dir, split, batch_size, shard_id, num_shards)
pipeline = Pipeline(batch_size, num_workers, device_id)
with pipeline:
// I get the expected accuracy using the reader directly
#inputs, target = fn.readers.file(file_root=data_dir,
# shard_id=shard_id,
# num_shards=num_shards,
# random_shuffle=training,
# pad_last_batch=True,
# name="Reader")
inputs, target = fn.external_source(
source=self.external_data, num_outputs=2)
if training:
images = fn.decoders.image_random_crop(inputs,
device=decoder_device, output_type=types.RGB,
device_memory_padding=device_memory_padding,
host_memory_padding=host_memory_padding,
preallocate_width_hint=preallocate_width_hint,
preallocate_height_hint=preallocate_height_hint,
random_aspect_ratio=[
0.8, 1.25],
random_area=[0.1, 1.0],
num_attempts=100)
images = fn.resize(images,
device=device,
resize_x=crop_size,
resize_y=crop_size,
interp_type=types.INTERP_TRIANGULAR)
mirror = fn.random.coin_flip(probability=0.5)
else:
images = fn.decoders.image(inputs,
device=decoder_device,
output_type=types.RGB)
images = fn.resize(images,
device=device,
size=val_size,
mode="not_smaller",
interp_type=types.INTERP_TRIANGULAR)
mirror = False
images = fn.crop_mirror_normalize(images.gpu() if cuda else images,
dtype=img_type,
output_layout="CHW",
crop=(crop_size, crop_size),
mean=[0.485 * 255, 0.456 *
255, 0.406 * 255],
std=[0.229 * 255, 0.224 *
255, 0.225 * 255],
mirror=mirror)
if cuda:
target = target.gpu()
pipeline.set_outputs(images, target)
self.iterator = DALIGenericIterator(
pipeline,
["x", "y"],
auto_reset=True
)
def __len__(self):
return self.external_data.full_iterations
def __iter__(self):
for token in self.iterator:
x = token[0]['x']
y = token[0]['y'].squeeze().long()
yield x, y
This is how I create my data loaders.
train_loader = DaliDataLoader(
args.train_dir, allreduce_batch_size,
device_id=hvd.local_rank(), shard_id=hvd.rank(),
num_shards=hvd.size(), training=True)
val_loader = DaliDataLoader(
args.val_dir, args.val_batch_size,
device_id=hvd.local_rank(), shard_id=hvd.rank(),
num_shards=hvd.size(), training=False)
What do I do wrong?
Thanks!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20 (20 by maintainers)
Hey @JanuszL,
You’re right. I got confused once again. The ids I attached to files were not sharded correctly, resulting in a mismatch between actual samples and ids. I confirm each id (and thus sample) is seen exactly once.
Your thought is really promising though. My first test is encouraging. Indeed, without pre-shuffling the dataset, each train/val shard contains a subset of the samples. I’ll run a full experiment to confirm this is leading to my issue.
Hi @thomas-bouvier ,
I’m happy that the file reader works as expected. In the case of the external source, you need to make sure that in each epoch all samples from the data set are returned and there are no repetitions. Printing or dumping the loaded file names into the file would be probably the best way to go now.