img2dataset: Downloader is not producing full set of expected outputs

Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.

Any tips on debugging further?

TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded

For example - I recently tried to download this dataset in a distributed manner on EMR:

https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

I applied some light NSFW filtering on it to produce a new parquet

# rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
sampled_df = df[df["NSFW"] == "unlikely"]
sampled_df.reset_index(inplace=True)

Verified its row count is ~12M samples:

import glob
import json
from pyarrow.parquet import ParquetDataset

files = glob.glob("*.parquet")

d = {}

for file in files:
    d[file] = 0
    dataset = ParquetDataset(file)
    for piece in dataset.pieces:
        d[file] += piece.get_metadata().num_rows

print(json.dumps(d, indent=2, sort_keys=True))

{
  "part00000.parquet": 12026281
}

Ran the download, and scanned over the output s3 bucket:

aws s3 cp\
	s3://path/to/s3/download/ . \
	--exclude "*" \
	--include "*.json" \
	--recursive

Ran this script to get the total count of images downloaded:

import json
import glob

files = glob.glob("/path/to/json/files/*.json")

count = {}
successes = {}

for file in files:
    with open(file) as f:
        j = json.load(f)
        count[file] = j["count"]
        successes[file] = j["successes"]

rate = 100 * sum(successes.values()) / sum(count.values())
print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")

which gave me the following output:

Success rate: 56.15816066896948. From 1508566 / 2686281

The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I’ll use a knot resolver later)

unknown url type: '21nicrmo2'                                                      1.0
<urlopen error [errno 22] invalid argument>                                        1.0
encoding with 'idna' codec failed (unicodeerror: label empty or too long)          1.0
http/1.1 401.2 unauthorized\r\n                                                    4.0
<urlopen error no host given>                                                      5.0
<urlopen error unknown url type: "https>                                          11.0
incomplete read                                                                   14.0
<urlopen error [errno 101] network is unreachable>                                38.0
<urlopen error [errno 104] connection reset by peer>                              75.0
[errno 104] connection reset by peer                                              92.0
opencv                                                                           354.0
<urlopen error [errno 113] no route to host>                                     448.0
remote end closed connection without response                                    472.0
<urlopen error [errno 111] connection refused>                                  1144.0
encoding issue                                                                  2341.0
timed out                                                                       2850.0
<urlopen error timed out>                                                       4394.0
the read operation timed out                                                    4617.0
image decoding error                                                            5563.0
ssl                                                                             6174.0
http error                                                                     62670.0
<urlopen error [errno -2] name or service not known>                         1086446.0
success                                                                      1508566.0

I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from

> ls
00000_stats.json  00051_stats.json  01017_stats.json  01066_stats.json  01112_stats.json  01157_stats.json
00001_stats.json  00052_stats.json  01018_stats.json  01067_stats.json  01113_stats.json  01159_stats.json
...
> ls -l | wc -l 
270

About this issue

Original URL
State: closed
Created 2 years ago
Comments: 33 (15 by maintainers)

Most upvoted comments

There are several options to surface them but I’m not sure if I can think of something clean, feel free to try things With spark it’s kind of usual to look at executor logs

rom1504 on Mar 29, 2022

Your credentials error is likely the problem

There is 2 ways to solve it One is to find the root cause and solve that cred problem Another is implement the retry I’m mentioning above, assuming this is a temporary problem, the second try should work

rom1504 on Mar 29, 2022