img2dataset: Downloader is not producing full set of expected outputs

Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.

Any tips on debugging further?

TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded

For example - I recently tried to download this dataset in a distributed manner on EMR:

https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/dataset/part-00000-5b54c5d5-bbcf-484d-a2ce-0d6f73df1a36-c000.snappy.parquet

I applied some light NSFW filtering on it to produce a new parquet

# rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
sampled_df = df[df["NSFW"] == "unlikely"]
sampled_df.reset_index(inplace=True)

Verified its row count is ~12M samples:

import glob
import json
from pyarrow.parquet import ParquetDataset

files = glob.glob("*.parquet")

d = {}

for file in files:
    d[file] = 0
    dataset = ParquetDataset(file)
    for piece in dataset.pieces:
        d[file] += piece.get_metadata().num_rows

print(json.dumps(d, indent=2, sort_keys=True))
{
  "part00000.parquet": 12026281
}

Ran the download, and scanned over the output s3 bucket:

aws s3 cp\
	s3://path/to/s3/download/ . \
	--exclude "*" \
	--include "*.json" \
	--recursive

Ran this script to get the total count of images downloaded:

import json
import glob

files = glob.glob("/path/to/json/files/*.json")

count = {}
successes = {}

for file in files:
    with open(file) as f:
        j = json.load(f)
        count[file] = j["count"]
        successes[file] = j["successes"]

rate = 100 * sum(successes.values()) / sum(count.values())
print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")

which gave me the following output:

Success rate: 56.15816066896948. From 1508566 / 2686281

The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I’ll use a knot resolver later)

unknown url type: '21nicrmo2'                                                      1.0
<urlopen error [errno 22] invalid argument>                                        1.0
encoding with 'idna' codec failed (unicodeerror: label empty or too long)          1.0
http/1.1 401.2 unauthorized\r\n                                                    4.0
<urlopen error no host given>                                                      5.0
<urlopen error unknown url type: "https>                                          11.0
incomplete read                                                                   14.0
<urlopen error [errno 101] network is unreachable>                                38.0
<urlopen error [errno 104] connection reset by peer>                              75.0
[errno 104] connection reset by peer                                              92.0
opencv                                                                           354.0
<urlopen error [errno 113] no route to host>                                     448.0
remote end closed connection without response                                    472.0
<urlopen error [errno 111] connection refused>                                  1144.0
encoding issue                                                                  2341.0
timed out                                                                       2850.0
<urlopen error timed out>                                                       4394.0
the read operation timed out                                                    4617.0
image decoding error                                                            5563.0
ssl                                                                             6174.0
http error                                                                     62670.0
<urlopen error [errno -2] name or service not known>                         1086446.0
success                                                                      1508566.0

I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from

> ls
00000_stats.json  00051_stats.json  01017_stats.json  01066_stats.json  01112_stats.json  01157_stats.json
00001_stats.json  00052_stats.json  01018_stats.json  01067_stats.json  01113_stats.json  01159_stats.json
...
> ls -l | wc -l 
270

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 33 (15 by maintainers)

Most upvoted comments

There are several options to surface them but I’m not sure if I can think of something clean, feel free to try things With spark it’s kind of usual to look at executor logs

Your credentials error is likely the problem

There is 2 ways to solve it One is to find the root cause and solve that cred problem Another is implement the retry I’m mentioning above, assuming this is a temporary problem, the second try should work