img2dataset: Downloader is not producing full set of expected outputs
Heya, I was trying to download the LAION400M dataset and noticed that I am not getting the full set of data for some reason.
Any tips on debugging further?
TL;DR - I was expecting ~12M files to be downloaded, only seeing successes in *_stats.json files indicating ~2M files were actually downloaded
For example - I recently tried to download this dataset in a distributed manner on EMR:
I applied some light NSFW filtering on it to produce a new parquet
# rest of the script is redacted, but there is some code before this to normalize the NSFW row to make filtering more convenient
sampled_df = df[df["NSFW"] == "unlikely"]
sampled_df.reset_index(inplace=True)
Verified its row count is ~12M samples:
import glob
import json
from pyarrow.parquet import ParquetDataset
files = glob.glob("*.parquet")
d = {}
for file in files:
d[file] = 0
dataset = ParquetDataset(file)
for piece in dataset.pieces:
d[file] += piece.get_metadata().num_rows
print(json.dumps(d, indent=2, sort_keys=True))
{
"part00000.parquet": 12026281
}
Ran the download, and scanned over the output s3 bucket:
aws s3 cp\
s3://path/to/s3/download/ . \
--exclude "*" \
--include "*.json" \
--recursive
Ran this script to get the total count of images downloaded:
import json
import glob
files = glob.glob("/path/to/json/files/*.json")
count = {}
successes = {}
for file in files:
with open(file) as f:
j = json.load(f)
count[file] = j["count"]
successes[file] = j["successes"]
rate = 100 * sum(successes.values()) / sum(count.values())
print(f"Success rate: {rate}. From {sum(successes.values())} / {sum(count.values())}")
which gave me the following output:
Success rate: 56.15816066896948. From 1508566 / 2686281
The high error rate here is not of major concern, I was running at low worker node count for experimentation so we have a lot of dns issues (I’ll use a knot resolver later)
unknown url type: '21nicrmo2' 1.0
<urlopen error [errno 22] invalid argument> 1.0
encoding with 'idna' codec failed (unicodeerror: label empty or too long) 1.0
http/1.1 401.2 unauthorized\r\n 4.0
<urlopen error no host given> 5.0
<urlopen error unknown url type: "https> 11.0
incomplete read 14.0
<urlopen error [errno 101] network is unreachable> 38.0
<urlopen error [errno 104] connection reset by peer> 75.0
[errno 104] connection reset by peer 92.0
opencv 354.0
<urlopen error [errno 113] no route to host> 448.0
remote end closed connection without response 472.0
<urlopen error [errno 111] connection refused> 1144.0
encoding issue 2341.0
timed out 2850.0
<urlopen error timed out> 4394.0
the read operation timed out 4617.0
image decoding error 5563.0
ssl 6174.0
http error 62670.0
<urlopen error [errno -2] name or service not known> 1086446.0
success 1508566.0
I also noticed there were only 270 json files produced, but given that each shard should contain 10,000 images, I expected ~1,200 json files to be produced. Not sure where this discrepancy is coming from
> ls
00000_stats.json 00051_stats.json 01017_stats.json 01066_stats.json 01112_stats.json 01157_stats.json
00001_stats.json 00052_stats.json 01018_stats.json 01067_stats.json 01113_stats.json 01159_stats.json
...
> ls -l | wc -l
270
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 33 (15 by maintainers)
There are several options to surface them but I’m not sure if I can think of something clean, feel free to try things With spark it’s kind of usual to look at executor logs
Your credentials error is likely the problem
There is 2 ways to solve it One is to find the root cause and solve that cred problem Another is implement the retry I’m mentioning above, assuming this is a temporary problem, the second try should work