pixivpy: Unable to Save All Images (Concurrency-Related Maybe)
So I have about 25,000 likes/bookmarks on Pixiv I’m trying to download (about 59,000 images). It takes almost 48 hours to download them all from project initialization.
I have two issues:
1. Images AND illustrations are being downloaded out of order. (The order I added it to my bookmarks, from last to first, and illustration album order, p0-p100)
- Usually it works right:

- Sometimes it doesn’t work:

Sorting through all date formats (created, modified) never yields the expected order. My 5th ever liked illustration shows up as my 2nd most recent download.
2. The bigger issue, some images are straight up not downloading.
My SQLite database gives me the correct number (I know it’s correct because it omits invisible works and my profile hasnt liked any ugoira, and my R-18 blocking settings are disabled):
My images directory has a lower number:
Running a comparison program gives me the missing filenames, here’s a few:
100259613_p0.jpg doesn't exist in the directory
100728200_p19.jpg doesn't exist in the directory
100728200_p48.jpg doesn't exist in the directory
101820708_p0.jpg doesn't exist in the directory
102787429_p0.jpg doesn't exist in the directory
104981105_p0.jpg doesn't exist in the directory
104981105_p1.jpg doesn't exist in the directory
105850627_p0.jpg doesn't exist in the directory
23580508_p0.jpg doesn't exist in the directory
28564463_p0.jpg doesn't exist in the directory
32759252_p17.jpg doesn't exist in the directory
35812504_p0.jpg doesn't exist in the directory
38319983_p12.jpg doesn't exist in the directory
40090211_p0.jpg doesn't exist in the directory
44923009_p0.png doesn't exist in the directory
45713411_p0.jpg doesn't exist in the directory
All of the missing files are from visible illustrations. Sometimes in a series, it’ll “skip” or fail to download an image, and proceed normally after, (100728200_p0-18, and 20-47 in my directory).
I know you guys can’t see what my program looks like, thus your diagnosis is limited (I can share my source code). But I have a feeling it’s related to:
with ThreadPoolExecutor(max_workers = 5) as executor:
for index_1, illust in enumerate(illusts_list):
executor.submit(api.download, url, path = saved_images_path, name = f'{media_file_name}')
My questions range in the following:
- Is this just how concurrency works? Does things a little out of order?
- Is this a compromise I have to deal with for no rate-limits?
- If I want files to be downloaded in order, do I have to forgo concurrency?
- Why does it not download certain images? There’s no pattern among the ones it missed.
- What can I do?
About this issue
- Original URL
- State: closed
- Created a year ago
- Comments: 20 (3 by maintainers)
The key point of downloading image is to send the header
Referer: https://app-api.pixiv.net/. So it’s easy to implement your own download function.This reminds me that I’ve wanted to reinvent the wheel before. The resulting code is as follows (you may need to modify some syntax according to your Python version).
Hmm I don’t use VSC so not sure about this. Looks like it’s running in debug-mode or something similar. Maybe check your execute config.
Some stackoverflow answer: https://stackoverflow.com/questions/54519728/how-to-prevent-visual-studio-code-python-debugger-from-stopping-on-handled-excep
It’s compatible with concurrency (very generic, actually).
Adding a prefix is quite simple, you just need to use the prefix parameter.
download()does not provide implementations for retry and file length check.download()Content-Lengthheader, see Get file size from “Content-Length” value, check if the length equal to downloaded image file size.From what we currently observed there is no rate limits on downloading. And since switching away from concurrency will only slows down your downloading, it doesn’t make sense that you’d run into new rate limits.
Both issues can be caused by concurrency.
executor.submit()will not be yield directly and can be easily missed. It can be some intermittent network issue. You can wrap the download function to add more logging for debug. E.g.: