pixivpy: Unable to Save All Images (Concurrency-Related Maybe)

So I have about 25,000 likes/bookmarks on Pixiv I’m trying to download (about 59,000 images). It takes almost 48 hours to download them all from project initialization.

I have two issues:

1. Images AND illustrations are being downloaded out of order. (The order I added it to my bookmarks, from last to first, and illustration album order, p0-p100)

  • Usually it works right: Screenshot_5 - GOOD
  • Sometimes it doesn’t work: Screenshot_5 - BAD

Sorting through all date formats (created, modified) never yields the expected order. My 5th ever liked illustration shows up as my 2nd most recent download.


2. The bigger issue, some images are straight up not downloading. My SQLite database gives me the correct number (I know it’s correct because it omits invisible works and my profile hasnt liked any ugoira, and my R-18 blocking settings are disabled): Screenshot_3 My images directory has a lower number: Screenshot_4 Running a comparison program gives me the missing filenames, here’s a few:

100259613_p0.jpg doesn't exist in the directory
100728200_p19.jpg doesn't exist in the directory
100728200_p48.jpg doesn't exist in the directory
101820708_p0.jpg doesn't exist in the directory 
102787429_p0.jpg doesn't exist in the directory 
104981105_p0.jpg doesn't exist in the directory 
104981105_p1.jpg doesn't exist in the directory 
105850627_p0.jpg doesn't exist in the directory 
23580508_p0.jpg doesn't exist in the directory
28564463_p0.jpg doesn't exist in the directory
32759252_p17.jpg doesn't exist in the directory
35812504_p0.jpg doesn't exist in the directory 
38319983_p12.jpg doesn't exist in the directory
40090211_p0.jpg doesn't exist in the directory
44923009_p0.png doesn't exist in the directory
45713411_p0.jpg doesn't exist in the directory

All of the missing files are from visible illustrations. Sometimes in a series, it’ll “skip” or fail to download an image, and proceed normally after, (100728200_p0-18, and 20-47 in my directory).


I know you guys can’t see what my program looks like, thus your diagnosis is limited (I can share my source code). But I have a feeling it’s related to:

with ThreadPoolExecutor(max_workers = 5) as executor:
    for index_1, illust in enumerate(illusts_list): 
        executor.submit(api.download, url, path = saved_images_path, name = f'{media_file_name}') 

My questions range in the following:

  • Is this just how concurrency works? Does things a little out of order?
    • Is this a compromise I have to deal with for no rate-limits?
    • If I want files to be downloaded in order, do I have to forgo concurrency?
  • Why does it not download certain images? There’s no pattern among the ones it missed.
    • What can I do?

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 20 (3 by maintainers)

Most upvoted comments

The key point of downloading image is to send the header Referer: https://app-api.pixiv.net/. So it’s easy to implement your own download function.

This reminds me that I’ve wanted to reinvent the wheel before. The resulting code is as follows (you may need to modify some syntax according to your Python version).

import shutil
from collections.abc import Container
from functools import wraps
from pathlib import Path
from tempfile import NamedTemporaryFile
from typing import Type

import requests


def retry(num: int, retryable: Container[Type[Exception]] = None):
    """Makes function retryable.
    
    :param num: Maximum execution number.
    :param retryable: Optional, a collection of retryable exception classes.
    """
    def decorator(func):
        @wraps(func)
        def decorated_func(*args, **kwargs):
            error_count = 0
            while True:
                try:
                    return func(*args, **kwargs)
                except Exception as ex:
                    if retryable is not None and ex.__class__ not in retryable:
                        raise
                    error_count += 1
                    if error_count >= num:
                        raise

        return decorated_func

    return decorator


@retry(3)
def download(url: str, path: str | Path, headers: dict[str, str] = None, force=False):
    if isinstance(path, str):
        path = Path(path)
    if path.exists() and not force:
        return

    with requests.get(url, headers=headers, timeout=15, stream=True) as response, \
            NamedTemporaryFile() as temp_file:

        response.raise_for_status()

        downloaded_length = 0
        for chunk in response.iter_content(chunk_size=1024):
            if not chunk:
                continue
            temp_file.write(chunk)
            downloaded_length += len(chunk)

        content_length = response.headers.get('Content-Length')
        if content_length is not None:
            if str(downloaded_length) != content_length:
                raise RuntimeError('Incorrect file size!')

        temp_file.seek(0)
        path.parent.mkdir(exist_ok=True, parents=True)
        with path.open('wb') as output_file:
            shutil.copyfileobj(temp_file, output_file)


def main():
    download(
        'https://i.pximg.net/img-original/img/2023/03/09/04/00/01/106042736_p0.jpg',
        Path('download') / '106042736_p0.jpg',
        headers={'Referer': 'https://app-api.pixiv.net/'},
    )


if __name__ == '__main__':
    main()

Hmm I don’t use VSC so not sure about this. Looks like it’s running in debug-mode or something similar. Maybe check your execute config.

Some stackoverflow answer: https://stackoverflow.com/questions/54519728/how-to-prevent-visual-studio-code-python-debugger-from-stopping-on-handled-excep

It’s compatible with concurrency (very generic, actually).

Adding a prefix is quite simple, you just need to use the prefix parameter.

download() does not provide implementations for retry and file length check.

  1. For retry, you can refer to @Xdynix 's code above, and use a try…catch outside download()
  2. For Content-Length header, see Get file size from “Content-Length” value, check if the length equal to downloaded image file size.

From what we currently observed there is no rate limits on downloading. And since switching away from concurrency will only slows down your downloading, it doesn’t make sense that you’d run into new rate limits.

Both issues can be caused by concurrency.

  1. The executor didn’t gurantee the order of task execution. A worker will pick up new task without waiting prior tasks to be completed, and no gurantee on task pickup order. By doing so the workers will not be stuck by some particular task.
    • So if the order matters, then concurrency is not suitable.
  2. Errors within the function passed to executor.submit() will not be yield directly and can be easily missed. It can be some intermittent network issue. You can wrap the download function to add more logging for debug. E.g.:
    def download_with_logging(api, *args, **kwargs):
        try:
            api.download(*args, **kwargs)
        except Exception as e:
            print(e)  # Or any logging method you like, like `logging` module.
    # ...
    executor.submit(download_with_logging, api, url, ...)