instaloader: How to scrape posts for large IG accounts in python and avoid 429 Too Many Requests

I am trying to download the most recent 10 posts from profiles on Instagram using the python Instaloader package. Some of these profiles are quite large and have a lot of likes and comments. For these posts, I keep getting a 429: Too many requests error. I understand that Instagram has a limit of 200 requests per hour and have read up on Instaloaders troubleshooting page here, as well as scoured the depths of Github issues including issue 774, 1006 , 944, 802, 822, etc. and Stackoverflow and unfortunately I’m having trouble finding a solution in general and especially for python, as it seems most people use this tool on the command line.

First, I understand instagram’s limit is 200 requests per hour. What does this mean? What constitutes a request? If a post has 2,000 likes, is each like a request, meaning it would take 10 hours to get 2,000 likes without exceeding the 429 limit?

Second, I am wondering how I can most efficiently abide by these limitations. I have been trying to use the RateController to set custom scraping intervals that will abide by it such as the code block below, but I keep getting a 429.

`class MyRateController(RateController):

def sleep(self, secs:30): wait_time=random.uniform(30, 120) time.sleep(wait_time)

def count_per_sliding_window(self, query_type): return 20`

To instantiate the class, I call: L = Instaloader(rate_controller=lambda ctx: MyRateController(ctx))

My implementation makes sense to me because the documentation says that the count_per_sliding_window “return[s] how many requests of the given type can be done within a sliding window of 11 minutes.” So, if I set the value to be 20, this looks to me to be 20 requests every 11 minutes, or about 110 requests an hour, which is less than 200. Unfortunately I still get continuous errors as shown below: `Too many queries in the last time. Need to wait 614 seconds, until 22:31.

Too many queries in the last time. Need to wait 566 seconds, until 22:31.

Too many queries in the last time. Need to wait 488 seconds, until 22:31.

Too many queries in the last time. Need to wait 426 seconds, until 22:31.`

Generally my intended process is the sequence below:

  • create instaloader instance
  • load session file
  • profile.get_posts()
  • for post in posts:
    • post.get_likes()
    • post.get_comments()

I also commonly get the error message shown below instead of the one previously shown: `JSON Query to graphql/query: 429 Too Many Requests [retrying; skip with ^C] Number of requests within last 10/11/20/22/30/60 minutes grouped by type: other: 1 1 1 1 1 1 37479f2b8209594dde7facb0d904896a: 1 1 1 1 1 1 2b0673e0dc4580674a88d426fe00ea90: 1 1 1 1 1 1

  • 1cb6ec562846122743b61e492c85999f: 1 1 1 1 1 1 Instagram responded with HTTP error “429 - Too Many Requests”. Please do not run multiple instances of Instaloader in parallel or within short sequence. Also, do not use any Instagram App while Instaloader is running. The request will be retried in 666 seconds, at 18:02.`

Ultimately my goals is to answer two questions:

  1. What constitutes a request using this API?
  2. How can I retrieve data for likes and comments for instagrams with a lot of engaged followers in Python? I The pseudocode I showed above works well for smaller users, but I keep getting 429 errors for larger users. I would love to find a way to programmatically grab the largest chunk of likes/comments possible at a time, then wait the minimum amount of time before grabbing again, and so on until I have all of the information I need.

I sincerely apologize for my poor formatting of code on here, this is the first time I’ve posted a github issue. Please let me know if anything can be clarified further. Thank you so much for your time! @aandergr @Thammus

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 4
  • Comments: 20

Most upvoted comments

Huh? I see the code, but I don’t see any hint of which file this code should be inside, or how it integrates with instaloader. It’s obviously not a standalone python program, and it doesn’t seem to import anything from instaloader.

@estatistics Sorry for answering after so long. Here’s the code which I have used!

import time  # for adding random delay between requests
import random
import tqdm  # for keeping track of the progress
import itertools
from concurrent.futures import ThreadPoolExecutor 

def thread_executor(l, threads=30, chunk=100):
    """l is list, threads is total number of threads and chunk is the value which we wants to sleep between"""

    global result_whole
    with ThreadPoolExecutor(max_workers=threads) as exe, tqdm.tqdm(total=len(l)) as prog:
        result_whole = []
        start = time.time()
        for chunk in chunker(chunk, l):
            start_chunk = time.time()
            # sleeping for random delay of 200 to 400 seconds
            sl = round(random.uniform(30, 50), 3)
            print('sleeping for', sl)
            time.sleep(sl)

            result = exe.map(get_user_meta, chunk, [prog] * len(chunk)) # get user_meta is the function which gets the profile details and returns a response which is then being appended in result_whole
            result_whole.extend(list(result))

            end_chunk = time.time()

            print('chunk time ', end_chunk - start_chunk, "seconds")

        end = time.time()
        print(f"taken time for {len(l)} profiles", end - start, "seconds")