twarc: conversations module "sleep" inconsistencies
Hi there, I feel like I’m running into a similar issue as described here: https://twittercommunity.com/t/inconsistent-rate-limit-academic-research-full-archive-search/162928/18?u=igorbrigadir
and here: https://github.com/DocNow/twarc/pull/578.
I too am fetching all tweets related to a conversation id using the twarc2 command:
Twarc2 conversations --archive input_conversation_ids.txt output_conversation_tweets.jsonl
But, I’m finding that it is doing far fewer than the 300 requests per 15 minutes that my academic Twitter API account has been allotted.
I’m using the latest version of twarc 2.8.2
Here is some of the log that I’m seeing (Notice that at 10:14:24 it stops…and then doesn’t restart until 11:23:27 for no clear reason.
2022-01-01 10:04:32,148 INFO fetching conversation 290902007206772736
2022-01-01 10:04:32,148 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902007206772736', 'max_results': 100}}
2022-01-01 10:04:32,174 WARNING rate limit exceeded: sleeping 591.8250887393951 secs
2022-01-01 10:14:24,003 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902007206772736', 'max_results': 100}}
2022-01-01 10:14:24,117 INFO Retrieved an empty page of results.
2022-01-01 10:14:24,117 INFO No more results for search conversation_id:290902007206772736.
2022-01-01 11:23:27,550 INFO fetching conversation 290902009131966464
2022-01-01 11:23:27,551 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:290902009131966464', 'max_results': 100}}
2022-01-01 11:23:28,605 INFO Retrieved an empty page of results.
2022-01-01 11:23:28,605 INFO No more results for search conversation_id:290902009131966464.
2022-01-01 11:23:28,607 INFO fetching conversation 290902010599964672
I’m also getting a lot of these warnings about "overlong sleep interval"s:
2022-01-01 15:53:56,688 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 15:53:56,688 WARNING rate limit exceeded: sleeping 901 secs
2022-01-01 16:08:57,693 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291011973909467136', 'max_results': 100}}
2022-01-01 16:08:57,795 INFO Retrieved an empty page of results.
2022-01-01 16:08:57,795 INFO No more results for search conversation_id:291011973909467136.
2022-01-01 16:08:57,795 INFO fetching conversation 291018004416839680
2022-01-01 16:08:57,796 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291018004416839680', 'max_results': 100}}
2022-01-01 16:08:57,820 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 16:08:57,820 WARNING rate limit exceeded: sleeping 901 secs
2022-01-01 16:23:58,834 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291018004416839680', 'max_results': 100}}
2022-01-01 16:23:58,967 INFO Retrieved an empty page of results.
2022-01-01 16:23:58,967 INFO No more results for search conversation_id:291018004416839680.
2022-01-01 16:23:58,968 INFO fetching conversation 291062323647500288
2022-01-01 16:23:58,968 INFO getting ('https://api.twitter.com/2/tweets/search/all',) {'params': {'expansions': 'author_id,in_reply_to_user_id,referenced_tweets.id,referenced_tweets.id.author_id,entities.mentions.username,attachments.poll_ids,attachments.media_keys,geo.place_id', 'tweet.fields': 'attachments,author_id,context_annotations,conversation_id,created_at,entities,geo,id,in_reply_to_user_id,lang,public_metrics,text,possibly_sensitive,referenced_tweets,reply_settings,source,withheld', 'user.fields': 'created_at,description,entities,id,location,name,pinned_tweet_id,profile_image_url,protected,public_metrics,url,username,verified,withheld', 'media.fields': 'alt_text,duration_ms,height,media_key,preview_image_url,type,url,width,public_metrics', 'poll.fields': 'duration_minutes,end_datetime,id,options,voting_status', 'place.fields': 'contained_within,country,country_code,full_name,geo,id,name,place_type', 'start_time': '2006-03-21T00:00:00+00:00', 'end_time': None, 'query': 'conversation_id:291062323647500288', 'max_results': 100}}
2022-01-01 16:23:58,992 WARNING Detected overlong sleep interval - is your system clock accurate? An accurate system time is needed to calculate how long to sleep for, and data collection might be slowed.
2022-01-01 16:23:58,992 WARNING rate limit exceeded: sleeping 901 secs
Any help would be appreciated as at this rate, I won’t put this dataset together within any reasonable amount of time.
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (11 by maintainers)
Here’s the log after roughly 1000 requests. So far, only seeing the non-900 second warnings every 300 requests. Not seeing 10 second warnings. Not seeing back-to-back long sleeps.
So far looking good…😃 sample_log (twarc-dev1).log
Oh wait, looking more closely at the log, it is definitely a bug, we’re just making too many calls… responses with no results are the most notable place, probably because they’re returning extremely fast.
I wonder if it would be usrful to add some logging statements to make some of the logic more traceable? Then @jonlee112 could run it and share relevant snippets of the log?
I’ve just merged the fix for this, and v2.8.3 should be out in a few minutes.
Thanks for all your help debugging!
@SamHames Will reset tokens and restart now.
Huh, that’s interesting, I’m not sure how to interpret that yet. That definitely indicates we’re tripping up the 1r/s rate limit, but I’m not sure how, and also why we’re not getting the >900 second sleep windows…
Would it be difficult for you to reset your bearer token and try the process again? (You can reset the token through the twitter developer console, and run
twarc2 configureagain to enter the new token)This is the correct behaviour, you can make 300 requests to that endpoint every 15 minutes and twarc should be stopping once it hits that point.
@edsu @igorbrigadir - do you think we should have exponential backoff on 1r/s retries? https://github.com/DocNow/twarc/blob/rate_limit_handling/twarc/decorators2.py#L51
Just taking a peek at the headers for the rate limits (Script here for reference.), putting a couple of notes for myself for tomorrow:
x-rate-limit-limitandx-rate-limit-remainingdon’t reflect the API limit of the endpoint until you hit that limit? The default values for me are 3000/3000 calls, until I hit the 15 minute window, when it dropped to the correct value of 300/0. This means that we can’t actually usex-rate-limit-remainingfor anything, as the values aren’t accurate until we have a 429 error anyway.x-rate-limit-resetalways refers to the 15 minute window, even when the 1r/s limit has been hit. I think we can disambiguate the two by looking atx-rate-limit-remaining- it will be 0 when we’ve hit a 15 minute window, and non-zero when we’ve hit the 1r/s window? I need to do some confirmation work here.Looking at this, and reflecting on the fact that we’re looking at estimated sleep times greater than 15 minutes makes me think we need to look at the time difference calculation, as there’s nothing in the headers that should be causing a problem. I will probably add a check for the 1 r/s rate limit and update the logging in that section of code and see if we can shake anything loose there.
I’ve just started taking a look at this in more detail - thinking out loud there’s two things I’m going to look at:
@edsu Sorry didn’t directly respond Ed - I was running a rehydrator app for another dataset but I stopped/closed that yesterday and retried the twarc2 conversations command and was still seeing the same sleep behaviors.