GetOldTweets3: HTTP Error, Gives 404 but the URL is working
Hi, I had a script running over the past weeks and earlier today it stopped working. I keep receiving HTTPError 404, but the provided link in the errors still brings me to a valid page.
Code is (all mentioned variables are established and the error specifically happens with the Manager when I check via debugging):
tweetCriteria = got.manager.TweetCriteria().setQuerySearch(term)\ .setMaxTweets(max_count)\ .setSince(begin_timeframe)\ .setUntil(end_timeframe) scraped_tweets = got.manager.TweetManager.getTweets(tweetCriteria)
The error message for this is the standard 404 error “An error occured during an HTTP request: HTTP Error 404: Not Found Try to open in browser:” followed by the valid link
As I have changed nothing about the folder, I am wondering if something has happened with my configurations more so than anything else, but wondering if others are experiencing this.
About this issue
- Original URL
- State: open
- Created 4 years ago
- Reactions: 74
- Comments: 144
Links to this issue
Commits related to this issue
- Disabled the Twitter side because of Mottl/GetOldTweets3#98. — committed to ituethoslab/navcom-data-downloader by xmacex 4 years ago
Unfortunately twitter api does not fully meet our need, because we need to full history search without any limitations. You can search only 5000 tweet in a month with twitter api.
I hope getoldtweets start to work as soon as possible, otherwise i can not complete my master thesis
I used the below query search and it returns me the links of the tweets.
I obtain the tweet_id and then I used tweepy to extract the tweet as I needed more attributes (may not be the best way to do):
Note that tweet_ids is a list of 100 tweet ids.
@burakoglakci thanks for sharing your experience and work with us!! Its really appreciable and its help me a lot. I want to ask that what will the query string (using snscraper) if we want to get the tweets according to longitude and latitude also how we can find the geo-location of any city/country on twitter. Thanks in advance 😃
For those who are still struggling to download tweets as csv from snscrape, for me this works absolutely fine. Configurations: Windows 7 SP1 (64 bit) Python 3.8.6 pip3.8 install git+https://github.com/JustAnotherArchivist/snscrape.git Write this code in new Jupyter Notebook and make sure that, it is using Python 3.8.6 Kernel
Using code from the above comments.
People here that have been using snscrape, can you post any code examples just doing a simple query search in script and not console? The lack of documentation is making this more trial and error as I learn the modules.
So far the only method of scraping tweets that still seems to work is snscrape’s jsonl method. A comment in this Twint issue explains how to do this. Please note you will need python 3.8 and the latest development version of snscrape. This doesn’t export the .json result to .csv though. For that I used an online solution at first, later I used the pandas library in python for the conversion.
I tried replacing https://twitter.com/i/search/timeline with https://twitter.com/search?. 404 error is gone but now there is 400 bad request error.
Yes, refer to my article as I mentioned above where I cover the basics of using snscrape instead because GetOldTweets3 is basically obsolete due to changes in Twitter’s API https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af
In regards to your specific use case, with snscrape you just put whatever query you want inside the quotes inside the TwitterSearchScraper method and adjust the since and until operators to whatever time range you’d want. I created a code snippet for you below. You can take out to i>500 if you don’t want to restrict the amount of tweets you want but just want every single tweet.
@DV777 Hi!
https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032 you can get any tweet objects you want using the method described here. I created a script for my own work, and I share it below. I hope it’s useful 😃 You must have a Twitter developer account to use this method.
Hey! For the ones struggling to use snscrape, I put together a little library to download tweets using snscrape/tweepy according to customizable queries. Although it’s still a work in progress, check this repo if you want to give it a try 😃
I don’t recommend using Tweepy with snscrape, it’s not really efficient, you’re basically scraping twice. When you scrape with snscrape there’s a tweet object you can interact with that has a lot of information that will cover most use cases. I wouldn’t recommend using tweepy’s api.statuses_lookup unless you need specific information only offered through tweepy.
For those still unsure about using snscrape I did write an article for scraping with snscrape that I hope clears up any confusion about using that library, there’s also python scripts and Jupyter notebooks I’ve created to build off of. I also have a picture in the article showing all the information accessible in snscrape’s tweet object. https://medium.com/better-programming/how-to-scrape-tweets-with-snscrape-90124ed006af
Thanks you so much @sufyanhamid I’m happy if it helped. As far as I know, the bounding box query cannot be run on snscrape, as in the Twitter Stream API. You can use the geocode query instead as in Twitter Rest API. Ex.
With this query, you can collect tweets within 5 miles, surrounding the point coordinate you specify. As far as I know, you can write till 15 miles.
This really works. Many thanks. Just keep in mind that using
snscrape
may return too many results, thus it is better to limit the number of tweet IDs using--max-results
same issue here, I think this is because twitter has removed the endpoint https://twitter.com/i/search/timeline?
Unfortunately i have same problem, i hope we find a solution as soon as possible.
Here is debug enabled. It shows the actual url being called, and it seems that twitter has removed the
/i/search/timeline
endpoint. 😦@DV777 Yes, the parameters attached to tweepy apply to tweets that have already been scraped.
On snscrape if you remove the
filter:replies
parameter, you can get answers. You can also collect retweets by removing thefilter:links
parameter. But mostly collects the links of the main tweet. I don’t know if there’s a way to get the number of likes with snscrape.@Niehaus A query like this works, I hope it works.
import snscrape.modules.twitter as sntwitter import csv maxTweets = 3000
csvFile = open(‘place_result.csv’, ‘a’, newline=‘’, encoding=‘utf8’)
csvWriter = csv.writer(csvFile) csvWriter.writerow([‘id’,‘date’,‘tweet’,])
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(‘from:@burakoglakci + since:2015-12-02 until:2020-11-05-filter:links -filter:replies’).get_items()): if i > maxTweets : break
csvWriter.writerow([tweet.id, tweet.date, tweet.content]) csvFile.close()
I’m having the exact same problem. When I remove the date filter it works, but when I have it (exactly how it is in the quoted code), I get no results. Anyone else having this issue or know how to solve it? @burakoglakci it’s not clear to me how the changes you made in the code would solve this problem.
**Edit: I think I figured it out. It’s simply that there was a small error in the quoted code, you have to put a space before the ‘since’
Edited
With snscrape, this works:
snscrape --jsonl twitter-search "from:barackobama since:2015-09-10 until:2015-09-12”> baracktweets.json
orsnscrape twitter-search "from:barackobama since:2015-09-10 until:2015-09-12” > baracktweets.txt
Explanation from the developer: twitter-user is actually just a wrapper around twitter-search using the search term from:username (plus code to extract user information from the profile page)
Thank you very much! It worked!! Thank you once again and I feel grateful for your help! 😃
You can get the results by running a code like this:
snscrape --jsonl twitter-search "YOURSEARCHQUERY @USERTODLFROM #HASHTAGTODLFROM since:2020-09-01 until:2020-09-25"> mytweets.json
I then ran the .json file through this tiny python code to get my .csv , which is enough for me right now. You might wanna check out the other answers if you’re looking for something more elegant with more info.
@HuifangYeo, if you really need to get data from twitter try the twitter api, I am using it like this:
I did it like that but have to limit the amount of tweets otherwise you will get error 429. I also tried twint but it is not working currently, right now I think the best approch is to use this. I am using the limit rate, to wait 15minutes every 15 calls to the twitter api, this works well, but if you try to pull a lot of data, twitter will give you another error 429. I hope this can help you, and good luck. 😃
Maybe it has something to do with this: https://blog.twitter.com/developer/en_us/topics/tips/2020/understanding-the-new-tweet-payload.html
So anyway to get historical tweets for a hashtag ? like the most popular hashtag for the word ripple as an example from 2015 ?
Tweepy have a limit of one week depth, i tried GOT but i have the same issue as here (404) anyone has another solution to build a database from historical tweets ? 😃
thanks !
Thanks for your help @burakoglakci , I’d be lost without this. Thing is when collecting a timeline, I do not get the retweets, replies and likes of the account I am scraping, and I guess these parameters apply to the tweets which are scraped already. I tried to find a way to scrape the full activity of an account but it seems quite hard. For example, even by using the following code :
I do not get the retweets / replies / likes made by the account. Only its own created tweets. Is there a way to scrape the whole thing ? Would you have a list of the additional parameters which I could add to the scraping ? Also, I do have these Twitter Api keys, problem being that tweepy & twitter api only let me collect 3000 tweets maximum when scraping an account’s timeline when I was using it in 2019. Is this still the case ?
first, use snscrape to collect the tweets you want, including tweet id and links. you can collect your tweets in csv or txt file.
Then collect tweet objects using this code. The code I share here is based on tweepy. querying using tweet IDs and finding and collect the objects you want(like, retweet).
change from:@Username -> keywords:#hashtag to search by keyword as opposed to username
Thanks to all who made this code available! smooth program and helpful for current project!
Arizona USA id: a612c69b44b2e5da
Florida USA id: 4ec01c9dbc693497 to find these IDs, you have to run geocode query on twitter. Ex. geocode:34.684879,-111.699645,1mi this cooordinates allow you to search for a point location in Arizona. you can use any map service to access coordinates. then click on the content of a tweet that appears as a result of this query. you will see arizona, USA as the place name on this tweet content, if not, review another tweet. after clicking on the place name, you will see the place ID on the link in the search bar.
@Woolwit Thanks for share the more attributes of a tweets. Kindly also share the code/qurey of that how we can get the no.likes, no.retweets, no.comments. Thanks in advance.
Anyone have a tip for getting all the tweets in an individual’s timeline? Have managed to get user tweets (thank you @burakoglakci for your example) but would like to get the tweets the user retweets as well (tweet.retweetedTweet didn’t get it). And for any other noobish coders out there, just in case this helps.
Yes it is a matter of indents, happened to me as well. When you have “if i> maxTweets:” - that needs to be in an indent. “Break” as well. “CsvWriter” needs to be aligned with the 'if i>maxTweets". the ‘csvFile.close()’ is outside of the if and needs to be aligned with the “for i, tweet in enumerate…”.
When it comes to the scraping of likes/retweets, I did not find any easy way to do it with snscrape. I have useed tweepy. Here is the link I have followed: https://medium.com/@jcldinco/downloading-historical-tweets-using-tweet-ids-via-snscrape-and-tweepy-5f4ecbf19032
Note, you need to request the Twitter Developer role because you need all the keys.
Hope it helps!
Hi!
I using python 3.8.6 when I run this query
('from:@JoeBiden + since:2020-01-01 until:2020-11-10 -filter:links -filter:replies').get_items()) :
I’ve collected 901 tweets.
Hello! I am using the last snscrape query, but it is not working for me. I am using @joebiden from 2020-01-01 and I am getting a weird output with just 1 tweet. I am a mac user, if any. I really do not know what is going on. I literally copy-paste the code and change the handle but it does not work. Any hints? Thank you so much!
@sbif
Use this query with snscrape:
import snscrape.modules.twitter as sntwitter import csv maxTweets = 3000
csvFile = open(‘place_result.csv’, ‘a’, newline=‘’, encoding=‘utf8’)
csvWriter = csv.writer(csvFile) csvWriter.writerow([‘id’,‘date’,‘tweet’,])
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(‘from:@BillGates + since:2015-12-02 until:2020-11-05-filter:links -filter:replies’).get_items()): if i > maxTweets : break csvWriter.writerow([tweet.id, tweet.date, tweet.content]) csvFile.close()
@burakoglakci You can please help me with the querie to get tweets of a specific user?
@bensilver95 @Niehaus
Absolutely, our queries are working. The codes I added in the previous post were not displayed correctly. If you want to add a location filter to your query,
you can run this query, with this query, you can collect shared tweets about covid from the state of Kentucky. Querying on shorter date ranges, as with GOT, can yield better results. Because in queries where there are too many tweets, twitter can stop responding.
@TamiresMonteiroCD @WelXingz @ahsanspark @Atoxal @SophieChowZZY
I think I solved the problem. I made a few changes to the lines. I collect tweets using a word and location filter. I’m using Python 3.8.6 on Windows 10 and it works fine right now.
This flashing seems to be related to the random choice of user agent in TweetManager.py where “user_agent = random.choice (TweetManager.user_agents …”. I believe that a loop scanning the user agent list with exception handling solves this problem.
Not sure why, but I had the same problem. I replace tweet.renderedContent by tweet.content and it works !
add “lang:en” without quotes inside query string example:
for i,tweet in enumerate(sntwitter.TwitterSearchScraper(keyword + 'lang:en').get_items()) :
Yes, you can add or remove filters as per your need.
Can you please tell me how to get tweets with multiple keywords in search query like “Jobs AND (unemployment OR government)” @ppival
@irwanOyong I was having the same issue, the reason is I wasn’t using the development version of snscrape. Be sure to install it with
pip3 install git+https://github.com/JustAnotherArchivist/snscrape.git
Once I did that it worked like @ppival said it should.
What an excellent opportunity to write a chapter about politics of APIs in the context of research! 😅 Your supervisor will have references for literature I am sure (and depending on your field), but you can look at publications from the Digital Methods Initiative at the University of Amsterdam, including people like Anne Helmond.
Found this in issues for Twint: https://github.com/twintproject/twint/pull/917#issuecomment-697361036
Worked for me
this was in issues for 'taspinar/twitterscraper ’ which also stopped working recently:
https://github.com/taspinar/twitterscraper/issues/344
I see! I’m fairly new to scrapping, but I’m working on a end of course thesis about sentiment analysis and could really use some newer tweets to help me out.
I’ve been tinkering with GOT3’s code a bit and got it to read the HTML of the search timeline, however it’s mostly unformatted. Like I said, I have little experience with scrapping so I’m really struggling to format it correctly. However, I will note my changes, for reference and for someone with more experience to pick-up if they so wish:
updated user_agents (updated with the ones used by TWINT);
updated endpoint (/search?)
some updates to the URL structure:
Edit: Forgot to say this. Sometimes the application gives me a 400: Bad Request, I run it again, and it outputs the HTML like said before.
I forked and created a branch to allow a user-specified UA, using samples from my current browser doesn’t fix the problem.
I notice the search and referrer URL shown in
--debug
output (https://twitter.com/i/search/timeline
) returns a 404 error:EDIT The url used for the internal search, and the one shown in the exception message, aren’t the same…