instagram-php-scraper: Can't retrieve user medias

Using getMediasByUserId returns error, the returned body is: {"message": "forbidden", "status": "fail"}

Is there way to get around this?

About this issue

Original URL
State: closed
Created 6 years ago
Reactions: 9
Comments: 109 (27 by maintainers)

Links to this issue

How can I get a user's media from Instagram without authenticating as a user? - Stack Overflow

Most upvoted comments

Looks like this cat and mouse game can’t be finished if we continue to discuss these updates and fixes in public…

350d on Apr 12, 2018

Above have already been told how X_Instagram-GIS calculated, but the game in “cat and mouse” continues and today encryption looks like this in PHP: $gis = md5(join(':', array( $page_data['rhx_gis'], json_encode($variables) )));

welcomemax on Apr 17, 2018

Looks like the hash of x-instagram-gis no longer includes the user-agent string.

jeff-an on Apr 12, 2018

#328 Need to test… But this solution works for me 😌

rhcarlosweb on Apr 9, 2018

guys, i am feeling paranoid. So i was trying to find out rate limit of the graphql endpoint without having logged in user using postman and browser. Eventually, i got to the limit and my ip was blocked, the strange part is that my phone was also blocked, whenever I am trying to scroll down in any instagram profile using mobile chrome incognito mode I am getting 429… and the facts are

I haven’t used instagram web on my phone for ages
I am using 3g so my laptop and phone should have different ips
I have Facebook and Instagram apps installed on my phone

Can it be that they have MAC address of all my devices? Or can someone explain whats happening…

I have destroyed my phone and laptop, and moving to the mountains ))

mnatsakanyank on Apr 18, 2018

@kenjones91 omg, they’re closing more up. This will kill my site. 🙁

fattony80 on Apr 12, 2018

@andrewyoo confirmed - it works, the only you need to keep in mind - request and parse tokens (rhx_gis and csrftoken) with same user agent as for other requests. Another problem - rate limits. Looks like its per IP based limits.

350d on Apr 10, 2018

it would be perfect to integrate this into this project, like optional way to parse something

Below simple code to perform HTML parsing and get media data from JS (Instagram.php)

    /**
     * Get medias shared by user (HTML parser)
     *
     * @param     string    $username
     * @return    Media[]
     * @throws    InstagramException
     */
    public function getMediasSharedByAccount($username)
    {
      $response = Request::get(Endpoints::getAccountPageLink($username));
      if (static::HTTP_NOT_FOUND === $response->code) {
          throw new InstagramNotFoundException('Account with given username does not exist.');
      }
      if (static::HTTP_OK !== $response->code) {
          throw new InstagramException('Response code is ' . $response->code . '. Body: ' . static::getErrorBody($response->body) . ' Something went wrong. Please report issue.');
      }
      $regex = '/window\._sharedData.*<\/script>/';
      preg_match($regex, $response->raw_body, $data);
      $data = $data[0];
      $data = str_replace('window._sharedData = ', '', $data);
      $data = str_replace(';</script>', '', $data);
      $data = json_decode($data, true, 512, JSON_BIGINT_AS_STRING);
      $nodes = $data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'];
      if (!isset($nodes) || empty($nodes)) {
        return [];
      }
      $medias = [];
      foreach ($nodes as $mediaArray) {
        $medias[] = Media::create($mediaArray['node']);
      }
      return $medias;
    }

ZipDriver on Apr 16, 2018

@knissophiliac you can get rhx_gis just from a GET on ‘https://www.instagram.com’. Also it’s pretty much returned in the html of most pages.

@350d It’s still working for me…

andrewyoo on Apr 10, 2018

@andrewyoo x-instagram-gis calculated with csrf_token, rhx_gis, window.navigator.userAgent and variables from API call. Here is my refactored hashing function:

function gishash(n,r,t){function e(n,r){var t=(65535&n)+(65535&r);return(n>>16)+(r>>16)+(t>>16)<<16|65535&t}function o(n,r,t,o,u,c){return e((f=e(e(r,n),e(o,c)))<<(a=u)|f>>>32-a,t);var f,a}function u(n,r,t,e,u,c,f){return o(r&t|~r&e,n,r,u,c,f)}function c(n,r,t,e,u,c,f){return o(r&e|t&~e,n,r,u,c,f)}function f(n,r,t,e,u,c,f){return o(r^t^e,n,r,u,c,f)}function a(n,r,t,e,u,c,f){return o(t^(r|~e),n,r,u,c,f)}function i(n,r){var t,o,i,h,g;n[r>>5]|=128<<r%32,n[14+(r+64>>>9<<4)]=r;var v=1732584193,d=-271733879,l=-1732584194,A=271733878;for(t=0;t<n.length;t+=16)o=v,i=d,h=l,g=A,d=a(d=a(d=a(d=a(d=f(d=f(d=f(d=f(d=c(d=c(d=c(d=c(d=u(d=u(d=u(d=u(d,l=u(l,A=u(A,v=u(v,d,l,A,n[t],7,-680876936),d,l,n[t+1],12,-389564586),v,d,n[t+2],17,606105819),A,v,n[t+3],22,-1044525330),l=u(l,A=u(A,v=u(v,d,l,A,n[t+4],7,-176418897),d,l,n[t+5],12,1200080426),v,d,n[t+6],17,-1473231341),A,v,n[t+7],22,-45705983),l=u(l,A=u(A,v=u(v,d,l,A,n[t+8],7,1770035416),d,l,n[t+9],12,-1958414417),v,d,n[t+10],17,-42063),A,v,n[t+11],22,-1990404162),l=u(l,A=u(A,v=u(v,d,l,A,n[t+12],7,1804603682),d,l,n[t+13],12,-40341101),v,d,n[t+14],17,-1502002290),A,v,n[t+15],22,1236535329),l=c(l,A=c(A,v=c(v,d,l,A,n[t+1],5,-165796510),d,l,n[t+6],9,-1069501632),v,d,n[t+11],14,643717713),A,v,n[t],20,-373897302),l=c(l,A=c(A,v=c(v,d,l,A,n[t+5],5,-701558691),d,l,n[t+10],9,38016083),v,d,n[t+15],14,-660478335),A,v,n[t+4],20,-405537848),l=c(l,A=c(A,v=c(v,d,l,A,n[t+9],5,568446438),d,l,n[t+14],9,-1019803690),v,d,n[t+3],14,-187363961),A,v,n[t+8],20,1163531501),l=c(l,A=c(A,v=c(v,d,l,A,n[t+13],5,-1444681467),d,l,n[t+2],9,-51403784),v,d,n[t+7],14,1735328473),A,v,n[t+12],20,-1926607734),l=f(l,A=f(A,v=f(v,d,l,A,n[t+5],4,-378558),d,l,n[t+8],11,-2022574463),v,d,n[t+11],16,1839030562),A,v,n[t+14],23,-35309556),l=f(l,A=f(A,v=f(v,d,l,A,n[t+1],4,-1530992060),d,l,n[t+4],11,1272893353),v,d,n[t+7],16,-155497632),A,v,n[t+10],23,-1094730640),l=f(l,A=f(A,v=f(v,d,l,A,n[t+13],4,681279174),d,l,n[t],11,-358537222),v,d,n[t+3],16,-722521979),A,v,n[t+6],23,76029189),l=f(l,A=f(A,v=f(v,d,l,A,n[t+9],4,-640364487),d,l,n[t+12],11,-421815835),v,d,n[t+15],16,530742520),A,v,n[t+2],23,-995338651),l=a(l,A=a(A,v=a(v,d,l,A,n[t],6,-198630844),d,l,n[t+7],10,1126891415),v,d,n[t+14],15,-1416354905),A,v,n[t+5],21,-57434055),l=a(l,A=a(A,v=a(v,d,l,A,n[t+12],6,1700485571),d,l,n[t+3],10,-1894986606),v,d,n[t+10],15,-1051523),A,v,n[t+1],21,-2054922799),l=a(l,A=a(A,v=a(v,d,l,A,n[t+8],6,1873313359),d,l,n[t+15],10,-30611744),v,d,n[t+6],15,-1560198380),A,v,n[t+13],21,1309151649),l=a(l,A=a(A,v=a(v,d,l,A,n[t+4],6,-145523070),d,l,n[t+11],10,-1120210379),v,d,n[t+2],15,718787259),A,v,n[t+9],21,-343485551),v=e(v,o),d=e(d,i),l=e(l,h),A=e(A,g);return[v,d,l,A]}function h(n){var r,t="",e=32*n.length;for(r=0;r<e;r+=8)t+=String.fromCharCode(n[r>>5]>>>r%32&255);return t}function g(n){var r,t=[];for(t[(n.length>>2)-1]=void 0,r=0;r<t.length;r+=1)t[r]=0;var e=8*n.length;for(r=0;r<e;r+=8)t[r>>5]|=(255&n.charCodeAt(r/8))<<r%32;return t}function v(n){var r,t,e="";for(t=0;t<n.length;t+=1)r=n.charCodeAt(t),e+="0123456789abcdef".charAt(r>>>4&15)+"0123456789abcdef".charAt(15&r);return e}function d(n){return unescape(encodeURIComponent(n))}function l(n){return h(i(g(r=d(n)),8*r.length));var r}return v(l(r+":"+t+":"+window.navigator.userAgent+":"+n))}

Call this function like this: gishash("{\"id\":\"5821462185\",\"first\":40,\"after\":\"\"}", rhx_gis, csrf_token). rhx_gis and csrf_token can be parsed from any embed page source (CORS available on this links);

I’ve tried to archive this via javascript but here is the problem: I can’t set these custom headers due allow-origin limitation for custom headers on instagram side, but this is not a problem in php I guess.

350d on Apr 10, 2018

@raiym @rhcarlosweb @gthedev hi! I really don’t know PHP to help with this one, but maybe the quick hotfix would be:

change the actual URL in https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Endpoints.php: ACCOUNT_MEDIAS = 'https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first={count}&after={max_id}'; to: ACCOUNT_MEDIAS = https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"{user_id}","first":{count},"after":"{max_id}"}
Send the following cookies with request. I just checked - the cookies I retrieved yesterday still work (one day now) and from different clients, without need to get the new ones before each request. So automated browser part might be omitted for now:

[
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "rur",
        "path": "/",
        "secure": false,
        "value": "PRN"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vw",
        "path": "/",
        "secure": false,
        "value": "1038"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 1554672942.248612,
        "httpOnly": false,
        "name": "csrftoken",
        "path": "/",
        "secure": true,
        "value": "ObRXje2ByOUmAnxqPaoFsD0CHvBEK8dQ"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 2153943342.248646,
        "httpOnly": false,
        "name": "mid",
        "path": "/",
        "secure": false,
        "value": "WsqLMgALAAFkkaMz9rbL568BCU5N"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vh",
        "path": "/",
        "secure": false,
        "value": "532"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_pr",
        "path": "/",
        "secure": false,
        "value": "2.5"
    }
]

Maybe this is not final solution, but at least media queries will work (for some time 😅)

myrs on Apr 8, 2018

I made a solution for this one, but in python using automated browser to retrieve cookies and new URL. Really don’t know how PHP implementation would look like, but this are the steps to do:

Get cookies with automated browser
Make request with this cookies, and new URL: 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"<user_id>","first":<items_to_retrieve>,"after":"<end_cursor>"}' where <end_cursor> is either blank or end_cursor from previous request <items_to_retrieve> - instagram web uses 12. I tested successfully with 20.

Disclaimer 1: no authorization needed! Disclaimer 2: actually I reused the same cookies several times and it worked. The expiry seams to be set in one year. But I don’t know if Instagram will catch the usage of cookies from many different clients if hardcoded to this scapper!

Python implementation:

# ! Error handling is omitted for clarity
import requests
from selenium import webdriver

media_url = 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"%s","first":20,"after":"%s"}'
browser = webdriver.Chrome()

# first get https://instagram.com to obtain cookies
browser.get('https://instagram.com')
browser_cookies = browser.get_cookies()

# set a session with cookies
session = requests.Session()
for cookie in browser_cookies:
    c = {cookie['name']: cookie['value']}
    session.cookies.update(c)

# get response as JSON
# > using id '25025320' - profile of Instagram for this example
response = session.get(media_url % ('25025320', ''), verify=False).json()

myrs on Apr 8, 2018

@Scottzonn somethink like this and still works 😃 …

function getMediasFromURL($username, $count = 12)
{
	$medias = array();
	$doc = new DOMDocument();
	$doc->loadHTML(implode("",file('https://www.instagram.com/'.$username.'/')));
	$jsNodes = $doc->getElementsByTagName("script");		
	$jsNodeTmp = "";
	foreach($jsNodes as $node){
		if(strpos($node->nodeValue,"window._sharedData")!==false){
			$jsNodeTmp = $node->nodeValue;
			break;
		}
	}		
	$medias = array();
	if($jsNodeTmp){
		$jsNodeTmp = trim(str_replace("window._sharedData","",$jsNodeTmp)," ;=");
		$json = json_decode($jsNodeTmp);
		$jsonMedia = $json->entry_data->ProfilePage[0]->graphql->user->edge_owner_to_timeline_media->edges;
		foreach($jsonMedia as $jsonMediaItem){
			if(count($medias) < $count)
				$medias[] = $jsonMediaItem;
		}
	}
	return $medias;
}

But in this case scrapper is not neccesary to use… Hope real solution will be found.

zaivst on Apr 16, 2018

I just solve the problem with parsing html page of account and then taking json from javascript. Yes it is just 12 medias, but it works 😃 I “love” instagram more and more )))

zaivst on Apr 15, 2018

The medias seem to work now with latest changes and logged in account, however it quickly reaches the limit. Body: message => rate limited; status => fail;

Does anyone know more details on the limits? What is the limit, is it based on user/ip/both?

The maximum request is 200 per hour! Check detail here: https://stackoverflow.com/questions/49585077/instagram-api-limit-reduced-to-200-from-5000

kenjones91 on Apr 13, 2018

https://www.instagram.com/vasiliizaikovskii/?__a=1 this works!

This doesn’t work today.
Got 403 status. 😦

kenjones91 on Apr 12, 2018

@footniko I used the following headers which are working for anonymous crawling:

{
        "x-instagram-gis" => gistoken, 
        "cookie" => "csrftoken=#{csrf_token}",
        "user-agent" => 'user agent string'
}

gistoken is calulated with @350d’s function
csrftoken in the cookie
user-agent string also passed in so that gistoken can be calculated and compared on the server side.

andrewyoo on Apr 10, 2018

I’ve just realized that x-instagram-gis is just an md5 hash 😀

350d on Apr 10, 2018

Strange because i have test with a blank value of $this->userSession[‘ig_pr’] = “”; and works too…

M… maybe Instagram is just waiting this cookie name, no matter the value. Because setting it to some random value, e.g. 42 works fine too!

But yes, when ig_pr not present, returns 403 code.

Nice user private data protection system, anyway 😅

myrs on Apr 9, 2018