instagram-php-scraper: Can't retrieve user medias

Using getMediasByUserId returns error, the returned body is: {"message": "forbidden", "status": "fail"}

Is there way to get around this?

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 9
  • Comments: 109 (27 by maintainers)

Most upvoted comments

Looks like this cat and mouse game can’t be finished if we continue to discuss these updates and fixes in public…

Above have already been told how X_Instagram-GIS calculated, but the game in “cat and mouse” continues and today encryption looks like this in PHP: $gis = md5(join(':', array( $page_data['rhx_gis'], json_encode($variables) )));

Looks like the hash of x-instagram-gis no longer includes the user-agent string.

#328 Need to test… But this solution works for me 😌

guys, i am feeling paranoid. So i was trying to find out rate limit of the graphql endpoint without having logged in user using postman and browser. Eventually, i got to the limit and my ip was blocked, the strange part is that my phone was also blocked, whenever I am trying to scroll down in any instagram profile using mobile chrome incognito mode I am getting 429… and the facts are

  1. I haven’t used instagram web on my phone for ages
  2. I am using 3g so my laptop and phone should have different ips
  3. I have Facebook and Instagram apps installed on my phone

Can it be that they have MAC address of all my devices? Or can someone explain whats happening…

I have destroyed my phone and laptop, and moving to the mountains ))

@kenjones91 omg, they’re closing more up. This will kill my site. 🙁

@andrewyoo confirmed - it works, the only you need to keep in mind - request and parse tokens (rhx_gis and csrftoken) with same user agent as for other requests. Another problem - rate limits. Looks like its per IP based limits.

it would be perfect to integrate this into this project, like optional way to parse something

Below simple code to perform HTML parsing and get media data from JS (Instagram.php)

    /**
     * Get medias shared by user (HTML parser)
     *
     * @param     string    $username
     * @return    Media[]
     * @throws    InstagramException
     */
    public function getMediasSharedByAccount($username)
    {
      $response = Request::get(Endpoints::getAccountPageLink($username));
      if (static::HTTP_NOT_FOUND === $response->code) {
          throw new InstagramNotFoundException('Account with given username does not exist.');
      }
      if (static::HTTP_OK !== $response->code) {
          throw new InstagramException('Response code is ' . $response->code . '. Body: ' . static::getErrorBody($response->body) . ' Something went wrong. Please report issue.');
      }
      $regex = '/window\._sharedData.*<\/script>/';
      preg_match($regex, $response->raw_body, $data);
      $data = $data[0];
      $data = str_replace('window._sharedData = ', '', $data);
      $data = str_replace(';</script>', '', $data);
      $data = json_decode($data, true, 512, JSON_BIGINT_AS_STRING);
      $nodes = $data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges'];
      if (!isset($nodes) || empty($nodes)) {
        return [];
      }
      $medias = [];
      foreach ($nodes as $mediaArray) {
        $medias[] = Media::create($mediaArray['node']);
      }
      return $medias;
    }

@knissophiliac you can get rhx_gis just from a GET on ‘https://www.instagram.com’. Also it’s pretty much returned in the html of most pages.

@350d It’s still working for me…

@andrewyoo x-instagram-gis calculated with csrf_token, rhx_gis, window.navigator.userAgent and variables from API call. Here is my refactored hashing function:

function gishash(n,r,t){function e(n,r){var t=(65535&n)+(65535&r);return(n>>16)+(r>>16)+(t>>16)<<16|65535&t}function o(n,r,t,o,u,c){return e((f=e(e(r,n),e(o,c)))<<(a=u)|f>>>32-a,t);var f,a}function u(n,r,t,e,u,c,f){return o(r&t|~r&e,n,r,u,c,f)}function c(n,r,t,e,u,c,f){return o(r&e|t&~e,n,r,u,c,f)}function f(n,r,t,e,u,c,f){return o(r^t^e,n,r,u,c,f)}function a(n,r,t,e,u,c,f){return o(t^(r|~e),n,r,u,c,f)}function i(n,r){var t,o,i,h,g;n[r>>5]|=128<<r%32,n[14+(r+64>>>9<<4)]=r;var v=1732584193,d=-271733879,l=-1732584194,A=271733878;for(t=0;t<n.length;t+=16)o=v,i=d,h=l,g=A,d=a(d=a(d=a(d=a(d=f(d=f(d=f(d=f(d=c(d=c(d=c(d=c(d=u(d=u(d=u(d=u(d,l=u(l,A=u(A,v=u(v,d,l,A,n[t],7,-680876936),d,l,n[t+1],12,-389564586),v,d,n[t+2],17,606105819),A,v,n[t+3],22,-1044525330),l=u(l,A=u(A,v=u(v,d,l,A,n[t+4],7,-176418897),d,l,n[t+5],12,1200080426),v,d,n[t+6],17,-1473231341),A,v,n[t+7],22,-45705983),l=u(l,A=u(A,v=u(v,d,l,A,n[t+8],7,1770035416),d,l,n[t+9],12,-1958414417),v,d,n[t+10],17,-42063),A,v,n[t+11],22,-1990404162),l=u(l,A=u(A,v=u(v,d,l,A,n[t+12],7,1804603682),d,l,n[t+13],12,-40341101),v,d,n[t+14],17,-1502002290),A,v,n[t+15],22,1236535329),l=c(l,A=c(A,v=c(v,d,l,A,n[t+1],5,-165796510),d,l,n[t+6],9,-1069501632),v,d,n[t+11],14,643717713),A,v,n[t],20,-373897302),l=c(l,A=c(A,v=c(v,d,l,A,n[t+5],5,-701558691),d,l,n[t+10],9,38016083),v,d,n[t+15],14,-660478335),A,v,n[t+4],20,-405537848),l=c(l,A=c(A,v=c(v,d,l,A,n[t+9],5,568446438),d,l,n[t+14],9,-1019803690),v,d,n[t+3],14,-187363961),A,v,n[t+8],20,1163531501),l=c(l,A=c(A,v=c(v,d,l,A,n[t+13],5,-1444681467),d,l,n[t+2],9,-51403784),v,d,n[t+7],14,1735328473),A,v,n[t+12],20,-1926607734),l=f(l,A=f(A,v=f(v,d,l,A,n[t+5],4,-378558),d,l,n[t+8],11,-2022574463),v,d,n[t+11],16,1839030562),A,v,n[t+14],23,-35309556),l=f(l,A=f(A,v=f(v,d,l,A,n[t+1],4,-1530992060),d,l,n[t+4],11,1272893353),v,d,n[t+7],16,-155497632),A,v,n[t+10],23,-1094730640),l=f(l,A=f(A,v=f(v,d,l,A,n[t+13],4,681279174),d,l,n[t],11,-358537222),v,d,n[t+3],16,-722521979),A,v,n[t+6],23,76029189),l=f(l,A=f(A,v=f(v,d,l,A,n[t+9],4,-640364487),d,l,n[t+12],11,-421815835),v,d,n[t+15],16,530742520),A,v,n[t+2],23,-995338651),l=a(l,A=a(A,v=a(v,d,l,A,n[t],6,-198630844),d,l,n[t+7],10,1126891415),v,d,n[t+14],15,-1416354905),A,v,n[t+5],21,-57434055),l=a(l,A=a(A,v=a(v,d,l,A,n[t+12],6,1700485571),d,l,n[t+3],10,-1894986606),v,d,n[t+10],15,-1051523),A,v,n[t+1],21,-2054922799),l=a(l,A=a(A,v=a(v,d,l,A,n[t+8],6,1873313359),d,l,n[t+15],10,-30611744),v,d,n[t+6],15,-1560198380),A,v,n[t+13],21,1309151649),l=a(l,A=a(A,v=a(v,d,l,A,n[t+4],6,-145523070),d,l,n[t+11],10,-1120210379),v,d,n[t+2],15,718787259),A,v,n[t+9],21,-343485551),v=e(v,o),d=e(d,i),l=e(l,h),A=e(A,g);return[v,d,l,A]}function h(n){var r,t="",e=32*n.length;for(r=0;r<e;r+=8)t+=String.fromCharCode(n[r>>5]>>>r%32&255);return t}function g(n){var r,t=[];for(t[(n.length>>2)-1]=void 0,r=0;r<t.length;r+=1)t[r]=0;var e=8*n.length;for(r=0;r<e;r+=8)t[r>>5]|=(255&n.charCodeAt(r/8))<<r%32;return t}function v(n){var r,t,e="";for(t=0;t<n.length;t+=1)r=n.charCodeAt(t),e+="0123456789abcdef".charAt(r>>>4&15)+"0123456789abcdef".charAt(15&r);return e}function d(n){return unescape(encodeURIComponent(n))}function l(n){return h(i(g(r=d(n)),8*r.length));var r}return v(l(r+":"+t+":"+window.navigator.userAgent+":"+n))}

Call this function like this: gishash("{\"id\":\"5821462185\",\"first\":40,\"after\":\"\"}", rhx_gis, csrf_token). rhx_gis and csrf_token can be parsed from any embed page source (CORS available on this links);

I’ve tried to archive this via javascript but here is the problem: I can’t set these custom headers due allow-origin limitation for custom headers on instagram side, but this is not a problem in php I guess.

@raiym @rhcarlosweb @gthedev hi! I really don’t know PHP to help with this one, but maybe the quick hotfix would be:

  1. change the actual URL in https://github.com/postaddictme/instagram-php-scraper/blob/master/src/InstagramScraper/Endpoints.php: ACCOUNT_MEDIAS = 'https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first={count}&after={max_id}'; to: ACCOUNT_MEDIAS = https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"{user_id}","first":{count},"after":"{max_id}"}
  2. Send the following cookies with request. I just checked - the cookies I retrieved yesterday still work (one day now) and from different clients, without need to get the new ones before each request. So automated browser part might be omitted for now:
[
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "rur",
        "path": "/",
        "secure": false,
        "value": "PRN"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vw",
        "path": "/",
        "secure": false,
        "value": "1038"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 1554672942.248612,
        "httpOnly": false,
        "name": "csrftoken",
        "path": "/",
        "secure": true,
        "value": "ObRXje2ByOUmAnxqPaoFsD0CHvBEK8dQ"
    },
    {
        "domain": "www.instagram.com",
        "expiry": 2153943342.248646,
        "httpOnly": false,
        "name": "mid",
        "path": "/",
        "secure": false,
        "value": "WsqLMgALAAFkkaMz9rbL568BCU5N"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_vh",
        "path": "/",
        "secure": false,
        "value": "532"
    },
    {
        "domain": "www.instagram.com",
        "httpOnly": false,
        "name": "ig_pr",
        "path": "/",
        "secure": false,
        "value": "2.5"
    }
]

Maybe this is not final solution, but at least media queries will work (for some time 😅)

I made a solution for this one, but in python using automated browser to retrieve cookies and new URL. Really don’t know how PHP implementation would look like, but this are the steps to do:

  1. Get cookies with automated browser
  2. Make request with this cookies, and new URL: 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"<user_id>","first":<items_to_retrieve>,"after":"<end_cursor>"}' where <end_cursor> is either blank or end_cursor from previous request <items_to_retrieve> - instagram web uses 12. I tested successfully with 20.

Disclaimer 1: no authorization needed! Disclaimer 2: actually I reused the same cookies several times and it worked. The expiry seams to be set in one year. But I don’t know if Instagram will catch the usage of cookies from many different clients if hardcoded to this scapper!

Python implementation:

# ! Error handling is omitted for clarity
import requests
from selenium import webdriver

media_url = 'https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables={"id":"%s","first":20,"after":"%s"}'
browser = webdriver.Chrome()

# first get https://instagram.com to obtain cookies
browser.get('https://instagram.com')
browser_cookies = browser.get_cookies()

# set a session with cookies
session = requests.Session()
for cookie in browser_cookies:
    c = {cookie['name']: cookie['value']}
    session.cookies.update(c)

# get response as JSON
# > using id '25025320' - profile of Instagram for this example
response = session.get(media_url % ('25025320', ''), verify=False).json()

@Scottzonn somethink like this and still works 😃 …

function getMediasFromURL($username, $count = 12)
{
	$medias = array();
	$doc = new DOMDocument();
	$doc->loadHTML(implode("",file('https://www.instagram.com/'.$username.'/')));
	$jsNodes = $doc->getElementsByTagName("script");		
	$jsNodeTmp = "";
	foreach($jsNodes as $node){
		if(strpos($node->nodeValue,"window._sharedData")!==false){
			$jsNodeTmp = $node->nodeValue;
			break;
		}
	}		
	$medias = array();
	if($jsNodeTmp){
		$jsNodeTmp = trim(str_replace("window._sharedData","",$jsNodeTmp)," ;=");
		$json = json_decode($jsNodeTmp);
		$jsonMedia = $json->entry_data->ProfilePage[0]->graphql->user->edge_owner_to_timeline_media->edges;
		foreach($jsonMedia as $jsonMediaItem){
			if(count($medias) < $count)
				$medias[] = $jsonMediaItem;
		}
	}
	return $medias;
}

But in this case scrapper is not neccesary to use… Hope real solution will be found.

I just solve the problem with parsing html page of account and then taking json from javascript. Yes it is just 12 medias, but it works 😃 I “love” instagram more and more )))

The medias seem to work now with latest changes and logged in account, however it quickly reaches the limit. Body: message => rate limited; status => fail;

Does anyone know more details on the limits? What is the limit, is it based on user/ip/both?

The maximum request is 200 per hour! Check detail here: https://stackoverflow.com/questions/49585077/instagram-api-limit-reduced-to-200-from-5000

https://www.instagram.com/vasiliizaikovskii/?__a=1 this works!

This doesn’t work today.
Got 403 status. 😦

@footniko I used the following headers which are working for anonymous crawling:

{
        "x-instagram-gis" => gistoken, 
        "cookie" => "csrftoken=#{csrf_token}",
        "user-agent" => 'user agent string'
}
  1. gistoken is calulated with @350d’s function
  2. csrftoken in the cookie
  3. user-agent string also passed in so that gistoken can be calculated and compared on the server side.

I’ve just realized that x-instagram-gis is just an md5 hash 😀

Strange because i have test with a blank value of $this->userSession[‘ig_pr’] = “”; and works too…

M… maybe Instagram is just waiting this cookie name, no matter the value. Because setting it to some random value, e.g. 42 works fine too!

But yes, when ig_pr not present, returns 403 code.

Nice user private data protection system, anyway 😅