requests-html: render() triggers website protections

Hi

I’ve just started working with requests-html, and render() seems to be triggering a website protection mechanism, and I’m not sure why. The below URL loads a number of .js scripts with data I’m trying to access. For the first r.html.text below, this data just returns “Loading…” as the js scripts haven’t yet run. After r.html.render(), the page rejects the request. Any advice on what is going wrong here and how to circumvent it would be very much appreciated - code below:

`from requests_html import HTMLSession

session = HTMLSession()

r = session.get(“https://register.epo.org/application?number=EP16190441&lng=en&tab=federated”)

print(r.html.text)

r.html.render()

print(r.html.text)`

About this issue

  • Original URL
  • State: open
  • Created 5 years ago
  • Comments: 15 (1 by maintainers)

Most upvoted comments

Hi @LaurT . I’m facing a similar issue as yours and @Ecript 's solution worked perfectly for me! With a small addition I managed to get it working for your page as well.

By investigating the underlying traffic I found that the quotation marks around 'Testing' in session = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"]) were being included in the GET requests headers. Removing the quotation marks did the trick.

The code below prints your webpage including javascript loaded content.

from requests_html import HTMLSession
session = HTMLSession(browser_args=["--no-sandbox", '--user-agent=Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1'])
r = session.get('https://register.epo.org/application?number=EP16190441&lng=en&tab=federated')
# This is necessary for your webpage in particular because it takes around 13 seconds for the page to load.
r.html.render(timeout=15)
print(r.html.text)

This is fantastic thank you! Would you be able to post the code that you got working? I’m getting a series of errors (see below) when trying to run render() with those arguments. I assume I’m doing something silly, but have tried a number of different permutations and just can’t get it to work:

from requests_html import HTMLSession url = “https://register.epo.org/application?number=EP16190441&lng=en&tab=federated” Testing = “Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1” session = HTMLSession(browser_args=[“–no-sandbox”, “–user-agent=‘Testing’”]) r = session.get(url) r.html.render() print(r.html.text)

Traceback (most recent call last): File “…test.py”, line 8, in <module> r.html.render() File “…\AppData\Roaming\Python\Python37\site-packages\requests_html.py”, line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File “C:\Program Files\Python37\Lib\asyncio\base_events.py”, line 584, in run_until_complete return future.result() File “…\AppData\Roaming\Python\Python37\site-packages\requests_html.py”, line 537, in _async_render await page.close() File “~\AppData\Roaming\Python\Python37\site-packages\pyppeteer\page.py”, line 1465, in close {‘targetId’: self._target._targetId}) pyppeteer.errors.NetworkError: Protocol error Target.closeTarget: Target closed.

Okay, I figured out how to change the user agent without altering the base code.

from requests_html import HTMLSession
# Sets the user agent to whatever you choose
session = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"])
r = session.get(url)
r.html.render()

The “no-sandbox” option is passed in by default, so you include it there to make sure it still makes it through when you override the browser_args argument. Those two arguments get passed to the Chromium session that is created when render() is called, not before. Until you call render() it will still use the default Chromium user agent unless you change it using the method I outlined in previous comments.


I tested this solution with your original issue, and it fixed the problem!

I traded out the default Chromium user agent with a different one I found here: https://developers.whatismybrowser.com/useragents/explore/software_name/firefox/

I used the user agent Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 and executed the code in your original comment, and that fixed your issue. I was loading content from the page that needed javascript to render. It was looking at the user agent to determine if you were a bot.

This begs the question of whether or not the default user agent in the source should be changed to something else maybe a more recent device or something, but I think that’s dependent on how many of these issues crop up. I’ll leave it to Kenneth to decide that.

Best of luck!