requests-html: render() triggers website protections
Hi
I’ve just started working with requests-html, and render() seems to be triggering a website protection mechanism, and I’m not sure why. The below URL loads a number of .js scripts with data I’m trying to access. For the first r.html.text below, this data just returns “Loading…” as the js scripts haven’t yet run. After r.html.render(), the page rejects the request. Any advice on what is going wrong here and how to circumvent it would be very much appreciated - code below:
`from requests_html import HTMLSession
session = HTMLSession()
r = session.get(“https://register.epo.org/application?number=EP16190441&lng=en&tab=federated”)
print(r.html.text)
r.html.render()
print(r.html.text)`
About this issue
- Original URL
- State: open
- Created 5 years ago
- Comments: 15 (1 by maintainers)
Hi @LaurT . I’m facing a similar issue as yours and @Ecript 's solution worked perfectly for me! With a small addition I managed to get it working for your page as well.
By investigating the underlying traffic I found that the quotation marks around
'Testing'insession = HTMLSession(browser_args=["--no-sandbox", "--user-agent='Testing'"])were being included in the GET requests headers. Removing the quotation marks did the trick.The code below prints your webpage including javascript loaded content.
This is fantastic thank you! Would you be able to post the code that you got working? I’m getting a series of errors (see below) when trying to run render() with those arguments. I assume I’m doing something silly, but have tried a number of different permutations and just can’t get it to work:
Traceback (most recent call last): File “…test.py”, line 8, in <module> r.html.render() File “…\AppData\Roaming\Python\Python37\site-packages\requests_html.py”, line 598, in render content, result, page = self.session.loop.run_until_complete(self._async_render(url=self.url, script=script, sleep=sleep, wait=wait, content=self.html, reload=reload, scrolldown=scrolldown, timeout=timeout, keep_page=keep_page)) File “C:\Program Files\Python37\Lib\asyncio\base_events.py”, line 584, in run_until_complete return future.result() File “…\AppData\Roaming\Python\Python37\site-packages\requests_html.py”, line 537, in _async_render await page.close() File “~\AppData\Roaming\Python\Python37\site-packages\pyppeteer\page.py”, line 1465, in close {‘targetId’: self._target._targetId}) pyppeteer.errors.NetworkError: Protocol error Target.closeTarget: Target closed.
Okay, I figured out how to change the user agent without altering the base code.
The “no-sandbox” option is passed in by default, so you include it there to make sure it still makes it through when you override the
browser_argsargument. Those two arguments get passed to the Chromium session that is created whenrender()is called, not before. Until you callrender()it will still use the default Chromium user agent unless you change it using the method I outlined in previous comments.I tested this solution with your original issue, and it fixed the problem!
I traded out the default Chromium user agent with a different one I found here: https://developers.whatismybrowser.com/useragents/explore/software_name/firefox/
I used the user agent
Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/20100101 Firefox/7.0.1and executed the code in your original comment, and that fixed your issue. I was loading content from the page that needed javascript to render. It was looking at the user agent to determine if you were a bot.This begs the question of whether or not the default user agent in the source should be changed to something else maybe a more recent device or something, but I think that’s dependent on how many of these issues crop up. I’ll leave it to Kenneth to decide that.
Best of luck!