grab: Python 3.5 - Unable to build DOM tree.

File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
  File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
  File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
  File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
  File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
  File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
  File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
  File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
  File "<string>", line None
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1

With preceding:

encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
I/O error : encoder error

Example:

class Scraper(Spider):
    def task_generator(self):
        urls = [
            'https://au.linkedin.com/directory/people-a/',
            'https://www.linkedin.com/directory/people-a/'
        ]
        for url in urls:
            yield Task('url', url=url)

    def task_url(self, grab, task):
        links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')


bot = Scraper()
bot.run()

That’s happened on some pages, perhaps lxml failed to detect correct encoding.

About this issue

  • Original URL
  • State: closed
  • Created 8 years ago
  • Comments: 23 (15 by maintainers)

Commits related to this issue

Most upvoted comments

Solution (assume you’re using virtualenv):

install libxml2 and libxslt using brew.

uninstall lxml

pip uninstall lxml

install lxml with statically linked dependencies

STATIC_DEPS=true pip install lxml --no-cache-dir

https://github.com/oiwn/grab-reproduce

this code run results:

(grab) ➜ oiwn@mylaptop  ~/projects/grab-reproduce git:(master) ✗ python github.py
/Users/oiwn/.virtualenvs/grab/lib/python3.5/site-packages/grab/deprecated.py:250: GrabDeprecationWarning: The `Grab.response` attribute is deprecated. Use `Grab.doc` instead.
  warn('The `Grab.response` attribute is deprecated. '
http://localhost:8000/showcases/virtual-reality
http://localhost:8000/showcases/software-defined-radio
http://localhost:8000/showcases/tools-for-open-source
http://localhost:8000/showcases/open-source-integrations
http://localhost:8000/showcases/serverless-architecture
http://localhost:8000/showcases/emoji
http://localhost:8000/showcases/web-application-frameworks
http://localhost:8000/showcases/hacking-minecraft
http://localhost:8000/showcases/web-accessibility
http://localhost:8000/showcases/github-browser-extensions
http://localhost:8000/showcases/great-for-new-contributors
http://localhost:8000/showcases/productivity-tools
http://localhost:8000/showcases/javascript-game-engines
http://localhost:8000/showcases/projects-that-power-github-for-mac
http://localhost:8000/showcases/game-engines
(grab) ➜ oiwn@mylaptop  ~/projects/grab-reproduce git:(master) ✗ python --version
Python 3.5.2

additional info

http://louistiao.me/posts/installing-lxml-on-mac-osx-1011-inside-a-virtualenv-with-pip/ http://lxml.de/build.html#building-lxml-on-macos-x

@rickwargo @Alex-Just

maybe report to upstream (lxml)?

It’s probably bug in lxml https://bugs.launchpad.net/lxml/+bug/1538213

there is option in grab to fix “special entities” https://github.com/lorien/grab/blob/3d094a3984bade85266f0adfd2f5e341dce37347/grab/base.py#L169

perhaps as a temporary solution emoji could be removed by regexp before building DOM tree using lxml.

Interesting… It works with 2.7.13 and doesn’t work with 3.4.3, 3.5.2, 3.6.0