grab: Python 3.5 - Unable to build DOM tree.
File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
File "src/lxml/parser.pxi", line 1799, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116219)
File "src/lxml/parser.pxi", line 1819, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:116494)
File "src/lxml/parser.pxi", line 1700, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:115040)
File "src/lxml/parser.pxi", line 1040, in lxml.etree._BaseParser._parseUnicodeDoc (src/lxml/lxml.etree.c:109165)
File "src/lxml/parser.pxi", line 573, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:103404)
File "src/lxml/parser.pxi", line 683, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:105058)
File "src/lxml/parser.pxi", line 613, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:103967)
File "<string>", line None
lxml.etree.XMLSyntaxError: switching encoding: encoder error, line 1, column 1
With preceding:
encoding error : input conversion failed due to input error, bytes 0x21 0x00 0x00 0x00
encoding error : input conversion failed due to input error, bytes 0x44 0x00 0x00 0x00
I/O error : encoder error
Example:
class Scraper(Spider):
def task_generator(self):
urls = [
'https://au.linkedin.com/directory/people-a/',
'https://www.linkedin.com/directory/people-a/'
]
for url in urls:
yield Task('url', url=url)
def task_url(self, grab, task):
links = grab.doc('//div[@class="columns"]//ul/li[@class="content"]/a')
bot = Scraper()
bot.run()
That’s happened on some pages, perhaps lxml failed to detect correct encoding.
About this issue
- Original URL
- State: closed
- Created 8 years ago
- Comments: 23 (15 by maintainers)
Commits related to this issue
- Add lxml test for issue #199 — committed to lorien/grab by lorien 6 years ago
- Add HTML file for lxml test for issue #199 — committed to lorien/grab by lorien 6 years ago
- Merge pull request #332 from lorien/issue_199_macos_lxml Add lxml test for issue #199 — committed to lorien/grab by lorien 6 years ago
Solution (assume you’re using virtualenv):
install
libxml2andlibxsltusing brew.uninstall lxml
install lxml with statically linked dependencies
https://github.com/oiwn/grab-reproduce
this code run results:
additional info
http://louistiao.me/posts/installing-lxml-on-mac-osx-1011-inside-a-virtualenv-with-pip/ http://lxml.de/build.html#building-lxml-on-macos-x
@rickwargo @Alex-Just
maybe report to upstream (lxml)?
It’s probably bug in lxml https://bugs.launchpad.net/lxml/+bug/1538213
there is option in grab to fix “special entities” https://github.com/lorien/grab/blob/3d094a3984bade85266f0adfd2f5e341dce37347/grab/base.py#L169
perhaps as a temporary solution emoji could be removed by regexp before building DOM tree using lxml.
Interesting… It works with 2.7.13 and doesn’t work with 3.4.3, 3.5.2, 3.6.0