ArchiveBox: Intermittent network response dropping when building and executing inside docker
It seems like the pocket RSS feeds are not being parsed correctly and fragments of the XML / HTML tags are being included in the links. Here’s how to reproduce this:
docker-compose exec archivebox /bin/archive http://getpocket.com/users/*[redacted]/feed/all
I created a pocket-account with two links in it, the corresponding RSS that is being downloaded looks like this:
<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
>
<channel>
<title>My Reading List: Read and Unread</title>
<description>Items I've saved to read</description>
<link>http://readitlaterlist.com/users/*[redacted]/feed/all</link>
<atom:link href="http://readitlaterlist.com/users/*[redacted]/feed/all" rel="self" type="application/rss+xml" />
<item>
<title><![CDATA[Trump Agrees to Reopen Government for 3 Weeks in Surprise Retreat From Wall]]></title>
<category>Unread</category>
<link>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</link>
<guid>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</guid>
<pubDate>Fri, 25 Jan 2019 16:21:38 -0600</pubDate>
</item>
<item>
<title><![CDATA[Neue Passwort-Leaks: Insgesamt 2,2 Milliarden Accounts betroffen]]></title>
<category>Unread</category>
<link>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</link>
<guid>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
<pubDate>Fri, 25 Jan 2019 16:20:07 -0600</pubDate>
</item>
</channel>
</rss>
Instead of the two <link>
s, the software now tries to pull in 10 links and seems to mess up the URLs:
[▶] [2019-01-25 22:30:05] Updating files for 10 links in archive...
[+] [2019-01-25 22:30:09] "https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>"
https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> /data/archive/1548455383 (new)
> favicon
> wget
Got wget response code 8:
https://www.heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html%3c/guid%3e:
2019-01-25 22:30:12 ERROR 404: Not Found.
Some resources were skipped: 404 Not Found
Run to see full output:
cd /data/archive/1548455383;
wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548455410 --page-requisites --user-agent="ArchiveBox/544de6831 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> pdf
> screenshot
> dom
> archive_org
Failed: Exception BadQueryException: Illegal character in path at index 110: https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
Run to see full output:
curl --location --head --max-time 60 --get https://web.archive.org/save/https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
> git
√ index.json
√ index.html
(note the <guid>
at the end of the URL wget is trying to download.
In the end, no links could be saved:
[√] [2019-01-25 22:35:50] Update of 10 links complete (5.75 min)
- 10 entries skipped
- 44 entries updated
- 16 errors
Latest stable version.
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 15 (7 by maintainers)
So… even more… I changed
puppeteer
forpuppeteer-core
(a version of Puppeteer that doesn’t download Chromium by default) in the Dockerfile, because we’re installing chromium anyways separately. This at first failed as well:There seems to be something going on either with my network connection or the npm servers. I tried again:
This finally did work. Not sure about the tarball errors.
Back to the original purpose of the ticket, pocket feeds not being properly imported: I tried the same RSS feed and this time my two links were parsed / downloaded correctly; screenshot, html, pdf confirmed and working.
Thanks again for your support and this project. Love it and i think it’s very important. You might want to consider
puppeteer-core
.OK, I did more digging. I edited the
Dockerfile
to include anRUN npm cache clean --force
before the puppeteer (now step 8 instead of 7) installation, but no luck there as well:I then reduced the Dockerfile to the bare minimum to see if that would give me any clue:
But still (this time errored out on the same spot):
So the error must be within the npm package of puppeteer?!