ArchiveBox: Intermittent network response dropping when building and executing inside docker

It seems like the pocket RSS feeds are not being parsed correctly and fragments of the XML / HTML tags are being included in the links. Here’s how to reproduce this:

docker-compose exec archivebox /bin/archive http://getpocket.com/users/*[redacted]/feed/all

I created a pocket-account with two links in it, the corresponding RSS that is being downloaded looks like this:

<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
    xmlns:content="http://purl.org/rss/1.0/modules/content/"
    xmlns:wfw="http://wellformedweb.org/CommentAPI/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:atom="http://www.w3.org/2005/Atom"
    >

<channel>

<title>My Reading List: Read and Unread</title>
<description>Items I've saved to read</description>
<link>http://readitlaterlist.com/users/*[redacted]/feed/all</link>
<atom:link href="http://readitlaterlist.com/users/*[redacted]/feed/all" rel="self" type="application/rss+xml" />


<item>
<title><![CDATA[Trump Agrees to Reopen Government for 3 Weeks in Surprise Retreat From Wall]]></title>
<category>Unread</category>
<link>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</link>
<guid>https://nytimes.com/2019/01/25/us/politics/trump-shutdown-deal.html</guid>
<pubDate>Fri, 25 Jan 2019 16:21:38 -0600</pubDate>
</item>
<item>
<title><![CDATA[Neue Passwort-Leaks: Insgesamt 2,2 Milliarden Accounts betroffen]]></title>
<category>Unread</category>
<link>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</link>
<guid>https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
<pubDate>Fri, 25 Jan 2019 16:20:07 -0600</pubDate>
</item>
</channel>

</rss>

Instead of the two <link> s, the software now tries to pull in 10 links and seems to mess up the URLs:

[▶] [2019-01-25 22:30:05] Updating files for 10 links in archive...
[+] [2019-01-25 22:30:09] "https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>"
    https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
    > /data/archive/1548455383 (new)
      > favicon
      > wget
        Got wget response code 8:
          https://www.heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html%3c/guid%3e:
          2019-01-25 22:30:12 ERROR 404: Not Found.
        Some resources were skipped: 404 Not Found
        Run to see full output:
            cd /data/archive/1548455383;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent --restrict-file-names=unix --timeout=60 --warc-file=warc/1548455410 --page-requisites --user-agent="ArchiveBox/544de6831 (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
      > pdf
      > screenshot
      > dom
      > archive_org
        Failed: Exception BadQueryException: Illegal character in path at index 110: https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
        Run to see full output:
            curl --location --head --max-time 60 --get https://web.archive.org/save/https://heise.de/security/meldung/Neue-Passwort-Leaks-Insgesamt-2-2-Milliarden-Accounts-betroffen-4287538.html</guid>
      > git
      √ index.json
      √ index.html

(note the <guid> at the end of the URL wget is trying to download.

In the end, no links could be saved:

[√] [2019-01-25 22:35:50] Update of 10 links complete (5.75 min)
    - 10 entries skipped
    - 44 entries updated
    - 16 errors

Latest stable version.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 15 (7 by maintainers)

Most upvoted comments

So… even more… I changed puppeteer for puppeteer-core (a version of Puppeteer that doesn’t download Chromium by default) in the Dockerfile, because we’re installing chromium anyways separately. This at first failed as well:

Step 7/15 : RUN npm i puppeteer-core
 ---> Running in 4cfe4c562904
npm ERR! code EPROTO
npm ERR! errno EPROTO
npm ERR! request to https://registry.npmjs.org/rimraf failed, reason: write EPROTO 139977009982336:error:14094410:SSL routines:ssl3_read_bytes:sslv3 alert handshake failure:../deps/openssl/openssl/ssl/record/rec_layer_s3.c:1407:SSL alert number 40
npm ERR!

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T05_08_25_248Z-debug.log
ERROR: Service 'archivebox' failed to build: The command '/bin/sh -c npm i puppeteer-core' returned a non-zero code: 1

There seems to be something going on either with my network connection or the npm servers. I tried again:

Step 7/15 : RUN npm i puppeteer-core
 ---> Running in e1b3a79eaf9c
npm WARN tarball tarball data for es6-promise@^4.0.3 (sha512-n6wvpdE43VFtJq+lUDYDBFUwV8TZbuGXLV4D6wKafg13ldznKsyEvatubnmUe31zcvelSzOHF+XbaT+Bl9ObDg==) seems to be corrupted. Trying one more time.
npm WARN tarball tarball data for puppeteer-core@latest (sha512-JTsJKCQdrk1RqEGZN3l2TyW7Rhy7GWRRzd3PftYyA3B35l0t0lLU+gdF7czemnpSVVMiAgHpM1Uk/iO6jLreMA==) seems to be corrupted. Trying one more time.

> puppeteer-core@1.11.0 install /node_modules/puppeteer-core
> node install.js

npm WARN saveError ENOENT: no such file or directory, open '/package.json'
npm notice created a lockfile as package-lock.json. You should commit this file.
npm WARN enoent ENOENT: no such file or directory, open '/package.json'
npm WARN !invalid#1 No description
npm WARN !invalid#1 No repository field.
npm WARN !invalid#1 No README data
npm WARN !invalid#1 No license field.

+ puppeteer-core@1.11.0
added 43 packages from 22 contributors and audited 50 packages in 14.773s
found 0 vulnerabilities

 ---> 7538b1c16fbc

This finally did work. Not sure about the tarball errors.

Back to the original purpose of the ticket, pocket feeds not being properly imported: I tried the same RSS feed and this time my two links were parsed / downloaded correctly; screenshot, html, pdf confirmed and working.

Thanks again for your support and this project. Love it and i think it’s very important. You might want to consider puppeteer-core .

OK, I did more digging. I edited the Dockerfile to include an RUN npm cache clean --force before the puppeteer (now step 8 instead of 7) installation, but no luck there as well:


 ---> 73357b1217dc
Removing intermediate container 84e268fc0b12
Step 6/16 : RUN chmod +x /usr/local/bin/dumb-init
 ---> Running in bd8430cbfbf9
 ---> dcaaf479c297
Removing intermediate container bd8430cbfbf9
Step 7/16 : RUN npm cache clean --force
 ---> Running in 7d6f8353ba49
npm WARN using --force I sure hope you know what you are doing.
 ---> 22fe375cd41a
Removing intermediate container 7d6f8353ba49
Step 8/16 : RUN npm i puppeteer
 ---> Running in 8a5d51af5bac
npm ERR! Unexpected end of JSON input while parsing near '...s/extract-zip":"^1.6.'

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T04_26_16_469Z-debug.log

I then reduced the Dockerfile to the bare minimum to see if that would give me any clue:

FROM node:11-slim
LABEL maintainer="Nick Sweeting <archivebox-git@sweeting.me>"

# RUN apt-get update \
#    && apt-get install -yq --no-install-recommends \
#        git wget curl youtube-dl gnupg2 libgconf-2-4 python3 python3-pip \
#    && rm -rf /var/lib/apt/lists/*

# Install latest chrome package and fonts to support major charsets (Chinese, Japanese, Arabic, Hebrew, Thai and a few others)
RUN apt-get update && apt-get install -y wget --no-install-recommends \
    && wget -q -O - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add - \
    && sh -c 'echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list' \
    && apt-get update \
    && apt-get install -y google-chrome-unstable fonts-ipafont-gothic fonts-wqy-zenhei fonts-thai-tlwg fonts-kacst ttf-freefont \
      --no-install-recommends \
    && rm -rf /var/lib/apt/lists/* \
    && rm -rf /src/*.deb

# It's a good idea to use dumb-init to help prevent zombie chrome processes.
#ADD https://github.com/Yelp/dumb-init/releases/download/v1.2.0/dumb-init_1.2.0_amd64 /usr/local/bin/dumb-init
#RUN chmod +x /usr/local/bin/dumb-init

# Do a npm clean
#RUN npm cache clean --force

# Install puppeteer so it's available in the container.
RUN npm i puppeteer

# Add user so we don't need --no-sandbox.
#RUN groupadd -r pptruser && useradd -r -g pptruser -G audio,video pptruser \
#    && mkdir -p /home/pptruser/Downloads \
#    && chown -R pptruser:pptruser /home/pptruser \
#    && chown -R pptruser:pptruser /node_modules

# Install the ArchiveBox repository and pip requirements
#RUN git clone https://github.com/pirate/ArchiveBox /home/pptruser/app \
#    && mkdir -p /data \
#    && chown -R pptruser:pptruser /data \
#    && ln -s /data /home/pptruser/app/archivebox/output \
#    && ln -s /home/pptruser/app/bin/archivebox /bin/archive \
#    && chown -R pptruser:pptruser /home/pptruser/app/archivebox
#    # && pip3 install -r /home/pptruser/app/archivebox/requirements.txt

VOLUME /data

ENV LANG=C.UTF-8 \
    LANGUAGE=en_US:en \
    LC_ALL=C.UTF-8 \
    PYTHONIOENCODING=UTF-8 \
    CHROME_SANDBOX=False \

But still (this time errored out on the same spot):

Step 4/10 : RUN npm i puppeteer
 ---> Running in cdcae8339d94
npm ERR! Unexpected end of JSON input while parsing near '...s/extract-zip":"^1.6.'

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2019-01-28T04_35_30_363Z-debug.log
ERROR: Service 'archivebox' failed to build: The command '/bin/sh -c npm i puppeteer' returned a non-zero code: 1

So the error must be within the npm package of puppeteer?!