getpapers: PDF corrupt when downloaded with getpapers, O.K. when downloaded directly

I am totally new to getpapers, so this may be just a misunderstanding on my part of some of the inner workings of the ContentMine toolchain (even though it seems to be pretty straightforward…).

Here is the problem: when I use getpapers to download a paper, say with the command:

getpapers --query 'PMCID:"PMC5293196" JOURNAL:"PLOS ONE"' --outdir network-analysis -p

I get:

info: Searching using eupmc API
info: Found 1 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (1/1) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!

which looks pretty good to me (the warnings seem to be harmless in my case). Everything is there and the PDF seems to be in

PMC5293196/fulltext.pdf

However, when I try to open it in acroread, the reader tells me that the file is corrupt, that some font is missing etc. - and the paper is practically unreadable!

If I try to download it directly from the URL

http://europepmc.org/articles/PMC5293196?pdf=render

which is found in the eupmc_results.json as the download URL for this ID, all is OK and the paper is displayed with absolutely no errors…

There is indeed a difference in the MD5 sums of the two versions:

md5sum PMC5293196/fulltext.pdf 
b815dff20f24ddbb035b4bfc3743d7f5  PMC5293196/fulltext.pdf
md5sum PMC5293196.pdf
5771a39765664bedb8ad8815b7980921  PMC5293196.pdf

The latter (PMC5293196.pdf) is the ‘good’ one, downloaded directly from http://europepmc.org/articles/PMC5293196?pdf=render with curl, the former (PMC5293196/fulltext.pdf) is the ‘bad’ one downloaded by getpapers.

It is not a locale issue (seems to me), because the same happens to a user with

LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE=C

and to a user with a non-US locale.

I have nodejs 4.4.6 and getpapers is installed with

npm install --global getpapers
npm WARN deprecated node-uuid@1.4.7: use uuid module instead
npm WARN deprecated tough-cookie@2.2.2: ReDoS vulnerability parsing Set-Cookie https://nodesecurity.io/advisories/130
/usr/bin/getpapers -> /usr/lib/node_modules/getpapers/bin/getpapers.js
getpapers@0.4.12 /usr/lib/node_modules/getpapers
├── progress@1.1.8
├── version_compare@0.0.3
├── commander@2.7.1 (graceful-readlink@1.0.1)
├── chalk@1.0.0 (escape-string-regexp@1.0.5, ansi-styles@2.2.1, supports-color@1.3.1, strip-ansi@2.0.1, has-ansi@1.0.3)
├── mkdirp@0.5.1 (minimist@0.0.8)
├── sanitize-filename@1.6.1 (truncate-utf8-bytes@1.0.2)
├── got@2.9.2 (lowercase-keys@1.0.0, timed-out@2.0.0, prepend-http@1.0.4, object-assign@2.1.1, is-stream@1.1.0, infinity-agent@2.0.3, statuses@1.3.1, nested-error-stacks@1.0.2, read-all-stream@2.2.0, duplexify@3.5.0)
├── matched@0.4.4 (fs-exists-sync@0.1.0, is-valid-glob@0.3.0, arr-union@3.1.0, async-array-reduce@0.2.1, extend-shallow@2.0.1, has-glob@0.1.1, glob@7.1.1, lazy-cache@2.0.2, resolve-dir@0.1.1)
├── winston@1.0.2 (cycle@1.0.3, stack-trace@0.0.9, eyes@0.1.8, isstream@0.1.2, async@1.0.0, pkginfo@0.3.1, colors@1.0.3)
├── restler@3.4.0 (yaml@0.2.3, qs@1.2.0, iconv-lite@0.2.11, xml2js@0.4.0)
├── lodash@3.10.1
├── xml2js@0.4.17 (sax@1.2.2, xmlbuilder@4.2.1)
├── requestretry@1.12.0 (extend@3.0.0, when@3.7.8, request@2.80.0, lodash@4.17.4)
└── crossref@0.1.2 (got@5.1.0, request@2.65.0)

What is going on here? Can you please shed some light?

About this issue

Original URL
State: closed
Created 7 years ago
Comments: 26 (12 by maintainers)

Commits related to this issue

fix #152 - set downloads to be binary — committed to ContentMine/getpapers by deleted user 7 years ago

Most upvoted comments

PMC5293196.pdf fulltext.pdf

I attach the two versions (“good” and “bad”) for PMC5293196.

PMC5293196.pdf: the good one, i.e. the one you get with curl. fulltext.pdf: the bad one, i.e. the one you get with getpapers.

If you open the two files in a pure text editor (say, in vi), you will notice that the differences are in parts like stream objects, i.e. in the binary parts. This explains why the PLOS icon does not appear in okular with the bad file. It is something that has to do with the encoding of the binary blobs of a PDF file.

Besides, the bad file is almost double in size, indicating that its binary parts have been encoded in a “bloating” way - base64?

I think I’m pretty close…Now it’s your turn, dear @blahah ! 😃

sedimentation-fault on Apr 9, 2017