puppeteer: PDF Generation hanging for documents with many large images

Steps to reproduce

Tell us about your environment:

  • Puppeteer version: 0.13.0
  • Platform / OS version: OSX 10.13.1
  • Node versions: tested with v8.6.0, v8.9.0 & v9.2.0
  • URLs (if applicable): N/A

What steps will reproduce the problem?

The code below isolates a problem discovered while using Puppeteer to generate PDF reports. While Puppeteer has been fantastic so far, I have come across a problem with pages containing a large number of images.

index.html:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <title>Test App</title>
  </head>
  <body>
    <div id="root"></div>
  </body>
</html>

pdfImages.js:

const puppeteer = require('puppeteer');

const IMAGE_URIS = [
  // A list of 1949 image URIs as strings
  // The images used range in size between 1KB to 1.7MB with an average size of 200KB.
];

const PDF_OPTIONS = {path: `${__dirname}/output.pdf`, format: 'A4', printBackground: true, landscape: false, margin: {top: '10mm', left: '10mm', right: '10mm', bottom: '15mm'}};
const IMAGE_LOADING_TIMEOUT = 60 * 1000 * 5;

function addImagesToPage(imageUriList) {
  const root = document.getElementById('root');
  imageUriList.forEach((imageUri) => {
    const div = document.createElement('div');
    const img = new Image();
    img.src = imageUri;
    img.style = 'max-width: 20vw; max-height: 20vh;';
    div.appendChild(img);
    root.appendChild(div);
  });
}

function waitForAllImagesToCompleteLoading() {
  const allImagesInDocument = Array.from(document.getElementsByTagName('img'));
  return allImagesInDocument
      .map((img) => img.complete)
      .every((completeStatus) => completeStatus);
}

let browser;
puppeteer.launch({headless: true})
  .then((newBrowser) => {
    browser = newBrowser;
    return browser.newPage();
  })
  .then((page) => {
    return page.goto(`file://${__dirname}/index.html`)
    .then(() => page.evaluate(addImagesToPage, IMAGE_URIS))
    .then(() => page.waitForFunction(waitForAllImagesToCompleteLoading, {timeout: IMAGE_LOADING_TIMEOUT}))
    .then(() => page.pdf(PDF_OPTIONS))
    .then(() => page.close());
  })
  .then(() => browser.close());

I run this using: env DEBUG="puppeteer:*" node --max-old-space-size=16384 pdfImages.js

What is the expected result? Running this code should generate a PDF.

Please note that running this with 1100 images and with --max-old-space-size=8174, the code runs without a problem.

What happens instead? Running the code with the command above causes the code to hang on the PDF generation stage. Please see HungPdfPrint.log for the logs produced when this happens.

The code consistently hangs in the following combinations:

  • When --max-old-space-size=8174 (8GB) or 16348 (16GB) and,
  • When the number of images is 1200, 1500 or 1949.

The code crashes when the --max-old-space-size flag is removed with an out of heap memory error. See this log: OutOfMemoryCrash.log

Again, when the number of images is 1100 and --max-old-space-size=8174 the code runs without a problem

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Reactions: 19
  • Comments: 27

Most upvoted comments

I’m having the same issues taking PDF snapshots of HTML containing images, where the combined image size is more than 200MB, if we have 193MB of images it works but over 200 it hangs.

@Mgonand the only way you’re getting this to work reliably is by creating a batch of PDF files and then joining them in a final file (I used pdftk). I use puppeteer to generate series of 10-image PDFs and then join one final file.

It seems puppeteer has a really hard time managing a lot of large, concurrent downloads.

I’m afraid that in my case, it is not because of the /dev/shm Docker problem (although we are running Docker with the --shm-size flag).

The code in the original post was run outside of Docker and still experienced the same issue.

I also found that galvez’s solution worked. We had a 27 page PDF that would not export no matter how much time we gave it, but if we cut it into 10-page chunks we got it exported in around 22 seconds. Thanks for the tip!

Also, instead of using pdftk try using Ghostscript. It takes a little bit longer but compresses the PDFs without a loss in quality. Some of our PDFs were approaching 100MB, so that was a nice side-effect.

gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile=output.pdf 0.pdf 1.pdf 2.pdf

@Mgonand I think you are getting this error is because puppeteer transfer the pdf buffer to string, I found source code around https://github.com/GoogleChrome/puppeteer/blob/cf8c62e835b565feb9d19639a3832a59d9b07aca/lib/PipeTransport.js#L48 Yet nodejs has a limitation for string length of 1GB, so it means if your result pdf is large than this number, puppeteer will crush.

Here is a answer from stackoverflow explaining this,https://stackoverflow.com/a/44540335

Hope the team will make some change to fix this.

Unfortunately that code is quite complicated (several hundred lines of code). There’s no browser or standard library logic to help out (as far as I know), you have to write it all yourself. The gist:

  1. Render all of the React components to one long “page”
  2. Measure the height of each fully-rendered component
  3. Calculate which components can fit on a page, and which components are taller than one page and need to be split across pages
  4. Re-render all of the components, giving them instructions for how to split themselves apart (if necessary)
  5. Count the number of page elements and capture the PDF

To implement this kind of logic, every component has to be “PDF aware” and be able to render itself based on the common PDF props we may pass into it. (For instance, a table might need to render only 20 rows at a time but have multiple copies on the same page to get every row.) It’s complicated but these PDFs are user-defined, so we have to handle any case.

Short of this level of application logic, I’m not quite sure how you could get the level of control you need to be able to split and count pages. 🤷

We actually created the pdf-page item for the exact reason you mention – we didn’t know how long the PDF would be otherwise. Our application requires that we know how long the PDF is, so we write application logic to break long tables into multiple small tables across pages. If that’s not doable for you then maybe the new createPDFStream() function could help solve the issue of large PDFs for you.

Any updates on this? I am currently experiencing the same behavior. With any amount of images which their total size is greater than 200Mb the PDF is never generated.

Puppeteer version: 1.6.0

I tried to launch with pipe option active and I get this error:

RangeError: Invalid string length
    at Pipe._dispatch (/Users/marcosgonzalez/Proyectos/ProMIR/printService/node_modules/puppeteer/lib/Pipe.js:47:38)
    at Socket.Pipe._eventListeners.helper.addEventListener.buffer (/Users/marcosgonzalez/Proyectos/ProMIR/printService/node_modules/puppeteer/lib/Pipe.js:29:64)
    at emitOne (events.js:115:13)
    at Socket.emit (events.js:210:7)
    at addChunk (_stream_readable.js:264:12)
    at readableAddChunk (_stream_readable.js:251:11)
    at Socket.Readable.push (_stream_readable.js:209:10)
    at Pipe.onread (net.js:587:20)

I tried to launch with args: ['--disable-dev-shm-usage'] and nothing changed.

Any ideas?

Hi, I’ve figured out what’s happening im my case. I’m running my project in a docker container, and by default it runs with a small /dev/shm shared memory space (64Mb). This is what is causing my chrome crashs. Raising this value for 2Gb solved my problem. This problem is described here. I hope this can help somebody with the same problem.