gatsby: [gatsby-source-wordpress] Large WordPress site causing extremely slow build time (stuck at 'source and transform nodes')

Description

gatsby develop hangs on source and transform nodes after querying a large WordPress installation (~9000 posts, ~35 pages).

Is there any guides as to what’s too big for Gatsby to handle in this regards?

Environment

  System:
    OS: macOS High Sierra 10.13.6
    CPU: x64 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
    Shell: 3.2.57 - /bin/bash
  Binaries:
    Node: 8.10.0 - ~/n/bin/node
    Yarn: 1.5.1 - ~/n/bin/yarn
    npm: 5.6.0 - ~/n/bin/npm
  Browsers:
    Chrome: 67.0.3396.99
    Safari: 11.1.2
  npmPackages:
    gatsby: ^1.9.273 => 1.9.273
    gatsby-image: ^1.0.54 => 1.0.54
    gatsby-link: ^1.6.45 => 1.6.45
    gatsby-plugin-google-analytics: ^1.0.27 => 1.0.31
    gatsby-plugin-postcss-sass: ^1.0.22 => 1.0.22
    gatsby-plugin-react-helmet: ^2.0.10 => 2.0.11
    gatsby-plugin-react-next: ^1.0.11 => 1.0.11
    gatsby-plugin-resolve-src: 1.1.3 => 1.1.3
    gatsby-plugin-sharp: ^1.6.48 => 1.6.48
    gatsby-plugin-svgr: ^1.0.1 => 1.0.1
    gatsby-source-filesystem: ^1.5.39 => 1.5.39
    gatsby-source-wordpress: ^2.0.93 => 2.0.93
    gatsby-transformer-sharp: ^1.6.27 => 1.6.27
  npmGlobalPackages:
    gatsby-cli: 1.1.58

edit: Just want to reiterate—this is not something easily fixable by deleted .cache/, .node_modules, etc. If that resolves your problem, you weren’t experiencing this issue.

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 19
  • Comments: 156 (84 by maintainers)

Most upvoted comments

Guys, I managed to fix this by running createRemoteFileNode requests in serial instead of parallel.

Yeah the issue is really based on the fact that createRemoteFileNode uses concurrency of 200 which is too much for most WP servers. I have my images on CloudFront and was hitting some rate limits there.

I tried fixing the issue with a branched version of the source-plugin for a while but the issue really isn’t in gatsby-source-wordpress it is in gatsby-source-filesystem. Ideally consumers of the createRemoteFileNode function would be able to pass in concurrency there. Then plugins could make the concurrency option available in their configs. I still would like to do a PR to address this issue!

The solution I have been using is just a simple script to modify the code inside node_modules. Really quite fragile and not ideal but it is a simple hack to modify the concurrency directly. Uses shelljs so it is supposed to work for windows users as well (haven’t tried).

#!/usr/bin/env node
const path = require('path');
const shell = require('shelljs');

const FILE_PATH = path.resolve(
  __dirname,
  // add path to your root dir here,
  'node_modules',
  'gatsby-source-filesystem/create-remote-file-node.js'
);

shell.sed('-i', 'concurrent: 200', 'concurrent: 20', FILE_PATH);

Hello,

I managed to add tracing using the steps outlined here https://www.gatsbyjs.org/docs/performance-tracing/. Unfortunately it did not provide much info as it simply told me that indeed source and transform nodes is taking quite long.

I have however done some of my own debugging on the issue after having some non-deterministic behavior involving images. When running either develop or build script I would get a case where not all of the images would be downloaded and the localFile nodes would not complete. After digging into the code I have determined that there seems to be an issue here

https://github.com/gatsbyjs/gatsby/blob/ad142af473fc8dc8555a5cf23a0dfca42fcbbe90/packages/gatsby-source-wordpress/src/normalize.js#L483-L506

For me createRemoteFile node was failing due to server timeout errors and defaults to returning null. I had to add some logging to createRemoteFile node as well to determine this and get the actual server responses. Since these nodes don’t complete and do not have ID’s they don’t get registered in the cache. The tmp files are deleted and the gatsby-source-filesystem was incomplete. For whatever reason (I haven’t looked that far yet) upon running the build script again the source-filesystem was then deleted probably because the script detects the filesystem is invalid or incomplete. It was this process that was for me creating a loop and causing errors on future builds as the filesystem never completes.

I’m working on a fix that seems to alleviate some of the issues at least regarding large amounts of images. When the develop or build script is successful in downloading all of the images the first time, it subsequently is not deleted and then the build process happens quite rapidly as the images are properly cached by gatsby-source-filesystem! My build went from 15 minutes down to 1 minute.

I’m not sure whether this is related to builds that have large amounts of posts. My issue was directly related to downloading 1.6 GB of image data.

This is my first time working with source plugins for gatsby so if anyone has any thoughts or advice regarding this I would appreciate it! I should be able to post my repo later today I am working on getting it to use my local version of gatsby-source-filesystem without complications.

Looks like this is a quirky issue. Here is my experience with it:

  • ❌ I saw this issue on macOS High Sierra (using iTerm)
  • ✅ I started using GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop and the issue went away (this was the case for a couple weeks)
  • ❌ I upgrade to Mojave and upgraded my global Gatsby installation to 2.7.47 and then started seeing the issue again (using iTerm)
  • ❌ Tried changing GATSBY_CONCURRENT_DOWNLOAD to 5
  • ❌ Tried blowing away .cache and node_modules
  • ❌ Tried resizing the iTerm window while running gatsby develop (both with 50 and 5)
  • ❌ Ran GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop in “Terminal” app, not in iTerm
  • ✅ Two weeks later tried using GATSBY_CONCURRENT_DOWNLOAD=50 gatsby develop in iTerm and resized the window a couple times during the process and it worked.

Prematurely thought I had it running with that last one but then it hung. Hopefully this helps others. Still seems like this isn’t quite nailed down but we’re getting there slowly but surely.

Update: Today this worked for me. Not sure if it’s because I resized the iTerm window at the right point in the process or because I watched it go from 93% all the way to 100% but something was different this time.

I had the same problem and I manage to solve this by resizing the terminal window.

Please refer to last comments on #4666.

I was experiencing the same issue in a gatsby+wordress integration. The build would stop forever in the onCreateNode API where I was using createRemoteFileNode.

Solution: I updated the gatsby-source-filesystem from 2.0.4 to 2.1.8 and added GATSBY_CONCURRENT_DOWNLOAD=50 to my environment variables.

@pieh Yes that would probably be the place to apply this logic. The throttling for me was a way to approach this and diagnose the issue so I agree that the createRemoteFileNode should be able to handle this on it’s own.

Particularly problematic however is the current behavior of silently failing the errors and returning null. In my opinion there should be some communication about either the failure or success of the operation. I think createRemoteFileNode could be made more robust with the following functionality.

  1. Eagerly create connections
  2. If there are errors from the server begin to throttle and/or retry if needed
  3. Set some sane defaults for throttling/retrying
  4. Create an entry point for adjusting throttling/retrying
  5. Reject a promise if for some reason the node is unable to be processed.

I can also say that I played around with timeout values here https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L135-L141. Although that increased the probability of a successful response I still had to add handling in order to ensure a successful response.

Most likely the correct entry point for this logic would be here.

https://github.com/gatsbyjs/gatsby/blob/master/packages/gatsby-source-filesystem/src/create-remote-file-node.js#L259-L269

Where if the tasks are failing they are retried and/or failed and then finally rejected.

Hello,

Following up on my comment from a few days ago. Here is my repo.

https://github.com/njmyers/byalejandradesign.com.git

I am using a monorepo in this project so here are some steps if you want to run the repository locally.

  1. Ensure you have the latest version of Yarn 1.12.3
  2. Clone the plugin branch git clone https://github.com/njmyers/byalejandradesign.com.git -b wordpress-plugin
  3. Run yarn && yarn bootstrap
  4. Navigate to the gatsby folder so you can look just at that folder cd packages/web
  5. Run yarn develop or yarn build-web. It should complete successfully the first time and subsequent runs of the same command will result in much quicker builds! Source and transform nodes takes 222s for me where as it was taking 3 times that earlier and/or not completing.
  6. If you want to see what is actually happening during source and transform you can look in your file browser at /packages/web/.cache/gatsby-source-filesystem you will see that the files are being created there.

I rewrote the downloadMediaFiles function completely. You can see that file at this link https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

It is probably more verbose then it needs to be but I had to do this in order to figure out everything that is happening. The functionality that I changed is adding a promise rejection when createRemoteFileNode returns null. I then use a function downloadRunner to throttle the requests so that they don’t all hit the server at once as well as a retry on promise rejections. I added 200ms throttle between each createRemoteFileNode request. I’m sure this value could be tweaked and some of this might be better suited to adding to createRemoteFileNode directly.

If anyone is curious the WP install is EC2 micro instance while the images are behind a CloudFront distribution. Personally I never had any issues with getting posts my issue was with getting images and I believe that most of the issues people are having are due to this.

I use createRemoteFileNode to fetch remote images and I experience this same problem: download gets stuck at around 680/780ish.

In createRemoteFileNode, there is a listener to downloadProgress event that was added in https://github.com/sindresorhus/got/releases/tag/v8.0.0 but gatsby-source-filesystem uses got 7.1.0.

I tried upgrading got to the latest version 9.2.2 and could now successfully download all images.

Add this in package.json:

  "resolutions": {
    "got": "^9.2.2"
  }

Just wanted to echo that this isn’t only a WP source issue — was hitting the same problem with gatsby-source-prismic, reducing the concurrency of soure-filesystem with @njmyers hack fixed it for me, so guessing it was a rate limiting/overload issue.

Agree that if nothing else the concurrency of source-filesystem should be configurable.

I had the same issue, stuck on “source and transform nodes”. After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn’t the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I’m no longer seeing “source and transform nodes”, and my build finishes.

@TylerBarnes ya here’s a repo where I was seeing it. I haven’t touched it in a little bit.


Side note: How do you handle a situation where you clone a Gatsby site with an older version of Gatsby than what is currently installed by the CLI?

#532314892 @bradydowling:

Not sure if it’s because I resized the iTerm window at the right point

While experiencing the same issue, I resized my iTerm window and bam – it suddenly continued, as well. I don’t know if this is a wild coincidence, or…

Is anyone willing to share their repo with me and credentials so I can give this one a spin and try to find the problem?

Feel free to send me a private mail at ward@gatsbyjs.com

Have you tried setting GATSBY_CONCURRENT_DOWNLOAD to a lower number? By default it’s set to 200.

Linux/mac: GATSBY_CONCURRENT_DOWNLOAD=5 gatsby build

Windows: setx GATSBY_CONCURRENT_DOWNLOAD 5; gatsby build

We haven’t been able to deploy for 2 days now. Can someone help point me in the right direction as to where this is occurring in the code so I can try and find a solution?

Have you upgraded your got nested dependency of the gatsby-source-filesystem to use at least version 9.4.0?

If not, you should add:

  "resolutions": {
    "gatsby-source-filesystem/got": "9.4.0"
  }

in your Gatsby project’s package.json. Then remove node_modules and your yarn.lock file and install again.

Note: This resolutions feature only works for yarn. npm has not implemented this yet.

No just the file downloading and caching part. createRemoteFileNode would then just call this package and get back a promise that’d resolve when the file was downloaded (or returned from the cache).

I tweeted out asking for help so hopefully someone will jump on this soon 😃

https://twitter.com/gatsbyjs/status/1027079401287102465

@bradydowling I’m looking at adding request retries with exponential backoff as well as adding an optional setting for max requests per second for cases where that doesn’t work well enough.

Partially Jobs Api (#19831) should fix this caching problem.

So fwiw I replaced the got.stream() bits with a dumb raw downloader:

    let r = ""
    require("http").get(url, res =>
      res
        .on("data", m => (r += m))
        .on("end", () => {
          console.timeEnd("$$ Fetch time for " + url)
          resolve(r)
        })
    )
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/IMG_5260.jpg
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/TRAVEL-LEISURE-2-copy.png: 1003.535ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/International-Travel-Topaz-Sapphire.png
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/IMG_4606.jpg: 3174.126ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/Brunch-Topaz-Sapphire-2.png
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/09/IMG_4647.jpg: 9521.157ms
$ Started actually fetching http://topazandsapphire.com/wp-content/uploads/2016/05/IMG_6978.jpg
$$ Fetch time for http://topazandsapphire.com/wp-content/uploads/2016/05/International-Travel-Topaz-Sapphire.png: 3611.910ms

So yes, I’m pretty sure the long delays (in this case at least) are caused by download. So perhaps our best bet is to improve the feedback while waiting for a download 😃

Interesting. This is just a standard WordPress personal blog hosted on EC2 so it’s not like it’s a gigantic install. Perhaps this is because all these requests are overloading the host. Or, I’m no WordPress expert, but perhaps there’s some sort of standard WP rate limit on REST API calls that can happen? I’m also going with the assumption that this behavior isn’t unique to this site.

Additional to use GATSBY_CONCURRENT_DOWNLOAD = 5, add the following code into your gatsby-node.js file

// Internationalization exports.onPostBuild = () => { ChildProcess.execSync(“ps aux | grep jest | grep -v grep | awk ‘{print $2}’ | xargs kill”) console.log(‘Copying locales’) fs.copySync(path.join(__dirname, ‘/src/locales’), path.join(__dirname, '/public/locales)) }

I also had the same issue. I resolved it with :

rm -r node_modules/ 
rm -r .cache
sudo chown -R login:login . 
fuser -k 8000/tcp 
yarn 
gatsby build
gatsby develop

Hope it can help

Hello! I’m experiencing the same issue with a source plugin I’m creating (unrelated to WordPress), and when downloading a 1000+ images form an API. It hangs almost always at the end of the process.

Setting GATSBY_CONCURRENT_DOWNLOAD didn’t solve it. I tried 50, 20, 5, no luck.

I get a collection of sizes from the API, and I was using the largest image, but changed it to the smallest one and doesn’t fix it either.

It’s hard to identify why it fails at this point, the only thing I get is source and transform nodes and then silence forever.

It would be awesome to have a debugging mechanism for this.

I had the same issue, stuck on “source and transform nodes”. After a lot of console.logs my problem ended up being time out issues with retrieving media files from wordpress. The problem wasn’t the server not being able to handle it, but rather cloudflare rate limiting and throwing timeouts after about 350 requests.

I bypassed cloudflare, went straight to the vps and I’m no longer seeing “source and transform nodes”, and my build finishes.

this was exactly my issue. Netlify was building very fast - less than 2 mins. Only about 30 posts, with around 500 source images. Locally wasn’t every completing, simply unticking the CloudFlare status to be DNS only solved the issue immediately

It seems to be related to massive images and slow internet connections. Netlify was able to build the site but my local connection was not as it is only 1MB/s download which caused it to timeout after 30s and fail on the large image.

@anagstef looks to be working much better! Thanks for the tip!

The output is very verbose when building with this version of got. Do you know if there’s any way to remove this?

When running gatsby develop, is there a way to keep local cache instead of fetching remote data each time the command is launched ?

@anagstef thanks very much for the tip! I’ll try this and report back.

@njmyers Sure!

A quick overview of what’s going on:

My website currently runs with ~1940 image files, maybe WordPress’s fault by creating multiple image files multiple times. If I do use a vanilla gatsby-source-wordpress, the issue appears as intended (there’s a “vanilla” build I’ve made yesterday evening on another build environent - which returns the same issue we’re discussing on this issue altogether. This build works and compiles when all the image files are returned). By using your plugin (replacing all the files inside node_modules/gatsby-source-wordpress (correct me if I’m wrong on this)), gatsby develop returns me the following:

TypeError: Cannot read property 'wordpress_parent' of undefined

  - normalize.js:287 entities.map.e
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:287:11

  - Array.map

  - normalize.js:286 Object.exports.mapElementsToParent.entities [as mapElementsToParent]
    [amazingtec]/[gatsby-source-wordpress]/normalize.js:286:12

  - gatsby-node.js:134 Object.exports.sourceNodes
    [amazingtec]/[gatsby-source-wordpress]/gatsby-node.js:134:24


warning The gatsby-source-wordpress plugin has generated no Gatsby nodes. Do you need it?
success source and transform nodes — 299.757 s
success building schema — 10.192 s

After a quick while, it outputs:

'Cannot query field "allWordpressPage" on type "Query". Did you mean "allSitePage"?',
    locations: [ [Object] ] } ]
error UNHANDLED REJECTION

  TypeError: Cannot read property 'allWordpressPage' of undefined

  - gatsby-node.js:54 graphql.then.result
    C:/Projects/amztec-gtby/amazingtec/gatsby-node.js:54:36

PS: this was a vanilla build of gatsby-source-wordpress that was “converted” to yours by replacing the files, as I said above. I think the fact that it can’t query all the pages is related to no nodes being generated. Also want to notice that this build is equal as my vanilla one that works when this issue doesn’t appear.

Also want to notice that adding routes appears to cause the same initial problem for me (as I wanted to avoid different pages that aren’t related or will return errors due to multiple protection layers over WordPress). I just don’t know if the routes I’ve listed are correct, or if I’m missing something after.

I’m very happy with your reply, this issue is currently being a huge setback to my project and I’m glad that you’re still up on this issue. Thanks a lot!

Hi All. I updated my local plugin version that I was using for a site that had this issue. I think it’s a better implementation as it uses ‘better-queue’ before ‘createRemoteNode’ and passes in the ‘concurrentRequests’ parameter. It’s a little bit redundant as ‘createRemoteNode’ already uses a queue but regardless this version seems to working well with the recent gatsby upgrades and gives feedback on the progress of the files. I will try to get a PR together for this. Sorry for delays I know I said I would work on this earlier but have been quite busy!

https://github.com/njmyers/byalejandradesign.com/blob/wordpress-plugin/packages/gatsby-source-wordpress/src/download-media-files.js

Updating got solved all issues for me, too.

@dustinhorton Yes it should be sitting there for quite some time too if you have a lot of images. My fork will throw unhandled promise rejection if a remote file fails to download. That is why I would like to be able to have some mechanism to properly handle this scenario.

I think I read on another thread as well that there was talk of integrating some sort of progress manager as well since this would provide feedback about plugin status.

If you look in your OS file system under project-root/.cache/gatsby-source-filesystem you should be able to see all the images that are being downloaded. In my case it is almost 400 images now so it does take quite some time. However before using my forked version the plugin would silently fail on an error and then never progress causing the issue where source and transform would take for hours…

Do you have a repo? I would love to be able to try it on another site as so far I have only tested it in a real life situation on my site.

Would a PR to only fix gatsby-source-wordpress be accepted, then extract the fix afterwards? Having trouble using @njmyers forked plugin as-is, and it seems like it’s a huge improvement.

It could be nice actually to extract out the file downloading piece to its own package that focuses on this problem of downloading and caching remote files.

I upgraded to “got”: “^9.2.2” now it’s working houra!

Believe I’ve cleaned up all the 404s. Will try to build tonight. Thanks all.