next.js: Next.js application showcases intermittent slow responses, in Google Cloud Run instance running, from 30 seconds up to more than 1 minute

Verify canary release

  • I verified that the issue exists in the latest Next.js canary release

Provide environment information

Operating System:
      Platform: darwin
      Arch: arm64
      Version: Darwin Kernel Version 22.3.0: Mon Jan 30 20:38:37 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T6000
    Binaries:
      Node: 16.18.1
      npm: 8.19.2
      Yarn: N/A
      pnpm: N/A
    Relevant packages:
      next: 12.3.0
      eslint-config-next: 12.3.0
      react: 18.2.0
      react-dom: 18.2.0

Which area(s) of Next.js are affected? (leave empty if unsure)

Data fetching (gS(S)P, getInitialProps), Middleware / Edge (API routes, runtime)

Link to the code that reproduces this issue

https://codesandbox.io/p/sandbox/eager-gagarin-9quspu?file=%2Fpages%2Fironsession.ts&selection=[{"endColumn"%3A20%2C"endLineNumber"%3A8%2C"startColumn"%3A20%2C"startLineNumber"%3A8}]

To Reproduce

  1. Deploy a Next.js application to Google Cloud Run.

  2. Send requests to the application and observe the response time.

Describe the Bug

I have a Next.js application running in a Google Cloud Run instance, and I’m experiencing intermittent slow responses. Sometimes the response time is around 30 seconds, but other times it can be more than 1 minute. This behaviour is not consistent and happens randomly. Right now, the only known fact is that if traffic is constant, the instance is responding very quickly, but whenever the traffic is dropped, the issue resurfaces.

The problem never happens when executing the same Docker image from a local workstation.

I have reached out to Google Cloud support, and they have informed me that the latencies are originating from the code execution.

Environment:

•	Next.js version: 12.3.0
•	Operating system:  node:16-alpine
•	Deployment platform: Google Cloud Run

Dockerfile:

# Install dependencies only when needed
FROM node:16-alpine AS deps

# Install Alpine packages
# To understand why libc6-compat might be needed, check:
# https://github.com/nodejs/docker-node/tree/b4117f9333da4138b03a546ec926ef50a31506c3#nodealpine
RUN apk add --no-cache libc6-compat

# Working directory
WORKDIR /app

# Install dependencies based on the preferred package manager
COPY package.json package-lock.json .
RUN npm ci

################################################################################
# Rebuild the source code only when needed
FROM node:16-alpine AS builder

# Working directory
WORKDIR /app

# Copy from dependencies prepared files
COPY --from=deps /app/node_modules ./node_modules

# Copy application files
COPY . .

# If using npm comment out above and use below instead
RUN npm run build

################################################################################
# Production image, copy all the files and run next
FROM node:16-alpine AS runner

# Image Build arguments
ARG GIT_HASH 'NA'
ARG GIT_BRANCH 'NA'

# Add system user
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

# Copy from builder prepared files
# Public files
COPY --from=builder /app/public /app/public
# Automatically leverage output traces to reduce image size
# https://nextjs.org/docs/advanced-features/output-file-tracing
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone /app/
COPY --from=builder --chown=nextjs:nodejs /app/.next/static /app/.next/static

# User to run all commands as
USER nextjs

# Working directory
WORKDIR /app

# Environment variables
ENV NODE_ENV production
ENV PORT 3000
# Next.js collects completely anonymous telemetry data about general usage.
# Learn more here: https://nextjs.org/telemetry.
ENV NEXT_TELEMETRY_DISABLED 0
ENV BACKEND_URL https://api.example.com
ENV BROWSER none
ENV FAST_REFRESH true
# ... some other sensitive env vars here ...

# Exposed ports
EXPOSE 3000

# Container command
CMD ["node", "server.js"]

Additional information: It’s worth noting that I’m not doing any API requests in the backend of my Next.js application. I’m using Iron Session for authentication, as recommended in the Next.js documentation (link: https://nextjs.org/docs/authentication).

Any help or guidance on how to resolve this issue, debugging, or possible root causes would be greatly appreciated. Thank you!

Expected Behavior

The application should respond in a timely manner, with response times consistently under 10 seconds.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

Google Cloud Run

About this issue

  • Original URL
  • State: open
  • Created a year ago
  • Comments: 24 (9 by maintainers)

Most upvoted comments

After a ton of debugging, it turned out to be a disk IO heavy operation at the start of the service.

Please explain… Does next do a disk operation at the start of server ? (Like creating files or something like that ?)

It was specific to my website because all my blog entries are in markdown that needs to be read and parsed.

@Fredkiss3 Here is the minimal project to reproduce the problem: https://github.com/MiltiadisKoutsokeras/nextjs_issue_48539

You can check the test application deployed in Google Cloud Run here.

@Fredkiss3 I will post it here next week as I am on the road now. Thank you.

UPDATE: We have raised the issue on Google Cloud support and it is escalated. I will notify on the conclusion.

The current status is the following:

  1. Cloud Run is an auto-scale application container runtime and it can launch multiple instances of the same Docker container to serve a user session. The only limitation is the configured min/max number of instances you set. Min controls how many instances are “hot” waiting for HTTP requests and max the upper limit.
  2. We have seen in the logs, that a single HTTP index page request is not served by a single Cloud Run instance. The list of resources required to be retrieved by the index page may be served by multiple Cloud Run instances. This is confirmed by the logs where each request is mapped to an instance UUID.
  3. Breaking up a single index page browser request to multiple instances has the after effect of paying the application startup delay MULTIPLE times and things get even worse if the instances are also from containers launched due to the request (application startup + container delay).
  4. As I have shown above, a single local Docker container can have a delay of up to 5 seconds or more. In the Cloud Run environment this can be multiplied by the delay of multiple launched instances raising the delay of index page response up to 1 minute or more.
  5. The delay between calling node server.js and executing the application Typescript code first line is the culprit of the majority of the total delay. I have tried to create a minimal Next app without any extra resources (image, css, etc) and can still take up to 2 seconds between the node server launch and taking control of the execution in Typescript.
  6. A Single instance seems to serve any request after the application launch within milliseconds.

Any advise on how to bring down the delay between launching the Docker container and running the application code is highly appreciated. It is a black box for us.

I am astonished by how a simple and small in scope application can be so slow under cloud environments. 😦

It does not seem I can attach it directly, give me some moments to upload it to a network share.

I’m gonna sound stupid but what about console.log or console.time (& console.timeEnd) to validate your hypothesis ?

Since the issue only happens on GCP it is kinda hard to help you debug because not everyone can easily deploy to GCP.