prisma: Connection Limit Exceeded and CPU Burnout Issues

Bug description

I hope this message finds you well. As part of the critical project team, I am reporting two urgent issues that require immediate attention.

Connection Limit Exceeded: Our application has encountered instances where the connection limit has been exceeded, significantly impacting the user experience. It is crucial to address this matter urgently to ensure the seamless operation of the system.

CPU Burnout: Additionally, our server is experiencing CPU burnout, leading to performance degradation and potential downtime. We must promptly resolve this problem to ensure the stability and reliability of our project.

As we prepare for significant growth, it is essential to address these challenges to maintain a smooth user experience and ensure the overall success of the project. I will create detailed issue reports for each problem and keep everyone informed about the progress.

Technical Details: Our project utilizes Prisma ORM with a GraphQL API and is built on Node.js/TypeScript. We are running on GCP-k8 with six pods, each having 24 GB of RAM and 8 CPU. After careful consideration, we have set a connection pool limit of 16 with a 20-second timeout. Despite these configurations, we are still facing connection pool limit exceedance, leading to unresponsive queries and extended response times.

Additionally, we have noticed instances where the Database CPU experiences burnout, negatively affecting overall performance. We have attached graphs for reference.

In the past, we encountered heap memory burning issues, which improved after removing the promise.all function. However, there is still room for further stability enhancement.

We highly appreciate your immediate attention to these issues to ensure the smooth functioning of our project. Your support is vital as we work towards resolving these challenges.

Screenshot 2023-07-18 at 1 52 19 PM Screenshot 2023-07-18 at 1 52 33 PM Screenshot 2023-07-18 at 1 52 46 PM Screenshot 2023-07-18 at 1 52 58 PM Screenshot 2023-07-18 at 1 50 59 PM

Screenshot 2023-07-18 at 1 50 43 PM

How to reproduce

NA

Expected behavior

  • The application should be able to handle an increasing number of users without encountering connection limit issues or experiencing CPU burnout.

Prisma information

schema.prisma

  provider        = "prisma-client-js"
  previewFeatures = ["metrics"]
}

datasource db {
  provider = "mysql"
  url      = env("DATABASE_URL")
}

enum enumName {} 

model modelName {} 

...
...
...

model modelTwoName {}

utils.ts

import { PrismaClient } from '@prisma/client';

export interface Context {
  user: () => Promise<User | null | undefined>
  prisma: PrismaClient
  request: any
}

...
...
...

genre.ts

import { Context } from '../../utils'
import { GraphQLResolveInfo } from 'graphql'
import { genreReturnType } from '../ReturnType'
import { errorName } from '../../error/error'

export const genreMethods = {
    query: {
        async genre(_parent, args, ctx: Context, _info: GraphQLResolveInfo) {
            try {
                const item = await ctx.prisma.genre.findUnique({ ...args, include: genreReturnType })
                return item
            } catch (err) {
                console.log(err)
                throw err
            }
        },
    }
}

export default genreMethods

Environment & setup

  • Docker Image: node:16-alpine
  • Database: MySQL/CloudSQL
  • Node.js version: 16.16.0

Prisma Version

4.1.0

About this issue

  • Original URL
  • State: closed
  • Created a year ago
  • Comments: 16 (8 by maintainers)

Most upvoted comments

@janpio Thank you for the update. Yes, I understand that there might not be a straightforward solution for this heap issue, considering the various possible reasons behind it. It seems that one of the causes could be query overloading with data to fetch.

Regarding the original problem, The situation is now generally under control. It’s a positive outcome that the database CPU is no longer overwhelmed, and the connection pool is effectively handling the workload.

With the successful resolution of the problem, I am going to close this issue with thanks 😃

We have an endpoint called ‘AllProjects’ that also retrieves information about users associated with each project, so many to many. When the frontend calls this endpoint with 20 projects, it generates 66 underlying queries to retrieve users(right now), which involve many-to-many joins with ‘genre’ and ‘musical instrument’ tables. As a result, so if you see making 5000 calls to the ‘AllProjects’ endpoint in an hour leads to 64000-74000 calls to the ‘users’ endpoint, and this number increases proportionally for further queries. The reason for this behavior is due to the use of GraphQL Prisma API ‘include’ feature. The more data we include, the more queries are built up. We are currently optimizing this process to reduce the query load as much as possible.

Less queries with less data returned is always better, so really make sure you need all that data (also all the fields being returned, not just the rows) and then make sure all the queries really are fast by using indexes. Any query you can avoid, can not cause a problem. Any query that needs to return less, will allow you servers to handle more requests.

Screenshot 2023-07-21 at 4 37 40 PM

This new screenshot of the queries being executed already looks much better. The throughput of queries in the connection pool should be seriously unblocked with that, as they are all much faster than before.

  1. Really focus on understanding the top queries (or even just start with the top one) and make them/it disappear from that list. I can’t think of any reason why a query would need to return avg >10.000 results and take avg 12 seconds to do that- and still be useful in an app. Can you share the full query from the top queries? Usually, it is possible to go back to understand what Prisma Client query would create this SQL query. From the leading underscore of the table _UserFavoriteGenre I can already know that this is part of a many-to-many relation of Prisma for example, indeed the one between User and Genre you shared above. Why would such a query ever return >10.000 results? Do you know where you run a Prisma Client query that would query that relation?

Now, this is interesting. We are not entirely sure what will happen if we increase the connection pool size, as there is no clear documentation available on the topic. Currently, we have set the maximum connections to 1024, and it seems to be working fine. However, I would like to understand what would be an optimal size for a large pool of connections.

That is one of the big questions of database usage - and it depends on a million different considerations.

Generally, we recommend we start with a pretty low number (hence our really conservative default), and only play with that if the traffic volume you get can not be handled - as is the case for you! Then you can increase it step by step and observer what that does to your app. If you have a bigger pool, that means Prisma can open and then use more connections at a time. This comes at a cost - opening a connection is slow, and managing them takes CPU and memory. So you have to watch the resource usage of your pods. A higher connection pool liit/size should lead to more resources being used. But when they are open, you have more “pipes” through which you can execute queries, which means that a “connection pool timeout” is less probable.

Can we use Prisma DataProxy with a long-running server instead of a serverless setup, and would it be beneficial for performance?

You can absolutely use it with a long running server as well. As your database can handle a lot of connections though, and your application servers also seem to be pretty happy and underutilized, I would not necessarily recommend that route right now yet. I think you can get this under control with easier measures.

3. Your database connections are bored. CPU and memory of your pods also seem to be bored. On the other hand you are getting connection pool timeouts. That error indicates a capacity and saturation problem - in that the queries an application tries to perform, can not be performed in the available time (the pool timeout) - so you get an error. This happens very quickly if you have some queries that are very slow, as they “hog” connections for a long time and then the queue gets to long - and the error explodes in your face. What happens if you double or triple the connection limit of each pod? If CPU and memory are fine, you should be able to use a lot more of those available 10.000 connections from your database! (Unfortunately the slow queries will probably still take over all connections in the connection pool sooner or later. So you might need to increase further, or do the following:

max_execution_time is seems to be very helpful.

It’s one of those things that can really protect your from some bad queries (as you had them in the overview before) killing your whole application. It’s like a kill switch that prevents bad queries from causing too much trouble. (Unless of course these queries are the core functionality of your app - but then you are motivated to fix that quickly as well 😆 )

Let me share an interesting thing with you: we had a conversation with the Google team, and they mentioned that we have all options available in MYSQL. However, we discovered that query caching was only accessible for Enterprise Plus users. Despite having an Enterprise Plus version (we have Enterprise) we decided to update our database version from 8.26 to 8.35, and to our surprise, everything started functioning smoothly. The CPU usage dropped significantly to just 0.5%, and the query response times improved to milliseconds. Honestly, I can’t pinpoint the exact reason for this improvement, but we should investigate further to understand the underlying cause. It’s essential to figure it out, as any potential issue could arise again in the future.

Wow, that is pretty amazing. We would be really interested to hear from you when/if you figure out what is/was going on there with 8.26. Mayyyybe some index previously was not used accidentally, and this was fixed - that would be the most simple explanation really. For now I am just happy for you! 🥳

You see, 5000 project queries triggered 79000 user queries. We have also requested our FE team to analyze this in detail. Additionally, we are currently working on implementing New Relic for our backend with Kubernetes (k8s). Screenshot 2023-07-21 at 4 37 40 PM Screenshot 2023-07-21 at 4 38 09 PM

Both very good ideas.

(From my own experience: Be careful with New Relic, using an APM like that is usually so impressive that you will want to use it on every project in the future and optimize all the things it points out 😄. Flipside is that the cost can be pretty heavy sometimes.)

Some questions for clarification:

Is this normal production traffic or some load testing?

All the graphs you shared are from the database, correct? What does CPU look like on the pods/servers?

Connection Limit Exceeded: Our application has encountered instances where the connection limit has been exceeded, significantly impacting the user experience. It is crucial to address this matter urgently to ensure the seamless operation of the system.

What kind of connection limit exactly? From Prisma or your database? What error message exactly did you get?

CPU Burnout: Additionally, our server is experiencing CPU burnout, leading to performance degradation and potential downtime. We must promptly resolve this problem to ensure the stability and reliability of our project.

Do I understand correctly that you have 1 database server, and that the available resources on that one are being fully utilized while you are running your application? How many connections and queries should that database server be able to handle?

Our project utilizes Prisma ORM with a GraphQL API and is built on Node.js/TypeScript. We are running on GCP-k8 with six pods, each having 24 GB of RAM and 8 CPU. After careful consideration, we have set a connection pool limit of 16 with a 20-second timeout.

So you are running 6 “servers” that are each running your application? The connection pool of Prisma is set to 16 connections (with 20 second timeout) on each of those “servers”?

What does the database set up and configuration look like? What CPU and memory? How many connections does your database server allow at a time?

Despite these configurations, we are still facing connection pool limit exceedance, leading to unresponsive queries and extended response times.

What exactly do you mean by that? The same “connection limit exceeded” or something else? What is the exact error message you are getting?

Additionally, we have noticed instances where the Database CPU experiences burnout, negatively affecting overall performance. We have attached graphs for reference.

Is this the same “CPU Burnout” described above or something else?

In the past, we encountered heap memory burning issues, which improved after removing the promise.all function. However, there is still room for further stability enhancement.

How many promised/queries were you trying to executed in the past? How are you running them now instead?

Screenshot 2023-07-18 at 1 50 43 PM

If I read this table correctly, you have a small number of queries that produce most of the load on your system. They all run for a full second or (a lot!) longer. They are frequently called, and return a lot of rows.

Do you know which code triggers these queries? Do these queries have appropriate indexes defined or do they really have to be that slow? Can you share the relevant schema, Prisma Client query and then the full generated SQL query for any query >1s? Do they really need to return that many rows? Could these query results maybe be cached, even in a local variable? That would avoid having to run that query that often and in a stateful Node server would be very efficient - no external caching tool required.

Finally, I agree with all that @Jolg42 wrote. Especially the old version makes it really hard as we certainly improved a lot of things around whatever you are doing with Prisma Client in your app in the last year. And running EXPLAIN on those queries to understand why they are so slow makes a lot of sense.

Hi @kashif-trailfive, I noticed you are using a relatively old version, 4.1.0 is now 1 year old. Did you consider upgrading, maybe? (Are they any blockers to upgrade the Prisma version?)

On the performance side, we introduced jsonProtocol in 4.11.0 and made it generally available in 5.0.0, many users sent us positive feedback, though this is not about database performance.

I also noticed that your top SELECT query returns on average 10k rows. You could try running these queries with EXPLAIN and see what is slow and maybe discover a missing index https://dev.mysql.com/doc/refman/8.0/en/explain.html

Thank you for your suggestion, which was very helpful. Yes, I think the table index somehow got blocked or encountered an issue, but now it’s back in action. We will investigate the indexing in more detail, as we are currently relying on manual indexes in Prisma.

@janpio Thanks for your reply, it’s very helpful!

If I was you @kashif-trailfive, I would do 3 things now:

We have an endpoint called ‘AllProjects’ that also retrieves information about users associated with each project, so many to many. When the frontend calls this endpoint with 20 projects, it generates 66 underlying queries to retrieve users(right now), which involve many-to-many joins with ‘genre’ and ‘musical instrument’ tables. As a result, so if you see making 5000 calls to the ‘AllProjects’ endpoint in an hour leads to 64000-74000 calls to the ‘users’ endpoint, and this number increases proportionally for further queries. The reason for this behavior is due to the use of GraphQL Prisma API ‘include’ feature. The more data we include, the more queries are built up. We are currently optimizing this process to reduce the query load as much as possible. Screenshot 2023-07-21 at 4 37 40 PM

  1. Really focus on understanding the top queries (or even just start with the top one) and make them/it disappear from that list. I can’t think of any reason why a query would need to return avg >10.000 results and take avg 12 seconds to do that- and still be useful in an app. Can you share the full query from the top queries? Usually, it is possible to go back to understand what Prisma Client query would create this SQL query. From the leading underscore of the table _UserFavoriteGenre I can already know that this is part of a many-to-many relation of Prisma for example, indeed the one between User and Genre you shared above. Why would such a query ever return >10.000 results? Do you know where you run a Prisma Client query that would query that relation?

Now, this is interesting. We are not entirely sure what will happen if we increase the connection pool size, as there is no clear documentation available on the topic. Currently, we have set the maximum connections to 1024, and it seems to be working fine. However, I would like to understand what would be an optimal size for a large pool of connections. Can we use Prisma DataProxy with a long-running server instead of a serverless setup, and would it be beneficial for performance?

  1. Your database connections are bored. CPU and memory of your pods also seem to be bored. On the other hand you are getting connection pool timeouts. That error indicates a capacity and saturation problem - in that the queries an application tries to perform, can not be performed in the available time (the pool timeout) - so you get an error. This happens very quickly if you have some queries that are very slow, as they “hog” connections for a long time and then the queue gets to long - and the error explodes in your face. What happens if you double or triple the connection limit of each pod? If CPU and memory are fine, you should be able to use a lot more of those available 10.000 connections from your database! (Unfortunately the slow queries will probably still take over all connections in the connection pool sooner or later. So you might need to increase further, or do the following:

max_execution_time is seems to be very helpful. Let me share an interesting thing with you: we had a conversation with the Google team, and they mentioned that we have all options available in MYSQL. However, we discovered that query caching was only accessible for Enterprise Plus users. Despite having an Enterprise Plus version (we have Enterprise) we decided to update our database version from 8.26 to 8.35, and to our surprise, everything started functioning smoothly. The CPU usage dropped significantly to just 0.5%, and the query response times improved to milliseconds. Honestly, I can’t pinpoint the exact reason for this improvement, but we should investigate further to understand the underlying cause. It’s essential to figure it out, as any potential issue could arise again in the future.

You see, 5000 project queries triggered 79000 user queries. We have also requested our FE team to analyze this in detail. Additionally, we are currently working on implementing New Relic for our backend with Kubernetes (k8s). Screenshot 2023-07-21 at 4 37 40 PM Screenshot 2023-07-21 at 4 38 09 PM

  1. What happens if you configure your database server to kill queries that take longer than xxx ms? That could ensur that no queries can hog any connection to long (MySQL has a setting max_execution_time). Not that this will of course make parts of your application explode - but then you can figure out where exactly that is and why these queries are so slow.

So to recap, in our opinion you have some bad database queries that kill your database CPU, and also block the Prisma connection pool and lead to timeouts there.

If I was you @kashif-trailfive, I would do 3 things now:

  1. Really focus on understanding the top queries (or even just start with the top one) and make them/it disappear from that list. I can’t think of any reason why a query would need to return avg >10.000 results and take avg 12 seconds to do that- and still be useful in an app.

    Can you share the full query from the top queries? Usually, it is possible to go back to understand what Prisma Client query would create this SQL query. From the leading underscore of the table _UserFavoriteGenre I can already know that this is part of a many-to-many relation of Prisma for example, indeed the one between User and Genre you shared above. Why would such a query ever return >10.000 results? Do you know where you run a Prisma Client query that would query that relation?

  2. Your database connections are bored. CPU and memory of your pods also seem to be bored. On the other hand you are getting connection pool timeouts. That error indicates a capacity and saturation problem - in that the queries an application tries to perform, can not be performed in the available time (the pool timeout) - so you get an error. This happens very quickly if you have some queries that are very slow, as they “hog” connections for a long time and then the queue gets to long - and the error explodes in your face.

    What happens if you double or triple the connection limit of each pod? If CPU and memory are fine, you should be able to use a lot more of those available 10.000 connections from your database! (Unfortunately the slow queries will probably still take over all connections in the connection pool sooner or later. So you might need to increase further, or do the following:

  3. What happens if you configure your database server to kill queries that take longer than xxx ms? That could ensur that no queries can hog any connection to long (MySQL has a setting max_execution_time). Not that this will of course make parts of your application explode - but then you can figure out where exactly that is and why these queries are so slow.

So to recap, in our opinion you have some bad database queries that kill your database CPU, and also block the Prisma connection pool and lead to timeouts there.

Note: A colleague mentioned that you could check Query Insights to see the CPU consumption of a specific query: https://cloud.google.com/sql/docs/mysql/using-query-insights#specific-query