fastapi: Gunicorn on Google Cloud Run get an 504 error status (Upstream Request Timeout)

Why my gunicorn always get a 504 error status code when I open my URL from Cloud Runabout 15 seconds for the first time, and after that the URL can be opened without an error. But after I leave it without opening the URL for about 30-60 minutes, it will return the 504 error again? Is my Gunicorn dead/shutdown? Because when i check it from my Cloud Run log, my gunicorn got a Shutting down, and I think my gunicorn was dead. So I need to keep my Gunicorn always on, but how can I make it to set my gunicorn always on?

My code at startup need to load the Machine Learning Model that got a 1 pickle about 100MB, in my case I need to load 6 pickle file (around 600mb++), and I use FastAPI for my API code.

This is how my pickle load :

# Load all model
@app.on_event("startup")
async def load_model():
    # Pathfile
    pathfile_model = os.path.join("modules", "model/")
    pathfile_data = os.path.join("modules", "data/")

    start_time = time.time()

    # Load Model
    usedcar.price_engine_4w = {}
    top5_brand = ["honda", "toyota", "nissan", "suzuki", "daihatsu"]
    for i in top5_brand:
        with open(pathfile_model + f'{i}_all_in_one.pkl', 'rb') as file:
            usedcar.price_engine_4w[i] = pickle.load(file)
    with open(pathfile_model + 'ex_Top5_all_in_one.pkl', 'rb') as file:
        usedcar.price_engine_4w['non'] = pickle.load(file)

    # Load Dataset Match
    with open(pathfile_data + settings.DATA_LIST) as path:
        usedcar.list_match_seva = pd.read_csv(path)

    elapsed_time = time.time() - start_time

    print("======================================")
    print("INFO  : Model loaded Succesfully")
    print("MODEL :", usedcar.price_engine_4w)
    print("ELAPSED MODEL TIME : ", elapsed_time)

Here are how my main.py code run :

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080, log_level="info", loop=asyncio)

This is my Dockerfile :

FROM python:3.8-slim-buster
RUN apt-get update --fix-missing
RUN DEBIAN_FRONTEND=noninteractive apt-get install -y libgl1-mesa-dev python3-pip git
RUN mkdir /usr/src/app
WORKDIR /usr/src/app
COPY ./requirements.txt /usr/src/app/requirements.txt
RUN pip3 install -U setuptools
RUN pip3 install --upgrade pi
RUN pip3 install -r ./requirements.txt --use-feature=2020-resolver
COPY . /usr/src/app
CMD exec gunicorn --bind :8080 --workers 2 --threads 4 main:app --worker-class uvicorn.workers.UvicornH11Worker --preload --timeout 60 --worker-tmp-dir /dev/shm

This is my requirements for uvicorn and gunicorn :

fastapi
fastapi-utils
uvicorn[standard]
gunicorn

This is my Cloud Run Log :

2021-02-15 14:31:54.346 WIT[2021-02-15 07:31:54 +0000] [1] [INFO] Handling signal: term
2021-02-15 14:31:54.385 WIT[2021-02-15 07:31:54 +0000] [11] [INFO] Shutting down
2021-02-15 14:31:54.386 WIT[2021-02-15 07:31:54 +0000] [12] [INFO] Shutting down
2021-02-15 14:31:54.486 WIT[2021-02-15 07:31:54 +0000] [11] [INFO] Waiting for application shutdown.
2021-02-15 14:31:54.486 WIT[2021-02-15 07:31:54 +0000] [11] [INFO] Application shutdown complete.
2021-02-15 14:31:54.486 WIT[2021-02-15 07:31:54 +0000] [12] [INFO] Waiting for application shutdown.
2021-02-15 14:31:54.486 WIT[2021-02-15 07:31:54 +0000] [11] [INFO] Finished server process [11]
2021-02-15 14:31:54.487 WIT[2021-02-15 07:31:54 +0000] [11] [INFO] Worker exiting (pid: 11)
2021-02-15 14:31:54.487 WIT======================================
2021-02-15 14:31:54.487 WITINFO : Model loaded Succesfully
2021-02-15 14:31:54.487 WITELAPSED MODEL TIME : 13.514873743057251
2021-02-15 14:31:54.487 WITINFO : Master Data Updated Succesfully
2021-02-15 14:31:54.487 WITELAPSED DATABASE TIME : 0.5247213840484619
2021-02-15 14:31:54.487 WIT======================================
2021-02-15 14:31:54.487 WIT[2021-02-15 07:31:54 +0000] [12] [INFO] Application shutdown complete.
2021-02-15 14:31:54.487 WIT[2021-02-15 07:31:54 +0000] [12] [INFO] Finished server process [12]
2021-02-15 14:31:54.487 WIT[2021-02-15 07:31:54 +0000] [12] [INFO] Worker exiting (pid: 12)

As we can see, from my cloud run log, my gunicorn was shutdown suddenly.

And this is my Error : Screenshot from 2020-12-15 16-34-03

After I looked around, I’ve tried a few things like:

--worker-tmp-dir /dev/shm (I use this line because I think there might be blocking from my Docker Container, so I add this line to make sure there’s no blocking from Docker Container, but it still gets a 504 status). Source 1 Source 2
--preload (I use this because I think my cloud run need to save some RAM to start my Gunicorn fastly, in case if my Gunicorn shutdown, then when I load my page again, it will load faster, but it still doesn’t effect). Source
I used my worker=2, thread=4, graceful_timeout=100, but it still makes my Cloud Run shutdown.

Thank you

About this issue

Original URL
State: closed
Created 3 years ago
Comments: 20 (9 by maintainers)

Most upvoted comments

Glad to read that, good job 👍

frankie567 on Feb 22, 2021

@frankie567 Hello frankie, I’m so sorry for my late update. So, this is my update.

Load your model lazily in a dependency I’ve done this, so in here my script load the model when only needed, it was very useful. I can request HTTP more faster than before, and it can avoid error 504 too (surprisingly 😄 ).

Use joblib instead of pickle I’ve done this, and the result unexpected, my model now can load it faster than using pickle. (It’s crazy 🤣 )

About maximum_instances I’ve set my maximum instance to 50 instances, from the story that you give, that was so crazy lol $72000, that soooo damnnnnn tooo muchhh, especially, I can’t imagine that huge of money.

Thank you so much @frankie567

rudi101101 on Feb 22, 2021

I’m not a specialist, but here is two things worth to try:

Load your model lazily in a dependency

Instead of loading your model at startup, try to load it on the first prediction query. You can wrap this in a FastAPI dependency:

import time

from fastapi import FastAPI, Depends


class MyModelPrediction:
  model_loaded: bool = False

  def __call__(self, query: str):
    if not self.model_loaded:
      self._load_model()
    print(f"Make prediction for query {query}")
    time.sleep(1)
    return query

  def _load_model(self):
    print("Load model...")
    time.sleep(10)
    self.model_loaded = True


app = FastAPI()

@app.get("/predict")
def predict(prediction = Depends(MyModelPrediction())):
  return prediction

I’ve replaced your actual logic with time.sleep to simulate loading times. The first prediction will be slow because it’ll have to load the model for the first time. But subsequent predictions will be faster because it’ll be loaded in memory (until the container turns idle). The container startup is now instantaneous, because it has nothing to do.

Notice that I’ve defined the dependency as a synchronous method (not async). Since your loading logic performs blocking I/O, it’s better to define it like this because FastAPI will then run it in the external threadpool (doc: https://fastapi.tiangolo.com/async/?h=techn#path-operation-functions). It means that it won’t block your main loop while the model is loading.

Use joblib instead of pickle

joblib is a very good library for persisting objects on the hard-disk. It may prove more efficient that standard pickle.

About maximum_instances

Cloud Run has an autoscaling feature. It means that if you experience very high traffic for various reasons (or that there is a bug in your code that cause the server to loop), Google will create new containers until maximum_instances is reached. With a limit of 1000, you can have a bill of thousands of dollars without even noticing!

Relevant story about this: https://www.theregister.com/2020/12/10/google_cloud_over_run/

frankie567 on Feb 17, 2021

Could you give us your Cloud Run configuration (or the command you make to deploy your service)? By default, Cloud Run instances start with 256 Mb of RAM so, given the size of your model, I suspect you run out of memory.

frankie567 on Feb 16, 2021

@frankie567 No problem frankie, thank you so much for answering all of my questions. I really appreciate you. Thank you. Now here I will close my question here.

rudi101101 on Mar 1, 2021

Sorry, I saw your question and then forgot to answer.

Basically, Cloud Run will scale by creating new instances based on the number of incoming requests. So if you experience a traffic spike, he’ll be able to handle it.

However, creating an instance takes time (we say “cold start”) ; until it can answer requests, which can induce some latency. If you set minimum_instances to 4/10/X, it means that there’ll always be 4/10/X containers ready to serve requests even if there is no request to handle. Of course, you are billed for those instances.

Unless you have a service with very high traffic, or you expect an enormous traffic spike because a TV show will talk about your company, I don’t think this option will be helpful for you.

Official docs: https://cloud.google.com/run/docs/configuring/min-instances

frankie567 on Feb 26, 2021

@rudi101101 So, how did it went? 🙂

frankie567 on Feb 19, 2021

Gunicorn uvicorn, with the first prediction.

Kludex on Feb 18, 2021

Be careful with timeout if you choose this approach.

Kludex on Feb 17, 2021

@frankie567 God thank you Frankie, I will try it first. Soon, I will give you the result.

rudi101101 on Feb 17, 2021

Well, theoretically even after being shut down because it didn’t receive traffic for a certain period of time, it should cold-start and run the startup event again without any issue.

Now, I don’t really understand why you get a timeout error on subsequent start. Random thought: make sure you don’t have any open resources or background tasks pending in your router that could prevent a proper shutdown of the container.

Yes, you could set a minimum instance, but it obviously incurs costs (BTW, you really should set the maximum instances parameter if don’t want to have very bad surprises – the default is at 1000 🤯 –).

frankie567 on Feb 16, 2021

Yes, the default is 5 minutes. Should be sufficient for you but, you know, computers 😅

Just to be sure we understand on what’s happening:

You make a first HTTP request
The service makes a cold start (first request received) and executes your startup event ; which takes around 15 seconds
Then, the route logic is executed (it could be interesting that you post it also). How much time does it take?
You get a response (right?)
You wait 30-60 minutes. At this point, it’s probable that Cloud Run shuts down the container because it doesn’t receive traffic (https://cloud.google.com/run/docs/about-instance-autoscaling#idle-instance).
You make a new HTTP request.
You get the 504 timeout error.

Is that so?

frankie567 on Feb 16, 2021

Thanks! May I suggest to increase Cloud Run timeout:

It clearly states that a 504 error is triggered when the timeout is reached.

frankie567 on Feb 16, 2021