azure-functions-python-worker: MemoryErrors Began Recently, Breaking our Python Function App

Investigative information

Timestamp: 2019/04/26 @ 3pm
Function App name: [PRIVATE] Contact if needed.
Function name(s) (as appropriate): [PRIVATE]
Core Tools version: 2.6.1048

Function App Overview

We built a system that processes audio files. Each input file is a compressed format (m4a and mp3 are the most common). The main processing function uses ffmpeg, which is included as a static binary, to decompress the audio file into raw samples for processing. While the system was able to handle unpacking 90mins of audio, we found that the processing time involved for complete unpacking was unreasonable, causing the entire function to take ~4mins to process. Running this process locally showed that the Python process was responsible for allocating a maximum of ~1.5-2GB of memory.

We were able to reduce this dramatically by streaming the decompression process. However, the streaming had some processing overhead such that “shorter” audio files actually took longer to process overall with the streaming approach. We determined that ~10mins in length would be a fine length to switch modes.

To make the most of each Function App instance, we make use of the Python multiprocessing module (via ProcessPoolExecutor). We use a single Process for processing audio data with ffmpeg (which itself is multi-threaded). We then use numCPUs - 1 Processes to handle downloading audio files to prepare for the processing phase. In this way, we can download multiple audio files to the function at a time while one audio file is being processed. Attempting to process multiple files simultaneously results in worse performance overall (we suspect that the multiple instances of ffmpeg have threads that fight each other for time on the CPU cores).

Things Broke On Us

This system worked with no issues as of March 24th, 2019: we could queue up as many audio files of whatever size we liked and the system would churn through them. Happy with the system, we moved on to another task. As our Function App is on a Linux Consumption Plan, we left everything “as-is”, expecting to return to it at a later date for integration into a larger system.

That later date was yesterday and we were surprised to find that something changed in the intervening month and broke our Function App. All of a sudden our Function App would stop responding to requests. Looking into the Monitor and Live Metrics Stream views, we noticed several things:

Monitor:
1. Doesn’t show all logs anymore.
Live Metrics Stream:
1. Only shows “1 Server Online”.
2. Server ID in the Sample Telemetry stream is “@(none)”.
3. The “Servers” section at the bottom shows no useful information (and only ever one server instance).

Previously we could rely on the Monitor section to show us logs of both successful and unsuccessful runs of the functions in our Function App. This is no longer the case.

Previously we could use the Live Metrics Stream to show when more than one server instance had spawned to handle the request load. Unless that capability is currently gone, this feature is broken.

What We’ve Discovered

Luckily for us, the Live Metrics Stream appears to work and provided us with a sampling of log messages that we could use to attempt to identify the source of the issue. By watching this stream and inspecting the Errors, we found the following:

The first error to occur is always caused by a MemoryError Exception in Python. This comes from a call to np.ascontiguousarray which is used to convert the audio samples from 16bit integers to 32bit float values. For large audio files, the array size here can be hundreds of MiBs in size (if not over 1GiB).
Every subsequent run of the function fails to start the ffmpeg process. It looks like the system is getting an OSError, indicating that it was “trying to execute a non-existent file.”
Once the Function App was shut down (from a timeout) and restarted by adding a request, the instances would again be able to access ffmpeg and process happily (when provided audio files at “the right size”).
If lots of audio file processing requests hit the Function App at once and the system breaks due to the MemoryError, then ~10mins from the break the timeout will occur and messages may begin to process again (usually succeeding without issue).

Certain audio file sizes can trigger the MemoryError to occur 100% of the time. In some circumstances, it seems that smaller audio files can trigger the MemoryError, though it is likely related to the memory state caused by simultaneously downloading files.

Questions

This leads us to the following questions:

What changed with memory handling in Function Apps [possibly Python-specific] to cause this issue?
How do we fix/work around this issue?

Repro steps

Reproduction is proprietary, but we have two ways to do it:

Unpack a 10min long audio file into a function with 32bit samples (stereo). (Specifically: with ffmpeg via the audioread module.)
Use our Function App to process a queue of multiple (~50) smaller (~3:30) audio files.

Expected behavior

The Function App continues to process audio files without interruption.

Actual behavior

The Function App fails due to a MemoryError and suddenly goes into a “Broken” state where the included ffmpeg static binary “goes missing”.

Known workarounds

No known workarounds at this time.

Related information

Triggers/Bindings Used:

HTTP - Used to add requests to the processing Queue.
Queue Storage - Used to trigger audio file processing.

About this issue

Original URL
State: closed
Created 5 years ago
Reactions: 3
Comments: 15 (1 by maintainers)

Most upvoted comments

Update - The updated memory config is available in East Asia now. It will be deployed to rest of the regions tomorrow.

Currently consumption sku doesn’t support > 1.5 GB workloads. App service plan function apps should though…

balag0 on May 15, 2019

We expect to have an additional ~600MB available to the container.

Please notify us, @balag0, when the memory allowance has been increased to 1.5 GB and Azure Functions becomes usable for us again. Or why not make it 3 GB, like λ has? 😉

j08lue on May 13, 2019

We expect to have an additional ~600MB available to the container.

I will try to add a wiki page to the host repo soon once I have more information. The limits will be configured such that the function app will be able to use all of 1.5 GB (minus whatever the functions runtime takes). We will account for the 400MB and adjust the overall limit accordingly if required.

The limit of 1.5 GB is just for the commit size which is the same limit as what we already have on Windows consumption sku. Is your concern that there is 0 swap space available? Even if we enabled swap space the overall limit of 1.5 GB will apply for that also.

On windows skus apps should have atleast 500MB of temp storage available to them as documented at https://github.com/projectkudu/kudu/wiki/Understanding-the-Azure-App-Service-file-system You can expect to see the same limits for Linux also.

balag0 on May 1, 2019

Thanks for the detailed information @ericdrobinson (got the name right this time 😃 ). This was very helpful.

Turns out this was due to a bad configuration setting on our side which limited the amount of memory allocated to the container. Unfortunately it would be sometime next time before we can fix this.

Once the configuration is fixed, you should start seeing more available memory on function startup.

As for the memory usage being so high at startup, I see the following values with a different (may be simpler) function app Proc Mem: rss=33562624, vms=664834048 So it is possible additional memory usage you see is due to the nature of the function app itself.

Even with paged memory, the app still cannot exceed the 1.5 GB limit. So swap memory being 0 shouldn’t matter as long as you get 1.5 GB (less memory taken by Functions runtime process).

balag0 on May 1, 2019

I’m back with more information.

I decided to take a look at what’s going on with the memory in the function environment. I added the psutil module and wrote a simple informational function:

def log_context_stats():
    """
    Retrieves and logs certain stats about the operating context for
    this process.
    """

    proc = psutil.Process(os.getpid())
    mem_info = proc.memory_info()

    stats = f"""Function Invocation Stats:
            Num CPUs: {os.cpu_count()}
            Virt Mem: {psutil.virtual_memory()}
            Swap Mem: {psutil.swap_memory()}
            Proc Mem: rss={mem_info.rss}, vms={mem_info.vms}
            """

    logging.info(stats)

I then called this function in two locations:

Immediately upon function invocation.
Immediately after catching the initial MemoryError.

Here is the output* from location 1:

Function Invocation Stats:
Num CPUs: 2
Virt Mem: svmem(
    total=1,039,429,632,
    available=240,713,728,
    percent=76.8,
    used=641,794,048,
    free=75,522,048,
    active=683,995,136,
    inactive=177,491,968,
    buffers=15,876,096,
    cached=306,237,440,
    shared=253,952,
    slab=67,407,872)
Swap Mem: sswap(total=0, used=0, free=0, percent=0, sin=0, sout=0)
Proc Mem:
    rss=144,080,896,
    vms=1,122,570,240

And here is the output* from location 2:

Function Invocation Stats:
Num CPUs: 2
Virt Mem: svmem(
    total=1,039,429,632, 
    available=39,194,624, 
    percent=96.2, 
    used=893,681,664, 
    free=79,187,968, 
    active=860,884,992, 
    inactive=22,732,800, 
    buffers=413,696, 
    cached=66,146,304, 
    shared=253,952, 
    slab=40,701,952) 
Swap Mem: sswap(total=0, used=0, free=0, percent=0, sin=0, sout=0) 
Proc Mem: 
    rss=304,037,888, 
    vms=1,570,119,680

(* Newlines and thousands separators added to numbers for readability.)

First, I should note that the swap_memory is showing 0 for everything. This might be due to the fact that the function is run in a Docker container and special settings are needed to access that information in these environments. Alternatively, it could be that someone turned off swap.

Second, what I found extremely startling is that a fresh run of the function showed only ~240MB of virtual_memory available. When the MemoryError was hit, the available memory had dropped by almost exactly 200MB (to ~39MB). Similarly, the process’ virtual memory starts at ~1.1GB and is already at ~1.57GB by the time the app runs out of memory.

It is perhaps valuable to recall at this point that the audio file being processed is just over the 100MiB line that causes the memory error to trigger when fully decompressed. I have no idea if the audio file is just large enough to run us out of available memory with array allocations/reallocations or if there is something otherwise special about the ~100MiB number.

Regardless, there is much about this state of affairs that is disconcerting.

Why is there only ~240MB available when the function starts up fresh?
Perhaps more telling/pertinent, is the process’ virtual memory. Why does it start at over 1GB?
Is swap still enabled? Even if we ran out of RAM, I would expect the process to slow down as it shuffled memory over to disk (of which there should be plenty of space). I had previously assumed that it was swap memory that saved us with our massive files back in March.

While having no swap memory might help explain the situation we’re experiencing, it is pretty shocking to see such high memory usage numbers on a clean start of a relatively simple function app instance.

I hope this information is helpful in tracking down the source of the issue.

ericdrobinson on Apr 30, 2019