azure-functions-host: Redesign how we log worker process errors

In the current implementation of how we handle worker process errors, we log a message as an error based on if the message contains the words “error”, “fail” or “severe”, this is fragile and does not cover all cases on when an issue has occurred i.e. exceptions etc.

# Method: ParseErrorMessageAndLog
 if ((msg.IndexOf("error", StringComparison.OrdinalIgnoreCase) > -1) ||
     (msg.IndexOf("fail", StringComparison.OrdinalIgnoreCase) > -1) ||
     (msg.IndexOf("severe", StringComparison.OrdinalIgnoreCase) > -1))
{
    LogError(msg);
}

Link to code

It’s worth noting that this method (ParseErrorMessageAndLog) is only used by the error callback:

private void OnErrorDataReceived(object sender, DataReceivedEventArgs e)
{
    if (e.Data != null)
    {
        ParseErrorMessageAndLog(e.Data);
    }
}

Maybe it is safe to assume that all messages coming from the worker through the error event listener Process.ErrorDataReceived += (sender, e) => OnErrorDataReceived(sender, e); should be logged as errors, or at least warnings?

Following tasks were identified as next steps to resolve this workitem

  • Function host: Review the option of exposing UseStdErrorStreamForErrorsOnly setting to users under a more descriptive app setting value
  • Extension Authors: Look into the possibility of intercepting the stdError stream, identify log level through language specific constructs and pass that info via a log prefix.
    • Investigation task on dotnet worker (#1116)
  • Identify what logs we are missing by introducing worker faults at different points of execution

About this issue

  • Original URL
  • State: open
  • Created 2 years ago
  • Comments: 17 (17 by maintainers)

Most upvoted comments

We rarely use the stdout/stderr streams in the first place - we prioritize sending logs over the rpc channel. That being said, users can write directly to stdout/stderr themselves, so I would be hesitant to change this behavior (worker logs should be prefixed with “LanguageWorkerConsoleLog” and user logs are not).

Fwiw, I think the existing default is the “correct” behavior, although perhaps not the most intuitive behavior. Stderr should be assumed to be any diagnostic data, not just error data. I would advocate for improving the error detection (add exception to the list?) and somehow pressuring both users & workers to send proper rpc messages instead where they have to explicitly choose the log level

Have spoken to several language owners on this topic and it appears that this will be beneficial in catching issues that have been missed in our kusto logs. Will need to have a further discussion on usage of RpcLog vs stderr/stdout. We also discussed changing this to be a capability instead of a worker.config setting.

@fabiocav if we feel this is a breaking change, we can punt this down the road, otherwise we can push to next sprint to start making changes.

A few references pointing to stderr being used for diagnostic data, although I agree the name is terrible:

This came up a lot when I worked on VS Code tooling that referenced a cli (usually core tools). Here is theoretical output for the command func --version:

WARNING: You are on an old version of core tools. Please upgrade now %%Cool Functions lightning bolt Logo%% INFO: We are collecting telemetry. Use a setting to turn it off 1.0.0

It’s really important that “1.0.0” is the only output that goes to “stdout” for anyone parsing the result of func --version (aka our VS Code extension). That means all the other stuff has to go to “stderr” even though not all of it is an error. This situation is the main reason that many cli’s use “stderr” for diagnostic data because they want to provide parse-able output and diagnostic data and you need two different streams to do that. I understand the host doesn’t fit neatly into this “parseable output for a cli” scenario, but just explaining why I think using stderr for diagnostic data is a lot more common that you’d think based on the name.

It doesn’t look like any of the language workers are utilizing this in their worker.config files. The default value for UseStdErrorStreamForErrorsOnly is false, meaning logs from stderr are written with info level by default unless the message contains the keywords “error” , “severe” or “fail”, then the logs are written with error level.

@pragnagopa from my understanding, Java is the only language that writes to the stderr stream for all msgs and so only Java should have UseStdErrorStreamForErrorsOnly == false and it should be true for all other languages?

cc: @ejizba, @Francisco-Gamino, @kaibocai, @kshyju, @vrdmr - FYI, this might be impacting how errors from workers are logged and depending on logging behaviour, you maybe need to update your worker.config file

Example

Current behaviour -> an error in the process logs as info

info

Behaviour with UseStdErrorStreamForErrorsOnly set to true

error

Ah, that’s great! We’ll validate and close this issue once we do. We’ll work with the language worker owners to update this where appropriate.

I just realized - we did add the capability - https://github.com/Azure/azure-functions-host/blob/dev/src/WebJobs.Script/Workers/Rpc/RpcWorkerDescription.cs#L82 - so worker should be able to specify if stderr stream should be considered as error logs only - @liliankasem / @fabiocav - please give that a try