powertools-lambda-python: Bug: AWS Powertools Logger does not display utf-8 encoded (e.g. Japanese) characters consistently with print() or native logging library

Expected Behaviour

{"level":"INFO","location":"<module>:8","message":"スコビルデモ2","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ2
INFO:root:スコビルデモ2

Current Behaviour

{"level":"INFO","location":"<module>:8","message":"\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12","timestamp":"2023-12-08 15:21:28,502+0000","service":"service_undefined"}
スコビルデモ2
INFO:root:スコビルデモ2

Code snippet

from aws_lambda_powertools import Logger
import logging

japanese_string = "\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12"

logger = Logger()
logger.info(japanese_string)


print(japanese_string)

import logging
logging.basicConfig(level=logging.INFO)
logging.info(japanese_string)

Possible Solution

Use a custom formatter via Logger(logger_formatter=formatter) - although I’d need to do some research as to what would be the correct formatter to fix this issue, it does seem e.g.

class LocalPlainFormatter(LambdaPowertoolsFormatter):
    # pylint: disable = no-self-use, too-few-public-methods
    """Converts from the JSON structured data to a plain logging format"""

    def serialize(self, log: dict[Any, Any]) -> str:
        """Converts from the JSON structured data to a plain logging format.

        :param log: A JSON log record.
        :return: Plan logging format version of the input `log`.
        """
        timestamp = log["timestamp"]
        levelname = log["level"]
        func_name_location = log["location"]
        message = log["message"]
        exception_name = log.get("exception_name", "")
        traceback = log.get("exception", "")
        local_context = log.get("local_context", "")
        if exception_name or traceback:
            exception_info = f" - {exception_name} - {traceback}"
        else:
            exception_info = ""

        return f"{levelname} — {timestamp} —  {func_name_location} — {message}{local_context}" \
               f"{exception_info}"

Can work around this issue, at the expense of losing the structured logging

I also tried

def myserializer(x):
  return json.dumps(x, ensure_ascii=False)

logger = Logger(json_serializer=myserial)

Which I thought might be the issue, but it seems no matter what I pass to json_serializer I get the same output

Steps to Reproduce

  • aws_lambda_powertools-2.29.0
  • python 3.11

Powertools for AWS Lambda (Python) version

latest

AWS Lambda function runtime

3.10

Packaging format used

Lambda Layers

Debugging logs

No response

About this issue

  • Original URL
  • State: closed
  • Created 7 months ago
  • Comments: 21 (15 by maintainers)

Most upvoted comments

@bml1g12 could you try the latest version and let us know, please? 2.29.1 on pip install and Lambda Layer version 53

  • arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2:53
  • arn:aws:lambda:{region}:017000801446:layer:AWSLambdaPowertoolsPythonV2-Arm64:53

@bml1g12 Verdict – we’re going with option 3… changing the default. We did our due diligence as much as possible and consider this to be a long standing bug.

Main reasons but not limited to why we’ve decided to flip the default:

  1. ensure_ascii=False only affects non-ascii (~99% of the use cases atm).and no performance impact when toggling so. Python 2 and Python 3 would’ve been a different story (unicode obj vs str).
  2. There isn’t any performance overhead in toggling this feature (hyperfine test). Largely due to doing less checks typically done for non-ascii that needs to be reverted and escaped.
  3. Log analytics tools don’t offer auto-deserialization of encoded non-ascii characters, making search troublesome at best. This blocks customers from using Logger not only with CloudWatch Logs but also Datadog and others. This alone is a big reason to change it.
  4. TypeError shared earlier is related to a non-JSON serializable type which is already handled today via the default method.
  5. We’ve added an extra test to cover both all non-ascii characters (3.7+) and edge cases like field names being non-ascii.
  6. Our destination is standard output, not HTTP requests. It makes it less likely to break any deserialization as the consumer. It’s quite the opposite, it creates an usability issue today (point 3). To be extra sure, we also checked that Loguru (popular logging framework) uses ensure_ascii=False as the default too

Great, thank you @bml1g12 for gathering everyone’s consensus (say hello to team on our behalf!) – We’re running some additional tests and benchmark today for option 3.

Did additional research on weekend in CPython code, Loguru and Requests library, old CPython bug tracking, Python 2 vs Python 3 necessary changes, to make us more comfortable with optional 3 – if our benchmark and additional tests work out nicely today, we’d be comfortable with the risk, provide a quick way to rollback if we must, and if we do need to rollback we’ll make it the default in v3.

I think I found a solution but we’d need some form of mutation testing to put my mind at ease — using ensure_ascii=False + our internal JSON Encoder default method possibly combined with a singledispatch.

I’ll ping Nate from the SDK team who’s got experience with this ~6 years ago in the requests module back in the days.

That should solve the edge cases since we’ve had this working for ~4 years and only discovered it now.

On Fri, 8 Dec 2023 at 20:57, Benjamin Lowe @.***> wrote:

Just to note, probably needs no explanation, but I gave the example snippet as japanese_string = “\u30b9\u30b3\u30d3\u30eb\u30c7\u30e2\uff12” but it’s the same result if we provide japanese_string = “スコビルデモ2” which of course would be the more likely example (user’s providing string naturally in their native language for logging).

is someone intentionally creating a Unicode-encoded log line to export it maybe?

But as you say, it’s possible some users are already using the default behaviour intentionally to output Unicode encoded strings - I think it’s unlikely, but it’s definately possible.

It seems like a breaking change to me, so in my mind it would need to be 1 or 2. Thanks for the invite to discuss - we use this library a lot in our company (great work team!), so I’ll discuss with my colleagues if they have any thoughts on best way to proceed on Monday and get back to you via this issue. Async is probably easiest for me.

I’m also little unclear why it is that:

japanese_set = {“スコビルデモ2”} logger = Logger()logger.info(japanese_set)

Does not throw an error when

json.dumps(japanese_set)

Does?

— Reply to this email directly, view it on GitHub https://github.com/aws-powertools/powertools-lambda-python/issues/3474#issuecomment-1847769131, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZPQBAFAR4VNDCA4VJMJ7LYINWJZAVCNFSM6AAAAABAMZ6QGCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXG43DSMJTGE . You are receiving this because you were mentioned.Message ID: @.*** com>

@leandrodamascena got to jump on a meeting I can’t reschedule, could you take over, pretty please? Need to do some extensive exploratory testing besides what we have just in case.

There could be data loss when converting to str when getting it back too… hmmm tricky. If we can’t be 99% sure, I’d add under a flag, and document the behaviour