azure-functions-durable-python: Sometimes `context.current_utc_datetime` intermittently evaluates to `None`

To be honest, I’m not too sure how to replicate this reliably, so at this stage I’m more looking more for some help on how to nail this down.

I’m currently facing an issue where deployed durable functions (I don’t really see this locally) would sometimes return None when trying to evaluate context.current_utc_datetime. I have many places in the orchestrator where I’m recording timestamps to a database entry, so this is something I use often. This can happen at any one of the many evaluations of context.current_utc_datetime throughout the run, and I can’t seem to find rhyme or reason as to what causes this.

I’ve noticed that repeatedly evaluating context.current_utc_datetime in a for loop would eventually return a valid timestamp, so I’ve monkey-patched this hack into the orchestrator:

    def get_current_utc_datetime() -> datetime:
        curr_time = context.current_utc_datetime
        while not curr_time:
            sleep(0.05)
            curr_time = context.current_utc_datetime
        return curr_time

However today, I’m seeing several orchestrator runs where it’s just blocking, seemingly forever, on this hack-y for loop without end, for over 30 minutes, which is much longer than I though an orchestrator is allowed to run. I’ve had to manually stop the entire deployed Functions App in order to try again with another orchestrator instance, but so far it’s all been hitting the same problem.

This is all with beta 11, within the last couple of weeks.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 15

Most upvoted comments

@timtylin,

Our whole team is fairly active on our GitHub repos, so your feedback about @davidmrdavid’s work on this issue is noted 🥇.

It sounds like we should close this issue, but I would recommend opening up a separate issue for those weird gaps you are seeing, and we can take a look at those. We should have our internal telemetry all wired up now, so if you give us a timestamp and ideally the orchestration instance id with those weird gaps (and as much information about the orchestration you feel comfortable sharing publicly), that would help us diagnose that issue and see if there are some easy tweaks in the meantime.

ConnorMcMahon on Dec 22, 2020

Hi @davidmrdavid

I’ve been stress-testing over the weekend and so far I haven’t seen it return None , so I’m happy to say that this no longer happens. I do wonder if this has uncovered some other underlying issue, as I’m still seeing some unexplained long gaps (>100s) between successive timestamps and the only thing in between is an Activity that does a single CosmosDB write. At first I thought it was concurrency issues (I’ve set max Activity concurrency to 1), but this is the only orchestrator running at that time, so I’m still a bit puzzled on how these delays happen.

Thank you very much for resolving the original issue with such a quick turnaround. I just wish there’s some way for me to leave you a great internal review 👏

timtylin on Dec 22, 2020