python-o365: oauth token race condition?

I have run into a fairly interesting race condition problem using oauth tokens stored on the filesystem. At least it has the hallmarks of a race condition…

I use O365 in a (Docker) containerized application I’ve developed that manages file encryption & SFTP transfers for my company. My application runs various download & upload jobs daily via cron, and sends out status e-mails when the jobs have completed. I have wrapped O365 in a separate module I created that takes care of all its initialization and allows my other modules to easily send preset e-mails depending on the job. Each job calls its own instance of this O365 wrapper.

I have 2 jobs that run at the exact same time in the morning. The jobs take slightly different amounts of time to complete, on the order of milliseconds. When they’re done, they both call the O365 wrapper, which starts the process of initializing O365 and sending the desired e-mails. At the time of these jobs, the last token was used over 60 min prior, so O365 has to fetch new oAuth tokens thru Graph.

Here’s where things get funky. Since the calls to O365 come within a few milliseconds of one another, it’s not enough time for the first call to fetch and save down a new token and have the other call see this token before it goes and does the same thing. So both O365 calls think they need to get new tokens…which they do (successfully). Both jobs then send a POST request to the graph sendMail endpoint with my email content…and after about 20 seconds I get a 503 service unavailable error on both requests.

To make this a bit easier to understand - here is a very rough, simplified & scrubbed version of my logs, to indicate what’s happening.

2019-11-06 07:00:01,236 - job1 - INFO - Job 1 Finishes
2019-11-06 07:00:01,238 - O365.connection:_internal_request:618 - INFO - Requesting (POST) URL: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail
2019-11-06 07:00:01,238 - O365.connection:_internal_request:619 - INFO - Request parameters: {...job 1 graph json...}
2019-11-06 07:00:01,238 - O365.connection:_internal_request:634 - INFO - Oauth Token is expired, fetching a new token
2019-11-06 07:00:01,246 - job2 - INFO - Job 2 Finishes
2019-11-06 07:00:01,248 - O365.connection:_internal_request:618 - INFO - Requesting (POST) URL: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail
2019-11-06 07:00:01,248 - O365.connection:_internal_request:619 - INFO - Request parameters: {...job 2 graph json...}
2019-11-06 07:00:01,248 - O365.connection:_internal_request:634 - INFO - Oauth Token is expired, fetching a new token
2019-11-06 07:00:01,669 - O365.connection:_internal_request:637 - INFO - New oauth token fetched
2019-11-06 07:00:01,669 - O365.connection:_internal_request:618 - INFO - Requesting (POST) URL: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail
2019-11-06 07:00:01,669 - O365.connection:_internal_request:619 - INFO - Request parameters: {...job 1 graph json...}
2019-11-06 07:00:01,734 - O365.connection:_internal_request:637 - INFO - New oauth token fetched
2019-11-06 07:00:01,734 - O365.connection:_internal_request:618 - INFO - Requesting (POST) URL: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail
2019-11-06 07:00:01,735 - O365.connection:_internal_request:619 - INFO - Request parameters: {...job 2 graph json...}
2019-11-06 07:00:24,354 - O365.connection:_internal_request:662 - DEBUG - Server Error: 503 Server Error: Service Unavailable for url: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail
2019-11-06 07:00:24,582 - O365.connection:_internal_request:662 - DEBUG - Server Error: 503 Server Error: Service Unavailable for url: https://graph.microsoft.com/v1.0/users/myemail@myemail.com/sendMail

I suspect the issue is coming from the first job sending a POST request to send the email while the graph server is still handling the second job’s request for a new token. The 503 code on the MS documentation indicates the service is not available OR busy. But I’m not positive. MS server is definitely not down. I wanted to see if you had encountered this or something similar before, and had any suggestions.

The easiest solution is to just have the job’s run a minute apart. Gives enough time to fetch and read new tokens. Or I thought of maybe using a token file name unique to the job? Currently, I use the same token file for everything.

Any input would be appreciated. Thanks!

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 20 (20 by maintainers)

Commits related to this issue

Most upvoted comments

Pushed the discussed token.py changes to #356 and removed changes to Connection