cloudquery: Investigation request: Investigate Azure plugin high memory usage

We got reports of syncing on the azure_compute_skus compute table failing even with 8GB of memory, with reduced concurrency of 1000.

We should investigate to see if this can be improved. One thing to note is the Azure client pre-initializes some data that is needed for multiplexing like SubscriptionsObjects, ResourceGroups and registeredNamespaces. Example config:

kind: source
spec:
  name: "azure"
  path: "cloudquery/azure"
  version: "v6.0.0"
  destinations: ["postgres"]
  tables: ["azure_compute_skus"]

About this issue

Original URL
State: closed
Created a year ago
Reactions: 4
Comments: 24 (15 by maintainers)

Commits related to this issue

fix(website): Use lower concurrency setting in Azure configuration (#9339)  #### Summary Follow up on https://github.com... — committed to cloudquery/cloudquery by erezrokah a year ago
fix(azure): Recommend service principal authentication method (#9355) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. This is pending users confirmation that cha... — committed to cloudquery/cloudquery by erezrokah a year ago
fix(azure): Remove redundant `SingleSubscriptionMultiplex` (#9365) #### Summary Somehow related to https://github.com/cloudquery/cloudquery/issues/9269 as it should save some memory and perfor... — committed to cloudquery/cloudquery by erezrokah a year ago
feat(azure-auth): Log `DefaultAzureCredential` credentials errors (#9363) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. When we create the Azure credentials vi... — committed to cloudquery/cloudquery by erezrokah a year ago
feat(azure): Discover resource groups and namespaces in parallel (#9382) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. I'm trying to reproduce https://github.co... — committed to cloudquery/cloudquery by erezrokah a year ago
feat(website-azure): Update Azure docs to skip some tables, remove concurrency (#9424) #### Summary A follow up to https://github.com/cloudquery/cloudquery/pull/9355, and related to https://gi... — committed to cloudquery/cloudquery by erezrokah a year ago
feat(azure): Add discovery concurrency setting (#9811) #### Summary This is meant to address https://github.com/cloudquery/cloudquery/issues/9269#issuecomment-1485086439. On accounts with man... — committed to cloudquery/cloudquery by erezrokah a year ago

Most upvoted comments

That’s great news @cberge908, happy to hear that. I think I’ll close this issue then as our next action items is waiting for the Azure team to fix https://github.com/Azure/azure-sdk-for-go/issues/20470 (hopefully soon) and https://github.com/Azure/azure-sdk-for-go/issues/19356 (probably not soon).

Only one job had to be repeated, this was probably due to an error on MS side (503, service unavailable). After the 2nd attempt this completed successful as well.

Interesting. Azure SDK has a default retry logic that includes 503 status codes, so it should handle such intermediate errors. If it happens again can you open an issue? We can override the default retry logic and maybe add longer timeouts.

Please comment if you have any more issues regarding memory and I can re-open

erezrokah on Apr 13, 2023

Hi @erezrokah - sorry for the late reply, just returned to office after the easter season. We’ve done another run with the new version and 5 jobs in parallel (with 1 table in each job). We’re now close to 2k subscriptions and all 5 jobs completed successfully within 5-10 minutes. None of them aborted prematurely. I have not read all logs in detail but a quick spot-check did not bring any 403s to the front.

Only one job had to be repeated, this was probably due to an error on MS side (503, service unavailable). After the 2nd attempt this completed successful as well.

cberge908 on Apr 12, 2023

Update 2

Using the second authentication method described here https://www.cloudquery.io/docs/plugins/sources/azure/overview#authentication-with-environment-variables seems to reduce memory usage significantly and allows the Azure plugin to finishing syncing with the default concurrency on our internal account. To people subscribed to this issue, it would be great if you could test it as well

erezrokah on Mar 22, 2023

Also, happy easter 🥚 🎉

Frohe Ostern from our group!

tomtom215 on Apr 13, 2023

Also, happy easter 🥚 🎉

erezrokah on Apr 13, 2023

Thanks @cberge908 that’s very helpful. Sorry for the late reply I was out of office for a couple of days. I’ll run my tests with a much higher number of subscriptions and see where that lands me.

erezrokah on Apr 4, 2023

Hi @erezrokah - we had 5 parallel jobs, each job queried all subscriptions (~1.8k). Every job was limited to one table and skipped all dependent tables. When limiting the subscriptions per job everything went smooth and the table contents were fetched within a few seconds. But without the limitation to a few subscriptions the 403 error occurred after roughly 30s.

Interestingly if you only run one job with one table for all subscriptions, the writing starts kind of fast and you don’t see any 403 at all + the job completes successfully.

When running the jobs limited to the subscription mentioned the 403 error they all complete within a few seconds. So it still looks like that the error occurs only when having no limits set on subscriptions and running more than one job of this in parallel.

Hope that helps to sum it all up.

cberge908 on Mar 30, 2023

The Azure team confirmed they have:

A bug in the SKUs API that returns all 30k items in a single response
Performance issues with the current SDK implementation

See https://github.com/Azure/azure-sdk-for-go/issues/20470#issuecomment-1485809476

(I still don’t have any updates on the 403 errors)

erezrokah on Mar 28, 2023

I would also expect to get the http 429 when hitting a rate limit instead of the 403. Maybe the high amount of API requests are doing strange things to the Azure API 😉

I’ve done a test with limiting all 5 jobs to only 10 subscriptions, they all succeed within 30-40s. Provides a good indicator that it might be related to the high volume we’re querying in parallel.

cberge908 on Mar 27, 2023

Hey @erezrokah - we had a chance to test v6.1.0 last night and the performance of the discovery has massively improved due to #9382 !

When doing a run against all subscriptions, our discovery period would average ~120-160 minutes before starting to write to the destination. From our single test yesterday, that seems to now be >10 minutes!

We also saw much more predictable run times (down from 24+ hours to ~2 hours) and memory usage with excluding: ["azure_compute_skus", "azure_*_definitions","azure_authorization_role_assignments"]

tomtom215 on Mar 24, 2023

Thanks @tomtom215, sharing what I found so far: It looks like that while we’re using the paginated API for that table, it returns all items in a single page (28859 items on our subscription). Since we do that for every subscription we discover in parallel (based on the concurrency option) all that memory adds up.

Still looking at how this can be optimized

erezrokah on Mar 23, 2023

Hi @erezrokah - correct, the only change was to skip the SKUs table. I sent over our findings from the run separately as promised

tomtom215 on Mar 23, 2023