cloudquery: Investigation request: Investigate Azure plugin high memory usage
We got reports of syncing on the azure_compute_skus
compute table failing even with 8GB of memory, with reduced concurrency of 1000
.
We should investigate to see if this can be improved. One thing to note is the Azure client pre-initializes some data that is needed for multiplexing like SubscriptionsObjects
, ResourceGroups
and registeredNamespaces
.
Example config:
kind: source
spec:
name: "azure"
path: "cloudquery/azure"
version: "v6.0.0"
destinations: ["postgres"]
tables: ["azure_compute_skus"]
About this issue
- Original URL
- State: closed
- Created a year ago
- Reactions: 4
- Comments: 24 (15 by maintainers)
Commits related to this issue
- fix(website): Use lower concurrency setting in Azure configuration (#9339) <!-- 🎉 Thank you for making CloudQuery awesome by submitting a PR 🎉 --> #### Summary Follow up on https://github.com... — committed to cloudquery/cloudquery by erezrokah a year ago
- fix(azure): Recommend service principal authentication method (#9355) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. This is pending users confirmation that cha... — committed to cloudquery/cloudquery by erezrokah a year ago
- fix(azure): Remove redundant `SingleSubscriptionMultiplex` (#9365) #### Summary Somehow related to https://github.com/cloudquery/cloudquery/issues/9269 as it should save some memory and perfor... — committed to cloudquery/cloudquery by erezrokah a year ago
- feat(azure-auth): Log `DefaultAzureCredential` credentials errors (#9363) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. When we create the Azure credentials vi... — committed to cloudquery/cloudquery by erezrokah a year ago
- feat(azure): Discover resource groups and namespaces in parallel (#9382) #### Summary Related to https://github.com/cloudquery/cloudquery/issues/9269. I'm trying to reproduce https://github.co... — committed to cloudquery/cloudquery by erezrokah a year ago
- feat(website-azure): Update Azure docs to skip some tables, remove concurrency (#9424) #### Summary A follow up to https://github.com/cloudquery/cloudquery/pull/9355, and related to https://gi... — committed to cloudquery/cloudquery by erezrokah a year ago
- feat(azure): Add discovery concurrency setting (#9811) #### Summary This is meant to address https://github.com/cloudquery/cloudquery/issues/9269#issuecomment-1485086439. On accounts with man... — committed to cloudquery/cloudquery by erezrokah a year ago
That’s great news @cberge908, happy to hear that. I think I’ll close this issue then as our next action items is waiting for the Azure team to fix https://github.com/Azure/azure-sdk-for-go/issues/20470 (hopefully soon) and https://github.com/Azure/azure-sdk-for-go/issues/19356 (probably not soon).
Interesting. Azure SDK has a default retry logic that includes 503 status codes, so it should handle such intermediate errors. If it happens again can you open an issue? We can override the default retry logic and maybe add longer timeouts.
Please comment if you have any more issues regarding memory and I can re-open
Hi @erezrokah - sorry for the late reply, just returned to office after the easter season. We’ve done another run with the new version and 5 jobs in parallel (with 1 table in each job). We’re now close to 2k subscriptions and all 5 jobs completed successfully within 5-10 minutes. None of them aborted prematurely. I have not read all logs in detail but a quick spot-check did not bring any 403s to the front.
Only one job had to be repeated, this was probably due to an error on MS side (503, service unavailable). After the 2nd attempt this completed successful as well.
Update 2
Using the second authentication method described here https://www.cloudquery.io/docs/plugins/sources/azure/overview#authentication-with-environment-variables seems to reduce memory usage significantly and allows the Azure plugin to finishing syncing with the default concurrency on our internal account. To people subscribed to this issue, it would be great if you could test it as well
Frohe Ostern from our group!
Also, happy easter 🥚 🎉
Thanks @cberge908 that’s very helpful. Sorry for the late reply I was out of office for a couple of days. I’ll run my tests with a much higher number of subscriptions and see where that lands me.
Hi @erezrokah - we had 5 parallel jobs, each job queried all subscriptions (~1.8k). Every job was limited to one table and skipped all dependent tables. When limiting the subscriptions per job everything went smooth and the table contents were fetched within a few seconds. But without the limitation to a few subscriptions the 403 error occurred after roughly 30s.
Interestingly if you only run one job with one table for all subscriptions, the writing starts kind of fast and you don’t see any 403 at all + the job completes successfully.
When running the jobs limited to the subscription mentioned the 403 error they all complete within a few seconds. So it still looks like that the error occurs only when having no limits set on subscriptions and running more than one job of this in parallel.
Hope that helps to sum it all up.
The Azure team confirmed they have:
See https://github.com/Azure/azure-sdk-for-go/issues/20470#issuecomment-1485809476
(I still don’t have any updates on the 403 errors)
I would also expect to get the http 429 when hitting a rate limit instead of the 403. Maybe the high amount of API requests are doing strange things to the Azure API 😉
I’ve done a test with limiting all 5 jobs to only 10 subscriptions, they all succeed within 30-40s. Provides a good indicator that it might be related to the high volume we’re querying in parallel.
Hey @erezrokah - we had a chance to test v6.1.0 last night and the performance of the discovery has massively improved due to #9382 !
When doing a run against all subscriptions, our discovery period would average ~120-160 minutes before starting to write to the destination. From our single test yesterday, that seems to now be >10 minutes!
We also saw much more predictable run times (down from 24+ hours to ~2 hours) and memory usage with excluding:
["azure_compute_skus", "azure_*_definitions","azure_authorization_role_assignments"]
Thanks @tomtom215, sharing what I found so far: It looks like that while we’re using the paginated API for that table, it returns all items in a single page (
28859
items on our subscription). Since we do that for every subscription we discover in parallel (based on the concurrency option) all that memory adds up.Still looking at how this can be optimized
Hi @erezrokah - correct, the only change was to skip the SKUs table. I sent over our findings from the run separately as promised