prometheus: Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48
What did you do?
Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48.
- job_name: tmp
azure_sd_configs:
- subscription_id: 'xxx-xxx-xxx-xxx'
authentication_method: ManagedIdentity
resource_group: 'xxx'
proxy_from_environment: true
It was working fine with 2.41 for both VM and VMSS instances but after upgrade it is discovering only regular VMs. Any instance within VM Scaling Set is not discovered and the error is returned (as many errors as VMSS you have):
ts=2023-12-04T17:19:13.235Z caller=azure.go:391 level=warn component="discovery manager scrape"
discovery=azure config=tmp msg="Network interface does not exist"
name=/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic
err="network interface does not exist"
Moreover, I checked the resource in error by following the path /subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic
and there is no problem to see it.
What did you expect to see?
Azure SD working as previously.
What did you see instead? Under which circumstances?
Missing VMSS instances in Targets. Only VMs are present.
System information
No response
Prometheus version
No response
Prometheus configuration file
No response
Alertmanager version
No response
Alertmanager configuration file
No response
Logs
No response
About this issue
- Original URL
- State: closed
- Created 7 months ago
- Comments: 15 (14 by maintainers)
Nice, this commit is not in 2.49. I will create separate issue for it.
Thanks! Closing this particular issue and will cherry-pick the fix onto 2.49
Hmm, it partially works for me. VMSS discovery works but I don’t have everything discovered because it says there are duplicate registrations on SD:
Not sure why, it is whether something else changed or my specific config with relabeling. I am still debugging and let you know.
@daniel-resdiary amazing potential fix is merged to main. Before closing this issue, can somebody else confirm this helps e.g. @roman-vynar? You will need to use the main branch e.g. “prom/prometheus:main” docker image (EDIT: Its latest “main” tag includes the fix now).
Once confirmed we can close this AND we can consider cherry-picking the fix for 2.49.0 release (rc.0 is out for now)🤗
I think I have a fix for this: https://github.com/prometheus/prometheus/pull/13283 This is my first time contributing to any open source project, so please let me know if I’ve missed anything.
I’ll take a look at this and see if I can get a fix. I don’t know how to write a test for it, though, as it needs to call GET on a real Azure NIC…
I’ve done a little bit of further testing on this. It looks like it would be possible to use the
*arm.ResourceID
returned bynewAzureResourceFromID
(ultimately fromarm.ParseResourceID(id)
) to determine whether a given NIC ID is a VM NIC or a VMSS NIC.The
ResourceType
on that*arm.ResourceID
is different in each case. It looks like it returns:Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces
for a VMSS NICMicrosoft.Network/networkInterfaces
for a VM NICLooks like this is breaking because getNetworkInterfaceByID uses
*armnetwork.InterfacesClient.Get()
to return all Network Interfaces.*armnetwork.InterfacesClient.Get()
only works with Virtual Machine interfaces.To get VMSS instance interfaces it needs to use
*armnetwork.InterfacesClient.GetVirtualMachineScaleSetNetworkInterface()
.VMSS Network Interfaces have a very different Resource ID format to VM Network Interfaces, which is why there are two functions to return these.
This used to work (before https://github.com/prometheus/prometheus/pull/11860) because getNetworkInterfaceByID used to just call GET on the NIC ID path. Doing it that way meant it didn’t matter that the ID was formatted differently for each ID type.