prometheus: Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48

What did you do?

Azure SD stopped working for VMSS instances after upgrade from 2.41 > 2.48.

  - job_name: tmp
    azure_sd_configs:
      - subscription_id: 'xxx-xxx-xxx-xxx'
        authentication_method: ManagedIdentity
        resource_group: 'xxx'
        proxy_from_environment: true

It was working fine with 2.41 for both VM and VMSS instances but after upgrade it is discovering only regular VMs. Any instance within VM Scaling Set is not discovered and the error is returned (as many errors as VMSS you have):

ts=2023-12-04T17:19:13.235Z caller=azure.go:391 level=warn component="discovery manager scrape" 
discovery=azure config=tmp msg="Network interface does not exist" 
name=/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic 
err="network interface does not exist"

Moreover, I checked the resource in error by following the path /subscriptions/xxx-xxx-xxx-xxx/resourceGroups/xxx/providers/Microsoft.Compute/virtualMachineScaleSets/nomad-main-xxx-vmss/virtualMachines/282/networkInterfaces/primary.nic and there is no problem to see it.

What did you expect to see?

Azure SD working as previously.

What did you see instead? Under which circumstances?

Missing VMSS instances in Targets. Only VMs are present.

System information

No response

Prometheus version

No response

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

About this issue

Original URL
State: closed
Created 7 months ago
Comments: 15 (14 by maintainers)

Most upvoted comments

Nice, this commit is not in 2.49. I will create separate issue for it.

Thanks! Closing this particular issue and will cherry-pick the fix onto 2.49

bwplotka on Dec 19, 2023

Hmm, it partially works for me. VMSS discovery works but I don’t have everything discovered because it says there are duplicate registrations on SD:

ts=2023-12-13T14:04:52.744Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=mqa_nginx msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.744Z caller=consul.go:349 level=error component="discovery manager scrape" discovery=consul config=mqa_golang msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"
ts=2023-12-13T14:04:52.745Z caller=refresh.go:96 level=error component="discovery manager scrape" discovery=azure config=cadvisor msg="Unable to register metrics" err="failed to register metric: duplicate metrics collector registration attempted"

Not sure why, it is whether something else changed or my specific config with relabeling. I am still debugging and let you know.

roman-vynar on Dec 13, 2023

@daniel-resdiary amazing potential fix is merged to main. Before closing this issue, can somebody else confirm this helps e.g. @roman-vynar? You will need to use the main branch e.g. “prom/prometheus:main” docker image (EDIT: Its latest “main” tag includes the fix now).

Once confirmed we can close this AND we can consider cherry-picking the fix for 2.49.0 release (rc.0 is out for now)🤗

bwplotka on Dec 13, 2023

I think I have a fix for this: https://github.com/prometheus/prometheus/pull/13283 This is my first time contributing to any open source project, so please let me know if I’ve missed anything.

daniel-resdiary on Dec 12, 2023

I’ll take a look at this and see if I can get a fix. I don’t know how to write a test for it, though, as it needs to call GET on a real Azure NIC…

daniel-resdiary on Dec 12, 2023

I’ve done a little bit of further testing on this. It looks like it would be possible to use the *arm.ResourceID returned by newAzureResourceFromID (ultimately from arm.ParseResourceID(id)) to determine whether a given NIC ID is a VM NIC or a VMSS NIC.

The ResourceType on that *arm.ResourceID is different in each case. It looks like it returns:

Microsoft.Compute/virtualMachineScaleSets/virtualMachines/networkInterfaces for a VMSS NIC
Microsoft.Network/networkInterfaces for a VM NIC

daniel-resdiary on Dec 11, 2023

Looks like this is breaking because getNetworkInterfaceByID uses *armnetwork.InterfacesClient.Get() to return all Network Interfaces. *armnetwork.InterfacesClient.Get() only works with Virtual Machine interfaces.
To get VMSS instance interfaces it needs to use *armnetwork.InterfacesClient.GetVirtualMachineScaleSetNetworkInterface().

VMSS Network Interfaces have a very different Resource ID format to VM Network Interfaces, which is why there are two functions to return these.

This used to work (before https://github.com/prometheus/prometheus/pull/11860) because getNetworkInterfaceByID used to just call GET on the NIC ID path. Doing it that way meant it didn’t matter that the ID was formatted differently for each ID type.

daniel-resdiary on Dec 11, 2023

may relates #11860

arukiidou on Dec 11, 2023