kubernetes: Azure provider triggers VMSS manual upgrade which restarts all instances at the same time

What happened:

We have a v1.10.6 cluster hosted in Azure (deployed with https://github.com/Azure/acs-engine) using VMSS for agent pool and we decided to change the VM Size of the VMSS instances for cost efficiency. The plan was:

  • Change the VMSS VM Size
  • Double the number of instances (the new instances being created with the new VM Size)
  • Drain the old instances
  • Upgrade the old instances to the new VM Size
  • Scale down to the original number of nodes

What actually appened:

  • We changed the VMSS VM Size
  • We doubled the number of instances
  • Kubernetes Azure cloudprovider triggered a manualupgrade of the VMSS leading to upgrading all the nodes not matching the new VM Size
  • All services hosted were down due to all the “old nodes” being stopped at the same time for Upgrade

Here the Azure audit log showing that the Service Principal used by the kubernetes cluster Azure provider triggered the manual upgrade:

{
    "authorization": {
        "action": "Microsoft.Compute/virtualMachineScaleSets/manualupgrade/action",
        "scope": "/subscriptions/<AZURE_SUBSCRIPTION_ID>/resourceGroups/k8s4-EUW-dev-RG/providers/Microsoft.Compute/virtualMachineScaleSets/k8s4-euw-dev-agent-vmss"
    },
    "caller": "<SOME_ID>",
    "channels": "Operation",
    "claims": {
        "aud": "https://management.core.windows.net/",
        "iss": "https://sts.windows.net/<AZURE_TENANT_ID>/",
        "iat": "1543571495",
        "nbf": "1543571495",
        "exp": "1543575395",
        "aio": "42RgYPg3Vf3czm1PjRPl7kz8ZFs/EwA=",
        "appid": "<K8S_AZURE_SERVICE_PRINCIPAL>",
        "appidacr": "1",
        "http://schemas.microsoft.com/identity/claims/identityprovider": "https://sts.windows.net/<AZURE_TENANT_ID>/",
        "http://schemas.microsoft.com/identity/claims/objectidentifier": "<SOME_ID>",
        "http://schemas.xmlsoap.org/ws/2005/05/identity/claims/nameidentifier": "<SOME_ID>",
        "http://schemas.microsoft.com/identity/claims/tenantid": "<AZURE_TENANT_ID>",
        "uti": "dOjT0Ob3B0OV9ckOFFXjAQ",
        "ver": "1.0"
    },
    "correlationId": "c0ec14e9-0d68-4eed-ab97-6043a2dc50fd",
    "description": "",
    "eventDataId": "00263b46-028d-498b-9264-ba94e20c1b84",
    "eventName": {
        "value": "EndRequest",
        "localizedValue": "End request"
    },
    "category": {
        "value": "Administrative",
        "localizedValue": "Administrative"
    },
    "eventTimestamp": "2018-11-30T10:09:31.1754255Z",
    "id": "/subscriptions/<AZURE_SUBSCRIPTION_ID>/resourceGroups/k8s4-EUW-dev-RG/providers/Microsoft.Compute/virtualMachineScaleSets/k8s4-euw-dev-agent-vmss/events/00263b46-028d-498b-9264-ba94e20c1b84/ticks/636791693711754255",
    "level": "Informational",
    "operationId": "c0ec14e9-0d68-4eed-ab97-6043a2dc50fd",
    "operationName": {
        "value": "Microsoft.Compute/virtualMachineScaleSets/manualupgrade/action",
        "localizedValue": "Manual Upgrade Virtual Machine Scale Set"
    },
    "resourceGroupName": "k8s4-EUW-dev-RG",
    "resourceProviderName": {
        "value": "Microsoft.Compute",
        "localizedValue": "Microsoft.Compute"
    },
    "resourceType": {
        "value": "Microsoft.Compute/virtualMachineScaleSets",
        "localizedValue": "Microsoft.Compute/virtualMachineScaleSets"
    },
    "resourceId": "/subscriptions/<AZURE_SUBSCRIPTION_ID>/resourceGroups/k8s4-EUW-dev-RG/providers/Microsoft.Compute/virtualMachineScaleSets/k8s4-euw-dev-agent-vmss",
    "status": {
        "value": "Accepted",
        "localizedValue": "Accepted"
    },
    "subStatus": {
        "value": "Accepted",
        "localizedValue": "Accepted (HTTP Status Code: 202)"
    },
    "submissionTimestamp": "2018-11-30T10:09:50.1545164Z",
    "subscriptionId": "<AZURE_SUBSCRIPTION_ID>",
    "properties": {
        "statusCode": "Accepted",
        "serviceRequestId": "03abebf2-7aaf-4195-8e21-fb9c2f48a861"
    },
    "relatedEvents": []
}

Environment:

  • Kubernetes version (use kubectl version): v1.10.6
  • Cloud provider or hardware configuration: Azure

/kind bug

About this issue

  • Original URL
  • State: closed
  • Created 6 years ago
  • Reactions: 12
  • Comments: 16 (12 by maintainers)

Most upvoted comments

The ETA of new VMSS APIs is mid April, and this issue would be fixed after that’s available.