kubernetes: AWS does not fail when provisioning a volume encrypted with inaccessible KSM key.
This is continuation of #48438 and #48936.
What happened: When StorageClass refers to a key that is not accessible to Kubernetes (i.e. AWS API returns AccessDeniedException), AWS plugin still tries to provision a volume under assumption that while Kubernetes can’t access the key, the other components may access it and create the volume and attach it to a node later.
Related code is here - checkEncryptionKey checks only for NotFoundException. Other errors are just logged and provisioning continues.
Problem is that when AWS really can’t access the key, it will provision some volume (ec2.CreateVolume returns success + valid AWS EBS), but this AWS EBS disappears after few seconds. Therefore Kubernetes does not know that something went wrong and creates a PV for the returned EBS. Kubernetes later fails to run a pod that refers to such PV, because the PV does not exist.
What you expected to happen: Kubernetes must not create PV in this case and must report error during provisioning of such volume.
Ideas:
-
According to AWS documentation, the only signal that AWS produces is an event in Amazon CloudWatch Events that a key could not be used for encryption. Kubernetes AWS cloud provider is not subscribed to these events and does not get this notification. We would need new permission for controller-manager to get the events and some new code to filter them and report back to user.
-
Kubernetes AWS cloud provider could wait for X seconds and re-check that created volume still exists and did not disappear because of inaccessible key. What would be reliable value of X?
-
Kubernetes volume plugin for AWS could return both a warning “I failed to check encryption key, created volume may not work” that would be propagated to user and the suspicious volume at the same time, but the API between volume plugins and Kubernetes allows only one - either error or volume. It could be extended, still the user experience would be suboptimal.
-
Kubernetes could send an error to user during volume attach (!) that a volume does not exist, please check your KMS key permissions. Again, UX is quite bad here.
/kind bug /sig aws
About this issue
- Original URL
- State: closed
- Created 6 years ago
- Comments: 15 (15 by maintainers)
@randomvariable, thanks a lot for detailed investigation. I’ll try to get some time to implement 2. even before CSI. Users are very innovative in setting up their cluster permissions wrong and the messaging to user is very confusing in this particular issue.