terraform-provider-proxmox: Unable to clone vm, looks like a timeout

https://github.com/Telmate/terraform-provider-proxmox/blob/24df605f0a4602fa3f5d231b5770ae68479f8641/proxmox/resource_vm_qemu.go#L783

I’m using Proxmox 6.3-2 and provider 2.9.0

proxmox_vm_qemu.resource-name: Still creating... [2m30s elapsed] ╷ │ Error: vm locked, could not obtain config │ │ with proxmox_vm_qemu.resource-name, │ on hosts.tf line 1, in resource "proxmox_vm_qemu" "resource-name": │ 1: resource "proxmox_vm_qemu" "resource-name" { │ ╵

But in proxmox logs I can see: transferred: 8306819 bytes remaining: 2178941 bytes total: 10485760 bytes progression: 79.22 %

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 2
  • Comments: 22

Most upvoted comments

Reproduced on v2.9.10.

Tried adjusting resource create timeouts and it didn’t seem to have any effect. Here are several test runs after changing the resource create timeouts. I did not change anything at all between runs:

  • Test 1: at the 5m50s mark, got a vm locked, could not obtain config error
  • Test 2: at the 5m mark, Error: file provisioner error because the file provisioner could not connect. “timeout - last error: dial tcp 192.168.1.78:22: connect: no route to host”
  • Test 3: Creation complete after 6m21s
  • Test 4: vm locked, could not obtain config
  • Test 5: the terraform destroy thought there was nothing to clean up, resulting in Error: 500 can't lock file '/var/lock/qemu-server/lock-135.conf' - got timeout on creation
  • Test 6: Creation complete after 6m17s
  • Test 7: Creation complete after 6m9s
  • Test 8: vm locked, could not obtain config
  • Test 9: the terraform destroy thought there was nothing to clean up, resulting in Error: file provisioner error because the file provisioner could not connect. “timeout - last error: dial tcp 192.168.1.78:22: connect: no route to host”
  • Test 10: Creation complete after 6m28s

So that gives some concrete data on the different types of errors that can come up and how often they appear. The issue with the destroy incorrectly thinking there’s nothing to destroy might be a separate issue and if I can dig up more details on that I’ll open a new ticket if one does not already exist. I suspect all the other issues are related to this timeout problem.

People seem to be mentioning setting both pm_timeout and PM_TIMEOUT to work around this issue. In case anyone in the future is confused about which is correct environment variable to us, it is PM_TIMEOUT. It is referred to as the pm_timeout in the documentation (similar to how the pm_api_url value is set by the PM_API_URL environment variable).

Similar to what others are reporting, I found that setting PM_TIMEOUT=600 seems to make everything completely stable. I’ve redeployed several times in a row without any failures, so this seems like a solid workaround.

In conclusion, I very much look forward to the fix with proper go style waits!

Update: I just got an The plugin.(*GRPCProvider).ApplyResourceChange request was cancelled. error with PM_TIMEOUT set to 600 (and it happened at 5m20s). So apparently, either this workaround isn’t bulletproof, or there’s an additional issue that is causing trouble. Also after that error, the VM still existed in Proxmox, but terraform thought there was nothing to destroy. Workaround for that is to just delete it manually in Proxmox.

Update 2: The VM is being cloned from server_1 (which is where the template is located) and deployed onto server_2 (for no particular reason). If I change the terraform file to set the target_node=“server_1”, the PM_TIMEOUT appears to be a more stable fix (maybe 100% stale?). In my case the backing store is a ceph cluster that is available on both servers. I wanted to mention this because it probably will affect the fix (waiting for the VM to be on the right node, not just for the clone operation to complete).

PVE 7.2-7; Plugin v2.9.11 cloud-init image size is 2252M.

Got timeout everytime when pm_parallel > 3:

create full clone of drive ide2 (zfsimage:vm-9000-cloudinit)
trying to acquire lock...
 OK
create full clone of drive scsi0 (zfsimage:base-9000-disk-0)
trying to acquire lock...
trying to acquire lock...
 OK
TASK ERROR: clone failed: can't lock file '/var/lock/pve-manager/pve-storage-zfsimage' - got timeout

export PM_TIMEOUT and pm_timeout seems doesn’t make effect.

I struggled on this issue for a while, and was able to fix it temporary by tweaking the pm_timeout parameter of the provider to a higher value. For 3 proxmox_vm_qemu clones, setting it to 600 was enough

This issue is still present in 2.9.1, 2.9.2 and 2.9.3. If a proxmox clone task takes longer than 5 minutes and 20 seconds terraform will try and send the config file which will give an error of:

Error: vm locked, could not obtain config.

A simple fix is to go back to 2.8.0 which still works.