tcmu-runner: tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.vmware01: Timing out cmd.

During our nightly Ceph scrubbing window, our ESXi hosts regularily (almost every night) lose the iSCSI connection and basically lock up completely (hard reset needed).

Looking at the logs, I found the following in the tcmu-runner logs appearing before the connection is lost.

Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.vmware01: Timing out cmd.
Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: tcmu_notify_conn_lost:176 rbd/rbd.vmware01: Handler connection lost (lock state 1)
Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: 2017-10-18 02:03:29.987 1650 [ERROR] tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.vmware01: Timing out cmd.
Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: 2017-10-18 02:03:29.987 1650 [ERROR] tcmu_notify_conn_lost:176 rbd/rbd.vmware01: Handler connection lost (lock state 1)
Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: 2017-10-18 02:03:29.988 1650 [INFO] tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.
Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.
Oct 18 02:03:30 node1.ceph tcmu-runner[1650]: tgt_port_grp_recovery_thread_fn:278: Enabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.
Oct 18 02:03:30 node1.ceph tcmu-runner[1650]: 2017-10-18 02:03:30.154 1650 [INFO] tgt_port_grp_recovery_thread_fn:278: Enabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.
Oct 18 02:03:33 node1.ceph tcmu-runner[1650]: tcmu_rbd_lock:446 rbd/rbd.vmware01: Acquired exclusive lock.
Oct 18 02:03:33 node1.ceph tcmu-runner[1650]: 2017-10-18 02:03:33.841 1650 [WARN] tcmu_rbd_lock:446 rbd/rbd.vmware01: Acquired exclusive lock.

System Info: Running on the first 3 of our Ceph OSD nodes CentOS 7.4 Kernel 4.13.4 with the Patches from the release note tcmu-runner 1.3.0-rc4 Ceph 12.2.1

About this issue

Original URL
State: open
Created 7 years ago
Comments: 20

Most upvoted comments

Just a FYI,

I am doing some experimenting here with different timers/settings which might work around the issue.

Will try to get back to you by Thurs/Fri.

mikechristie on Oct 25, 2017

Hey,

Sorry. I am still testing.

Right now to work around the timeout issue I think we just need to increase these timers:

increase the rados osd op timeout (this is set for the iscsi gw in

/etc/ceph/iscsi-gateway.cfg

by adding

[target] osd_op_timeout = N-in-seconds

and then restarting the gw.

decrease the osd heartbeat and interval settings like above.
increase the guest OS disk/command timeout per the guest OS docs.

It looks like a lot of times the first timer is just a little too short, so maybe setting it to around 60 - 90 seconds would be long enough so that we do not fire off the error handler when we do not need tol. #2 is probably ok for now, but dropping it down a little might help when we see the osd crashes. However, dropping too low is going to cause a lot of warnings in the logs, so I would leave it for now. And I am not yet sure about #3 when 1 - 2 are adjusted. #3 should be at least as long as #1 plus some settle/adjustment time. Because the entire host hung we never saw if the guest OS started to see errors too, so I would leave this for now.

I think we are also hitting something like this:

https://kb.vmware.com/s/article/2113956?language=en_US

There are so many ATS miscompare errors in the logs. We do not see the vmkernel errors mentioned in the KB though, but I think it is worth trying. Note that for HA we do not support completely disabling ATS heartbeats because we do not support distributed reservations yet, so you should only disable ATS for heartbeats by doing (ignore the wmfs versions in the command):

esxcli system settings advanced set -i 0 -o /VMFS3/UseATSForHBOnVMFS5

Note that I am still testing this stuff though. You can try this out if this is not a production system.

mikechristie on Oct 31, 2017

Is there any estimate on when this will be fixed?

Yes, correct sharing the lun between multiple hosts is not supported right now.

It is the next item on the TODO after this release. Maybe a month.

But - this does not explain the general slowness and the problems/lockup happening when the ceph cluster is just under scrubbing load, since no path is actually going down, right?

It might explain it. It does not explain why commands are taking so long initially, but once commands start taking a long time (longer than the rados osd op timeout of 30 seconds) paths will go down temporarily. So if commands started getting slow on all paths at the same time then all paths could go down temporarily. However, from the log snippet it looked like we were recovering really quickly so we should not have got into the situation where you had to hard reset.

Here is what we are seeing from the target logs in the original message:

// This log message it means a command took longer than 30 seconds to complete. We did not think it would take more than 30 secs to complete a command even with scrub in progress, so the code assumes it might be a connection to the ceph cluster issue. We are going to fail the command and ask for a retry while we try to reconnect the rbd image to the ceph cluster. (Note that in your case this assumption might be wrong and for some reason it really is taking the OSD longer than 30 secs to complete the command).

Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: tcmu_rbd_handle_timedout_cmd:669 rbd/rbd.vmware01: Timing out cmd.

// Here we disabled the iscsi target port while we try to reopen the image

Oct 18 02:03:29 node1.ceph tcmu-runner[1650]: tgt_port_grp_recovery_thread_fn:253: Disabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.

// We reopened the image and are now renabling the iscsi target port. It looks like it was not a issue with the connection to the ceph cluster because we were able to reopen the image in 1 second.

Oct 18 02:03:30 node1.ceph tcmu-runner[1650]: tgt_port_grp_recovery_thread_fn:278: Enabled iscsi/iqn.2017-01.com.domain.ceph:iscsigw/tpgt_1.

// ESX relogged into the iscsi target port pretty quickly (3 seconds) and started to send IO to this path. This path grab the rbd lock so it can execute IO.

Oct 18 02:03:33 node1.ceph tcmu-runner[1650]: tcmu_rbd_lock:446 rbd/rbd.vmware01: Acquired exclusive lock.

maybe the additional load by the flapping OSDs makes it even worse?

If OSDs are going down while a scrub is in progress then it could explain why commands are taking a long time. We might not be detecting this and failing over to other OSDs in time. On your OSDs in your /etc/ceph/ceph.conf did you set these settings (it is mentioned in the doc I referenced so if you did not have the doc initially you missed it):

[osd] osd heartbeat grace = 20 osd heartbeat interval = 5

[1382567.124995] ABORT_TASK: Found referenced iSCSI task_tag: 123417 [1382575.529050] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 123417

These messages mean ESX’s command timer has fired. It sent an abort, but the command completed shortly after that (so this just means nothing was permanently hung). Depending on the command type that could be 8 - 15 seconds. So during a scrub if IO is going to take a long time then these are expected. If OSDs are crashing we would also expect them. If you you just doing normal IO and/or just opening the storage part of the gui we should not see them. When we see these on the target side, we want to check if the ESX vmkernel log is mentioning something about APD and/or failovers. See below for info on how to increase the ESX logging for this and the scrub issue.

Oh yeah, for the path selection policy info it would just be in the

esxcli storage nmp device list

command output. For your device naa.60014059731a7108ea74de29e8c1c55f you should see this for the type and policy:

Storage Array Type: VMW_SATP_ALUA Path Selection Policy: VMW_PSP_MRU

vmkernel.log

There was nothing interesting in here. There is one failed command during probing but that it is expected because we do not support it. ESX will ignore it.

Can you enable some extra logging. It will give more info about what is happening when we see those ABORT messages on the target.

In the Software Advanced Settings screen in the GUI under Scsi, set:

Scsi.LogCmdErrors 1 Scsi.LogScsiAborts 1 Scsi.LogMPCmdErrors 1

Could you also tell me the value for

Scsi.FailVMOnAPD

that you have.

Thank you so much for your time and help!

No problem.

mikechristie on Oct 19, 2017