crate: Transport response handler not found
CrateDB version: 4.1.2
Environment description:
Centos 7.latest OpenJDK 11.0.7 Data nodes: 8cpu, 64gb ram Node makeup: 48 data, 3 master, 2 ingest, 2 query While we have 48 data nodes, it’s essentially over 2 availability zones (24 per zone)
Problem description: Our cluster health will occasionally get stuck in yellow and will require us to restart crate on the affected nodes for the health to go back to green. We typically have a nagios check that runs the alter cluster command which ends up resolving the problem, however, there are cases that require manual intervention.
We typically see shards stay unassigned until we run ALTER CLUSTER REROUTE RETRY FAILED. Some logs from a related issue #9748
shard has exceeded the maximum number of retries [20] on failed allocation attempts - manually execute 'alter cluster....' [unassigned_info[[reason=ALLOCATION_FAILED], at ..... failed to create shard, failure IOException[failed to obtain in-memory shard lock]...
[WARN ][o.e.i.c.IndicesClusterStateService] [hostname][[namespace..partitioned.tablename.someuuid][1]] marking and sending shard failed due to [failed to create shard] java.io.IOException: failed to obtain in-memory shard lock
at org.elasticsearch.index.IndexService.createShard(IndexService.java:358)
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:440)
at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:112)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:551)
...
[INFO ][o.e.i.s.TransportNodesListShardStoreMetaData] [hostname][namespace..partitioned.tablename.someuuid][1]: failed to obtain shard lock
org.elasticsearch.env.ShardLockObtainFailedException: [namespace..partitioned.tablename.someuuid][1]: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [read metadata snapshot]
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:748)
at org.elasticsearch.NodeEnvironment.shardLock(NodeEnvironment.java:663)
at org.elasticsearch.index.Store.readMetadataSnapshot(Store.java:443)
....
AFTER running the retry command we get shards stuck in the RELOCATING state with the following log message that emits at a very fast rate:
[WARN ][o.e.t.TransportService][node]Transport response handler not found of id [9285317]
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Reactions: 1
- Comments: 17 (8 by maintainers)
Commits related to this issue
- Fix peer recovery request handling, don't sent response twice Remove duplicated response sending. Follow up of #9131. Relates #10306. — committed to crate/crate by seut 4 years ago
- Fix peer recovery request handling, don't sent response twice Remove duplicated response sending. Follow up of #9131. Relates #10306. — committed to crate/crate by seut 4 years ago
- Fix peer recovery request handling, don't sent response twice Remove duplicated response sending. Follow up of #9131. Relates #10306. — committed to crate/crate by seut 4 years ago
- Fix peer recovery request handling, don't sent response twice Remove duplicated response sending. Follow up of #9131. Relates #10306. (cherry picked from commit 74801432fcba62befbe23cde12224e1c3d7f... — committed to crate/crate by seut 4 years ago
- Fix peer recovery request handling, don't sent response twice Remove duplicated response sending. Follow up of #9131. Relates #10306. (cherry picked from commit 74801432fcba62befbe23cde12224e1c3d7f... — committed to crate/crate by seut 4 years ago
We’ve finally found the issue related to the
Transport handler not found ...log entries, see https://github.com/crate/crate/pull/10797. Thank you for reporting, it was indeed an issue.@seut I will get it to you via my colleague @rene-stiams .
@seut appreciate the info, we’ll give
unassigned.node_left.delayed_timeouta try and see if it yields any improvements. One thing to note that as the cluster begins moving shards around, that inherently increases the load across a majority of the data nodes which can cause a widespread impact. Certainly understand that network latency/load need to be taken into account, but I think our issue is currently related to high load due to high IOWAIT