scylla-operator: storage_io_error while deploying cluster

Describe the bug Hi, I’m having issues getting my scylla cluster to run on my kubernetes 1.18.0 cluster using rook-ceph as the storage backend. I keep getting this error when the first node starts up:

[shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)

Or sometimes when the first node starts up the second node will throw the same error. I set up some nodeSelectors to deploy the scylla cluster on nodes that I’m certain don’t have disk issues. To Reproduce Steps to reproduce the behavior:

Create Operator
Create the Cluster
See error

Expected behavior I expect the scylla pods to deploy one by one.

Logs

INFO  2020-04-06 19:06:27,464 [shard 0] init - Scylla version 3.2.1-0.20200122.e3e301906d5 starting.
WARN  2020-04-06 19:06:27,464 [shard 0] init - Only 512 MiB per shard; this is below the recommended minimum of 1 GiB/shard; continuing since running in developer mode
INFO  2020-04-06 19:06:27,465 [shard 0] init - starting prometheus API server
INFO  2020-04-06 19:06:27,465 [shard 0] init - creating tracing
INFO  2020-04-06 19:06:27,465 [shard 0] init - creating snitch
INFO  2020-04-06 19:06:27,465 [shard 0] init - determining DNS name
INFO  2020-04-06 19:06:27,465 [shard 0] init - starting API server
INFO  2020-04-06 19:06:27,466 [shard 0] init - Scylla API server listening on 127.0.0.1:10000 ...
INFO  2020-04-06 19:06:27,466 [shard 0] init - initializing storage service
INFO  2020-04-06 19:06:27,466 [shard 0] init - starting per-shard database core
WARN  2020-04-06 19:06:27,467 [shard 0] init - I/O Scheduler is not properly configured! This is a non-supported setup, and performance is expected to be unpredictably bad.
 Reason found: none of --max-io-requests, --io-properties and --io-properties-file are set.
To properly configure the I/O Scheduler, run the scylla_io_setup utility shipped with Scylla.INFO  2020-04-06 19:06:27,467 [shard 0] init - creating data directories
INFO  2020-04-06 19:06:27,477 [shard 0] init - creating commitlog directory
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
The key fingerprint is:
SHA256:vOn+hcc7vqswkf7dcrc8RPKde75ZxynIQAd8BKRY1mc root@simple-cluster-mn8-statefulset-0
The key's randomart image is:
+---[RSA 4096]----+
|       o++o.     |
|      + .o.E     |
|     . . .+.     |
|       ....  . . |
|        S.    + o|
|       . +oo.  =o|
|        * .o+...=|
|       . + +oo+o*|
|       .o.+o**.B*|
+----[SHA256]-----+
Could not load host key: /etc/ssh/ssh_host_ecdsa_key
Connecting to http://localhost:10000
Starting the JMX server
INFO  2020-04-06 19:06:30,608 [shard 0] init - creating hints directories
JMX is enabled to receive remote connections on port: 7199
INFO  2020-04-06 19:06:32,220 [shard 0] init - verifying directories
Traceback (most recent call last):
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 197, in <module>
    args.func(args)
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 123, in check_version
    current_version = sanitize_version(get_api('/storage_service/scylla_release_version'))
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 82, in get_api
    return get_json_from_url("http://" + api_address + path)
  File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 74, in get_json_from_url
    retval = result.get(timeout=5)
  File "/opt/scylladb/python3/lib64/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f9d599de4d0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
INFO  2020-04-06 19:06:37,466 [shard 0] database - Populating Keyspace system_schema
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF computed_columns id=cc7c7069-3740-33c1-92a4-c3de78dbd2c4 version=2b8c4439-de76-31e0-807f-3b7290a975d7
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF view_virtual_columns id=08843b63-45dc-3be2-9798-a0418295cfaa version=c777531c-15f7-326f-8ebe-39fd0265c8c9
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF dropped_columns id=5e7583b5-f3f4-3af1-9a39-b7e1d6f5f11f version=7426bc6c-4c2f-3200-8ad8-4329610ed59a
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF indexes id=0feb57ac-311f-382f-ba6d-9024d305702f version=99c40462-8687-304e-abe3-2bdbef1f25aa
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF functions id=96489b79-80be-3e14-a701-66a0b9159450 version=329ed804-55b3-3eee-ad61-d85317b96097
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF columns id=24101c25-a2ae-3af7-87c1-b40ee1aca33f version=d33236d4-9bdd-3c09-abf0-a0bc5edc2526
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF scylla_tables id=5d912ff1-f759-3665-b2c8-8042ab5103dd version=16b55508-a81a-3b90-9a0d-f58f5f833864
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF tables id=afddfb9d-bc1e-3068-8056-eed6c302ba09 version=b6240810-eeb7-36d5-9411-43b2d68dddab
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF keyspaces id=abac5682-dea6-31c5-b535-b3d6cffd0fb6 version=e79ca8ba-6556-3f7d-925a-7f20cf57938c
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF triggers id=4df70b66-6b05-3251-95a1-32b54005fd48 version=582d7071-1ef0-37c8-adc6-471a13636139
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF views id=9786ac1c-dd58-3201-a7cd-ad556410c985 version=5b58bb47-96e7-3f57-accf-0bfca4dbbc6e
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF types id=5a8b1ca8-6602-3f77-a045-9273d308917a version=de51b2ce-5e4d-3b7d-a75f-2204332ce8d1
INFO  2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF aggregates id=924c5587-2e3a-345b-b10c-12f37c1ba895 version=4b53e92c-0368-3d5c-b959-2ec1bfd1a59f
WARN  2020-04-06 19:06:37,756 [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention
WARN  2020-04-06 19:06:37,756 [shard 0] storage_service - Commitlog error: std::system_error (error system:4, Interrupted system call)
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: starts
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO  2020-04-06 19:06:37,756 [shard 0] gossip - gossip is already stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - messaging_service stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - stream_manager stopped
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: auth shutdown
INFO  2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: done
WARN  2020-04-06 19:06:39,492 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)

Environment:

Platform: Generic, local cluster
Kubernetes version: 1.18.0
Scylla version: 3.2.1
Scylla-operator version: yanniszark/scylla-operator:v0.1.4

Additional context I had updated the cluster from 1.15 -> 1.16 -> 1.17 -> 1.18 when it first started happening, but on a second test cluster which I also performed the same Kubernetes upgrade procedure scylla was working fine.

About this issue

Original URL
State: closed
Created 4 years ago
Comments: 28 (12 by maintainers)

Most upvoted comments

@dahankzter I opened this issue yesterday and it looks somewhat similar. I have a reproducer: https://github.com/scylladb/scylla/issues/6381

gnumoreno on May 6, 2020

@dahankzter can we prevent/warn this kind of deployment? At least until we can figure out how we can support it.

@MicDeDuiwel note that you’ll be experiencing double replication and poor performance running on top of ceph, since each of Scylla’s replicas will be replicated by ceph. So if each layer has a replication factor of 3, you end up with an overall replication factor of 9. The recommendation is to work with local volumes.

avikivity on Apr 22, 2020

An Update from our side:

So after a lot of digging and reaching out we came to the realisation that our rook deployment must be at fault. We had many issues with rook and cephfs and decided to switch to ceph block storage. This solved our issue, scylla deployments are working just fine now.

MicDeDuiwel on Apr 21, 2020