scylla-operator: storage_io_error while deploying cluster
Describe the bug Hi, I’m having issues getting my scylla cluster to run on my kubernetes 1.18.0 cluster using rook-ceph as the storage backend. I keep getting this error when the first node starts up:
[shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
Or sometimes when the first node starts up the second node will throw the same error. I set up some nodeSelectors to deploy the scylla cluster on nodes that I’m certain don’t have disk issues. To Reproduce Steps to reproduce the behavior:
- Create Operator
- Create the Cluster
- See error
Expected behavior I expect the scylla pods to deploy one by one.
Logs
INFO 2020-04-06 19:06:27,464 [shard 0] init - Scylla version 3.2.1-0.20200122.e3e301906d5 starting.
WARN 2020-04-06 19:06:27,464 [shard 0] init - Only 512 MiB per shard; this is below the recommended minimum of 1 GiB/shard; continuing since running in developer mode
INFO 2020-04-06 19:06:27,465 [shard 0] init - starting prometheus API server
INFO 2020-04-06 19:06:27,465 [shard 0] init - creating tracing
INFO 2020-04-06 19:06:27,465 [shard 0] init - creating snitch
INFO 2020-04-06 19:06:27,465 [shard 0] init - determining DNS name
INFO 2020-04-06 19:06:27,465 [shard 0] init - starting API server
INFO 2020-04-06 19:06:27,466 [shard 0] init - Scylla API server listening on 127.0.0.1:10000 ...
INFO 2020-04-06 19:06:27,466 [shard 0] init - initializing storage service
INFO 2020-04-06 19:06:27,466 [shard 0] init - starting per-shard database core
WARN 2020-04-06 19:06:27,467 [shard 0] init - I/O Scheduler is not properly configured! This is a non-supported setup, and performance is expected to be unpredictably bad.
Reason found: none of --max-io-requests, --io-properties and --io-properties-file are set.
To properly configure the I/O Scheduler, run the scylla_io_setup utility shipped with Scylla.INFO 2020-04-06 19:06:27,467 [shard 0] init - creating data directories
INFO 2020-04-06 19:06:27,477 [shard 0] init - creating commitlog directory
Generating public/private rsa key pair.
Your identification has been saved in /etc/ssh/ssh_host_rsa_key.
Your public key has been saved in /etc/ssh/ssh_host_rsa_key.pub.
The key fingerprint is:
SHA256:vOn+hcc7vqswkf7dcrc8RPKde75ZxynIQAd8BKRY1mc root@simple-cluster-mn8-statefulset-0
The key's randomart image is:
+---[RSA 4096]----+
| o++o. |
| + .o.E |
| . . .+. |
| .... . . |
| S. + o|
| . +oo. =o|
| * .o+...=|
| . + +oo+o*|
| .o.+o**.B*|
+----[SHA256]-----+
Could not load host key: /etc/ssh/ssh_host_ecdsa_key
Connecting to http://localhost:10000
Starting the JMX server
INFO 2020-04-06 19:06:30,608 [shard 0] init - creating hints directories
JMX is enabled to receive remote connections on port: 7199
INFO 2020-04-06 19:06:32,220 [shard 0] init - verifying directories
Traceback (most recent call last):
File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 197, in <module>
args.func(args)
File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 123, in check_version
current_version = sanitize_version(get_api('/storage_service/scylla_release_version'))
File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 82, in get_api
return get_json_from_url("http://" + api_address + path)
File "/opt/scylladb/scripts/libexec/scylla-housekeeping", line 74, in get_json_from_url
retval = result.get(timeout=5)
File "/opt/scylladb/python3/lib64/python3.7/multiprocessing/pool.py", line 657, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '<multiprocessing.pool.ExceptionWithTraceback object at 0x7f9d599de4d0>'. Reason: 'TypeError("cannot serialize '_io.BufferedReader' object")'
INFO 2020-04-06 19:06:37,466 [shard 0] database - Populating Keyspace system_schema
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF computed_columns id=cc7c7069-3740-33c1-92a4-c3de78dbd2c4 version=2b8c4439-de76-31e0-807f-3b7290a975d7
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF view_virtual_columns id=08843b63-45dc-3be2-9798-a0418295cfaa version=c777531c-15f7-326f-8ebe-39fd0265c8c9
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF dropped_columns id=5e7583b5-f3f4-3af1-9a39-b7e1d6f5f11f version=7426bc6c-4c2f-3200-8ad8-4329610ed59a
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF indexes id=0feb57ac-311f-382f-ba6d-9024d305702f version=99c40462-8687-304e-abe3-2bdbef1f25aa
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF functions id=96489b79-80be-3e14-a701-66a0b9159450 version=329ed804-55b3-3eee-ad61-d85317b96097
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF columns id=24101c25-a2ae-3af7-87c1-b40ee1aca33f version=d33236d4-9bdd-3c09-abf0-a0bc5edc2526
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF scylla_tables id=5d912ff1-f759-3665-b2c8-8042ab5103dd version=16b55508-a81a-3b90-9a0d-f58f5f833864
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF tables id=afddfb9d-bc1e-3068-8056-eed6c302ba09 version=b6240810-eeb7-36d5-9411-43b2d68dddab
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF keyspaces id=abac5682-dea6-31c5-b535-b3d6cffd0fb6 version=e79ca8ba-6556-3f7d-925a-7f20cf57938c
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF triggers id=4df70b66-6b05-3251-95a1-32b54005fd48 version=582d7071-1ef0-37c8-adc6-471a13636139
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF views id=9786ac1c-dd58-3201-a7cd-ad556410c985 version=5b58bb47-96e7-3f57-accf-0bfca4dbbc6e
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF types id=5a8b1ca8-6602-3f77-a045-9273d308917a version=de51b2ce-5e4d-3b7d-a75f-2204332ce8d1
INFO 2020-04-06 19:06:37,466 [shard 0] database - Keyspace system_schema: Reading CF aggregates id=924c5587-2e3a-345b-b10c-12f37c1ba895 version=4b53e92c-0368-3d5c-b959-2ec1bfd1a59f
WARN 2020-04-06 19:06:37,756 [shard 0] storage_service - Shutting down communications due to I/O errors until operator intervention
WARN 2020-04-06 19:06:37,756 [shard 0] storage_service - Commitlog error: std::system_error (error system:4, Interrupted system call)
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: starts
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown rpc and cql server done
INFO 2020-04-06 19:06:37,756 [shard 0] gossip - gossip is already stopped
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: stop_gossiping done
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - messaging_service stopped
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown messaging_service done
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - stream_manager stopped
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: shutdown stream_manager done
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: auth shutdown
INFO 2020-04-06 19:06:37,756 [shard 0] storage_service - Stop transport: done
WARN 2020-04-06 19:06:39,492 [shard 0] commitlog - Exception in segment reservation: storage_io_error (Storage I/O error: 4: Interrupted system call)
Environment:
- Platform: Generic, local cluster
- Kubernetes version: 1.18.0
- Scylla version: 3.2.1
- Scylla-operator version: yanniszark/scylla-operator:v0.1.4
Additional context I had updated the cluster from 1.15 -> 1.16 -> 1.17 -> 1.18 when it first started happening, but on a second test cluster which I also performed the same Kubernetes upgrade procedure scylla was working fine.
About this issue
- Original URL
- State: closed
- Created 4 years ago
- Comments: 28 (12 by maintainers)
@dahankzter I opened this issue yesterday and it looks somewhat similar. I have a reproducer: https://github.com/scylladb/scylla/issues/6381
@dahankzter can we prevent/warn this kind of deployment? At least until we can figure out how we can support it.
@MicDeDuiwel note that you’ll be experiencing double replication and poor performance running on top of ceph, since each of Scylla’s replicas will be replicated by ceph. So if each layer has a replication factor of 3, you end up with an overall replication factor of 9. The recommendation is to work with local volumes.
An Update from our side:
So after a lot of digging and reaching out we came to the realisation that our rook deployment must be at fault. We had many issues with rook and cephfs and decided to switch to ceph block storage. This solved our issue, scylla deployments are working just fine now.