ClickHouse: LXC containers: Crashes after update/hardware server restart

Hello!

I’m having this problem for about a year. After CH updating and/or hardware server reboot (CH installed in LXC container), I get a bunch of random crashes. Initially, they appear just at start. Some time later they appear only at quite high load (multiple parallel queries). Later they disappear until reboot/update. There’re no any RAM errors reported by memcheck (no ECC though), no any random crashes or errors in other software (10+ LXC containers and KVMs running, zfs is using most of the RAM).

2021.12.15 16:04:10.923799 [ 9351 ] {} <Fatal> BaseDaemon: ########################################
2021.12.15 16:04:10.923890 [ 9351 ] {} <Fatal> BaseDaemon: (version 21.8.12.29 (official build), build id: 89CB735EABD0B424DF213861E4D0FD666E2A0CF1) (from thread 8643) (no query) Received signal Segmentation fault (11)
2021.12.15 16:04:10.923946 [ 9351 ] {} <Fatal> BaseDaemon: Address: 0x7f3300007f1c Access: read. Address not mapped to object.
2021.12.15 16:04:10.923970 [ 9351 ] {} <Fatal> BaseDaemon: Stack trace: 0x10d2b0b0 0x10d5dd2e 0x10f39120 0x10f3dd57 0x10c8d6d0 0x10052768 0x10054797 0x10055514 0x9024b1f 0x9028403 0x7f33ac9d6ea7 0x7f33ac8f5def
2021.12.15 16:04:10.924078 [ 9351 ] {} <Fatal> BaseDaemon: 1. DB::MergeTreeData::getDataPartsVector(std::initializer_list<DB::IMergeTreeDataPart::State> const&, std::__1::vector<DB::IMergeTreeDataPart::State, std::__1::allocator<DB::IMergeTreeDataPart::State> >*, bool) const @ 0x10d2b0b0 in /usr/bin/clickhouse
2021.12.15 16:04:10.924136 [ 9351 ] {} <Fatal> BaseDaemon: 2. DB::MergeTreeDataMergerMutator::selectPartsToMerge(DB::FutureMergedMutatedPart&, bool, unsigned long, std::__1::function<bool (std::__1::shared_ptr<DB::IMergeTreeDataPart const> const&, std::__1::shared_ptr<DB::IMergeTreeDataPart const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*)> const&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>*) @ 0x10d5dd2e in /usr/bin/clickhouse
2021.12.15 16:04:10.924200 [ 9351 ] {} <Fatal> BaseDaemon: 3. DB::StorageMergeTree::selectPartsToMerge(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::shared_ptr<DB::RWLockImpl::LockHolderImpl>&, std::__1::unique_lock<std::__1::mutex>&, bool, DB::SelectPartsDecision*) @ 0x10f39120 in /usr/bin/clickhouse
2021.12.15 16:04:10.924225 [ 9351 ] {} <Fatal> BaseDaemon: 4. DB::StorageMergeTree::scheduleDataProcessingJob(DB::IBackgroundJobExecutor&) @ 0x10f3dd57 in /usr/bin/clickhouse
2021.12.15 16:04:10.924258 [ 9351 ] {} <Fatal> BaseDaemon: 5. DB::IBackgroundJobExecutor::backgroundTaskFunction() @ 0x10c8d6d0 in /usr/bin/clickhouse
2021.12.15 16:04:10.924851 [ 9351 ] {} <Fatal> BaseDaemon: 6. DB::BackgroundSchedulePoolTaskInfo::execute() @ 0x10052768 in /usr/bin/clickhouse
2021.12.15 16:04:10.924882 [ 9351 ] {} <Fatal> BaseDaemon: 7. DB::BackgroundSchedulePool::threadFunction() @ 0x10054797 in /usr/bin/clickhouse
2021.12.15 16:04:10.924900 [ 9351 ] {} <Fatal> BaseDaemon: 8. ? @ 0x10055514 in /usr/bin/clickhouse
2021.12.15 16:04:10.924933 [ 9351 ] {} <Fatal> BaseDaemon: 9. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @0x9024b1f in /usr/bin/clickhouse
2021.12.15 16:04:10.924951 [ 9351 ] {} <Fatal> BaseDaemon: 10. ? @ 0x9028403 in /usr/bin/clickhouse
2021.12.15 16:04:10.924988 [ 9351 ] {} <Fatal> BaseDaemon: 11. start_thread @ 0x8ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
2021.12.15 16:04:10.925012 [ 9351 ] {} <Fatal> BaseDaemon: 12. clone @ 0xfddef in /lib/x86_64-linux-gnu/libc-2.31.so
2021.12.15 16:04:11.037995 [ 9351 ] {} <Fatal> BaseDaemon: Checksum of the binary: A8C5BDC5B60DE1251EAACB4D0E110F95, integrity check passed.
2021.12.15 16:04:30.981745 [ 8604 ] {} <Fatal> Application: Child process was terminated by signal 11.

and

2021.12.15 16:05:02.470615 [ 9544 ] {} <Fatal> BaseDaemon: ########################################
2021.12.15 16:05:02.470699 [ 9544 ] {} <Fatal> BaseDaemon: (version 21.8.12.29 (official build), build id: 89CB735EABD0B424DF213861E4D0FD666E2A0CF1) (from thread 9440) (no query) Received signal Segmentation fault (11)
2021.12.15 16:05:02.470740 [ 9544 ] {} <Fatal> BaseDaemon: Address: NULL pointer. Access: read. Unknown si_code.
2021.12.15 16:05:02.470783 [ 9544 ] {} <Fatal> BaseDaemon: Stack trace: 0x10d50d30 0x10d2b832 0x10f3ae37 0x10f3ddf0 0x10c8d6d0 0x10052768 0x10054797 0x10055514 0x9024b1f 0x9028403 0x7f91d1000ea7 0x7f91d0f1fdef
2021.12.15 16:05:02.470937 [ 9544 ] {} <Fatal> BaseDaemon: 1. std::__1::back_insert_iterator<std::__1::vector<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > > std::__1::__merge<DB::MergeTreeData::LessDataPart&, boost::multi_index::detail::bidir_node_iterator<boost::multi_index::detail::ordered_index_node<boost::multi_index::detail::null_augment_policy, boost::multi_index::detail::index_node_base<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > > >, std::__1::__wrap_iter<std::__1::shared_ptr<DB::IMergeTreeDataPart const>*>, std::__1::back_insert_iterator<std::__1::vector<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > > >(boost::multi_index::detail::bidir_node_iterator<boost::multi_index::detail::ordered_index_node<boost::multi_index::detail::null_augment_policy, boost::multi_index::detail::index_node_base<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > > >, boost::multi_index::detail::bidir_node_iterator<boost::multi_index::detail::ordered_index_node<boost::multi_index::detail::null_augment_policy, boost::multi_index::detail::index_node_base<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > > >, std::__1::__wrap_iter<std::__1::shared_ptr<DB::IMergeTreeDataPart const>*>, std::__1::__wrap_iter<std::__1::shared_ptr<DB::IMergeTreeDataPart const>*>, std::__1::back_insert_iterator<std::__1::vector<std::__1::shared_ptr<DB::IMergeTreeDataPart const>, std::__1::allocator<std::__1::shared_ptr<DB::IMergeTreeDataPart const> > > >, DB::MergeTreeData::LessDataPart&) @ 0x10d50d30 in /usr/bin/clickhouse
2021.12.15 16:05:02.470997 [ 9544 ] {} <Fatal> BaseDaemon: 2. DB::MergeTreeData::getDataPartsVector(std::initializer_list<DB::IMergeTreeDataPart::State> const&, std::__1::vector<DB::IMergeTreeDataPart::State, std::__1::allocator<DB::IMergeTreeDataPart::State> >*, bool) const @ 0x10d2b832 in /usr/bin/clickhouse
2021.12.15 16:05:02.471037 [ 9544 ] {} <Fatal> BaseDaemon: 3. DB::StorageMergeTree::selectPartsToMutate(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, std::__1::shared_ptr<DB::RWLockImpl::LockHolderImpl>&) @ 0x10f3ae37 in /usr/bin/clickhouse
2021.12.15 16:05:02.471064 [ 9544 ] {} <Fatal> BaseDaemon: 4. DB::StorageMergeTree::scheduleDataProcessingJob(DB::IBackgroundJobExecutor&) @ 0x10f3ddf0 in /usr/bin/clickhouse
2021.12.15 16:05:02.471093 [ 9544 ] {} <Fatal> BaseDaemon: 5. DB::IBackgroundJobExecutor::backgroundTaskFunction() @ 0x10c8d6d0 in /usr/bin/clickhouse
2021.12.15 16:05:02.471116 [ 9544 ] {} <Fatal> BaseDaemon: 6. DB::BackgroundSchedulePoolTaskInfo::execute() @ 0x10052768 in /usr/bin/clickhouse
2021.12.15 16:05:02.471137 [ 9544 ] {} <Fatal> BaseDaemon: 7. DB::BackgroundSchedulePool::threadFunction() @ 0x10054797 in /usr/bin/clickhouse
2021.12.15 16:05:02.471156 [ 9544 ] {} <Fatal> BaseDaemon: 8. ? @ 0x10055514 in /usr/bin/clickhouse
2021.12.15 16:05:02.471182 [ 9544 ] {} <Fatal> BaseDaemon: 9. ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) @0x9024b1f in /usr/bin/clickhouse
2021.12.15 16:05:02.471201 [ 9544 ] {} <Fatal> BaseDaemon: 10. ? @ 0x9028403 in /usr/bin/clickhouse
2021.12.15 16:05:02.471228 [ 9544 ] {} <Fatal> BaseDaemon: 11. start_thread @ 0x8ea7 in /lib/x86_64-linux-gnu/libpthread-2.31.so
2021.12.15 16:05:02.471254 [ 9544 ] {} <Fatal> BaseDaemon: 12. clone @ 0xfddef in /lib/x86_64-linux-gnu/libc-2.31.so
2021.12.15 16:05:02.582671 [ 9544 ] {} <Fatal> BaseDaemon: Checksum of the binary: A8C5BDC5B60DE1251EAACB4D0E110F95, integrity check passed.
2021.12.15 16:05:22.491631 [ 9394 ] {} <Fatal> Application: Child process was terminated by signal 11.

Crash reporting is enabled.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 24 (9 by maintainers)

Commits related to this issue

Most upvoted comments

if k8s pods uses lxc then it is better to avoid this - yes.

However if you are using recent version of ClickHouse then you are safe, percpu arena will be disabled automatically.

narenas_total_get() returns 8 (equal to cores count for container) and ind varies up to 11 from start to start.

Nice debugging, great!

Workaround is to set LXC cores count equal to physical.

The thing is that clickhouse uses percpu arenas. You can also disable percpu arenas:

  • ln -s "percpu_arena:disable,confirm_conf:true,abort_conf:true" /etc/malloc.conf
  • export MALLOC_CONF="percpu_arena:disable,confirm_conf:true,abort_conf:true"

Here

  • confirm_conf - will print config that had been accepted
  • abort_conf:true - will abort if the option or value unknown

In the mean time I will take more closer look into this issue. BTW do you have LXD?

Upd. All is clear now - malloc_getcpu returns physical index of processor which can be more than available cores count.

Proper fix may be on lxc side to provide correct mapping (extremely complicated, I guess) or take into account physical cores count.

Workaround is to set LXC cores count equal to physical.