ccls: ccls hangs ps commands and then no responses in gvim

Observed behavior

I try to use ccls on my extremely large proprietary codebase which I am unable to share.

After a short amount of time, ccls ends up in a state that is very similar to what is described here: https://rachelbythebay.com/w/2014/10/27/ps/

My “ps -ef” command hangs when it reaches “ccls”. Also if I do “cat /proc/<pid>/cmdline”, I get a hang. I am using gvim 8.1.328 with “autozimu/LanguageClient-neovim” plugin. When I try commands like “hover” or “goto definition” in gvim, no response.

I do not have these issues when using “cquery” on the same codebase with same compile_commands.json and same set of source files. I am able to browse with “cquery” but not with “ccls” due to this freeze.

If I use ccls with a much much smaller sample project it works fine and I don’t get this hang.

I do not have root access on the machine where I do the source code development.

I encountered these issues with ccls originally in September 2018 and was using cquery instead because of this problem. I decided to try ccls again and am still getting this issue.

If I quit gvim, ccls is still hanging my ps commands. A simple “kill <pid>” won’t stop ccls, I need to perform “kill -9 <pid>” to unhang ps.

Expected behavior

Should not get a hang on “ps -ef” when I use ccls, should not get a hang when I do

Steps to reproduce

  1. gvim “my file”
  2. ccls starts indexing and eventually hangs the ps tools

System information

  • ccls version (git describe --tags):

$ git describe --tags 0.20181225.8

  • OS:

$ uname -a Linux hostname 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

  • Editor:

gvim 8.1.328

  • Language client (and version):

autozimu/LanguageClient-neovim

I don’t know how to get the version? I installed it last fall. It works with cquery.

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 27 (11 by maintainers)

Most upvoted comments

I’m able to make the uninterruptible sleep go away completely by changing the locks of g_index_mutex in pipeline.cc! I changed all std::shared_lock on this mutex to std::unique_lock, as well as changing the std::lock_guard to std::unique_lock (not sure that is necessary but I did it anyway).

Now I don’t go into uninterruptible sleep at all anymore! And I’m using all 72 threads!

So I think there are bugs in ccls around the usage of std::shared_lock around sections of code that are writing to data structures. std::shared_lock should only be around reads, but ccls has it around writes!

This is causing data corruption and somehow this leads to uninterruptible sleep on my system. Perhaps for others it manifests as “Out of Memory”, because who knows what problems corrupted data structures can lead to!

So I think ccls should fix this bug. It explains why I see this issue in ccls and not cquery, because ccls is not performing proper locks around data structures accessed by multiple threads.

I believe I found one of the root causes of the uninterruptible sleep! By fixing this bug ccls no longer goes into uninterruptible sleep right away. However it still goes into uninterruptible sleep much much later, but I think I have peeled first layer of the onion on this problem! There is probably another data structure that is not properly protected that is causing the uninterruptible sleep much later.

From pipeline.cc, in Indexer_Parse(), around lines 322-325:

        if (entry.id >= 0) {
          std::shared_lock lock2(project->mtx);
          project->root2folder[entry.root].path2entry_index[path] = entry.id;
        }

I have caught MANY MANY threads trying to write to the path2entry_index at the same time! I have attached 2 screenshots from the same core dump that do this.

ccls_bug2 ccls_bug

When I change every instance of locking of project->mtx to use std::unique_lock, then I can delay the uninterruptible sleep by a very long time. It does still happen though, so I think there is probably another data structure in ccls that is not properly protected.

Since I have 72 CPU cores and 72 threads, I have many many more threads operating in parallel than other users, so that is probably why I see this problem and others do not. By reducing number of threads, it also makes it less likely that multiple threads access the data structure at the same time.

I’m not sure how much more that sort of trial and error testing will help find the problem; the failure seems pretty random to me and so doesn’t really point in a particular direction.

Has anyone considered trying to build ccls with TSAN just to see if it can help find anything? Something like that seems more promising, if the results are usable.

Great finding! Thank you for pinpointing this incorrect usage of shared_mutex. It was caused by:

Commit 579df478eb203d311b8e33ba4f4bc7bd23e87efc
Date:   Fri Dec 21 01:05:23 2018 -0800

    Extend .ccls

I didn’t remember why I at some point switched to std::shared_lock and amended it … 😅

The main thread (sometimes reader sometimes writer) and indexer threads (always writer) access root2folder concurrently. Since there can only be one reader, I shall change the std::shared_mutex to std::mutex.

The race condition was difficult to trigger because the time spent in the critical section is short. With 72 threads the contention becomes likely.

        IndexUpdate update = IndexUpdate::CreateDelta(nullptr, prev.get());
        on_indexed->PushBack(std::move(update),
          request.mode != IndexMode::NonInteractive);
        if (entry.id >= 0) {
          std::shared_lock lock2(project->mtx);     ///////////// should use std::lock_guard
          project->root2folder[entry.root].path2entry_index[path] = entry.id;
        }
      }

This commit is included in 0.20181225.8 so I should make a new patch release 0.20181225.9

This is causing data corruption and somehow this leads to uninterruptible sleep on my system.

Hope this is the exact issue that leads to “Out of Memory”. I’m still very suspicious why it does that… Did you get a chance to dump /proc/$PID/task/$task/stack when the threads were wreaking havoc in the std::shared_lock guarded region? (nevermind if you didn’t). Thanks so much for locating this!