rustup: Rustup (including proxies) is not safe for concurrent use

(Edited to capture all the details that have emerged over time)

Recovering from this bug:

Usually just doing a rustup component remove NAME && rustup component add NAME will fix things. Sometimes removing the entire toolchain will be needed. In rare cases uninstalling rustup entirely will be needed.

User model of Rustup

Rustup may be run as three different commands:

rustup-init to install rustup (& by default a toolchain)
rustup to explicitly query or modify an installation (including rustup itself and one or more toolchains)
as a proxy of rustc, cargo etc (& which can implicitly trigger installation, upgrade of modification of a toolchain e.g. through toolchain files)

Locking in Rustup

Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.

Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.

Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE’s hold rls open for extended periods.

We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)

Toolchains

We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.

Components

Components such as rls are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.

Downloads

We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.

Network file systems & rustup

Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.

NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held…

  [EWOULDBLOCK]
	    The file is locked and the LOCK_NB option was specified.

So it is likely that we will see bug reports of rustup hanging when ~/.rustup is on NFS and the NFS server’s lock RPC service is misconfigured or flaky. I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don’t need it… and a bug reporting hint to tell us they have disabled it.

Locks and notifications

OS locks typically don’t have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 17
Comments: 21 (16 by maintainers)

Commits related to this issue

Attach `/cargo-lock` locks to all dune rules using rustup or cargo There is currently an issue with `rustup`[0] that can cause the toolchain to fall into an unusable state if multiple instances are r... — committed to MinaProtocol/mina by mrmr1993 2 years ago
fix(gitpod): workaround rustup's concurrent issue Related: rust-lang/rustup#988 Related: rust-lang/rustup#2417 — committed to pan93412/ferrumfix by pan93412 2 years ago
fix(gitpod): workaround rustup's concurrent issue Related: rust-lang/rustup#988 Related: rust-lang/rustup#2417 — committed to wmjtyd/ferrumfix-fork by pan93412 2 years ago

Most upvoted comments

Ok, so here’s a bit of a specification, I think this ties together all the various bits involved.

Locking in Rustup

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE’s hold rls open for extended periods.

Toolchains

Components

Downloads

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

Network file systems & rustup

  [EWOULDBLOCK]
	    The file is locked and the LOCK_NB option was specified.

Locks and notifications

rbtcollins on Mar 8, 2020

@rbtcollins OK. Just to clarify my own understanding of the issue as @matklad described it, if the update crashes while it is happening, it could leave the system in an indeterminate state, which is enough of a problem that we want some method of avoiding it completely, correct? If that is true, below is a first stab at solving the two problems. Every process is required to follow the steps below, or everything fails.

Create a new temporary file in /tmp. My current understanding is that this is not guaranteed to be atomic, but that doesn’t matter, the only requirement is that they try to create this file.
Each process then tries to atomically rename their own temporary file to some well-known name, e.g. /tmp/rustup_lock_file. Bonus points if the contents of the file include a short readme so that anyone wondering what this file is can get an idea as to what it’s for. It may even contain a URL pointing to this issue, or something similar.
If the rename fails, it means that either some other running process owns the lock, or the other process crashed while holding the lock. In most cases it will mean the former, and as long as everything is proceeding as expected nobody will look at this file, and it will either be cleared at the end of the second phase (see below), or on the next reboot of the machine which is when the system will reset the contents of /tmp.
If the rename succeeds, then that process now owns the lock. It now starts the second stage actions.
1. It looks for a file within ~/.rustup named state_machine. If this file does not exist, then the last process to do any work succeeded, and the system is in a clean state. Since the process owns the lock, it can now safely create the state_machine file, and update its contents with status messages, state information, etc., etc., etc. safely. Once all work is done, the process first deletes the state_machine file, and then deletes the /tmp/rustup_lock_file, confirming that the former is truly committed before executing the latter.
2. If the file does exist, it means that the last process to be doing updates failed somehow, leaving the directory in an unclean state. The state written within the file could be used to indicate what steps succeeded, and what steps need to be rolled back or finished, or whatever. Regardless, cleanup steps are initiated.

There are a few advantages to the above algorithm:

There is only one lock, so the time it takes to grab or release the lock should be small relative to the amount of work the rest of the system is doing.
It doesn’t preclude the use of threads internal to a process, so we’re not forcing single threading, just single processing.
If the process crashes, then cleaning up is simply a matter of rebooting the system. Since that seems to be the go-to fix for everything computer related, it won’t be a surprise to anyone to find out that’s what they should do. The advantage is that if the process crashed due to something weird in the computer itself unrelated to rustup, rebooting might clear that issue out, letting the process continue to completion after reboot.
The state_machine file gives a hint as to where the process was when it failed. This could be useful when debugging what’s going on, and for general awareness of the current state of the process (e.g., tail -f ~/.rustup/state_machine is a quick and dirty way of watching the current progress of the process).

The one big disadvantage is that this is not a truly transactional algorithm. I don’t yet know how to do that, but maybe some of the database literature would help with that?

ckaran on Feb 28, 2022

rustup needs to have concurrency support bolted on in general. Need to extract cargo’s flocking code.

brson on Mar 16, 2017

Not actively working on a PR, but it is on my radar.

rbtcollins on Mar 2, 2020

We’ve been running into this on the Libra team as developers using CLion or IDEA manage to invoke rustup concurrently accidentally quite often when we check in toolchain file updates. This results in a borked toolchain install that is missing rustc, but fixes itself after a manual run of rustup toolchain install $TOOLCHAIN.

@rbtcollins Are you working on a PR to fix this now? If so, that would be awesome.

metajack on Mar 2, 2020

Further thoughts: we need to mutually exclude operations on toolchains: removal and upgrade affect the entire toolchain as well as operations on components (add/remove components). I think locking the dir of the toolchain is probably the right control point.

rbtcollins on Feb 16, 2020

Ok, so looks like cargo has problems on NFS which I was worried about:

cifsfs supports oplocks so SMBfs should lock properly:

https://github.com/torvalds/linux/blob/master/fs/cifs/cifsfs.c
meaning that I think its just the blocking failure mode of crashed /firewalled/missing rpc.lockd that we need to be concerned with.

fs2 which cargo uses does use LOCK_NB with flock - flock(file, libc::LOCK_EX | libc::LOCK_NB) (https://tantivy-search.github.io/tantivy/src/fs2/unix.rs.html#36)

questions:

would lockf instead, with F_TLOCK still fail in the same way? It is defined as never blocking, but network file systems … need to check the code.
Open file description locks still don’t have a timeout field in the lock descriptor that calls operate on, so there’s no way to tell NFS that you don’t want to wait for that shitty shitty network. There is ENOLCK to signal that locking over the network failed, but the question is how long the network takes to realise it.
https://docs.oracle.com/cd/E19253-01/816-4555/rfsrefer-9/index.html has a client side retransmit timer of 15 seconds, which suggests ELONGTIME.

I’m inclined to provide a UI warning about NFS when taking the lock with a env variable to shut the warning up for folk that know things work properly in their environment (using the NFS sniffing logic from cargo); but then assume that the fs is working properly.

rbtcollins on Feb 5, 2020

No: https://github.com/rust-lang/cargo/blob/master/src/cargo/util/flock.rs

I’ve also absolutely dropped the ball on this one 😃

matklad on Feb 5, 2020

@matklad did that fancy flock code get extracted?

rbtcollins on Feb 5, 2020

Yes, currently we have no locking on the toolchain directories IIRC.

kinnison on Oct 2, 2019

Hi, is this still an issue? It might explain some problems I’ve been having with rustup.

fenhl on Sep 30, 2019

Oh there’s a tricky bit here in that rustup is reentrant. Can’t just use a simple global lock.

Where exactly does reentrancy happen?

I don’t yet understand how rustup works, but looks like adding a file lock to the Transaction should do the trick, unless transactions can nest and deadlock each other?

matklad on Mar 22, 2017

Oh there’s a tricky bit here in that rustup is reentrant. Can’t just use a simple global lock.

brson on Mar 17, 2017

@matklad ooh awesome. Can you extract cargo’s fancy flock code to its own crate to share with rustup?

brson on Mar 17, 2017