invoke: ctx.run fails due to bad command exit code

Scenario

I stumbled upon this issue when using fabric2 where I had a task which would execute alembic to upgrade/downgrade multiple local databases. When executing the task it would sometimes (completely randomly) fail with the following exception.

Please note the following is only happening when pty is set to False.

Version of invoke used 1.3.0.

Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed

Reproduction of the bug

I replicated the fabric task to a pure invoke task like shown here. When executing the following code I get randomly the error mentioned above. I will execute the task multiple times to see that the error occurs in different runs every time.

Code

from invoke import task


@task
def upgrade(ctx):
    for index in range(1, 5):
        try:
            print("Run upgrade")
            ctx.run("./env/bin/alembic upgrade head")
        except Exception as exc:
            print(exc)

First run

$ ./env/bin/invoke upgrade
Run upgrade 1
Run upgrade 2
Run upgrade 3
Run upgrade 4
Run upgrade 5
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 6
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 7
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 8
Run upgrade 9

Second run

$ ./env/bin/invoke upgrade
Run upgrade 1
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 2
Run upgrade 3
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 4
Run upgrade 5
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 6
Encountered a bad command exit code!

Command: './env/bin/alembic upgrade head'

Exit code: None

Stdout: already printed

Stderr: already printed


Run upgrade 7
Run upgrade 8
Run upgrade 9

Investigation

After investigating the issue with one of my colleagues we might have identified the issue. However this is merely an assumption rather than a certainty. It seems that the threads that are being spawned for the Runners are trying to access Local.returncode function before the process has had a chance to finish without waiting on the process to finish.

I’ve tried changing the code of returncode by adding Popen wait call which seems to fix the issue I had.

    def returncode(self):
        if self.using_pty:
            # No subprocess.returncode available; use WIFEXITED/WIFSIGNALED to
            # determine whch of WEXITSTATUS / WTERMSIG to use.
            # TODO: is it safe to just say "call all WEXITSTATUS/WTERMSIG and
            # return whichever one of them is nondefault"? Probably not?
            # NOTE: doing this in an arbitrary order should be safe since only
            # one of the WIF* methods ought to ever return True.
            code = None
            if os.WIFEXITED(self.status):
                code = os.WEXITSTATUS(self.status)
            elif os.WIFSIGNALED(self.status):
                code = os.WTERMSIG(self.status)
                # Match subprocess.returncode by turning signals into negative
                # 'exit code' integers.
                code = -1 * code
            return code
            # TODO: do we care about WIFSTOPPED? Maybe someday?
        else:
            self.process.wait(timeout=3)
            return self.process.returncode
$ ./env/bin/invoke upgrade
Run upgrade 1
Run upgrade 2
Run upgrade 3
Run upgrade 4
Run upgrade 5
Run upgrade 6
Run upgrade 7
Run upgrade 8
Run upgrade 9

Please consider the following pull request https://github.com/pyinvoke/invoke/pull/661

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Reactions: 2
  • Comments: 24 (14 by maintainers)

Commits related to this issue

Most upvoted comments

I can confirm the fix works.

Using a similar ‘a to z’ tasks file I tested on Windows 10 with various releases of invoke and multiple Python versions.

py.exe -3.8 -m pip install --upgrade invoke==1.2.0
del out.txt && py.exe -3.8 -m invoke a b c d e f g h i j k l m o q r s t u v w x y z && py.exe -3.8 -c "import invoke; print(invoke.__version__, open('out.txt', 'rt').read()[-3:].strip())"

Results:

1.2.0 z
1.3.0 b
1.4.0 c
1.4.0 z

Where the latter 1.4.0 is git master. This is reproducible for Python 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.

Thanks so much for invoke, and for looking into this!

1.3.1 and 1.4.1 are out!

@bitprophet master works (8b4206ef) on Windows. I built a regression test that fails 100% of the time for me. I’m working on cleaning it up and I’ll post it in a new repo for you to look at. But, basically, it’s a task file with 26 tasks (a through z) and a call to invoke a b c ... z. Each task outputs its name (e.g. task ‘a’ outputs ‘a’) into a shared file. Since it appears the tasks from the command line invocation execute in-order, the last line in the file should be ‘z’.

I’ve found iterating on that test 5 times will consistently produce the failure.

I also built a search (bisector-esque) and it looks like 75c4d5eb68c9 (8-Jul-2019) introduced the problem.

But the latest on master (8b4206ef) is working great on my Widows box.

Travis seems to confirm, regular tests all still pass and the new regression checker does too.

Just merged up thru master - I’d love it if someone else (especially a Windows user) could grab one of the branches (1.3, 1.4 or master) and confirm it’s good for them as well. I’ll release after that happens.

git bisect confirms the issue occurred in 75c4d5eb68c9bd0165de564be9aa11f02b1f88d3 aka the one that both closed subprocess stdin, and modified what it means for an IO worker thread to be “dead”.

The latter is the usual culprit for this class of issue so again, not surprising. I independently tried tweaking just the subprocess stdin closing part, and that doesn’t seem to make a difference outside of causing the issue to be worse when it’s commented out.

I tried reverting just the ExceptionHandlingThread.is_dead change and poof, the issue immediately disappears 😞

This is why I really hate it when that code changes…


I don’t see any material in/related to #552 that clearly explains why this change was directly relevant to the stdin blocking problem. My suspicion is that the OP in that ticket (and I clearly didn’t catch this or it wasn’t fully relevant at the time) got turned around while debugging and thought the dead-thread checker was part & parcel of the problem.

(It occurs to me today, rereading the code, that this may be partly due to the use of “dead” here to mean “a thread that died unexpectedly”…it’s a bit confusing. The fact that the “fixed” tests in that PR had to be “fixed” to continue should have been a sign to both of us that they were failing correctly - that’s on me.)


With a sample of one of the triggers for the issue that #552 fixes (Invoke never telling its subprocess about mirrored stdin closing), echo Hi | python -c "from invoke import run; run('cat')":

  • Invoke 1.2.0, which lacks #552’s fix, hangs forever - as expected
  • Invoke 1.3.0, with #552’s fix, works fine, as expected
  • Invoke 1.3.0, with the ExceptionHandlingThread.is_dead change reverted (and no other changes), still works just fine.

So I think for now, this is probably the culprit and reverting it seems unlikely to completely undo the good works of #552. I’m willing to revert it and then wait to see whether there’s a true corner case I’m missing here that the is_dead tweak actually fixes.

Sadly I don’t dev on Windows so I’ve no offhand idea why it’s so much worse there, but I’m relatively confident any fix that kills the issue on Unixes will work there as well. Will definitely be looking for confirmation of any final fix from y’all Windows users 👍

Thanks @flazzarini, #661 solves also the same issue that we had in our Github Actions CI workflows (https://github.com/arduino/arduino-cli/pull/416), both on win and linux environments.

I hope that this PR will be merged ASAP 😸