pytest-xdist: pytest hangs while running tests

Hello everyone, Last night I let my test suite running until this morning, but I noticed that it hasn’t finished running all tests, looking at htop, I noticed some strange processes running for a really long time, as shown in this screenshot:

Looking at the tests output, I saw that the last ran tests were all ran by the same worker gw2 (there were 4 workers running), as there were 3 processes import sys;exec(eval(sys.stdin.readline())) running for 13+ hours, I think that those 3 workers were just stuck somehow.

About this issue

Original URL
State: open
Created 7 years ago
Reactions: 8
Comments: 30 (12 by maintainers)

Commits related to this issue

setup.cfg: Use --timeout=1800 pytest option Ensure that individual tests never block over 1800 seconds, this also helps to avoid locking up in pytest-xdist parallel testing mode: https://github.... — committed to apache/buildstream by gtristan 3 years ago

Most upvoted comments

@nicoddemus I’m wondering if xdist should ensure to identify all currently running tests and their phases whenever a node exceeds a predetermined timeframe

An even further expansion of this might be printing stacktraces

RonnyPfannschmidt on Aug 2, 2023

The timeout Plugin tends to be a great Help for hangups

RonnyPfannschmidt on Jul 29, 2023

@JacobCallahan shot me more details at the work channel

RonnyPfannschmidt on Dec 20, 2022

I’m facing a very weird issue. When I run my test in parallel- Only 1 browser opens even when I execute pytest -n 2 After first test is successful, pytest just hangs and it does not even timeout Has anyone faced this issue?

py.test -vv -n 2 =================================================================================================== test session starts =================================================================================================== platform darwin – Python 3.7.3, pytest-5.4.1, py-1.10.0, pluggy-0.13.1 – ///* cachedir: .pytest_cache rootdir: //*, inifile: pytest.ini plugins: xdist-1.31.0, allure-pytest-2.8.40, repeat-0.9.1, forked-1.1.3, timeout-1.4.2 [gw0] darwin Python 3.7.3 cwd: //* [gw1] darwin Python 3.7.3 cwd: /**/* collected 8 items
[gw0] Python 3.7.3 (default, Apr 24 2020, 18:51:23) – [Clang 11.0.3 (clang-1103.0.32.62)] [gw1] Python 3.7.3 (default, Apr 24 2020, 18:51:23) – [Clang 11.0.3 (clang-1103.0.32.62)] gw0 [8] / gw1 [8] scheduling tests via MarkBasedScheduler

test1 [gw1] [ 12%] PASSED test1

It will get stuck after this.

vishakha-vonage on May 13, 2021

Hey @telles-simbiose and @BlackHobbiT,

Did you manage to make it work? We suffer from the same issue. The test run hangs on 93%. All the workers are busy, only killing the process in task manager solves it, meaning let the run continue. When that worker is crushed, the report is also lost from that specific test.

Thanks

tamaskakuszi on Feb 22, 2021

Try to use pytest-timeout, pytest --timeout=<timetoskip>, this will kill existing hang thread and move the execution for you.

mavo123 on Nov 13, 2020

Yeah, suffered from this issue sometimes. --fulltrace provide lock here platform linux -- Python 3.6.1, pytest-5.0.1, py-1.5.3, pluggy-0.12.0 -- /usr/bin/python3.6 cachedir: .pytest_cache rootdir: /home/BlackHobbiT/path/to/test inifile: pytest.ini plugins: forked-1.0.2, xdist-1.29.0, allure-pytest-2.7.0 [gw0] linux Python 3.6.1 cwd: /home/BlackHobbiT/path/to/test [gw0] Python 3.6.1 (default, Sep 7 2017, 16:36:03) -- [GCC 6.3.0 20170406]


config = <_pytest.config.Config object at 0x7f1cac7e4da0>, doit = <function _main at 0x7f1caccb0f28>

    def wrap_session(config, doit):
        """Skeleton command line program"""
        session = Session(config)
        session.exitstatus = ExitCode.OK
        initstate = 0
        try:
            try:
                config._do_configure()
                initstate = 1
                config.hook.pytest_sessionstart(session=session)
                initstate = 2
>               session.exitstatus = doit(config, session) or 0

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:213: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

config = <_pytest.config.Config object at 0x7f1cac7e4da0>, session = <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>

    def _main(config, session):
        """ default command line protocol for initialization, session,
        running tests and reporting. """
        config.hook.pytest_collection(session=session)
>       config.hook.pytest_runtestloop(session=session)

/usr/local/lib/python3.6/dist-packages/_pytest/main.py:257: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_HookCaller 'pytest_runtestloop'>, args = (), kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}
notincall = set()

    def __call__(self, *args, **kwargs):
        if args:
            raise TypeError("hook calling supports only keyword arguments")
        assert not self.is_historic()
        if self.spec and self.spec.argnames:
            notincall = (
                set(self.spec.argnames) - set(["__multicall__"]) - set(kwargs.keys())
            )
            if notincall:
                warnings.warn(
                    "Argument(s) {} which are declared in the hookspec "
                    "can not be found in this hook call".format(tuple(notincall)),
                    stacklevel=2,
                )
>       return self._hookexec(self, self.get_hookimpls(), kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/hooks.py:289: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_pytest.config.PytestPluginManager object at 0x7f1caee2ab00>, hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    def _hookexec(self, hook, methods, kwargs):
        # called from all hookcaller instances.
        # enable_tracing will set its own wrapping function at self._inner_hookexec
>       return self._inner_hookexec(hook, methods, kwargs)

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:87: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

hook = <_HookCaller 'pytest_runtestloop'>
methods = [<HookImpl plugin_name='main', plugin=<module '_pytest.main' from '/usr/local/lib/python3.6/dist-packages/_pytest/main...1cab5f7f98>>, <HookImpl plugin_name='logging-plugin', plugin=<_pytest.logging.LoggingPlugin object at 0x7f1cab5ce2b0>>]
kwargs = {'session': <Session session exitstatus=<ExitCode.OK: 0> testsfailed=0 testscollected=16>}

    self._inner_hookexec = lambda hook, methods, kwargs: hook.multicall(
        methods,
        kwargs,
>       firstresult=hook.spec.opts.get("firstresult") if hook.spec else False,
    )

/usr/local/lib/python3.6/dist-packages/pluggy/manager.py:81: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def pytest_runtestloop(self):
        self.sched = self.config.hook.pytest_xdist_make_scheduler(
            config=self.config, log=self.log
        )
        assert self.sched is not None
    
        self.shouldstop = False
        while not self.session_finished:
>           self.loop_once()

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:115: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <xdist.dsession.DSession object at 0x7f1cab5f7f98>

    def loop_once(self):
        """Process one callback from one of the workers."""
        while 1:
            if not self._active_nodes:
                # If everything has died stop looping
                self.triggershutdown()
                raise RuntimeError("Unexpectedly no active workers available")
            try:
>               eventcall = self.queue.get(timeout=2.0)

/usr/local/lib/python3.6/dist-packages/xdist/dsession.py:129: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <queue.Queue object at 0x7f1cab5f76d8>, block = True, timeout = 2.0

    def get(self, block=True, timeout=None):
        '''Remove and return an item from the queue.
    
        If optional args 'block' is true and 'timeout' is None (the default),
        block if necessary until an item is available. If 'timeout' is
        a non-negative number, it blocks at most 'timeout' seconds and raises
        the Empty exception if no item was available within that time.
        Otherwise ('block' is false), return an item if one is immediately
        available, else raise the Empty exception ('timeout' is ignored
        in that case).
        '''
        with self.not_empty:
            if not block:
                if not self._qsize():
                    raise Empty
            elif timeout is None:
                while not self._qsize():
                    self.not_empty.wait()
            elif timeout < 0:
                raise ValueError("'timeout' must be a non-negative number")
            else:
                endtime = time() + timeout
                while not self._qsize():
                    remaining = endtime - time()
                    if remaining <= 0.0:
                        raise Empty
>                   self.not_empty.wait(remaining)

/usr/lib/python3.6/queue.py:173: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7f1cab4f2120>, 0)>, timeout = 1.9999943980947137

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.
    
        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.
    
        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.
    
        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).
    
        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.
    
        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   KeyboardInterrupt

/usr/lib/python3.6/threading.py:299: KeyboardInterrupt

BlackHobbiT on Jan 30, 2020