luigi: A task can run twice if it fails to create output (without error)

Hello,

working with your wonderfull tool, I’ve faced this issue : if for any reason the target should not be created (or if the task does not create any target), Luigi tries to re-run it :

#!/usr/bin/env python
#-*- coding: utf-8-*-0

import luigi

class BadlyCodedTask(luigi.Task):

    def output(self):
        return luigi.LocalTarget("BadlyCodedTask")

    def run(self):
        print "*" * 60
        print "BadlyCodedTask.run" 


class RunAll(luigi.Task):

    def requires(self):
        return BadlyCodedTask()

    def output(self):
        return luigi.LocalTarget("RunAll")

    def run(self):
        with self.output().open('w') as out_file:
            out_file.write("Done")


if __name__ == '__main__':
    luigi.run(main_task_cls=RunAll)

Produces :

lexman@dsksrv-13:~/dev/pano_forge/tests_luigi$ ./test_reduced.py --local-scheduler
DEBUG: Checking if RunAll() is complete
INFO: Scheduled RunAll() (PENDING)
DEBUG: Checking if BadlyCodedTask() is complete
INFO: Scheduled BadlyCodedTask() (PENDING)
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) running   BadlyCodedTask()
************************************************************
BadlyCodedTask.run
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) done      BadlyCodedTask()
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) running   RunAll()
ERROR: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) failed    RunAll()
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/luigi/worker.py", line 288, in _run_task
    raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependency at run time: BadlyCodedTask()
DEBUG: Checking if RunAll() is complete
INFO: Scheduled RunAll() (PENDING)
DEBUG: Checking if BadlyCodedTask() is complete
INFO: Scheduled BadlyCodedTask() (PENDING)
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 2
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) running   BadlyCodedTask()
************************************************************
BadlyCodedTask.run
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) done      BadlyCodedTask()
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) running   RunAll()
ERROR: [pid 10660] Worker Worker(salt=978716908, host=dsksrv-13, username=lexman, pid=10660) failed    RunAll()
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/luigi/worker.py", line 288, in _run_task
    raise RuntimeError('Unfulfilled %s at run time: %s' % (deps, ', '.join(missing)))
RuntimeError: Unfulfilled dependency at run time: BadlyCodedTask()
DEBUG: Asking scheduler for work...
INFO: Done

Should’nt Luigi ensure that the task is run only once, even if it fails ?

About this issue

  • Original URL
  • State: closed
  • Created 10 years ago
  • Comments: 15 (8 by maintainers)

Most upvoted comments

Hello @davidljung, if you’re looking at a framework that will check the outputs, have a look at (tuttle)[http://github.com/lexman/tuttle]. It will also check that the data you’ve already produced has not changed (edited by hand, for example), and would reprocess the data in that case.

The problem is you need to store the state somewhere. And the design philosophy of Luigi as Elias mentioned is that the “worker knows best”. That’s why typically you implement this check by relying on the file system. In fact you can implement it in any way you want by using the completed() function. But the default implementation of completed() is to run the output() function and check if the output exists.

So you have full control over this behavior, you just need to be explicit about where the state is stored.

The benefit of this approach is that it’s also explicit what to do if you want to re-run something – just delete the file (or whatever marker you use, could be a row in a database or anything) and then re-run the dependency graph