luigi: Parallel Tasks Fail

Hi there, I am using luigi on a kubernetes cluster and I am trying to run 500 parallel jobs (or more) but it freezes and never completes, with some tasks showing as failed in the GUI. When I click on a failed task to check the error message, I get an empty error box. I am trying to run the following workflow:

import luigi
from luigi.contrib.kubernetes import KubernetesJobTask 
class PerlPi(KubernetesJobTask):   
    name = "pi"
    index = luigi.Parameter()
    max_retrials = 3
    @property
    def spec_schema(self): 
        return {
            "containers": [
                {
                    "name": "pi",
                    "image": "perl",
                    "command": ["sh", "-c", "perl -Mbignum=bpi -wle 'print bpi(2000)' > /work/pi"+str(self.index)],
                    "volumeMounts": [{
                        "mountPath": "/work",
                        "name": "shared-volume",
                        "subPath": "jupyter/pi"
                    }]
                }
            ],
            "volumes": [{
                "name": "shared-volume",
                "persistentVolumeClaim": {
                    "claimName": "galaxy-pvc"
                 }
            }]
        }
    
    def output(self):
        target = "pi/pi"+str(self.index)
        return luigi.LocalTarget(target)
    
class ManyJobs(luigi.WrapperTask):
    def requires(self):
        # Generate 500 PerPi jobs  (0, 1 ... 499)
        for i in range(1000):
            yield PerlPi(index=i)

The command I use is: $> export PYTHONPATH=./ $> luigi --module many_jobs ManyJobs --scheduler-host luigi.default --workers 500

Could you please assist?

Many thanks, Noureddin

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 15 (3 by maintainers)

Most upvoted comments

I think that the problem is that Luigi breaks down since a process is spawned for each workers, so for 500 workers there are probably network timeouts and everything starts to break down. Luigi is not meant to massively parallelize jobs, in my opinion this thread can be closed. Maybe someone from the Spotify community can further confirm this.