filprofiler: NameError when running Apache Beam pipelines under Fil

Hi @itamarst! Fil looks like a great project, and I’ve been eagerly giving it a try - but have run across a small issue when trying to use it to debug memory issues in Apache Beam pipelines.

Version information

Fil: 2021.5.0 Python: 3.8.9 (default, Apr 3 2021, 01:50:09) [Clang 12.0.0 (clang-1200.0.32.29)]

Here’s a minimal reproducible example that breaks when running under fil-profiler run, but otherwise works just fine:

import math
import apache_beam as beam


class MyProcessor(beam.DoFn):
    def process(self, element):
        return math.exp(element)


def main():
    with beam.Pipeline() as pipeline:
        pipeline | beam.Create([1, 2, 3]) | beam.ParDo(MyProcessor())


if __name__ == "__main__":
    main()

The traceback printed shows that the code is being invoked directly through Fil all the way from main() (i.e.: no subprocesses involved) but at runtime, the globals in scope of the function can’t be resolved:

< many frames omitted >
  File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/Users/psobot/Library/Python/3.8/lib/python/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "example.py", line 7, in process
    return math.exp(element)
NameError: name 'math' is not defined [while running 'ParDo(MyProcessor)']

This does seem like Fil and Apache Beam might both be doing something with globals that causes them not to play nicely together.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Comments: 22

Most upvoted comments

This is now available as part of release 2021.8.0.

I also, by the way, now have a first pass of profiler that can run on production jobs, if you’d be interested in playing with it.

The fix will be included in the next release, will post here when it’s available.

Oops. At this point I’m using Python’s runpy module for both cases, and in theory it’s supposed to match Python’s normal behavior, but perhaps not in practice. I’ll take a look, but it’ll probably be a week or two before I have time to look at this (and it sounds like you have a workaround, so it shouldn’t be a blocker).