filprofiler: NameError when running Apache Beam pipelines under Fil
Hi @itamarst! Fil looks like a great project, and I’ve been eagerly giving it a try - but have run across a small issue when trying to use it to debug memory issues in Apache Beam pipelines.
Version information
Fil: 2021.5.0 Python: 3.8.9 (default, Apr 3 2021, 01:50:09) [Clang 12.0.0 (clang-1200.0.32.29)]
Here’s a minimal reproducible example that breaks when running under fil-profiler run
, but otherwise works just fine:
import math
import apache_beam as beam
class MyProcessor(beam.DoFn):
def process(self, element):
return math.exp(element)
def main():
with beam.Pipeline() as pipeline:
pipeline | beam.Create([1, 2, 3]) | beam.ParDo(MyProcessor())
if __name__ == "__main__":
main()
The traceback printed shows that the code is being invoked directly through Fil all the way from main()
(i.e.: no subprocesses involved) but at runtime, the globals in scope of the function can’t be resolved:
< many frames omitted >
File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "/Users/psobot/Library/Python/3.8/lib/python/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
raise exc.with_traceback(traceback)
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 582, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "example.py", line 7, in process
return math.exp(element)
NameError: name 'math' is not defined [while running 'ParDo(MyProcessor)']
This does seem like Fil and Apache Beam might both be doing something with globals that causes them not to play nicely together.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Comments: 22
This is now available as part of release 2021.8.0.
I also, by the way, now have a first pass of profiler that can run on production jobs, if you’d be interested in playing with it.
The fix will be included in the next release, will post here when it’s available.
Oops. At this point I’m using Python’s
runpy
module for both cases, and in theory it’s supposed to match Python’s normal behavior, but perhaps not in practice. I’ll take a look, but it’ll probably be a week or two before I have time to look at this (and it sounds like you have a workaround, so it shouldn’t be a blocker).