textract: `UnboundLocalError: local variable 'pipe' referenced before assignment`

text = textract.process(file, method='pdfminer')

Error: UnboundLocalError Traceback (most recent call last) <ipython-input-8-e7fe7b1fc2d1> in <module>() ----> 1 text = textract.process(file, method=‘pdfminer’)

~/.local/lib/python3.6/site-packages/textract/parsers/init.py in process(filename, encoding, extension, **kwargs) 75 76 parser = filetype_module.Parser() —> 77 return parser.process(filename, encoding, **kwargs) 78 79

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs) 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 —> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string) 48 return self.encode(unicode_string, encoding)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs) 29 30 elif method == ‘pdfminer’: —> 31 return self.extract_pdfminer(filename, **kwargs) 32 elif method == ‘tesseract’: 33 return self.extract_tesseract(filename, **kwargs)

~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs) 46 def extract_pdfminer(self, filename, **kwargs): 47 “”“Extract text from pdfs using pdfminer.”“” —> 48 stdout, _ = self.run([‘pdf2txt.py’, filename]) 49 return stdout 50

~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args) 94 # pipe.wait() ends up hanging on large files. using 95 # pipe.communicate appears to avoid this issue —> 96 stdout, stderr = pipe.communicate() 97 98 # if pipe is busted, raise an error (unlike Fabric)

UnboundLocalError: local variable 'pipe' referenced before assignment

_Originally posted by @SatyaRamGV in https://github.com/deanmalmgren/textract/issue_comments#issuecomment-439043876_

About this issue

  • Original URL
  • State: open
  • Created 6 years ago
  • Reactions: 3
  • Comments: 17 (3 by maintainers)

Most upvoted comments

@SatyaRamGV I tried with versions textract==1.6.1, textract==1.6.2, textract==1.6.3. All these versions throw this error. I’m on my windows 10. I have enough memory to perform this task, still, I get the same error.

Traceback (most recent call last):

File “<ipython-input-2-c969b65ffa97>”, line 1, in <module> text = textract.process(r"C:..\docs\Mortgage Security Agreement\Closed End PA MTG 5000.39.pdf", method=‘pdfminer’)

File “C:..\venv\lib\site-packages\textract\parsers_init_.py”, line 77, in process return parser.process(filename, encoding, **kwargs)

File “C:..\venv\lib\site-packages\textract\parsers\utils.py”, line 46, in process byte_string = self.extract(filename, **kwargs)

File “C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py”, line 31, in extract return self.extract_pdfminer(filename, **kwargs)

File “C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py”, line 48, in extract_pdfminer stdout, _ = self.run([‘pdf2txt.py’, filename])

File “C:..\venv\lib\site-packages\textract\parsers\utils.py”, line 96, in run stdout, stderr = pipe.communicate()

UnboundLocalError: local variable ‘pipe’ referenced before assignment

I think I know where this comes from: this bit of code in ShellParser:

        # run a subprocess and put the stdout and stderr on the pipe object
        try:
            pipe = subprocess.Popen(
                args,
                stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
        except OSError as e:
            if e.errno == errno.ENOENT:
                # File not found.
                # This is equivalent to getting exitcode 127 from sh
                raise exceptions.ShellError(
                    ' '.join(args), 127, '', '',
                )

…coupled with forking issues on Unix: https://stackoverflow.com/questions/5306075/python-memory-allocation-error-using-subprocess-popen

Since the out-of-memory error is an OSError, it gets caught in the except block, but then eaten; the program tries to continue but since the assignment to pipe failed, it’s not defined, hence the error message.

This could be alleviated by adding a bare raise after the errno check, at least to make it clearer what the actual error is. I could submit a PR if necessary?