textract: `UnboundLocalError: local variable 'pipe' referenced before assignment`
text = textract.process(file, method='pdfminer')
Error: UnboundLocalError Traceback (most recent call last) <ipython-input-8-e7fe7b1fc2d1> in <module>() ----> 1 text = textract.process(file, method=‘pdfminer’)
~/.local/lib/python3.6/site-packages/textract/parsers/init.py in process(filename, encoding, extension, **kwargs) 75 76 parser = filetype_module.Parser() —> 77 return parser.process(filename, encoding, **kwargs) 78 79
~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in process(self, filename, encoding, **kwargs) 44 # output encoding 45 # http://nedbatchelder.com/text/unipain/unipain.html#35 —> 46 byte_string = self.extract(filename, **kwargs) 47 unicode_string = self.decode(byte_string) 48 return self.encode(unicode_string, encoding)
~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract(self, filename, method, **kwargs) 29 30 elif method == ‘pdfminer’: —> 31 return self.extract_pdfminer(filename, **kwargs) 32 elif method == ‘tesseract’: 33 return self.extract_tesseract(filename, **kwargs)
~/.local/lib/python3.6/site-packages/textract/parsers/pdf_parser.py in extract_pdfminer(self, filename, **kwargs) 46 def extract_pdfminer(self, filename, **kwargs): 47 “”“Extract text from pdfs using pdfminer.”“” —> 48 stdout, _ = self.run([‘pdf2txt.py’, filename]) 49 return stdout 50
~/.local/lib/python3.6/site-packages/textract/parsers/utils.py in run(self, args) 94 # pipe.wait() ends up hanging on large files. using 95 # pipe.communicate appears to avoid this issue —> 96 stdout, stderr = pipe.communicate() 97 98 # if pipe is busted, raise an error (unlike Fabric)
UnboundLocalError: local variable 'pipe' referenced before assignment
_Originally posted by @SatyaRamGV in https://github.com/deanmalmgren/textract/issue_comments#issuecomment-439043876_
About this issue
- Original URL
- State: open
- Created 6 years ago
- Reactions: 3
- Comments: 17 (3 by maintainers)
@SatyaRamGV I tried with versions textract==1.6.1, textract==1.6.2, textract==1.6.3. All these versions throw this error. I’m on my windows 10. I have enough memory to perform this task, still, I get the same error.
Traceback (most recent call last):
File “<ipython-input-2-c969b65ffa97>”, line 1, in <module> text = textract.process(r"C:..\docs\Mortgage Security Agreement\Closed End PA MTG 5000.39.pdf", method=‘pdfminer’)
File “C:..\venv\lib\site-packages\textract\parsers_init_.py”, line 77, in process return parser.process(filename, encoding, **kwargs)
File “C:..\venv\lib\site-packages\textract\parsers\utils.py”, line 46, in process byte_string = self.extract(filename, **kwargs)
File “C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py”, line 31, in extract return self.extract_pdfminer(filename, **kwargs)
File “C:..\venv\lib\site-packages\textract\parsers\pdf_parser.py”, line 48, in extract_pdfminer stdout, _ = self.run([‘pdf2txt.py’, filename])
File “C:..\venv\lib\site-packages\textract\parsers\utils.py”, line 96, in run stdout, stderr = pipe.communicate()
UnboundLocalError: local variable ‘pipe’ referenced before assignment
I think I know where this comes from: this bit of code in ShellParser:
…coupled with forking issues on Unix: https://stackoverflow.com/questions/5306075/python-memory-allocation-error-using-subprocess-popen
Since the out-of-memory error is an OSError, it gets caught in the
except
block, but then eaten; the program tries to continue but since the assignment topipe
failed, it’s not defined, hence the error message.This could be alleviated by adding a bare
raise
after theerrno
check, at least to make it clearer what the actual error is. I could submit a PR if necessary?