OCRmyPDF: Segmentation fault when using pipes
Describe the bug When running ocrmypdf through podman/docker I sometimes (#864) experience segmentation faults and the container hangs indefinitely. The output file is empty.
To Reproduce The following command is executed to reproduce the failure, due to the non-deterministic behavior of ocrmypdf, it might take a while or even multiple loops to reproduce.
for i in $(seq 0 100); do
podman run --rm -i ocrmypdf --verbose -rcd --jbig2-lossy -l deu - - <tmp.pdf >out.pdf; done
done
All of the options can be omitted and the issue is reproducible. The resulting log is:
ocrmypdf 12.6.0.post6+g42713b77.d20211012
Running: ['tesseract', '--list-langs']
stdout/stderr = List of available languages (7):
chi_sim
deu
eng
fra
osd
por
spa
Running: ['unpaper', '--version']
Found unpaper 6.1
Running: ['tesseract', '--version']
Found tesseract 4.1.1
Running: ['gs', '--version']
Found gs 9.53.3
reading file from standard input
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/stdin, /tmp/ocrmypdf.io.yzr1_6f6/origin.pdf)
Using Tesseract OpenMP thread limit 3
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=jpeggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
1 Rotating output by 0
1 Running: ['tesseract', '-l', 'osd', '--psm', '0', '/tmp/ocrmypdf.io.yzr1_6f6/000001_rasterize_preview.jpg', 'stdout']
1 page is facing ⇧, confidence 7.23 - no change
1 Rasterize with pnggray, rotation 0
1 Running: ['gs', '-dQUIET', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-sDEVICE=pnggray', '-dFirstPage=1', '-dLastPage=1', '-r150.000000x150.000000', '-o', '-', '-sstdout=%stderr', '-dAutoRotatePages=/None', '-f', '/tmp/ocrmypdf.io.yzr1_6f6/origin.pdf']
1 Rotating output by 0
1 Running: ['unpaper', '-v', '--dpi', '150.0', '--layout', 'none', '--mask-scan-size', '100', '--no-border-align', '--no-mask-center', '--no-grayfilter', '--no-blackfilter', '--no-deskew', '/tmp/tmpmqv67lqw/input.pnm', '/tmp/tmpmqv67lqw/output.pgm']
1 stdout/stderr = [image2 @ 0x55a80053afc0] Using AVStream.codec to pass codec parameters to muxers is deprecated, use AVStream.codecpar instead.
[image2 @ 0x55a80053afc0] Encoder did not produce proper pts, making some up.
unpaper 6.1
License GPLv2: GNU GPL version 2.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
-------------------------------------------------------------------------------
Processing sheet #1: /tmp/tmpmqv67lqw/input.pnm -> /tmp/tmpmqv67lqw/output.pgm
input-file for sheet 1: /tmp/tmpmqv67lqw/input.pnm
output-file for sheet 1: /tmp/tmpmqv67lqw/output.pgm
sheet size: 1232x1718
...
noise-filter ... deleted 47 clusters.
blur-filter... deleted 0 pixels.
writing output.
1 resolution (150.01239999999999, 150.01239999999999)
1 convert
1 PIL format = PNG
1 imgformat = PNG
1 input dpi = 150 x 150
1 rotation = 0°
1 input colorspace = L
1 width x height = 1232px x 1718px
1 read_images() embeds a PNG
1 convert done
1 Running: ['tesseract', '-l', 'deu', '-c', 'textonly_pdf=1', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr.png', '/tmp/ocrmypdf.io.yzr1_6f6/000001_ocr_tess', 'pdf', 'txt']
1 Emplacement update
1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0
1 Grafting
1 Page rotation: (content, auto) -> page = (0, 0) -> 0
Postprocessing...
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/graft_layers.pdf, /tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf)
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.yzr1_6f6/fix_docinfo.pdf', '/tmp/ocrmypdf.io.yzr1_6f6/pdfa.ps']
GPL Ghostscript 9.53.3 (2020-10-01)
Copyright (C) 2020 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Treating 18 as an optimization candidate
XrefExt(xref=18, ext='.png')
Optimizable images: JPEGs: 0 PNGs: 1
Treating 18 as an optimization candidate
Optimizable images: JBIG2 groups: (0,)
Optimize ratio: 1.00 savings: 0.0%
os.symlink(/tmp/ocrmypdf.io.yzr1_6f6/optimize.opt.pdf, /tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf)
/tmp/ocrmypdf.io.yzr1_6f6/optimize.pdf -> -
Output sent to stdout
dmesg yields:
[21719.464718] conmon[91767]: segfault at 111d000 ip 00007fcf434cf980 sp 00007ffc7f66d4e8 error 4 in libc.so.6[7fcf43380000+176000]
[21719.464741] Code: d7 c1 85 c0 75 a4 48 81 ea 80 00 00 00 0f 86 07 01 00 00 48 ff c7 89 f9 48 83 cf 7f 83 e1 7f 48 01 ca 0f 1f 84 00 00 00 00 00 <c5> fd 74 4f 01 c5 fd 74 57 21 c5 fd 74 5f 41 c5 fd 74 67 61 c5 ed
(Always the same location in libc)
Exchanging >out.pdf with tee out.pdf I at some point could see strange characters being omited after %%EOF (?), however, most of the time it hangs before that.
Example file The example file is attached in encrypted form. tmp.pdf.gpg.zip
Expected behavior The output file should be correct and the tool should not hang.
System
- OS: Fedora 35
- OCRmyPDF Version: 12.6.0.post6+g42713b77.d20211012, but reproducible just as well with jbarlow83/ocrmypdf:v13.2.0, jbarlow83/ocrmypdf:v13.1.1 and jbarlow83/ocrmypdf:v13.1.0
- How did you install ocrmypdf? podman pull jbarlow83/ocrmypdf
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 20
ocrmypdf does not behave much differently. Frankly if the error shown here is actually the case, https://issueexplorer.com/issue/containers/conmon/251 there are two unchecked pointer dereferences so pretty much anything is possible… including getting the host to execute arbitrary code produced by the container. I wouldn’t use podman for anything until this is fixed.