tesseract: terminate called after throwing an instance of 'std::bad_alloc'

Hello,

First thanks for your job. I am trying to run tesseract 4 but I am getting an issue:

Info in bmfCreate: Generating pixa of bitmap fonts from string terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted (core dumped)

Step to reproduce (with a docker file):

FROM ubuntu
RUN apt-get update && apt-get install -y \
	autoconf \
	automake \
	libtool \
	autoconf-archive \
	pkg-config \
	libpng12-dev \
	libjpeg8-dev \
	libtiff5-dev \
	zlib1g-dev \ 
	libicu-dev \
	libpango1.0-dev \
	libcairo2-dev \
	git \
	curl && \
	rm -rf /var/lib/apt/lists/*

RUN curl http://www.leptonica.org/source/leptonica-1.74.1.tar.gz -o leptonica-1.74.1.tar.gz && \
	tar -zxvf leptonica-1.74.1.tar.gz && \
	cd leptonica-1.74.1 && ./configure && make && make install && \
	cd .. && rm -rf leptonica*

RUN git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git && \
	cd tesseract && \
	./autogen.sh && \
	./configure --enable-debug && \
	LDFLAGS="-L/usr/local/lib" CFLAGS="-I/usr/local/include" make && \
	make install && \
	ldconfig && \
	make training && \
	make training-install && \
	cd .. && rm -rf tesseract

# Get basic traineddata
RUN curl https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata > eng.traineddata && \
	mv eng.traineddata /usr/local/share/tessdata/

RUN curl https://github.com/tesseract-ocr/tessdata/raw/master/fra.traineddata > fra.traineddata && \
	mv fra.traineddata /usr/local/share/tessdata/

Then:

docker build -t tesseract4 .
docker run tesseract4
docker run -t -i tesseract4 /bin/bash
mkdir test
cd test
curl http://tleyden-misc.s3.amazonaws.com/blog_images/ocr_test.png > test.png
tesseract test.png out

Can someone explain me what is happening?

For information I have 2471 megabytes of memory remaning

Thanks in advance

About this issue

  • Original URL
  • State: closed
  • Created 7 years ago
  • Comments: 33 (3 by maintainers)

Most upvoted comments

curl https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata > eng.traineddata does not get the expected data file, but gets a HTML redirection file:

<html><body>You are being <a href="https://raw.githubusercontent.com/tesseract-ocr/tessdata/master/eng.traineddata">redirected</a>.</body></html>

Use curl -LO https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata (and similar for other languages), then Tesseract with Docker works for me. With the bad data file, I get an error message:

# tesseract ocr_test.png out -l bad
Info in bmfCreate: Generating pixa of bitmap fonts from string
Error opening data file /usr/local/share/bad.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'bad'
Tesseract couldn't load any languages!
Could not initialize tesseract.