trafilatura: Celery error with v1.2.1: ValueError: signal only works in main thread

Having version 1.2.1 it is not possible to launch trafilatura extraction in the async task like celery. https://github.com/adbar/trafilatura/blob/1bb5fee6a4812e53b6597053c25efde995174d79/trafilatura/core.py#L982 It would be better to have HAS_SIGNAL as config variable, and not hardcoded value

celery_1      |     text = trafilatura.extract(
celery_1      |   File "/usr/local/lib/python3.8/site-packages/trafilatura/core.py", line 982, in extract
celery_1      |     signal(SIGALRM, timeout_handler)
celery_1      |   File "/usr/local/lib/python3.8/signal.py", line 47, in signal
celery_1      |     handler = _signal.signal(_enum_to_int(signalnum), _enum_to_int(handler))
celery_1      | ValueError: signal only works in main thread

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 17 (7 by maintainers)

Commits related to this issue

Most upvoted comments

If you are still struggling with this issue then here is my solution:

from trafilatura.settings import use_config import trafilatura config = use_config() config.set("DEFAULT", "EXTRACTION_TIMEOUT", "0") downloaded = trafilatura.fetch_url('https://the-URL-you-want-to-extract') output = trafilatura.extract(downloaded, config=config)

I was struggling with Flask App and not getting data from cfg file (don’t know why, didn’t get the time to investigate)

Hi @mikii121, please use the latest version and a specially crafted settings file:

  • setting EXTRACTION_TIMEOUT to 0 will disable signal
  • extract(downloaded, settingsfile="myfile.cfg")

For more see extraction settings.

Hi @alex-bender, thanks for your feedback. Can you try something similar to this solution?

If it is too much of a problem I could make use of signal optional, anyone here experiencing the same problem?

Confirming that using below fixed the ValueError issue

extract(downloaded, settingsfile="myfile.cfg")

Yes, it seems to work. Thanks all.

On Mon, Aug 1, 2022 at 2:25 PM Adrien Barbaresi @.***> wrote:

@spirovskib https://github.com/spirovskib Does it work now and can I close the issue?

— Reply to this email directly, view it on GitHub https://github.com/adbar/trafilatura/issues/202#issuecomment-1201134375, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSHYZ5WXTPJNMTI75XNRB3VW665BANCNFSM5U3XP7EA . You are receiving this because you were mentioned.Message ID: @.***>

– Spirovski Bozidar

Thanks both, I’ll try it over the weekend and post results.

Hi @adbar. Thanks for your answer. It works for me. 😉

I have similar setup, the only one difference is absence of --master --processes 4 --threads 2, so going to try that. Will let you know