nautobot: Jobs results can be very slow / unable to display due to thousands of `ensure_git_repository` calls

Environment

  • Python version: 3.10
  • Nautobot version: 1.2.8

Steps to Reproduce

  1. Configure Git backed Jobs
  2. Call them thousand times
  3. (probably not related, but to be complete on the history: disable the Jobs provider for that Git source, add again the same Git source with another slug and enable Jobs provider)
  4. Navigate to the Extra > Job results page

Expected Behavior

Job results are displayed in less than a few seconds max

Observed Behavior

Navigation fails, reverse proxy displays 502 Bad Gateway after 60s or more time, or if there’s too much trafic, the container itself is considered unhealthy and a reverse proxy like Traefik can remove it as backend, resulting in 404.

Additional informations

(maybe this title could be added by default to the bug template ?)

In the container logs (of the nautobot container, not the celery_worker one), I’ve noticed thousands of thoses lines:

Attaching to nautobot_nautobot_1
nautobot_1       |   Repository successfully refreshed
nautobot_1       | 10:56:00.816 INFO    nautobot.jobs :
nautobot_1       |   Repository successfully refreshed
nautobot_1       | 10:56:00.889 INFO    nautobot.jobs :
nautobot_1       |   Repository successfully refreshed
[..]

I’ve noticed this in the past (https://github.com/nautobot/nautobot/issues/744) but it seemed harmless at the time; today, with way more JobResults, it’s impossible for the page to display.

The ensure_git_repository call seems to be made for each JobResult related to a Git job, maybe limited to the MAX_PAGE_SIZE count or the user page count, but I’m not sure since there’s really a lot of logs.

This can be critical, because when called multiple times, it can cause the healthcheck to fail / timeout, and it may be possible to DoS the instance this way.

In the Docker (itself) logs, it’s also possible to notice:

# journalctl -fu docker
Mar 17 10:47:13 my_hostname dockerd[12283]: time="2022-03-17T10:47:13.700931230+01:00" level=warning msg="Health check for container <my_container_id> error: context deadline exceeded
[..]

I tried to find where this call originate from and noted:

ensure_git_repository
>pull_git_repository_and_refresh_data
>>enqueue_pull_git_repository_and_refresh_data
>Jobs._get_job_source_paths
>>Jobs.get_jobs
>>>get_job_classpaths
>>>get_job
>>>>JobResult.related_object()
>>>JobListView.get()
>>>JobView.get()

The path I think is problematic is JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository.

Maybe Jobs.get_jobs could be cached (per request?) (to deduplicate ensure_git_repository calls)? Or maybe ensure_git_repository should not be called at all (updating list on git creation/sync only? And a beat/watcher could monitor changes for local jobs). And shouldn’t ensure_git_repository run only on the celery worker?

About this issue

  • Original URL
  • State: closed
  • Created 2 years ago
  • Comments: 21 (21 by maintainers)

Commits related to this issue

Most upvoted comments

Thinking about this from a different angle - as you pointed out, the main critical path here is JobResultTable > JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository, which is a lot of unnecessary work when all we’re using to actually render the table column, in the case where related_object is a Job, is its class_path (actually loading the Jobs into memory, etc is a completely unnecessary bit of processing).

I wonder therefore if we should revisit how this table column is calculated/rendered so that it only calls record.related_object in the case where we can determine in advance that related_object is not a Job, something like:

def job_creator_link(value, record):
    """
    Get a link to the related object, if any, associated with the given JobResult record.
    """
    # record.related_object is potentially slow if the related object is a Job class,
    # as it needs to actually (re)load the Job class into memory. That's unnecessary
    # computation as we don't actually need the class itself, just its class_path which is already
    # available as record.name on the JobResult itself. So save some trouble:
    if record.obj_type == ContentType.objects.get(app_label="extras", model="job"):
        return reverse("extras:job", kwargs={"class_path": record.name})

    # If it's not a Job class, maybe it's something like a GitRepository, which we can look up cheaply:
    related_object = record.related_object
    if related_object:
        return related_object.get_absolute_url()
    return None

@u1735067 We revised this again to hybridize the approach here to assert that the caching is in place for Git-based jobs, and that the related_object is only retrieved once per row if it’s a not a Job class. If you could please give the latest for this a chance?

One thing however, is that we updated the base for this fix from develop (v1.2.x) to next (v1.3.0-beta.x) as we are about to release v1.3.0. So you will need to run database migrations as well, OR just hot fix your v1.2.x install w/ the updated code. Thanks in advance!

We moved the cached call down to _get_job_source_paths instead of get_jobs.

Oh right, I looked quickly and missed that, sorry! I just tested and there’s no errors yes 🙂

_get_job_source_paths is what was being called in get_jobs which ultimately calls ensure_git_repository and several other things which should be safe per-worker to cache.

And the case when repo is removed but _get_job_source_paths still returns it’s path (for max 60s) seems to be handled silently by pkgutil.iter_modules, ignoring non-existent paths.

Modules are still loaded every time get_jobs is called but this should be a fraction of the call time in comparison to the git repository validation.

This is still a large part of the time sadly, I tested the TTBF (time to first byte) of 1063 git jobresults

  • 50/page
    • file_cache on _get_job_source_paths & get_job_classpaths
      • 2.42s / 2.30s / 2.56s / 2.46s
      • -> On a common paging size, the UI still feels slow ☹
    • file_cache.cached (thread key) on get_jobs
      • 0.509s / 0.501s / 0.475s / 0.477s
  • 500/page
    • file_cache on _get_job_source_paths & get_job_classpaths
      • 25.79s / 23.88s / 23.7s / 23.41s
    • file_cache.cached (thread key) on get_jobs
      • 4.91s / 4.08s / 4.07s / 4.05s
  • 1000/page
    • file_cache on _get_job_source_paths & get_job_classpaths
      • 46.63s / 46.11s / 46.82s / 46.52s
      • -> Launch multiple refresh of the job results pages and the instance can become unavailable to the healthcheck
    • file_cache.cached (thread key) on get_jobs
      • 10.05s / 8.20s / 8.14s / 9.11s / 8.79s

@jathanism thank you, I was able to test but result was the same as in https://github.com/nautobot/nautobot/issues/1515#issuecomment-1071235072, still some errors.

First I though it was a concurrency issue with cacheops, but by digging I found it was because threads don’t share loaded modules, so the thread running get_jobs (https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826) is fine, but the others, reading cache, are unable to access the referenced classes.

To workaround this, I tried to use a custom per-thread caching key:

# https://github.com/Suor/django-cacheops/blob/5.1/cacheops/simple.py#L53
# https://github.com/Suor/django-cacheops/blob/5.1/cacheops/utils.py#L89
# https://docs.python.org/3/library/threading.html
# https://github.com/python/cpython/blob/3.10/Lib/threading.py
def _cacheops_func_cache_key_per_thread(func, args, kwargs, extra=None):
    """
    Calculate cache key based on func, arguments and current thread
    """
    import threading
    from cacheops.utils import json, md5hex, obj_key

    current_thread = threading.current_thread()
    thread_factors = [current_thread.name, current_thread.ident]
    if hasattr(current_thread, 'native_id'):
        thread_factors.append(current_thread.native_id)
    
    factors = [func, args, kwargs, extra, thread_factors]
    key = md5hex(json.dumps(factors, sort_keys=True, default=obj_key))

    # Debug
    #import logging
    #logging.getLogger("nautobot.jobs").warning('>>>>>>>>>>>>>>>>>>>>>>>>> {} {} {}'.format(threading.current_thread(), thread_factors, key))
    
    return key


@cached(timeout=60, key_func=_cacheops_func_cache_key_per_thread)
def get_jobs():
    [...]

This forces each thread to run get_jobs at least once. Also, with get_jobs in cache, there’s no need to cache ensure_git_repository nor get_job_classpaths. This seems to work correctly for now.

Note that with uWSGI, Thread.ident and Thread.native_id values seems to be the same across threads, so to distinguish threads the name have to be used (https://github.com/unbit/uwsgi/blob/2.0.20/plugins/python/python_plugin.c#L1323):

16:51:55.781 WARNING nautobot.jobs :
  >>>>>>>>>>>>>>>>>>>>>>>>> <_MainThread(uWSGIWorker1Core0, started 140496645598024)> ['uWSGIWorker1Core0', 140496645598024, 1] db4624064d96aa320366075470e4e29f
16:51:56.621 WARNING nautobot.jobs :
  >>>>>>>>>>>>>>>>>>>>>>>>> <_MainThread(uWSGIWorker3Core0, started 140496645598024)> ['uWSGIWorker3Core0', 140496645598024, 1] e252e4d784901cdd1a90584dba336abe
16:51:57.225 WARNING nautobot.jobs :
  >>>>>>>>>>>>>>>>>>>>>>>>> <_MainThread(uWSGIWorker3Core0, started 140496645598024)> ['uWSGIWorker3Core0', 140496645598024, 1] e252e4d784901cdd1a90584dba336abe
16:51:57.799 WARNING nautobot.jobs :
  >>>>>>>>>>>>>>>>>>>>>>>>> <_MainThread(uWSGIWorker2Core0, started 140496645598024)> ['uWSGIWorker2Core0', 140496645598024, 1] 963f6d84537803f7b1b7ac6d318eb132
16:51:58.278 WARNING nautobot.jobs :
  >>>>>>>>>>>>>>>>>>>>>>>>> <_MainThread(uWSGIWorker2Core0, started 140496645598024)> ['uWSGIWorker2Core0', 140496645598024, 1] 963f6d84537803f7b1b7ac6d318eb132

I’m not sure this is enough to cover all uWSGI configurations however.

And about

Note that this workaround seems to cause issues when a job form is refreshed rapidly (and maybe at other places), some custom job fields, like selection, can appear unpopulated, while the job code populate it correctly …

The empty selection error was because I used a generator to populate a ChoiceVar field, this wasn’t correct as only the first read would return choices, adding caching made this appear.

Also, Nautobot seems to import jobs as top-level modules (https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826), couldn’t this be an issue with a job file named sys.py for example? Maybe the jobs could be imported as package named nautobot.jobs.{slug.replace('-', '_')} (automatically generating the __init__.py if not provided)?


Some notes to play with pickle (https://docs.python.org/3/library/pickle.html, https://docs.python.org/3/library/pickletools.html):

import cacheops, pickle, pickletools
pickle.loads(cacheops.redis.redis_client.get('c:963f6d84537803f7b1b7ac6d318eb132'))
pickletools.dis(cacheops.redis.redis_client.get('c:963f6d84537803f7b1b7ac6d318eb132'))

@u1735067 If you wouldn’t mind downloading and trying the fix in #1585 that would be super!

I was able to access the page using this temporary workaround (caching for 5 secs the Jobs.get_jobs() call):

docker-compose exec --user root nautobot sed -ie '/@cached(timeout=5)  # TMP/d;s/def get_jobs/@cached(timeout=5)  # TMP\ndef get_jobs/' /usr/local/lib/python3.10/site-packages/nautobot/extras/jobs.py
docker-compose restart nautobot

(For some reason, my user page count was set to 1000, that’s why there were so many things to refresh; still that shouldn’t happen)