nautobot: Jobs results can be very slow / unable to display due to thousands of `ensure_git_repository` calls
Environment
- Python version: 3.10
- Nautobot version: 1.2.8
Steps to Reproduce
- Configure Git backed Jobs
- Call them thousand times
- (probably not related, but to be complete on the history: disable the Jobs provider for that Git source, add again the same Git source with another slug and enable Jobs provider)
- Navigate to the Extra > Job results page
Expected Behavior
Job results are displayed in less than a few seconds max
Observed Behavior
Navigation fails, reverse proxy displays 502 Bad Gateway after 60s or more time, or if there’s too much trafic, the container itself is considered unhealthy and a reverse proxy like Traefik can remove it as backend, resulting in 404.
Additional informations
(maybe this title could be added by default to the bug template ?)
In the container logs (of the nautobot container, not the celery_worker one), I’ve noticed thousands of thoses lines:
Attaching to nautobot_nautobot_1
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.816 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.889 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
[..]
I’ve noticed this in the past (https://github.com/nautobot/nautobot/issues/744) but it seemed harmless at the time; today, with way more JobResults, it’s impossible for the page to display.
The ensure_git_repository call seems to be made for each JobResult related to a Git job, maybe limited to the MAX_PAGE_SIZE count or the user page count, but I’m not sure since there’s really a lot of logs.
This can be critical, because when called multiple times, it can cause the healthcheck to fail / timeout, and it may be possible to DoS the instance this way.
In the Docker (itself) logs, it’s also possible to notice:
# journalctl -fu docker
Mar 17 10:47:13 my_hostname dockerd[12283]: time="2022-03-17T10:47:13.700931230+01:00" level=warning msg="Health check for container <my_container_id> error: context deadline exceeded
[..]
I tried to find where this call originate from and noted:
ensure_git_repository
>pull_git_repository_and_refresh_data
>>enqueue_pull_git_repository_and_refresh_data
>Jobs._get_job_source_paths
>>Jobs.get_jobs
>>>get_job_classpaths
>>>get_job
>>>>JobResult.related_object()
>>>JobListView.get()
>>>JobView.get()
The path I think is problematic is JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository.
Maybe Jobs.get_jobs could be cached (per request?) (to deduplicate ensure_git_repository calls)? Or maybe ensure_git_repository should not be called at all (updating list on git creation/sync only? And a beat/watcher could monitor changes for local jobs). And shouldn’t ensure_git_repository run only on the celery worker?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (21 by maintainers)
Commits related to this issue
- Add release-note for #1515 — committed to nautobot/nautobot by bryanculver 2 years ago
- Add release-note for #1515 — committed to nautobot/nautobot by bryanculver 2 years ago
Thinking about this from a different angle - as you pointed out, the main critical path here is
JobResultTable>JobResult.related_object()>Jobs.get_job>Jobs.get_jobs>Jobs._get_job_source_paths->ensure_git_repository, which is a lot of unnecessary work when all we’re using to actually render the table column, in the case whererelated_objectis a Job, is itsclass_path(actually loading the Jobs into memory, etc is a completely unnecessary bit of processing).I wonder therefore if we should revisit how this table column is calculated/rendered so that it only calls
record.related_objectin the case where we can determine in advance thatrelated_objectis not a Job, something like:@u1735067 We revised this again to hybridize the approach here to assert that the caching is in place for Git-based jobs, and that the
related_objectis only retrieved once per row if it’s a not a Job class. If you could please give the latest for this a chance?One thing however, is that we updated the base for this fix from
develop(v1.2.x) tonext(v1.3.0-beta.x) as we are about to release v1.3.0. So you will need to run database migrations as well, OR just hot fix your v1.2.x install w/ the updated code. Thanks in advance!Oh right, I looked quickly and missed that, sorry! I just tested and there’s no errors yes 🙂
And the case when repo is removed but
_get_job_source_pathsstill returns it’s path (for max 60s) seems to be handled silently bypkgutil.iter_modules, ignoring non-existent paths.This is still a large part of the time sadly, I tested the TTBF (time to first byte) of 1063 git jobresults
file_cacheon_get_job_source_paths&get_job_classpaths2.42s / 2.30s / 2.56s / 2.46sfile_cache.cached(thread key) onget_jobs0.509s / 0.501s / 0.475s / 0.477sfile_cacheon_get_job_source_paths&get_job_classpaths25.79s / 23.88s / 23.7s / 23.41sfile_cache.cached(thread key) onget_jobs4.91s / 4.08s / 4.07s / 4.05sfile_cacheon_get_job_source_paths&get_job_classpaths46.63s / 46.11s / 46.82s / 46.52sfile_cache.cached(thread key) onget_jobs10.05s / 8.20s / 8.14s / 9.11s / 8.79s@jathanism thank you, I was able to test but result was the same as in https://github.com/nautobot/nautobot/issues/1515#issuecomment-1071235072, still some errors.
First I though it was a concurrency issue with
cacheops, but by digging I found it was because threads don’t share loaded modules, so the thread runningget_jobs(https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826) is fine, but the others, reading cache, are unable to access the referenced classes.To workaround this, I tried to use a custom per-thread caching key:
This forces each thread to run
get_jobsat least once. Also, withget_jobsin cache, there’s no need to cacheensure_git_repositorynorget_job_classpaths. This seems to work correctly for now.Note that with uWSGI,
Thread.identandThread.native_idvalues seems to be the same across threads, so to distinguish threads thenamehave to be used (https://github.com/unbit/uwsgi/blob/2.0.20/plugins/python/python_plugin.c#L1323):I’m not sure this is enough to cover all uWSGI configurations however.
And about
The empty selection error was because I used a generator to populate a ChoiceVar field, this wasn’t correct as only the first read would return choices, adding caching made this appear.
Also, Nautobot seems to import jobs as top-level modules (https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826), couldn’t this be an issue with a job file named
sys.pyfor example? Maybe the jobs could be imported as package namednautobot.jobs.{slug.replace('-', '_')}(automatically generating the__init__.pyif not provided)?Some notes to play with
pickle(https://docs.python.org/3/library/pickle.html, https://docs.python.org/3/library/pickletools.html):@u1735067 If you wouldn’t mind downloading and trying the fix in #1585 that would be super!
I was able to access the page using this temporary workaround (caching for 5 secs the
Jobs.get_jobs()call):(For some reason, my user page count was set to 1000, that’s why there were so many things to refresh; still that shouldn’t happen)