nautobot: Jobs results can be very slow / unable to display due to thousands of `ensure_git_repository` calls
Environment
- Python version: 3.10
- Nautobot version: 1.2.8
Steps to Reproduce
- Configure Git backed Jobs
- Call them thousand times
- (probably not related, but to be complete on the history: disable the Jobs provider for that Git source, add again the same Git source with another slug and enable Jobs provider)
- Navigate to the Extra > Job results page
Expected Behavior
Job results are displayed in less than a few seconds max
Observed Behavior
Navigation fails, reverse proxy displays 502 Bad Gateway after 60s or more time, or if there’s too much trafic, the container itself is considered unhealthy and a reverse proxy like Traefik can remove it as backend, resulting in 404.
Additional informations
(maybe this title could be added by default to the bug template ?)
In the container logs (of the nautobot
container, not the celery_worker
one), I’ve noticed thousands of thoses lines:
Attaching to nautobot_nautobot_1
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.816 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
nautobot_1 | 10:56:00.889 INFO nautobot.jobs :
nautobot_1 | Repository successfully refreshed
[..]
I’ve noticed this in the past (https://github.com/nautobot/nautobot/issues/744) but it seemed harmless at the time; today, with way more JobResults, it’s impossible for the page to display.
The ensure_git_repository
call seems to be made for each JobResult related to a Git job, maybe limited to the MAX_PAGE_SIZE count or the user page count, but I’m not sure since there’s really a lot of logs.
This can be critical, because when called multiple times, it can cause the healthcheck to fail / timeout, and it may be possible to DoS the instance this way.
In the Docker (itself) logs, it’s also possible to notice:
# journalctl -fu docker
Mar 17 10:47:13 my_hostname dockerd[12283]: time="2022-03-17T10:47:13.700931230+01:00" level=warning msg="Health check for container <my_container_id> error: context deadline exceeded
[..]
I tried to find where this call originate from and noted:
ensure_git_repository
>pull_git_repository_and_refresh_data
>>enqueue_pull_git_repository_and_refresh_data
>Jobs._get_job_source_paths
>>Jobs.get_jobs
>>>get_job_classpaths
>>>get_job
>>>>JobResult.related_object()
>>>JobListView.get()
>>>JobView.get()
The path I think is problematic is JobResult.related_object() > Jobs.get_job > Jobs.get_jobs > Jobs._get_job_source_paths -> ensure_git_repository
.
Maybe Jobs.get_jobs
could be cached (per request?) (to deduplicate ensure_git_repository
calls)? Or maybe ensure_git_repository
should not be called at all (updating list on git creation/sync only? And a beat/watcher could monitor changes for local jobs). And shouldn’t ensure_git_repository
run only on the celery worker?
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Comments: 21 (21 by maintainers)
Commits related to this issue
- Add release-note for #1515 — committed to nautobot/nautobot by bryanculver 2 years ago
- Add release-note for #1515 — committed to nautobot/nautobot by bryanculver 2 years ago
Thinking about this from a different angle - as you pointed out, the main critical path here is
JobResultTable
>JobResult.related_object()
>Jobs.get_job
>Jobs.get_jobs
>Jobs._get_job_source_paths
->ensure_git_repository
, which is a lot of unnecessary work when all we’re using to actually render the table column, in the case whererelated_object
is a Job, is itsclass_path
(actually loading the Jobs into memory, etc is a completely unnecessary bit of processing).I wonder therefore if we should revisit how this table column is calculated/rendered so that it only calls
record.related_object
in the case where we can determine in advance thatrelated_object
is not a Job, something like:@u1735067 We revised this again to hybridize the approach here to assert that the caching is in place for Git-based jobs, and that the
related_object
is only retrieved once per row if it’s a not a Job class. If you could please give the latest for this a chance?One thing however, is that we updated the base for this fix from
develop
(v1.2.x) tonext
(v1.3.0-beta.x) as we are about to release v1.3.0. So you will need to run database migrations as well, OR just hot fix your v1.2.x install w/ the updated code. Thanks in advance!Oh right, I looked quickly and missed that, sorry! I just tested and there’s no errors yes 🙂
And the case when repo is removed but
_get_job_source_paths
still returns it’s path (for max 60s) seems to be handled silently bypkgutil.iter_modules
, ignoring non-existent paths.This is still a large part of the time sadly, I tested the TTBF (time to first byte) of 1063 git jobresults
file_cache
on_get_job_source_paths
&get_job_classpaths
2.42s / 2.30s / 2.56s / 2.46s
file_cache.cached
(thread key) onget_jobs
0.509s / 0.501s / 0.475s / 0.477s
file_cache
on_get_job_source_paths
&get_job_classpaths
25.79s / 23.88s / 23.7s / 23.41s
file_cache.cached
(thread key) onget_jobs
4.91s / 4.08s / 4.07s / 4.05s
file_cache
on_get_job_source_paths
&get_job_classpaths
46.63s / 46.11s / 46.82s / 46.52s
file_cache.cached
(thread key) onget_jobs
10.05s / 8.20s / 8.14s / 9.11s / 8.79s
@jathanism thank you, I was able to test but result was the same as in https://github.com/nautobot/nautobot/issues/1515#issuecomment-1071235072, still some errors.
First I though it was a concurrency issue with
cacheops
, but by digging I found it was because threads don’t share loaded modules, so the thread runningget_jobs
(https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826) is fine, but the others, reading cache, are unable to access the referenced classes.To workaround this, I tried to use a custom per-thread caching key:
This forces each thread to run
get_jobs
at least once. Also, withget_jobs
in cache, there’s no need to cacheensure_git_repository
norget_job_classpaths
. This seems to work correctly for now.Note that with uWSGI,
Thread.ident
andThread.native_id
values seems to be the same across threads, so to distinguish threads thename
have to be used (https://github.com/unbit/uwsgi/blob/2.0.20/plugins/python/python_plugin.c#L1323):I’m not sure this is enough to cover all uWSGI configurations however.
And about
The empty selection error was because I used a generator to populate a ChoiceVar field, this wasn’t correct as only the first read would return choices, adding caching made this appear.
Also, Nautobot seems to import jobs as top-level modules (https://github.com/nautobot/nautobot/blob/v1.2.10/nautobot/extras/jobs.py#L826), couldn’t this be an issue with a job file named
sys.py
for example? Maybe the jobs could be imported as package namednautobot.jobs.{slug.replace('-', '_')}
(automatically generating the__init__.py
if not provided)?Some notes to play with
pickle
(https://docs.python.org/3/library/pickle.html, https://docs.python.org/3/library/pickletools.html):@u1735067 If you wouldn’t mind downloading and trying the fix in #1585 that would be super!
I was able to access the page using this temporary workaround (caching for 5 secs the
Jobs.get_jobs()
call):(For some reason, my user page count was set to 1000, that’s why there were so many things to refresh; still that shouldn’t happen)