harbor: Periodic GC job logs are missing after first job run

We’re running Harbor v2.1.3, jobservice is configured to store job logs in the database.

We have a GC job scheduled to run once a week, the Job executes and finishes successfully, however there is no log stored in the database and we get a “internal server error” (and a 404 in the core log) when trying to retrieve in the UI or via the API.

Job status (indeed shows finished and correct “update” date):

Trying to get the log in the UI, we get this:

We are 99.9% (😄) sure we saw the job log on the first scheduled run after the job was configured, but we seem to be hitting this issue on the subsequent job runs.

as a side, I’m also curious what is the expected UX here? since this is a periodic job it has the same ID (105 in our case) every time it runs, so we are only able to access the log of the last run?

jobservice log when the job last ran:

2021-02-17T00:30:00Z [INFO] [/jobservice/worker/cworker/c_worker.go:76]: Job incoming: {"name":"IMAGE_GC","id":"7d97f3a11a9a6d1eaa4217fe","t":1613521800,"args":null}
2021-02-17T00:30:00Z [INFO] [/jobservice/runner/redis.go:67]: Start to run periodical job execution: 7d97f3a11a9a6d1eaa4217fe@1613521800
2021-02-17T00:30:15Z [INFO] [/jobservice/runner/redis.go:133]: Job 'IMAGE_GC:7d97f3a11a9a6d1eaa4217fe' exit with success

core log when the job last ran:

2021-02-17T00:30:00Z [INFO] [/core/service/notifications/admin/handler.go:98]: received admin job status update event: job-105, job_uuid-7d97f3a11a9a6d1eaa4217fe, status-running, revision-1613521708
2021-02-17T00:30:15Z [INFO] [/core/service/notifications/admin/handler.go:98]: received admin job status update event: job-105, job_uuid-7d97f3a11a9a6d1eaa4217fe, status-finished, revision-1613521708

jobservice log when trying to retrieve the job log:

2021-02-17T15:52:02Z [ERROR] [/jobservice/api/handler.go:288]: Serve http request 'GET /api/v1/jobs/7d97f3a11a9a6d1eaa4217fe/executions?page_number=1&page_size=100' error: 404 {"code":10010,"message":"object is not found","details":"7d97f3a11a9a6d1eaa4217fe"}

core log when trying to retrieve the job log:

2021-02-17T15:52:02Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"INTERNAL_SERVER_ERROR","message":"http error: code 404, message {\"code\":10010,\"message\":\"object is not found\",\"details\":\"7d97f3a11a9a6d1eaa4217fe\"}"}]}

admin_job table entry:

coredb=# select * from admin_job where id=105;
 id  | job_name | job_kind |                cron_str                 |  status  |         job_uuid         |    creation_time    |        update_time         | deleted |  revision  | status_code |                                                            job_parameters
-----+----------+----------+-----------------------------------------+----------+--------------------------+---------------------+----------------------------+---------+------------+-------------+---------------------------------------------------------------------------------------------------------------------------------------
 105 | IMAGE_GC | Periodic | {"type":"Custom","cron":"0 30 0 * * 3"} | finished | 7d97f3a11a9a6d1eaa4217fe | 2021-02-03 15:14:09 | 2021-02-17 00:30:15.104082 | f       | 1613521708 |           3 | {"delete_untagged":true,"redis_url_reg":"redis+sentinel://rfs-harbor-redis:26379/mymaster/2?idle_timeout_seconds=30","time_window":2}

tried to retrieve the log event directly from the job_log table:

coredb=# select * from job_log where job_uuid='7d97f3a11a9a6d1eaa4217fe';
 log_id | job_uuid | creation_time | content
--------+----------+---------------+---------
(0 rows)

combing through core & jobservice logs I couldn’t find any error or an indication to fail store the job log.

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 1
Comments: 20 (20 by maintainers)

Most upvoted comments

@dkulchinsky

We have located the root cause of this issue. It is because the job stats data kept in the Redis has an expiration time. The expiration time is hardcoded and set to be 1 day. That means 1 day later the data will be gone and no chance to retrieve it. However, the core service needs to query it when getting the log. As it has been gone and the issue is raised.

The expiration time is by design and the purpose of it is to reduce the storage size of the Redis as there might be large-scale job data generated in a short time.

In 2.2.x (now the latest patch is 2.2.2), we have used a different way to launch GC jobs. There will be no issues in 2.2.x. So, a possible way is upgrading your Harbor to 2.2.2 with harbor helm v1.6.2.

As it’s very hard to fix it in the 2.1 branch, so we don’t prefer to introduce any fix into 2.1.x (the expiration is by design and it has been fixed in 2.2.x). Maybe an upgrade is a good option now.

steven-zou on Jun 18, 2021

@dkulchinsky

Sorry for the late response and thanks a lot for providing the related info. We’ll take a look at it and see if we can find the concrete root cause and fix it ASAP.

@wy65701436 @ywk253100 let’s have an offline discussion later.

steven-zou on Jun 17, 2021