harbor: harbor-core sporadically throws HTTP.5xx - Database performance related

Running Harbor v2.2.1, and using an AWS RDS as database. Some specs for our Harbor deployment:

300 Repositories
800 artifacts per Repository (approximately)

We are observing high database usage generated from harbor. We’ve enabled PostgreSQL slow_log, to gather some insights on the database usage. The following queries are reported with duration over 10seconds:

| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.15.1(1331):harbor@core_db:[21602]:LOG:  duration: 13985.656 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE    | mgmt-harbor-serverless-prod.1 |
| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.17.254(33891):harbor@core_db:[21663]:LOG:  duration: 14040.258 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE | mgmt-harbor-serverless-prod.1 |
| 1622621735000 | 2021-06-02 08:15:35 UTC:10.1.11.177(40627):harbor@core_db:[21719]:LOG:  duration: 13736.308 ms  execute <unnamed>: SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE | mgmt-harbor-serverless-prod.1 |

These are FOR UPDATE queries in which causes the rows retrieved by the SELECT statement to be locked as though for update.

Database Locking graph confirms the DB locking issue, showing 400+ locked tables.
Screen Shot 2021-06-02 at 12 23 43 PM

When this happens, we are sporadically getting HTTP.5xx errors thrown by harbor on PUT Requests, thus image upload fails for clients:

"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" "  8516 13.609 [harbor-harbor-core-http] [] 100.97.80.52:8080 66 13.612 500 90bf9ce9edddd7d33ccd7014dae4a47e
"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" " 2269 6.295 [harbor-harbor-core-http] [] 100.97.81.139:8080 66 6.296 500 9fa9d68459840391c48486e08061fbed
"PUT /v2/beat/[...] HTTP/1.1" 500 66 "-" "[...] 561 30.762 [harbor-harbor-core-http] [] 100.97.81.139:8080 66 30.760 500 2cc59940a45cec7fd5b372fe74c88f5c
"PUT /v2/beat/[...] /manifests/[...] HTTP/1.1" 500 66 "-" [...] " 3309 6.366 [harbor-harbor-core-http] [] 100.97.81.40:8080 66 6.364 500
"PUT /v2/beat/[...] [...]blobs/uploads/cb4b12f8-98f8-43a1-9e27-04bfe1d46f80? HTTP/1.1" 500 66 "-" "[...] 1584 23.598 [harbor-harbor-core-http] [] 100.97.80.52:8080 66 23.600 500

and at the same time Harbor throws errors that suggest issues with the database:

[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /v2/beat/[....]/manifests/[...] request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:78]: rollback transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: driver: bad connection"}]} 
[ERROR] [/server/middleware/blob/put_manifest.go:63][middleware="blob" requestID="07fa6cc169437ec9272eebdab971e216"]: get project failed, error: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: deal with /api/v2.0/projects/beat/repositories/[...]/artifacts request in transaction failed: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:86]: commit transaction failed: driver: bad connection
[ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: driver: bad connection"}]} 
[ERROR] [/lib/orm/orm.go:78]: rollback transaction failed: driver: bad connection
[ERROR] [/controller/quota/controller.go:299][requestID="c5be4a76ee93e8f79beaf362a309f1de"]: failed to calculate quota usage for project 37, error: driver: bad connection

Database specs are upgraded to be able to cope with this. Currently running under 16VCPUs and 122GB Ram, dedicated to Harbor.

Could this suggest an issue with the way that Harbor is managing the Database?

About this issue

Original URL
State: closed
Created 3 years ago
Reactions: 10
Comments: 27 (11 by maintainers)

Most upvoted comments

Hey @heww , thanks for your input.

According to your logs, can you confirm that do you have pushing actions during deleting all untagged artifacts?

During the duration of the cleanup which lasted 337min performed yesterday, there were pushing actions performed. But in today’s specific logs that we’ve shared (2021-06-08 07:28:25 UTC) there was no cleanup action on-going.

Could you provide the count of the records in the quota table?

core_db=> select COUNT(*) from quota;
 count
-------
     7
(1 row)

Taking about 13985.656ms to execute a SELECT FOR UPDATE SQL is not normal. Can you confirm that there isn’t too much pushing for your harbor instance?

Nginx-Ingress HTTP/PUT requests forwarded to harbor, last 24h: Screen Shot 2021-06-08 at 2 33 29 PM

Nginx-Ingress HTTP/PUT requests forwarded to harbor, last 7d: Screen Shot 2021-06-08 at 2 33 48 PM

How much CPU and memory the Postgres using when you got this problem?

Problem is observed when we get > 250-300 Lock Tables, when harbor runs SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE on the database :

Screen Shot 2021-06-08 at 2 15 45 PM

DB CPU/MEM usage, at the same time window as the above graph, seems nominal : Screen Shot 2021-06-08 at 2 15 38 PM

This is a 16VCPUs and 122GB Ram RDS DB Instance, dedicated to Harbor.

Just to make sure we are on the same page, the locking issue is now mitigated with https://github.com/goharbor/harbor/issues/15048#issuecomment-855767153 as proposed.

We now have a new issue concerning the database, with duplicate_keys (unique_artifact), that seem to came up after the cleanup performed with untagged artifacts. Would you advise opening a new issue about this?

phoinixgrr on Jun 8, 2021

According to the logs, the images beat/repo_name:phobos and beat/repo_name:deimos have the same digest sha256: e1aee8bb773a4615023387b261d1e93feef5e6ed2d903004813032a01464bad1

When pushing the images with the same digest into the same repository, it may hit a bug.

@phoinixgrr Thanks for your input. I’ll create a PR to fix it

ywk253100 on Jun 9, 2021

@heww
Calling API endpoint /api/internal/switchquota actually fixed our extreme locking! Awesome. Seems to be working as a workaround in our case. Question: Should this indicate an issue with the way harbor manages the Database?

Since then, we’ve performed a cleanup on our largest project, by deleting all untagged artifacts (using TAG RETENTION rule ) And we have a new kind of HTTP.500 issues thrown, when pushes happens. We’ve managed to drill-down the logs to identify this new issue.

Docker client gets HTTP 500 Internal Server error, from harbor, when PUSHING : Jun 8 07:28:25 ip-10-30-2-156 dockerd[541]: level=info msg="Attempting next endpoint for push after error: received unexpected HTTP status: 500 Internal Server Error"
Nginx-ingress confirms the 500 thrown by Harbor:

2021-06-08 07:28:25	192.168.5.196 - - [08/Jun/2021:07:28:25 +0000] "PUT /v2/beat/[....] HTTP/1.1" 500 66 "-" "docker/[...]  git-commit/[...]-aws os/linux arch/amd64 UpstreamClient(Docker-Client[...])" 6844 0.460 [harbor-harbor-core-http] [] 100.97.81.117:8080 66 0.460 500 95ef3bdd3ff37311192eb8eea8bf9927

Harbor logs suggest Database related issues: 2021-06-08T07:28:25Z [ERROR] [/lib/http/error.go:54]: {"errors":[{"code":"UNKNOWN","message":"unknown: pq: current transaction is aborted, commands ignored until end of transaction block"}]}
PostgrsSQL logs, show duplicate keys/ unique constraint violated.

2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:ERROR: duplicate key value violates unique constraint "unique_artifact"
2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:DETAIL:  Key (repository_id, digest)=(311, sha256:1c45e91bc8eefd723d37e3ea7def56de111d617349f47abc89181243ba060c76) already exists.
2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:STATEMENT:  INSERT INTO "artifact" ("type", "media_type", "manifest_media_type", "project_id", "repository_id", "repository_name", "digest", "size", "icon", "push_time", "pull_time", "extra_attrs", "annotations") VALUES ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13) RETURNING "id"
2021-06-08 07:28:25 UTC:10.1.20.13(16391):harbor@core_db:[2307]:ERROR: current transaction is aborted, commands ignored until end of transaction block

As conclusion, we’ve managed to Workaround our main issue concerning the extreme database locking happening. Unfortunately, it seems like the cleaning performed on the untagged artifacts, has introduced new issues with the database, related to duplicate_keys (unique_artifact)

phoinixgrr on Jun 9, 2021

@phoinixgrr any news?

Just remembered that I was in such trouble when there were not enough IOPS, because of my SATA disks

@paulliss There is no observed IOWAIT on any of the subsystems, either Harbor itself, nor the RDS Database. Storage is performant.

Other than that, things are not looking good. When this extreme locking to the Database is performed, Harbor starts throwing 5xx errors to image pushes. The following query has been identified to produce the locking:

SELECT "id", "reference", "reference_id", "hard", "creation_time", "update_time" FROM "quota" WHERE "reference" = $1 AND "reference_id" = $2 FOR UPDATE

Hello, if you do not use quota feature, you can disable from api to mitigate your errors , look the swagger api

@aladjadj Quotas are reported as disabled (-1) from the API:

[
  {
    "id": 100,
    "ref": {
      "id": 464,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-03-19T10:14:01.046001Z",
    "update_time": "2021-03-19T10:21:04.23482Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 5607868043
    }
  },
  {
    "id": 67,
    "ref": {
      "id": 167,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-03-03T11:03:59.948701Z",
    "update_time": "2021-03-24T09:33:17.246179Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 80959601
    }
  },
  {
    "id": 45,
    "ref": {
      "id": 56,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-27T14:10:31.390346Z",
    "update_time": "2021-03-17T09:22:06.392402Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 998161523
    }
  },
  {
    "id": 44,
    "ref": {
      "id": 55,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-27T14:09:10.222079Z",
    "update_time": "2021-03-17T09:22:06.365389Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 0
    }
  },
  {
    "id": 41,
    "ref": {
      "id": 43,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-20T11:05:34.104598Z",
    "update_time": "2021-03-17T09:22:06.349697Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 53979915
    }
  },
  {
    "id": 37,
    "ref": {
      "id": 37,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-19T12:27:09.413684Z",
    "update_time": "2021-06-07T07:57:46.678234Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 4569325322201
    }
  },
  {
    "id": 36,
    "ref": {
      "id": 36,
      "name": "------",
      "owner_name": "admin"
    },
    "creation_time": "2021-01-19T12:01:19.862108Z",
    "update_time": "2021-04-20T06:46:28.082002Z",
    "hard": {
      "storage": -1
    },
    "used": {
      "storage": 119763371
    }
  }
]

phoinixgrr on Jun 7, 2021

Hello, if you do not use quota feature, you can disable from api to mitigate your errors , look the swagger api

aladjadj on Jun 4, 2021