distribution: docker push fails with 500 Internal server error: s3aws Path not found for /data files

$ docker version Client: Version: 1.11.1 API version: 1.23 Go version: go1.5.4 Git commit: 5604cbe Built: Tue Apr 26 23:44:17 2016 OS/Arch: windows/amd64

Server: Version: 1.11.1 API version: 1.23 Go version: go1.5.4 Git commit: 5604cbe Built: Wed Apr 27 00:34:20 2016 OS/Arch: linux/amd64

$ docker info Containers: 1 Running: 1 Paused: 0 Stopped: 0 Images: 3 Server Version: 1.11.1 Storage Driver: aufs Root Dir: /mnt/sda1/var/lib/docker/aufs Backing Filesystem: extfs Dirs: 20 Dirperm1 Supported: true Logging Driver: json-file Cgroup Driver: cgroupfs Plugins: Volume: local Network: bridge null host Kernel Version: 4.4.8-boot2docker Operating System: Boot2Docker 1.11.1 (TCL 7.0); HEAD : 7954f54 - Wed Apr 27 16:36:45 UTC 2016 OSType: linux Architecture: x86_64 CPUs: 1 Total Memory: 995.9 MiB Name: default ID: TGKQ:DJSI:CGK6:7W7K:ZY45:CI7A:ALAL:JCTO:CLU2:557X:CZ2F:TXFN Docker Root Dir: /mnt/sda1/var/lib/docker Debug mode (client): false Debug mode (server): true File Descriptors: 99 Goroutines: 160 System Time: 2016-09-05T15:09:27.171017908Z EventsListeners: 0 Registry: https://index.docker.io/v1/ Labels: provider=virtualbox

$ docker exec registrytestdist registry -v registry github.com/docker/distribution 49c1a62

Command to run registry $ docker run -d -p 5000:5000 --name registrytestdist -v pwd/config.yml:/etc/docker/registry/config.yml distribution/registry:master

$ cat config.yml version: 0.1 log: level: debug fields: service: registry storage: s3: bucket: fimregistry2 accesskey: ACCESSKEY secretkey: SECRETKEY region: NOTAPPLICABLE (I’m using a S3 compatible storage) regionendpoint: http://S3ENDPOINT (the endpoint is vdctest.os-eu-mad-1.instantservers.telefonica.com just for referencing the logs) secure: false v4auth: false http: addr: :5000 headers: X-Content-Type-Options: [nosniff]

When pushing to the registry I get a 500 Internal Server error after a series of retries

The storage backend is a S3 compatible (v2auth)

$ docker push localhost:5000/mynginx The push refers to a repository [localhost:5000/mynginx] 69ecf026ff94: Pushing [==================================================>] 3.584 kB d7953e5e5bba: Retrying in 1 second 2f71b45e4e25: Retrying in 1 second received unexpected HTTP status: 500 Internal Server Error

When looking at the logs for the 500 server error I find a series of lines for that (and only http.response.status =500 for these lines) all identical but for different _upload/UUID/data

$ docker logs registrytestdist | grep -E ‘http.response.status=500’ … (showing just one of the lines) time=“2016-09-05T14:37:47.322255381Z” level=error msg=“response completed with error” err.code=unknown err.detail=“s3aws: Path not found: /docker/registry/v2/repositories/mynginx/_uploads/d370ec21-002f-498c-8846-f662e685a802/data” err.message=“unknown error” go.version=go1.6.3 http.request.host=“localhost:5000” http.request.id=cdd506bc-bad4-458b-9927-68f1141aba52 http.request.method=PATCH http.request.remoteaddr=“172.17.0.1:60282” http.request.uri=“/v2/mynginx/blobs/uploads/d370ec21-002f-498c-8846-f662e685a802?_state=eLoVFasLp7JeVkFrgL5rdze-bGOplIZs1FWnYU4cztt7Ik5hbWUiOiJteW5naW54IiwiVVVJRCI6ImQzNzBlYzIxLTAwMmYtNDk4Yy04ODQ2LWY2NjJlNjg1YTgwMiIsIk9mZnNldCI6MCwiU3RhcnRlZEF0IjoiMjAxNi0wOS0wNVQxNDozNzo0Ny4wMDIwOTQ2M1oifQ%3D%3D” http.request.useragent=“docker/1.11.1 go/go1.5.4 git-commit/5604cbe kernel/4.4.8-boot2docker os/linux arch/amd64 UpstreamClient(Docker-Client/1.11.1 (windows))” http.response.contenttype=“application/json; charset=utf-8” http.response.duration=33.558892ms http.response.status=500 http.response.written=191 instance.id=ac7b315f-71ee-4bad-8019-3e53c5b3dc03 service=registry vars.name=mynginx vars.uuid=d370ec21-002f-498c-8846-f662e685a802 version=49c1a62 …

As I could check in the S3 compatible Storage, based in Caringo Swarm, these lines corresponds to multipart uploads that seems to be not completed (one to one equialence)

List of multipart parts in the Object Storage: … (showing just one of the items) {“content_type”:“application/caringo-multipart-id”, “name”:“f13393c026932a573587fdc4a5cd0480”, “x_multipart_object_meta”:“docker/registry/v2/repositories/mynginx/_uploads/d370ec21-002f-498c-8846-f662e685a802/data”, “hash”:“3138f83abc20715e948f2b2c669d2f7a”, “last_modified”:“2016-09-05T14:37:47.055100Z”, “x_multipart_content_type_meta”:“application/octet-stream”}, …

Any help will be wellcome Thanks a lot Best

About this issue

  • Original URL
  • State: open
  • Created 8 years ago
  • Comments: 26 (7 by maintainers)

Most upvoted comments

In my case this error was caused by setting the regionendpoint to the bucket subdomain hostname rather than the region hostname.

regionendpoint: "https://s3-us-east-2.s3.amazonaws.com" ✔️ regionendpoint: "https://my-bucket.s3.amazonaws.com"

I got this error too (v2.7.1), after hours of digging around, I think I have found the root cause of this. Like what @LeoQuote said, it all about consistency model.

AWS S3 used to be Eventual-Consistency for PutObject, GetObject, DeleteObjects, ListObjects (used by s3 driver), that means sometimes you can’t read an object just after it’s been created. But this isn’t a problem anymore, because AWS S3 now delivers strong read-after-write and list consistency (https://aws.amazon.com/cn/s3/consistency/).

But I’m using another S3-compatible service, it doesn’t guarantee read-after-write for ListObjects, that’s been used in #Stat, #Move in the S3 driver, which’s called by #moveBlob, and cause this line to return an error (message: unknown error completing upload: %v).

I think at the time of the S3 driver been written, we don’t have strong-consistency, so we must deal with this type of error. Even for now, there are still bunch of S3-compatibles that don’t deliver read-after-writer for list.

@LeoQuote have proposed an PR (https://github.com/distribution/distribution/pull/3278) by adding retry logic to work around this problem, or we may limit the use of ListObjects in some methods like #Stat (not sure if it’s feasible).

For now, I think adding retry is the efficient way to fix this, but do we have better ideas? I’m happy to write this fix because I’m very annoyed by this error 😭 (affects my CI/CD pipeline).

Hi @sergeyfd, I am glad that worked for you! I am currently in the process of releasing our example script, but it is being delayed because we must follow some legal and quality procedures before making it public. I’ll share the GitHub URL once it is ready!

Regards.

We have exactly the same issue with the same type of storage. What’s interesting is that it used to work all right for a couple of month and started to break just a couple of days ago.

Sorry for the delay, we are sharing our current cleanup application, in Go, here:

https://github.com/adidas/s3-upload-cleaner

@airadier Could you share the script?

Finally we got to the root cause of our issue. Our S3 storage backend was a DellEMC Elastic Cloud Storage, so not official S3 from AWS. Apparently, docker registry S3 backend starts multipart upload on S3 via S3 API (using S3 SDK library), and once the upload is finished it calls the “complete multipartupload” method from the API. However, if the push is interrupted, the multipartupload still exists and is never aborted. There is even a TO-DO comment in the code related to this.

With our storage backend, it looks like having thousands of incomplete multipart uploads was causing a misbehaviour when the ListMultipartUploads method of the S3 API was called. This method was randomly replying with an empty list where there should be a multipart upload that was just created a few milliseconds before. This was causing the “path not found” error.

For official AWS S3, the problem also exists, as you get charged for the existing and unfinished multipart uploads unless you configure a cleanup rule (see https://aws.amazon.com/blogs/aws/s3-lifecycle-management-update-support-for-multipart-uploads-and-delete-markers/).

Our solution was creating a cleanup script that was run multiple times, listing existing multipart uploads for every path (or prefix, really), and aborting the existing ones. After several runs with erratic behaviour and inconsistent responses, more than 5000 incomplete multipart uploads were cleaned up, and the S3 API for our backend started to work correctly. We are running the cleanup script nightly and no errors so far.