distribution: Swift eventual consistency problems?

Hey @ojacques @nevermosby, thanks a lot for integrating Swift support into this project. We’re running in Rackspace and are using Cloud Files (Rackspace’s installation of Swift) as our storage driver.

When pushing layers, we often notice errors like:

» docker push [image_name]
The push refers to a repository [image_name] (len: 1)
6db1d48e9f8d: Image successfully pushed
6165b10a9c5b: Image successfully pushed
4c9dbc176128: Image successfully pushed
1a0202a44aea: Image successfully pushed
574ba1962306: Image successfully pushed
b45b44f4dace: Image successfully pushed
a2173b937aaf: Pushing [==================================================>] 4.028 MB/4.028 MB
digest invalid: provided digest did not match uploaded content

Retrying the docker push a few seconds later works. This error is so common that I was able to reproduce this on my first shot, and most 10-layer images take 2-3 tries before they are successfully pushed.

It looks like there previously was a hack to retry in layerwriter.go for this purpose, but that was removed in a refactor. Since then, retry logic has crept back into the S3 driver, albeit for a different reason (slow S3 connections).

Anyway, I bring this up to ask if you guys would consider adding a similar retry logic if you buy that the problem is Swift’s eventual consistency. Do you guys see this problem at HP?

I can bumble around in Go myself if you guys think this is worthwhile, but I feel it would be faster if you guys are down to build it.

Anyhow, cheers and thanks again for adding Swift integration to this project.

cc @sds @z3usy

About this issue

  • Original URL
  • State: closed
  • Created 9 years ago
  • Comments: 69 (22 by maintainers)

Commits related to this issue

Most upvoted comments

Everyone seeing infrequent errors that might be related to eventual consistency, you might want to give the patches that I produced over the last weeks a try (#1578, #1605 and #1650). I applied them on top of the 2.4.0 release to use in our own production environment here at SAP, and pushed the image to the Docker Hub as majewsky/registry:2.4.0.p2. Feel free to docker pull and check it out. I will keep that image updated until all patches are merged into an upstream release.

The only error I’m seeing in our production with this version is blob upload unknown during docker push. Debug traces indicate that in this particular case, the following thing happens: A layer is uploaded, but not all Swift nodes have the uploaded layer immediately. The Registry checks with Swift whether the upload is complete, and one of the storage nodes replies “yes”. But when it later tries to access that layer, another storage node replies “not found”.

I don’t have a good idea just yet how to thoroughly fix this (except by finishing up #1229), but I’m confident that in this particular case, a simple re-run of docker push should pick up from where it left off (the layer was fully uploaded after all), and hopefully go through.

So after much time debugging the “blob unknown error” issue, the problem is particularly visible with the docker 1.9.1 client, which does not implement any retry, compared to the clients 0.10 and up. So when the swift go registry (here a 2.5.0, but the bug is also visible on 2.5.1 or 2.6.0-rc1) misses a layer with an HTTP 404 error, the client also gets a 404. You can mitigate the problem by putting an nginx proxy in front that will catch the 404 errors and makes the retry, I have successfully implemented that solution to compensate the retry issue. But the best would be, as mentioned early in this post, to implement a number of retries at the server level. But at least now you have all the ingredients to properly reproduce the bug.

I think the pragmatic solution is my suggestion above:

Once thing which would help would be to create single objects rather than DLOs for small uploads. Swift can have objects up to 5GB. Assuming I’m reading the code correctly, it seems to segment objects into 20MB chunks and store everything as a DLO.

That will work very well for all objects < 5GB which is nearly all of them I think.

This isn’t perfect but it will make the situation loads better. It is also 100% backwards compatible.

Dynamic Large Objects and Static Large Objects are a mess in Swift if you ask me… They rely on consistent directory listings and the user to make sure all the parts are present so are best avoided IMHO.

@majewsky’s and @ncw Thanks a lot for the info.

Yesterday i got a chance to have look at the swift back end logs to see there is any error in swift logs when the push is happening. There is no error in the swift back end. The issue which i found for the zero byte layer is that , it does n’t have corresponding segment. I checked in the swift logs and found swift did not received any PUT request for the segment for that layers.

Hello,

we also faced the similar kind of issue. Pushing a docker images is successful but when you pull the same image you will get file system verification failed for a layer. We checked in the swift backed and found the layer which got failed is zero bytes. Pushing the same image say’s already the layer is present. But when you delete the layer from the swift backend and then try to push the image again , it will be successful. I am using loadbalancer ,registry 2.5, redis and swift . It will be great if some one have a fix for this issue.