source-controller: "SSH could not read data: Error waiting on socket" when using libgit2

https://cloud-native.slack.com/archives/CLAJ40HV3/p1625133279255100

This is reported in the logs:

{
  "level": "error",
  "ts": "2021-07-01T07:44:10.656Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "redacted",
  "namespace": "flux-system",
  "error": "unable to clone 'ssh://git@github.com/redacted/redacted.git', error: SSH could not read data: Error waiting on socket"
}

… though apparently not all the time, as

After adding [update markers], Image Automation controller started to update files for me.

Source controller reportedly manages to clone the repo (all the time?) when set to use libgit2, and changing to an RSA key didn’t stop the error messages. EDIT: no, not all the time – source-controller also fails intermittently, indicating that the problem is in the code in source-controller/pkg that source-controller and image-automation-controller both use.

About this issue

  • Original URL
  • State: closed
  • Created 3 years ago
  • Reactions: 11
  • Comments: 108 (39 by maintainers)

Most upvoted comments

hello Flux Maintainers!,

Thanks a lot for your great work!

I wanted to let you know that this issue can be closed and it is solved with the latest Flux 0.31.0 release. I have already tested the latest release and seems that I no longer experience the issue in all of my environments.

@Nosmoht occurrences of this issue tend be intermittent and resolve itself. We have a few changes that should be released soon which should decrease their likelihood. On they are merged/released I will share on this thread.

Same problem here, after upgrading to v0.27.3

Folks,

We are aware this issue is hitting people and are working and attempting to solve it upstream (hence the label). The error seems to not be permanent, but rather transient (and annoying, of which I am much aware).

Until things move upstream, commenting you are running into it is not helpful if no new information is shared (even if newer releases happen). As given the issue has not been closed, it is not expected to have been solved (but if it magically does, please do comment).

This would greatly help reduce noise in my (and likely other their) notifications. 🙏🙇

@uderik @ronald-hadrian @coover-anovaa @jakubhajek @mkoertgen @mksony would you mind to try out our experimental Managed Transport to confirm whether it fixes the issues you experienced?

This error (error waiting on socket) should be fixed by the new approach which reuses SSH connections instead of attempting to open new SSH connections. A few of our upstream dependencies have issues with concurrent SSH connections, including the golang crypto library, hence the new approach was pursued.

Please give it a test and let us know how you get on. So far the results have been quite positive and the overall reliability seems to have improved.

With the release of Flux v0.26.0, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process around libgit2 for the source-controller and image-automation-controller, we have observed some of the issues as described here to have vanished (from EKS at least).

Best to subscribe to this rather than more people saying it’s happening to them (it’s just notifying all of us with no benefit): https://github.com/libgit2/git2go/pull/870

@Nosmoht We are aiming to have a release done between the end of this week and beginning of next.

@uderik thanks again for sharing. I managed to reproduce and on my environment the changes seem to have fixed the problem. The new version have some changes we recently merged into main, which would recover git2go/libgit2 panics. Therefore if this would happen again you would see errors on the logs, but no crashes/restarts.

The PR is now updated and a new image created for source-controller: ghcr.io/fluxcd/source-controller:rc-6d517589

Please let me know how you get on.

xref: https://github.com/fluxcd/source-controller/pull/713#issuecomment-1125027191

UPDATE: changed the image with an official source-controller release candidate.

I had the same issue and tried the new version with the EXPERIMENTAL_GIT_TRANSPORT set to true and now there are no errors for me anymore as well 💪

source-controller: 0.22.5
image-automation-controller: 0.21.3

@uderik would you mind testing the versions below and confirm whether that fixes your issue?

- source-controller: 
quay.io/paulinhu/source-controller:v0.24.4-cacheless@sha256:61930cad1da900f209b396f20c2f7740ff32b5cf1bb4ab7892200790c00a5f4b

- image-automation-controller:
quay.io/paulinhu/image-automation-controller:v0.22.2-cacheless@sha256:87823667cfc4c6e395d996ceaee92a1b5059a8950884f2c3aa49488dcbed81f5

An user with a similar issue had this resolving. This is related to the in-flight PR: https://github.com/fluxcd/source-controller/pull/713

Test images based on version https://github.com/fluxcd/source-controller/commit/830771fc0ac93eeba1ef86d4768e2cc68d236197.

Hi, no errors for almost 2 days

At my test environment it seems that fixes on PR #636 resolves all intermittent issues with SSH - when using the experimental transport. This is now merged and released to both source-controller and image-automation-controller.

For information on how to test: https://github.com/fluxcd/source-controller/issues/636#issuecomment-1080789920

Update: The fix is now released, so updated the link above to target comment containing the official versions.

@aholbreich go to where your flux repo is find the git repository yaml for the resource and under spec add gitImplementation: libgit2 and that should change it

I would like to mention that our git repo is in fact Atlassian Bitbucket.

We’re also on Bitbucket, and seeing the same issue.

Hi @Nosmoht, we are in the process of getting some improvements merged, which should fix this. Could you try this image ghcr.io/fluxcd/image-automation-controller:rc-48bcca59 and confirm whether it solves the issue for you?

@uderik @Nosmoht here’s the release candidate for source controller: ghcr.io/fluxcd/source-controller:rc-4b3e0f9a

Can you please give it a go and let me know whether it resolved your issues?

@pjbgf on two clusters (us-east-1 and ap-southeast-1) same issue, many errors with transport close (potentially due to a timeout) and no updates on another cluster (eu-central-1) see pod restarts with last error:

[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x16f19c0]
runtime stack:
runtime.throw({0x1da3c46, 0xc0008dc8f8})
        runtime/panic.go:1198 +0x71
runtime.sigpanic()

maybe it has something to do with the location of the clusters , gitlab location eu-central-1

gitrepo interval 1min, timeout 2min

see attach for pod crash log pod_crash.log

update: log with first issue image-controller.log

Still encountering this issue after updating to 0.28.0

It alternates with unable to clone: transport closed

Hi guys,

I fixed this by temporarily downgrading image-automation-controller from v0.17.1 to 0.14.1, removing .git from repo url and by using 3m for git timeout. One of these changes or a combination of them made the error disappear on 2 clusters.

Have created an upstream issue to ask for guidance now that our code should be close to picture perfect: https://github.com/libgit2/git2go/issues/851

Haven’t had time to test yet will try next week

The above changes are now available in image-automation-controller v0.16.1

ghcr.io/fluxcd/image-automation-controller:v0.16.1

Now got an image-automation-controller image as well, based on https://github.com/fluxcd/image-automation-controller/pull/239: docker.io/hiddeco/image-automation-controller:sc-git-update-9ef9856@sha256:53f3d91172b198d8b916d3aefa4dd62867b2c678fb4ca919c61974617

@rjhenry thanks a lot! 🙇🥇 Did a quick port, and the image details can be found at https://github.com/fluxcd/image-automation-controller/pull/238

Hey everyone! After upgrade FluxCD2 to the latest version I faced with the same issue 😦 I checked gitRepository CRD and by default is go-git gitImplementationson, I’ve tried to change to libgit2, but still have errors:

{
  "level": "error",
  "ts": "2021-09-07T11:35:38.559Z",
  "logger": "controller-runtime.manager.controller.imageupdateautomation",
  "msg": "Reconciler error",
  "reconciler group": "image.toolkit.fluxcd.io",
  "reconciler kind": "ImageUpdateAutomation",
  "name": "image-update-automation-myrepo",
  "namespace": "test-ns",
  "error": "unable to clone ssh://git@github.com/repo/myrepo, error: SSH could not read data: Error waiting on socket"
}

Any thoughts , how to fix this ? Thanks 🙏

For the record, this occurs with both libgit2 and go-git gitImplementationson the gitRepository resource; I’ve switched back to go-git as the source controller seemed much happier with that.