source-controller: "SSH could not read data: Error waiting on socket" when using libgit2
https://cloud-native.slack.com/archives/CLAJ40HV3/p1625133279255100
This is reported in the logs:
{
"level": "error",
"ts": "2021-07-01T07:44:10.656Z",
"logger": "controller-runtime.manager.controller.imageupdateautomation",
"msg": "Reconciler error",
"reconciler group": "image.toolkit.fluxcd.io",
"reconciler kind": "ImageUpdateAutomation",
"name": "redacted",
"namespace": "flux-system",
"error": "unable to clone 'ssh://git@github.com/redacted/redacted.git', error: SSH could not read data: Error waiting on socket"
}
… though apparently not all the time, as
After adding [update markers], Image Automation controller started to update files for me.
Source controller reportedly manages to clone the repo (all the time?) when set to use libgit2, and changing to an RSA key didn’t stop the error messages. EDIT: no, not all the time – source-controller also fails intermittently, indicating that the problem is in the code in source-controller/pkg that source-controller and image-automation-controller both use.
About this issue
- Original URL
- State: closed
- Created 3 years ago
- Reactions: 11
- Comments: 108 (39 by maintainers)
hello Flux Maintainers!,
Thanks a lot for your great work!
I wanted to let you know that this issue can be closed and it is solved with the latest Flux 0.31.0 release. I have already tested the latest release and seems that I no longer experience the issue in all of my environments.
@Nosmoht occurrences of this issue tend be intermittent and resolve itself. We have a few changes that should be released soon which should decrease their likelihood. On they are merged/released I will share on this thread.
Same problem here, after upgrading to
v0.27.3Folks,
We are aware this issue is hitting people and are working and attempting to solve it upstream (hence the label). The error seems to not be permanent, but rather transient (and annoying, of which I am much aware).
Until things move upstream, commenting you are running into it is not helpful if no new information is shared (even if newer releases happen). As given the issue has not been closed, it is not expected to have been solved (but if it magically does, please do comment).
This would greatly help reduce noise in my (and likely other their) notifications. 🙏🙇
@uderik @ronald-hadrian @coover-anovaa @jakubhajek @mkoertgen @mksony would you mind to try out our experimental Managed Transport to confirm whether it fixes the issues you experienced?
This error (
error waiting on socket) should be fixed by the new approach which reuses SSH connections instead of attempting to open new SSH connections. A few of our upstream dependencies have issues with concurrent SSH connections, including the golang crypto library, hence the new approach was pursued.Please give it a test and let us know how you get on. So far the results have been quite positive and the overall reliability seems to have improved.
With the release of Flux
v0.26.0, we would like to kindly ask folks with issues to update to the latest image releases. Since we changed our build process aroundlibgit2for the source-controller and image-automation-controller, we have observed some of the issues as described here to have vanished (from EKS at least).Best to subscribe to this rather than more people saying it’s happening to them (it’s just notifying all of us with no benefit): https://github.com/libgit2/git2go/pull/870
@Nosmoht We are aiming to have a release done between the end of this week and beginning of next.
@uderik thanks again for sharing. I managed to reproduce and on my environment the changes seem to have fixed the problem. The new version have some changes we recently merged into main, which would recover git2go/libgit2 panics. Therefore if this would happen again you would see errors on the logs, but no crashes/restarts.
The PR is now updated and a new image created for source-controller:
ghcr.io/fluxcd/source-controller:rc-6d517589Please let me know how you get on.
xref: https://github.com/fluxcd/source-controller/pull/713#issuecomment-1125027191
UPDATE: changed the image with an official source-controller release candidate.
I had the same issue and tried the new version with the
EXPERIMENTAL_GIT_TRANSPORTset totrueand now there are no errors for me anymore as well 💪@uderik would you mind testing the versions below and confirm whether that fixes your issue?
An user with a similar issue had this resolving. This is related to the in-flight PR: https://github.com/fluxcd/source-controller/pull/713
Test images based on version https://github.com/fluxcd/source-controller/commit/830771fc0ac93eeba1ef86d4768e2cc68d236197.
Hi, no errors for almost 2 days
At my test environment it seems that fixes on PR #636 resolves all intermittent issues with SSH - when using the experimental transport. This is now merged and released to both source-controller and image-automation-controller.
For information on how to test: https://github.com/fluxcd/source-controller/issues/636#issuecomment-1080789920
Update: The fix is now released, so updated the link above to target comment containing the official versions.
@aholbreich go to where your flux repo is find the git repository yaml for the resource and under spec add
gitImplementation: libgit2and that should change itWe’re also on Bitbucket, and seeing the same issue.
Hi @Nosmoht, we are in the process of getting some improvements merged, which should fix this. Could you try this image
ghcr.io/fluxcd/image-automation-controller:rc-48bcca59and confirm whether it solves the issue for you?@uderik @Nosmoht here’s the release candidate for source controller:
ghcr.io/fluxcd/source-controller:rc-4b3e0f9aCan you please give it a go and let me know whether it resolved your issues?
@pjbgf on two clusters (us-east-1 and ap-southeast-1) same issue, many errors with transport close (potentially due to a timeout) and no updates on another cluster (eu-central-1) see pod restarts with last error:
maybe it has something to do with the location of the clusters , gitlab location eu-central-1
gitrepo interval 1min, timeout 2min
see attach for pod crash log pod_crash.log
update: log with first issue image-controller.log
Still encountering this issue after updating to 0.28.0
It alternates with unable to clone: transport closed
Hi guys,
I fixed this by temporarily downgrading image-automation-controller from v0.17.1 to 0.14.1, removing .git from repo url and by using 3m for git timeout. One of these changes or a combination of them made the error disappear on 2 clusters.
Have created an upstream issue to ask for guidance now that our code should be close to picture perfect: https://github.com/libgit2/git2go/issues/851
Haven’t had time to test yet will try next week
The above changes are now available in image-automation-controller v0.16.1
Now got an image-automation-controller image as well, based on https://github.com/fluxcd/image-automation-controller/pull/239:
docker.io/hiddeco/image-automation-controller:sc-git-update-9ef9856@sha256:53f3d91172b198d8b916d3aefa4dd62867b2c678fb4ca919c61974617@rjhenry thanks a lot! 🙇🥇 Did a quick port, and the image details can be found at https://github.com/fluxcd/image-automation-controller/pull/238
Hey everyone! After upgrade FluxCD2 to the latest version I faced with the same issue 😦 I checked
gitRepository CRDand by default isgo-gitgitImplementationson, I’ve tried to change tolibgit2, but still have errors:Any thoughts , how to fix this ? Thanks 🙏
For the record, this occurs with both
libgit2andgo-gitgitImplementationson thegitRepositoryresource; I’ve switched back togo-gitas the source controller seemed much happier with that.