gitoxide: *Sometimes* `gix fetch` gets stuck in negotiation with `ssh://` remotes (hosted by `gitea`)
Current behavior π―
This happens on some repos sporadically (using v0.29.0, but it has happened long before that).
When you run gix fetch it is stuck in the negotiation phase forever(?) I tend to stop it after a few seconds, but I seem to remember it staying there for a few minutes.
Cancelling it with CTRL+C and rerunning the command causes the same behaviour.
Running a git fetch fixes the repo(?) and now gix fetch works again.
This is the error displayed after sending CTRL+C:
Error: An IO error occurred when talking to the server
Caused by:
Broken pipe (os error 32)
Expected behavior π€
gix fetch should work or time out the negotiation after a resonable amount of time (a few seconds to a minute).
Steps to reproduce πΉ
???
I see it happenning on my selfhosted gitea repos relatively often (~once every two or so weeks) but I have no idea how to reproduce this.
If you have any idea how I could go about diagnosing the issue Iβll make sure to keep it in mind for the next time it happens. For now this is all I have.
About this issue
- Original URL
- State: closed
- Created 9 months ago
- Comments: 51 (50 by maintainers)
Commits related to this issue
- fix: assure we flush before turning a writer into a reader. (#1061) Otherwise it might be that the write-end still isn't flushed, so the receiver didn't get the message it's waiting on, which wouild ... — committed to Byron/gitoxide by Byron 9 months ago
- fix: assure we flush before turning a writer into a reader. (#1061) Otherwise it might be that the write-end still isn't flushed, so the receiver didn't get the message it's waiting on, which wouild ... — committed to Byron/gitoxide by Byron 9 months ago
- fix: assure we flush before turning a writer into a reader. (#1061) Otherwise it might be that the write-end still isn't flushed, so the receiver didn't get the message it's waiting on, which wouild ... — committed to Byron/gitoxide by Byron 9 months ago
- fix: V1 negotiation won't hang anymore (#1061) The logic previously tried to estimate when a pack can be expected, and when a NAK is the end of a block, or the beginning of a pack. This can be known... — committed to Byron/gitoxide by Byron 8 months ago
- fix: V1 negotiation won't hang anymore (#1061) The logic previously tried to estimate when a pack can be expected, and when a NAK is the end of a block, or the beginning of a pack. This can be known... — committed to Byron/gitoxide by Byron 8 months ago
- fix: V1 negotiation won't hang anymore (#1061) The logic previously tried to estimate when a pack can be expected, and when a NAK is the end of a block, or the beginning of a pack. This can be known... — committed to Byron/gitoxide by Byron 8 months ago
- fix: V1 negotiation won't hang anymore (#1061) The logic previously tried to estimate when a pack can be expected, and when a NAK is the end of a block, or the beginning of a pack. This can be known... — committed to Byron/gitoxide by Byron 8 months ago
Verified that
mainfixes the issue c: Thanks for all of this!Can you try once more from this PR? It contains adjustments to the logic to work with more test-cases, and I can only hope that it also still covers your case.
Thatβs incredible! A small change with huge effect! I can now just hope that the test-coverage is as good as I think or else something else might break π (at least the blast radios is limited to V1).
Alright, the PR is in flight and I hope it will be smooth sailing from now on π.
Thanks a lot for trying!
This means I am puzzled as to where the
donecould have gone. Rust sends it into a pipe that should connect to thegit-upload-packwhich has just sent ACK and NAK and would now proceed to read the next packetline. That should bedoneand then the pack is sent.But that clearly doesnβt happen.
Maybe something else is happening here, somehow. What confuses me is that
doneis sent in the second round which would mean that it finished parsing the first response. But according to the logic here withclient_expects_pack=falseandsaw_ready=true, weβd getfalsefor the filter which would then try to read past the lastNAKwhich should make it stall right there. Thus it wouldnβt get to senddoneat all in the second round.In any case, the way I understand the code in
git-upload-pack, a logic change seems in order:Can you try it with this patch? it passes the test-suite, so thatβs a start (and it will hang if I butcher it too much).
Here is the log with
git-upload-packβs output:log
gitoxide.tracePacket=1Seems to print binary characters (β\0β) which makes it a pain to deal (grep and diff need the--textflag or they refuse to print to stdout).Here are the logs:
git.log
gix.log
Packetline support is cooking in this PR and should be merged soon.
Then Iβd need the output of
GIT_TRACE_PACKET=1 git fetchandgix -c gitoxide.tracePacket=1 --trace fetchfor comparison. If redaction of the output is happening, please be sure to apply the same βfunctionβ to both outputs - itβs fine to remove everything except the negotiation part - it should be distinct enough.My expectation is that both interactions should be very similar if not the same, so there is probably some difference that explains the deadlock. My hope is that this is something obvious, like a protocol error of some sort, while the negotiation commit-walk is exactly the same.
To be sure that itβs (probably) not a bug on the server side, I have checked the code of
giteaand believe that they also just rungitunder the hood, i.e.git-upload-pack.Another test we can try is to locally host the repo in the state that is on the server using
git daemon, and tune upgits own tracing, add it as remote to the local clone that is in a state that hangs, and see what that yields - usually that will provide additional information about the state of thegit-upload-packprocess.But one step at a time π .
PS: I also implemented auto-strict so
-c x=foowill not quietly ignore obvious errors anymore.You could automatically set it to the
strictversion of the desired mode when CLI overrides are detected - I think thatβs a great idea. Should just be a couple of lines if you want to try it.That would be very strange, it shouldnβt need two negotiation rounds if just one commit is missing. It would definitely be interesting to see what happens when using HTTPS (as a state-less variant of the protocol). I guess once there is something to reproduce the issue that can also reproduce on GitHub, which could be used to compare both HTTPS and SSH.
Never mind the rambling above though, I think once packetline tracing is available, the issue will clear up quickly.
How many negotiation rounds did you get with(Concurrent editing)skipping? Just one, is my prediction.In theory, thatβs a feature, and itβs intentionally lenient there. This makes it easy to change and maybe it should change. If you want to change it to strict mode please be my guest.
I am looking into adding tracing support similar to
GIT_TRACE_PACKETnow.A bit more context on when it happens (Iβm not 100% sure this is the pattern because it happens so infrequently).
I have two computers where I have the same git repos (the ones that get stuck from my selfhosted gitea instance). I have a preference for one PC so I leave the other alone for a while. When I return to the other computer the repos sometimes get stuck.
So how I think the issue could be reproduced is:
sshorigin.Iβll see if I can reproduce this like that.
Thanks so much, I forgot that itβs possible to interrupt and then shut-down the application normally, showing the trace.
We see that it hangs in round two, which probably means it blocks while sending orβ¦ it blocks while receiving a reply maybe because the sending didnβt get flushed so that would be a local problem. Since I pretty much trust negotiation by now Iβd think it might be something silly like a flush that wasnβt performed.
Using
gix --trace fetch(private repo, canβt upload .git repository) (using gix 0.30.0):Killing it after 7s or 1min seems to make no difference to the trace output.
I will make a back up of this repo in case you have a fix youβd like to test.
Yes, that would be optimal.
Thanks for helping me to make
gixbetter!There is also a
--traceoption, but right now it only prints at the end of an invocation which doesnβt happen during hangs. Having an altenrativetrace-modethat is instant might alleviate this, even though I donβt think it would reveal that much.There is a light at the end of the tunnel though, as itβs definitely planned to offer a built-in native
sshclient as transport as well instead of forwarding to thesshbinary. Once that is in place, and if this fixes the issue, itβs clear that the cause of this issue is something about howgixcommunicates with thesshprocess via pipelines, which is really easy to get wrong without noticing as many tests never reach certain thresholds that may cause these bugs to appear.