podman: pasta: timeout again (rawhide)

Resurrecting #17598. Pasta test timing out again. Seen in rawhide with passt-0^20230908.g05627dc-1.fc40-x86_64:

[+1596s] ok 556 podman networking with pasta(1) - TCP/IPv4 small transfer, loopback
[+1726s] not ok 557 podman networking with pasta(1) - TCP/IPv4 large transfer, tap
         # (from function `die' in file test/system/helpers.bash, line 747,
         #  from function `run_podman' in file test/system/helpers.bash, line 384,
         #  from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 248,
         #  in test file test/system/505-networking-pasta.bats, line 700)
         #   `pasta_test_do' failed
         #
         # [20:56:52.117914919] $ /var/tmp/go/src/github.com/containers/podman/bin/podman rm -t 0 --all --force --ignore
         # [20:56:52.226806205] 42b6b5a167b84848f4b09d240ee31f69736ad905a0761bf4b53a4539b2879ceb
         #
         # [20:56:52.238402603] $ /var/tmp/go/src/github.com/containers/podman/bin/podman ps --all --external --format {{.ID}} {{.Names}}
         #
         # [20:56:52.314645311] $ /var/tmp/go/src/github.com/containers/podman/bin/podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # [20:56:52.386168270] quay.io/libpod/testimage:20221018 f5a99120db64
         # 1+0 records in
         # 1+0 records out
         # 10485760 bytes (10 MB, 10 MiB) copied, 0.0408792 s, 257 MB/s
         #
         # [20:56:52.693255830] $ /var/tmp/go/src/github.com/containers/podman/bin/podman run --net=pasta -p [10.128.0.120]:5294:5294/tcp quay.io/libpod/testimage:20221018 sh -c for port in $(seq 5294 5294); do                              socat -u TCP4-LISTEN:${port},bind=[10.128.0.120] EXEC:md5sum &                          done; wait
         # [20:59:02.709781235] timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # timeout: sending signal KILL to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [20:59:02.716663367] [ rc=137 (** EXPECTED 0 **) ]

This is f40 rawhide, not f39; and it is an in-progress PR. Podman is not yet testing f40 rawhide, so we are unlikely to see this flake often… until that PR merges.

About this issue

Original URL
State: open
Created 9 months ago
Comments: 35 (2 by maintainers)

Commits related to this issue

CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
CI VMs: bump to f39 + f38 ...from f38 + f37. Requires one minor e2e test change, to handle an error logging change in conmon 2.1.8. Also, this is important, requires crun-1.9.1 because of a kernel ... — committed to edsantiago/libpod by edsantiago 9 months ago
tcp: Don't use TCP_WINDOW_CLAMP On the L2 tap side, we see TCP headers and know the TCP window that the ultimate receiver is advertising. In order to avoid unnecessary buffering within passt/pasta (... — committed to AkihiroSuda/passt-mirror by dgibson 8 months ago

Most upvoted comments

Who wants me to spin up new VMs and cycle them into CI juuuuuust before a long holiday break? Anyone? Anyone?

I can remind you on Friday evening. Today… I don’t know, it wouldn’t feel professional enough.

sbrivio-rh on Dec 19, 2023

Thanks for the update!

Who wants me to spin up new VMs and cycle them into CI juuuuuust before a long holiday break? Anyone? Anyone?

edsantiago on Dec 19, 2023

This table might help. This is my new little thing that displays package versions on all VM builds. Find the passt column. You can see that some OSes have newer pasta, but some still have old (1107). If run rootless, this pasta test would fail on that OS.

Ah, yes, for Fedora 39 I waited a couple of weeks before submitting this update – there’s always the risk of obsoleting previous updates which are still in testing. This has been in stable for just 11 hours how.

I suppose it should be sorted on Debian too in a bit – 0.0~git20231204.b86afe3-1 was accepted into unstable 10 days ago.

sbrivio-rh on Dec 19, 2023

Thanks. I ended up running my above reproducer in a 1mt, it ran fine for the lifetime of the VM. And each time I checked, it was O(5s), not 30. Your code LGTM, thank you!

This is now included in passt-0^20231110.g5ec3634-1.fc40 for Fedora Rawhide. Debian update is still pending.

sbrivio-rh on Nov 14, 2023

With @edsantiago 's assistance, I’ve now reproduced the problem both on a VM he set up, and one I set up. I’ve simplified the reproducing case down a bit, but still don’t have a lot in the way of clues about what’s going on. Because the client/sending socat appears to be finished in the stuck case, I’m guessing it’s (another) race in FIN handling.

I expect to continue digging into this tomorrow and make some more substantial progress.

dgibson on Oct 17, 2023