podman: pasta: timeout again (rawhide)

Resurrecting #17598. Pasta test timing out again. Seen in rawhide with passt-0^20230908.g05627dc-1.fc40-x86_64:

[+1596s] ok 556 podman networking with pasta(1) - TCP/IPv4 small transfer, loopback
[+1726s] not ok 557 podman networking with pasta(1) - TCP/IPv4 large transfer, tap
         # (from function `die' in file test/system/helpers.bash, line 747,
         #  from function `run_podman' in file test/system/helpers.bash, line 384,
         #  from function `pasta_test_do' in file test/system/505-networking-pasta.bats, line 248,
         #  in test file test/system/505-networking-pasta.bats, line 700)
         #   `pasta_test_do' failed
         #
         # [20:56:52.117914919] $ /var/tmp/go/src/github.com/containers/podman/bin/podman rm -t 0 --all --force --ignore
         # [20:56:52.226806205] 42b6b5a167b84848f4b09d240ee31f69736ad905a0761bf4b53a4539b2879ceb
         #
         # [20:56:52.238402603] $ /var/tmp/go/src/github.com/containers/podman/bin/podman ps --all --external --format {{.ID}} {{.Names}}
         #
         # [20:56:52.314645311] $ /var/tmp/go/src/github.com/containers/podman/bin/podman images --all --format {{.Repository}}:{{.Tag}} {{.ID}}
         # [20:56:52.386168270] quay.io/libpod/testimage:20221018 f5a99120db64
         # 1+0 records in
         # 1+0 records out
         # 10485760 bytes (10 MB, 10 MiB) copied, 0.0408792 s, 257 MB/s
         #
         # [20:56:52.693255830] $ /var/tmp/go/src/github.com/containers/podman/bin/podman run --net=pasta -p [10.128.0.120]:5294:5294/tcp quay.io/libpod/testimage:20221018 sh -c for port in $(seq 5294 5294); do                              socat -u TCP4-LISTEN:${port},bind=[10.128.0.120] EXEC:md5sum &                          done; wait
         # [20:59:02.709781235] timeout: sending signal TERM to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # timeout: sending signal KILL to command ?/var/tmp/go/src/github.com/containers/podman/bin/podman?
         # [20:59:02.716663367] [ rc=137 (** EXPECTED 0 **) ]

This is f40 rawhide, not f39; and it is an in-progress PR. Podman is not yet testing f40 rawhide, so we are unlikely to see this flake often… until that PR merges.

About this issue

  • Original URL
  • State: open
  • Created 9 months ago
  • Comments: 35 (2 by maintainers)

Commits related to this issue

Most upvoted comments

Who wants me to spin up new VMs and cycle them into CI juuuuuust before a long holiday break? Anyone? Anyone?

I can remind you on Friday evening. Today… I don’t know, it wouldn’t feel professional enough.

Thanks for the update!

Who wants me to spin up new VMs and cycle them into CI juuuuuust before a long holiday break? Anyone? Anyone?

This table might help. This is my new little thing that displays package versions on all VM builds. Find the passt column. You can see that some OSes have newer pasta, but some still have old (1107). If run rootless, this pasta test would fail on that OS.

Ah, yes, for Fedora 39 I waited a couple of weeks before submitting this update – there’s always the risk of obsoleting previous updates which are still in testing. This has been in stable for just 11 hours how.

I suppose it should be sorted on Debian too in a bit – 0.0~git20231204.b86afe3-1 was accepted into unstable 10 days ago.

Thanks. I ended up running my above reproducer in a 1mt, it ran fine for the lifetime of the VM. And each time I checked, it was O(5s), not 30. Your code LGTM, thank you!

This is now included in passt-0^20231110.g5ec3634-1.fc40 for Fedora Rawhide. Debian update is still pending.

With @edsantiago 's assistance, I’ve now reproduced the problem both on a VM he set up, and one I set up. I’ve simplified the reproducing case down a bit, but still don’t have a lot in the way of clues about what’s going on. Because the client/sending socat appears to be finished in the stuck case, I’m guessing it’s (another) race in FIN handling.

I expect to continue digging into this tomorrow and make some more substantial progress.