openssl: New EOF detection breaks session resumption

We want to do session resumption.

If we follow the documentation of SSL_shutdown:

In case the application wants to be able to resume the session, it is recommended to do a complete shutdown procedure (bidirectional close_notify alerts). […] The shutdown is not yet finished: the close_notify was sent but the peer did not send it back yet. Call SSL_read() to do a bidirectional shutdown.

Consider the following pseudocode:

// obtain SSL_SESSION via callback during SSL_connect
...

// bidirectional shutdown
if (SSL_shutdown(ssl) == 0)
    SSL_read(ssl, NULL, 0);

// try to resume session via new ssl object
...

Due to new EOF detection (db943f43a60d1b5b1277e4b5317e8f288e7a0a3a), the call to SSL_read fails here and invalidates the session (SSLfatal) breaking resumption.

Version: OpenSSL 1.1.1e and current master (7e06a6758bef584deabc9cb4b0d21b3e664b25c9) Proposed fix: If bidirectional shutdown is supposed to work correctly, a zero-length EOF in SSL_read should not be a fatal error.

/cc @kleest

About this issue

  • Original URL
  • State: open
  • Created 4 years ago
  • Reactions: 1
  • Comments: 81 (61 by maintainers)

Commits related to this issue

Most upvoted comments

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn’t.

IMO there is no doubt that this is a bug. However - all bug fixes are behaviour changes. Normally we hope that the behaviour change introduced by a bug fix into a stable branch is desirable because no one wants the old incorrect behaviour. However, every now and then we come across a bug like this one. The old behaviour has been there for so long that other software has been written to expect the incorrect behaviour.

Since 1.1.1 is supposed to be stable, and this has broken stuff, it seems that the correct solution is to revert it. However, 3.0 is a major release. We are trying to keep breaking changes in 3.0 to an absolute minimum, but we do not rule them out entirely. Software authors should reasonably expect to have to test that their software still works when upgrading to 3.0, and might have to make some minor changes in some cases. So, it still seems to me entirely appropriate to fix this bug in that release.

That said this has highlighted a couple of related problems which should be fixed in 3.0:

  • Session resumption breaking was unexpected - so we should look to fix that
  • s_server should also be fixed

It’s entirely possible that there are other related fixes that we should do to minimise the impact. So it would be good if we could identify those.

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn’t possible to buid and USE a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

I think the fix should stay in 3.0. Migration to 3.0 requires at least a rebuild/test of the code, and most applications should continue to work, and those that don’t can be remediated to not be oblivious to truncation attacks.

The issue with introducing this into 1.1.1e is that applications tested with previous 1.1.x releases would now without additional build/test cycles be suddenly exposed to a library that presents a potentially unexpected and not entirely uncommon end-of-session behaviour. That’s a higher level of risk than for recertification (build/test) of the application for 3.0, likely on a newer overall OS release, …

Which is not to say that we couldn’t discuss providing backwards-compatibility crutches for 3.0, but I’d like to know more about actual, rather than hypothetical impact. Which applications and which corner cases are running into this?

PR for the 1.1.1 revert: #11400

I suppose so. A related issue to solve in master is also the #11388

Does this imply that we need to do a 1.1.1f soon?

I would like to suggest adding an option to SSL_set_options to switch between the old and new EOF handling, which would be set to “old” by default. This would solve all EOF related issues for software relying on the old behavior while keeping the option for anyone concerned about #10880.

@beldmit This is not just a s_server problem. We found broken resumption using nginx as well.

@kroeckx While you are right that the server should send a close notify, many servers don’t. So “fix the server” is not something we can do. I did a quick test against the Alexa Top 50 and the rate of successful resumptions using bidirectional shutdown went from 70% using 1.1.1d to 30% using 1.1.1e. If SSL_read is going to signal an error, at least don’t invalidate the session while doing so.

As a general note, this kind of breaking change is not something I would have expected in a bugfix release.

Alright, then the SSL_shutdown documentation should be updated to reflect that in 3.0.

I did that in #11531

I still think there should be an option to ignore EOF for protocols that already handle truncation well.

I think that’s a good suggestion. It’s probably easier for applications to use that than to avoid the SSL_read() call. We should consider adding that in 1.1.1-stable.

It is a problem that OpenSSL seems to lack a culture of trying to minimize the amount of user visible changes.

I think this claim is ill-informed; the culture is to minimize the amount of user-visible changes. We’re all human, though, and sometimes things slip through. As 1.1.1f has illustrated, we are capable of reacting and fixing it when the wrong thing happened.

I think a couple of things are being conflated here, leading to some confusion.

  1. I think we can all agree that unexpected EOF should be a fatal error in an ideal world, because RFC 8446 is very clear about the close_notify requirement.
  2. The documentation is accurate if the server properly closes the TLS connection.

The main problem is how misbehaving servers are handled. If they are treated as a fatal SSL error, then the session should be invalidated according to the RFC:

Servers and clients MUST forget the secret values and keys established in failed connections, with the exception of the PSKs associated with session tickets, which SHOULD be discarded if possible.

This would unfortunately turn all connections to currently non-conforming servers into failed connections, breaking session resumption and causing lots of errors. Assuming that the server applications cannot be fixed in the near future, there is no perfect solution to the problem but changing this behavior in a major release instead seems reasonable to me.

The “poorly-implemented HTTP/1.1 servers” are still out there and are being used. How common? Impossible to say.

I agree that bugfixes are behavior changes, and that 3.0 is a major release. It would be unfortunate if, for example, it wasn’t possible to buid and USE a FIPS-capable curl, for example, or set up a FIPS-capable nginx web server.

Please consider these things when you decide if the EOF detection stays in 3.0

Can we at least do the revert in 1.1.1 soon? I can prepare PR for that.

The server-sides causing these problems will not just fix themselves in the mean-time so they will just show up later rather than sooner if you revert in 1.1.1 and leave enabled in v3. Postponing the decision will not help. It is either a bug or it isn’t.

If you keep it in 1.1.1 (no comment on that), please add a big CHANGES item that says this is broken and is changing in the next release, and point to a FAQ entry that explains what to do.

I am afraid that this is the most reasonable way for 1.1.1.

And still in the master it would be worth allowing the resumption after the unexpected EOF unless we have a very good reason not to do that.

In practice I’m not sure how many people would actually use that option. I’m wondering whether a better solution is to keep this fix in master, but revert it in 1.1.1 and just document it as a known bug.

I’m currently unsure if we should revert this or not.

This seems like the best solution for the near future until we can find a better way to handle the EOF issue.

And it seems non-trivial for code that actually knows all data was received and can’t avoid calling SSL_read() to ignore the error.

We would love to ignore the error, except that SSL_read now calls SSLfatal, which invalidates the session attached to it and makes it not resumable. We have no direct control over the validity of the session and cannot predict if a server will send a close notify or not.

If this problem persists, we can only switch to unidirectional shutdown and hope that the issue gets fixed in a future release.

We can consider revering it in 1.1.1 but keeping it in master/3.0. But I’m currently still unsure what to do, we want to encourage people to fix this. And it seems non-trivial for code that actually knows all data was received and can’t avoid calling SSL_read() to ignore the error.