runtime: Database-agnostic way to detect transient database errors

Provider tracking issues

This new API has been merged, following are tracking issues for implementation in the different providers:

Summary

A recurring theme for database programming has been building automated retry policies, but System.Data doesn’t currently allow database drivers to expose whether a database error can be expected to be transient or not. This leads to database-specific logic for detecting transient exceptions being built again and again outside of the driver.

API Proposal

class DbException
{
    public virtual bool IsTransient => false;
}

Notes

  • IsTransient should be optimistic, i.e. return true in all cases where a simple retry of the operation - without any other change - may be successful. We prefer to err on the side of transience, since the downside is only a few extra retries.
  • In at least some cases, if an operation transiently fails within a transaction, the entire transaction must be rolled back and retried (including all previous operations). This would be out of scope of this proposal: DbException would only inform the consumers that the exception is transient, but the actual handling (rolling back, reissuing previous commands) would be the consumer’s responsibility.

Resources and examples

Old notes on SupportsTransienceDetection (removed from proposal)

  • SupportsTransienceDetection can be used to signal that IsTransient has been implemented by a driver, and can be expected to be true where relevant. Note that in general it should be fine to just check IsTransient directly, as it would default to false in any case.
  • It’s a bit problematic for to be on DbException:
    • This makes it hard to do a simple check on startup (one needs an actual DbException instance thrown from the driver).
    • This feature flag would ideally be in a general metadata/feature discovery facility (see #28763), but we don’t have that at the moment.
    • We could put this flag on DbProviderFactory, but that would turn it into a general place for feature flags as opposed to a factory (current flags such as CanCreateDataAdapter are about factory methods on the type, so much more appropriate).
    • As the problem isn’t very acute, having this property on DbException is probably OK.

/cc @David-Engel @cheenamalhotra @bgrainger @bricelam @ajcvickers @stephentoub @terrajobst @mgravell @FransBouma

Edit history

Date Modification
2020-04-10 API proposal
2020-04-18 Removed SupportsTransienceDetection

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 5
  • Comments: 50 (42 by maintainers)

Commits related to this issue

Most upvoted comments

While working on SqlClient, I realized that the classification of Transient errors is a combination of recommendations offered by the database and the nature of the customer workload and the customer’s own understanding of their infrastructure running the applications and databases.

A driver can classify an error as transient, if it is recommended. And such recommendations are usually made for a very small subset of errors. At least this is what my observation is, while interacting with SqlClient customers.

For many other errors, it is the application that decides whether it should be transient or not.

Consider SqlConnection.Open() in SqlClient, it retries on a bunch of errors while connecting to Azure SQL DB. But those errors are returned by the server. However there are many cases in which customers want to retry on networking specific errors, which may be caused by broken connections or caused by failure in TCP paths. Some customers want to have retries on such errors and some customers want to detect these ASAP and have no retries.

Similar arguments go for Command Execution where the customers want to retry on some errors and fail fast on some errors. E.g. Deadlock is a problem that some customer applications want to retry on, because in their workload it is transient and for some customer applications, deadlocks need investigation.

Basically the known transient errors in SqlClient are much fewer and this proposal makes more sense with a mechanism via which the customers can tell us if they want a particular error to be treated as transient and retried on, vs errors that we fail fast with.

Otherwise the client’s may not be authoritatively able to classify errors as transient and in case of client considering errors as transient may be undesirable.

One of the asks from a few customers of SqlClient is to be able to specify what errors the applications want to consider transient and have the client retry internally on those errors.

I wonder if the above situations apply for other drivers, they probably might.

An API ecosystem where customers have the ability to specify what errors should be treated transient would be lot more powerful.

Let me know if you guys think that the IsTransient property proposed here makes sense as-is (as well as SupportsTransienceDetection).

@roji I think IsTransient is a good first step. I’m on the same page as everyone else here. SupportsTransienceDetection adds a small amount of value. I could take it or leave it.

For reference, we have SqlClient issue #416 that is related to transient errors and retry functionality at the SqlClient level. There’s some good background information. The main goal is to reduce the code cost for existing/legacy applications to reliably migrate to a cloud model.

The things that SqlClient can retry on are necessarily limited based on the limited context that the ADO.NET provider has about what the application is doing–as @FransBouma points out. This is why, in my opinion, resiliency is something that should, in general, happen at a higher level than the ADO.NET provider.

EF (both EF6 and EF Core) have more of this context available, and hence are able to do a better job of retrying. For example, EF can retry a failure in getting results back by buffering the results and never giving anything to the application until all the data has been obtained from the database. This ensures that the application won’t get duplicates or other inconsistent data if the query has to be re-run. Likewise, EF is has integration with transactions, such that entire transaction can also be retried on transitive failures.

I don’t know the full details of why transient error handling was added to SqlClient, especially given that it was already available in EF6 at the time it was added to SqlClient. I suspect the reasoning was something like this:

  • SQL Azure was unusable without retries. (I’d like to think this isn’t the case now–5 years ago, it certainly was.)
  • Not everyone is using EF
  • Therefore SqlClient needs to add these retries for SQL Azure to be usable outside of EF.

It has also become clear over the years that there is no rigid definition of what is a “transient error”, and we have updated EF several times based on new data from SQL Azure. We also have a mechanism for applications to modify the the list of errors that are considered transient.

The value of this feature, to me, is to allow the list of errors considered transient to be coupled to the ADO.NET provider, rather than something that the application or EF Core needs to create. Likewise, it no longer becomes a SQL Server-specific list in EF Core. So, to me, this is more about providing some provider-specific information to EF or the application so that it can choose what to do.

Thanks @David-Engel, and thanks for the link too. I posted a comment there with insights from here on why attempting full resiliency within the database driver may not be feasible, hopefully that’s helpful…

Transient errors and retries aren’t only connection open / reconnect oriented, i.e. if you have issued 3 commands and the 4th fails due to error 1205 (result streaming) you have to retry all 4, if they’re in the same transaction. I have no idea if SqlClient’s connectionstring retry logic takes care of that (I doubt it).

Hence it’s a nice idea, but to get full resilience you really have to perform the retry logic at the app level. (IMHO)

The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can’t do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I’m not sure that IsTransient should mean the same thing as “does it terminate the transaction”… I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I’d expect an ORM/retry policy to check this flag, and if it’s true, roll back the ongoing transaction (if any) and re-attempt.

It’s true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn’t seem terribly important to me as transient exceptions aren’t supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

I think what you describe is very sensible and in practice it likely leads to redoing the whole transaction again anyway, but I think the key difference with e.g. a UC violation in the middle of a transaction is that the error is outside of the scope of the data, i.e. there’s nothing to ‘correct’ on the client / app side; the only thing it can do is retry the whole unit of work.

An error in the data, violating a UC, an FK, retrying the same Unit of work will lead to the same error, as the error is inside the scope of the data.

Would that help determining when the IsTransient flag should be set or not?

@roji Our application does not need to distinguish between a network issue, the server being down, and the database being down. It needs to know that those errors do not depend on the database commands that it was running in the transaction.

I agree that such a feature is not in scope for this issue, but I mentioned it as an example of difficulty in migrating to DbException.IsTransient from application-side recognition of error numbers.

@KalleOlaviNiemitalo detecting that the database/server is down is notoriously difficult (e.g. a network issue could indicate the server is down (or not)), am curious what you’re current logic for this is…

But either way this seems to be orthogonal (and so out of scope) for this feature, which is only about identifying transient errors - or do you see things otherwise?

The PG logic is pretty simple (and somewhat extreme) - any error puts the ongoing transaction in a failed state, and you can’t do anything aside from rolling the transaction back (or rolling back to a savepoint).

But I’m not sure that IsTransient should mean the same thing as “does it terminate the transaction”… I was only thinking of this as an indication that a retry may be successful (even if that retry involves rolling back the transaction and replaying statements). In other words, I’d expect an ORM/retry policy to check this flag, and if it’s true, roll back the ongoing transaction (if any) and re-attempt.

It’s true that an error may be transient but not cause transaction termination, and so rolling back the entire transaction to retry may be overkill (it could be possible to just retry the latest command). That doesn’t seem terribly important to me as transient exceptions aren’t supposed to be that common for it to matter (am I wrong?).

Does this make sense to everyone? Am I missing something?

Video

  • Makes sense as proposed.
namespace System.Data.Common
{
    public partial class DbException
    {
        public virtual bool IsTransient => false;
    }
}

@roji I’ll hold off until API review is complete. Is there a time frame for when API reviews are completed in general?

@cheenamalhotra

@roji I don’t think this should be touched upon by other drivers/use-cases if not needed. Also I’m not sure if we want to introduce this DbExcpetion at all, and not SqlError instead, as this is clearly Azure SQL specific. Unless I’m missing info about other .NET Data providers handling transient errors.

From my point of view, transient error recognition and resilience/retry is something that affects all databases, not just Azure SQL - any database can have an intermittent networking issue, or some other transient problem causing it to momentarily fail. I’ve definitely received requests for this for Npgsql, and IsTransient is in fact implemented there. The goal here is simply to surface the transience information to the user (and/or to any resilience layer they’re using à la Polly) so that operations can be retried.

My point above was that including resilience logic inside an ADO driver (as opposed to just exposing IsTransient) makes sense to me mainly for Azure SQL (at least back in the day), since as @ajcvickers said, “SQL Azure was unusable without retries”.

However giving precise information to users from DbException like ErrorCode holds more value IMO so client apps can decide which error codes to flag as transient along with the list of “transient” errors that driver would maintain (ref #34798).

The problem with exposing ErrorCode on DbException (as opposed to SqlException) is that there isn’t a database-agnostic concept of what an error actually is (see this comment)… SQL Server’s error code is an int, but PostgreSQL’s is a string. Even if the CLR type were the same, the actual error codes are completely different, so an API consumer wouldn’t be able to actually do anything with the code: the moment they’d want to test for anything, they’d be forced to recognize database-specific error codes, and so they may as well just code against SqlException, NpgsqlException, etc.

I’m curious, do non-.NET SQL Server drivers implement the same kind of resilience logic as SqlClient because of Azure SQL? @David-Engel do you have any info here?

@roji SqlClient is the most feature-rich in terms of this type of functionality, but yes, other drivers do implement resilience logic. ODBC has automatic connection open retry logic, for example and automatic reconnection of idle connections which have been disconnected by network devices. Saurabh touched on what I believe is the biggest use case for these features: existing/legacy applications which were built on the assumption of local resources where “blips” rarely occurred (whether that’s a network hiccup, fail over event, provisioning delay, etc) and where it is burdensome to make code changes. Enabling those to be migrated to cloud databases without suffering from those cloud “blips” has very high value to customers.

Yeah, I think that makes sense.

I admit that for Npgsql (where IsTransient already exists) I haven’t received requests to tweak its meaning. However, if for a particular driver it makes sense to expose a configuration interface (via connection strings or via a special API), which affects what the driver returns via InTransient, then it’s by all means free to do so (again, this would be specific to that driver since errors aren’t agnostic).

I’ll just remark that in that case, instead of configuring the driver to modify its IsTransient values, a user could just implement their logic wherever the IsTransient is being consulted (since things like Polly are almost always configurable). That is, say I want to treat error code X as transient, although by default the driver doesn’t treat it as such; I could modify my external retry strategy (which is under my direct control) to recognize both IsTransient and code X. But it may indeed still make sense to influence IsTransient in the driver in some scenarios.

(so all good)

Is SupportsTransienceDetection really necessary? All I’m interested in as a consumer of the exception is whether it’s a transient error. Now I have a list of transient error codes per database in my transient error recovery strategy classes and that’s not very flexible (if error codes are added I have to maintain this list plus I have to look up these codes for all databases supported). So I just want to know ‘is this exception a transient error?’ If so, use the strategy, if not, throw the exception further up like a normal exception.

If IsTransient is false, whether that’s because it’s not a transient error or because the driver/ado.net provider doesn’t support this, the net end result is the same: the recovery strategy is bypassed and the exception is rethrown as a normal exception.

Perhaps I’m overlooking a use case tho 😃