DnsClient.NET: Intermittent "Record reader index out of sync." error on SRV record Resolve.

We’re using MongoD.Driver that use DNS Client to resolve SRV record of our mongo db Altas cluster.

The code run from a kubernetes cluster in AKS (Azure kubernetes services). On a newly created cluster, we have lots of connection error with this stacktrace :

A timeout occured after 30000ms selecting a server using CompositeServerSelector 
{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. 
Client view of cluster state is { ClusterId : \"1\", ConnectionMode : \"ReplicaSet\", Type : \"ReplicaSet\", State : \"Disconnected\", Servers : [], DnsMonitorException : \"DnsClient.DnsResponseException: Unhandled exception ---> 
**System.InvalidOperationException: Record reader index out of sync.
**  at DnsClient.DnsRecordFactory.GetRecord(ResourceRecordInfo info)
   at DnsClient.DnsMessageHandler.GetResponseMessage(ArraySegment`1 responseData)
   at DnsClient.DnsUdpMessageHandler.Query(IPEndPoint server, DnsRequestMessage request, TimeSpan timeout)
   at DnsClient.LookupClient.ResolveQuery(IReadOnlyCollection`1 servers, DnsMessageHandler handler, DnsRequestMessage request, Boolean useCache, LookupClientAudit continueAudit)
   --- End of inner exception stack trace ---
   at DnsClient.LookupClient.ResolveQuery(IReadOnlyCollection`1 servers, DnsMessageHandler handler, DnsRequestMessage request, Boolean useCache, LookupClientAudit continueAudit)
   at DnsClient.LookupClient.QueryInternal(IReadOnlyCollection`1 servers, DnsQuestion question, Boolean useCache)
   at DnsClient.LookupClient.Query(String query, QueryType queryType, QueryClass queryClass)
   at MongoDB.Driver.Core.Misc.DnsClientWrapper.ResolveSrvRecords(String service, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Clusters.DnsMonitor.Monitor()\" 

I’ve made a small test program that use direclty DNSClient and it seems the SRV lookup fails from time to time.

Any clues on whats going on ?

About this issue

  • Original URL
  • State: closed
  • Created 4 years ago
  • Reactions: 2
  • Comments: 25 (8 by maintainers)

Most upvoted comments

The fix is now released on NuGet https://www.nuget.org/packages/DnsClient/1.3.0

@scarreno @scadorel

If you upgrade the MongoDB.Driver to 2.10.3 it now references MongoDB.Driver.Core@2.10.0 which in turn references DnsClient@1.3.1

Thanks @JoeSainsburys. The custom dns solution works for us!

@MichaCo I’ve run the same test as before (this time using your new beta version), on a Windows 10 machine, on a Linux (Centos 7) machine outside of AKS and on a container based on Alpine (mcr.microsoft.com/dotnet/core/sdk:3.1.100-alpine3.10) within AKS. I now get a consistent response when querying for the SRV record, and I don’t get any errors.

To echo others, thanks for your time putting in a fix for this, and I’m looking forward to the next release of DnsClient, as my problem was with MongoDB and it looks like their package just needs a version of DnsClient greater than 1.2.0, so I won’t even have to wait for them to update 😃

The DNS subsystem in AKS probably always return OPT records, but it started sending them with a body. The bug was that the library didn’t read the body and didn’t progress the reader which caused parser issues.

In general, it is totally valid for the OPT record to have a data body, and it was clearly a bug in 1.2.0 of this library not reading it.

Some background about OPT records: There are some experimental features in DNS which use the key/value body of OPT to transport random stuff… Maybe something started using the OPT records to transport some additional information. The type code of the data used in the example above isn’t registered, so it is clearly experimental - meaning, I have no idea what the data is about or where it comes from.

You can find all known codes here: https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-11

The code from the example above is in that 65001-65534 group.

@MichaCo I’ve just tested with the beta and it works perfectly in our app. Thanks for taking the time to look into it and fix it 💪

@fbeauche turns out this is (most likely ) the same bug as #55 and will be fixed with 1.3.0. Using your example data I was able to reproduce and that issue and confirm that it’s in the next release.

The opt record in your example payload contains 20 bytes and version 1.2.0 of DnsClient doesn’t read those bytes which results in an pretty unfortunate bug.

Very interesting,

@JasonJhuboo First of all, thanks for trying to re-produce the issue. That at least narrows it down a bit to AKS related issues with DNS resolution.

I tried to find out more and found that there are actually a bunch of issues DNS issues tracked on the AKS github repo. https://aka.ms/aks/io-throttle-issue for example. A couple other issues are referenced in there.

From what I found so far, it could be an issue with core-dns and/or alpine images. At least there is a known issue maybe related to this: https://github.com/Azure/AKS/issues/667

Not sure yet if I can do anything to improve the situation/stability, I’ll have a closer look over the weekend I guess.