Back at the start of July, one of our customers reported that an application was intermittently extremely slow or completely failed. Originally the customer thought that it was a firewalling problem, but we identified a DNS problem as the cause - the LEA's DNS server was taking a few orders of magnitude longer than you'd expect to respond to AAAA record lookups for two domains that were used by the app, and eventually responded with a failure.
A quick explanation is probably in order here: when a client needs to connect to a web server on the internet, it has to convert the domain name (e.g. www.example.com) into the numerical IP address(es) of the server(s). It does this through the Domain Name System (DNS). Typically a (modern) client requests both "A" records and "AAAA" records from a DNS server - "A" records list the (legacy) IPv4 addresses for the web server, whilst "AAAA" records list its (newer) IPv6 addresses. A web server may not have both IPv4 and IPv6 addresses, but the DNS server still has to produce a successful response to tell the client this.
Importantly, even if the client only has a legacy IPv4 internet connection, it may not know that it can't contact a server using IPv6 until it actually tries, so it will usually still ask for "AAAA" records so that it can get an address and try it. Also, whether or not the DNS server has an IPv6 connection is irrelevant - if AAAA records exist it is required to reply with them, and if they don't its required to reply saying they don't; the LEA DNS server was doing neither.
The LEA were informed that their DNS server was breaking when queried for the AAAA records, so they replied saying they had pinged a few things and used done some nslookups (presumably only for the A record!) and they couldn't see a problem.
We ran more tests - looking up the A records worked fine (successful response in 9 milliseconds), looking up the AAAA records failed (failure response in 17 seconds) and looking up the AAAA records through a different DNS server (successful response in 23 milliseconds). We even gave them transcripts of the tests so that they would know exactly how we tested it and would be able to reproduce it themselves.
The LEA responded with words to the effect of "well no one else has reported this problem", so we ran the same tests from a different school within the same area and demonstrated that they had the same issues. Again, we sent transcripts of the tests to the LEA (* see footnote).
The LEA then started asking whether the school was using a transparent proxy and what the school's internal domain name is - none of this is relevant to the problem being reported. We weren't reporting problems with the transparent proxy, or any of the school's internal servers, we were specifically reporting a problem with the LEA's DNS server.
We did some further investigation and got more detail on which DNS lookups were failing, sent this to the LEA together with more transcripts of tests and an offer to work with them to help. Rather than asking for our help, the LEA closed the ticket as "resolved", but provided no explanation. We reran the tests, sent them another transcript demonstrating that nothing had been fixed.
The problem was originally reported at the start of the summer holidays. Two months later the new term started - still the problems weren't fixed, still the LEA hadn't taken us up on our offers to help them (for free!) and now it transpires a lot more domains are affected than we originally investigated. Its causing really serious problems for the school, so the school started banging heads together and someone from the LEA actually called us. I explain the problem yet again and he goes off saying he needs to look up some more information.
Then they start talking about transparent proxying again, and again I have to point out that we are reporting a problem with the DNS server and that this has nothing to do with the transparent proxy. Again, I send them an email describing the problem, providing transcripts of tests, etc. LEA techie tells the school that I didn't send any information and that I just forwarded his email back to him - I'm a bit stunned about this since it means that (1) he has never seen an email with inline comments before, and (2) he didn't read past the first line of the email. So the email gets resent to him.
The LEA reply with some screenshots of some tests they have done which they say show that there's no problem:
- They logged into the leased line router and queried the network interface statistics that show no line errors.
- They pinged a few machines.
- They tracerouted to somewhere.
The LEA suggests that this is happening because they don't provide IPv6 connectivity (as mentioned above, whether or not IPv6 is available doesn't actually change anything from a DNS perspective - clients still look up AAAA records and DNS servers are still expected to reply).
Now they say they've poked lots of holes in their firewall because they "have no information on what port AAAA records would be using" (errm, 53, the same as every other DNS request in the world?!) and could we retest - unsurprisingly its still broken.
As far as I can see:
- They haven't actually run the tests (which we've told them how to run!) to try and reproduce the problem. They've tested a few other things that were never a problem to begin with.
- They don't understand enough about DNS (which is an extremely fundamental internet protocol) to diagnose the issues - they seem to have entered a "change something at random and see if it fixes it" phase instead of trying to get to the root of the problem.
- They are completely out of their depth - if they want to run a reliable WAN, they need someone wuo is actually qualified to administer a network. That means someone who understands how to reproduce problems, use debugging tools such as WireShark, etc.
- They haven't handled this in a timely way at all - they had the whole of summer to investigate, and didn't actually start looking at anything in earnest until after the start of term.
I have spent literally hours on this problem, mostly repeating the same explanations and tests over and over (although strictly speaking this isn't "our problem", diagnosing and liaising with the LEA is something we're handling as part of the customer's advanced support contract, so we're not really being paid by the LEA for this level of hand-holding). I honestly can't see them resolving this problem until they reproduce it themselves and do some proper diagnostics.
FootnoteAs mentioned, part of the LEA's defense is basically "no one else has reported a problem" - now, not looking into a problem because it isn't affecting many people is a pretty crumby attitude to begin with, but there are reasons why some people would be affected and some not.
Fundamentally, how services, such as DNS, are expected to behave are defined by standards. These boil down to rules like "when a client sends a request like this, the server must send a response like that". Software that relies on these services is written to expect them to follow the rules laid out by the standards, and there is no standard set of rules saying how to handle a service that is breaking the rules - it is extremely difficult to draw up a standard explaining how to deal with something breaking the standards, simply because there are so many ways the standards could be broken!
So you may have two different pieces of software that do basically the same job, call them A and B. In an environment where everything is sticking to the rules, they both work equally well since this behaviour is standardised. However, if some service isn't sticking to the standards then they will often handle this differently - maybe software A still works fine, but software B breaks. In a different situation the roles may be reversed, with software B working ok.
So its possible for a real problem, such as this, to go unreported simply because a lot of people happen to be using software that, by chance, isn't badly affected by the broken service. Its also possible for problems to go unreported because people write off the problem as "software A is broken" and so don't report the issue to the operator of the broken server.