Sunday, 24 July 2016

Wifi doorbell project

We've got a wireless doorbell which is fairly unreliable.  Also, I can't hear it from my office, and the 433MHz transmitter built into the doorbell button doesn't have enough range for me to put a sounder in the office.  The unreliability problem could probably be fixed by just buying a new wireless doorbell, but that probably wouldn't help with the range problem.

The advent of the ESP8266 microcontrollers opened up a more interesting approach.  These are programmable microcontrollers with a built in wifi interface.  They are tiny and also only cost about a pound each.  So my plan was: build a new doorbell push from scratch, which talks to the wifi network.  The old doorbell has an electromechanical sounder, which I'm reusing by replacing the driver circuit.

Doorbell push

ESP-01 on stripboard
The first job was the doorbell push.  I'm using an ESP-01 module for this because it's tiny, and its running off a pair of AAA cells, which will provide a 3.0v supply when new.  In theory, I could use a buck-boost regulator to maintain a 3.3v supply to the ESP throughout the life of the batteries, but the quiescent current draw of the regulator would probably drain the battery fairly quickly since this circuit is going to spend most of its life asleep.  Although the ESP8266 is supposed to run on 3.3v, they are apparently ok down to around 2.3v, so I'm running it directly off the batteries with no regulator.  The ESP8266's wifi radio has some reasonably high power demands, so there's a 100µF capacitor across the supply rails to absorb any high current bursts which the batteries may not be up to supplying.

The theory of operation is as follows: the ESP-01 module will be put in deep sleep mode when idle.  The button is wired between the ESP's reset pin and ground, so when someone pushes the button to ring the bell, it pulls reset low and wakes the device from deep sleep.  The ESP-01 has an on board power LED which is always on and would drain the batteries quite quickly, so I've cut the track supplying power to that LED - the module now draws 20µA when in deep sleep mode, so I expect the batteries to last a while.  When the ESP resets, it immediately associates with the wifi network and makes two HTTP requests - one directly to the sounder (more on this later) and the other to my home server.  A php script on the server sends out "doorbell rang" instant messages to both myself and Mel via XMPP - this solves the problem of hearing the bell in the office, since the XMPP notification pops up on my workstation and phone.

Everything just fits!
The button I bought contains an LED ring which I'm using to give the person ringing the bell feedback - The ring lights up when the wifi connection has been established (including DHCP, etc.) and flashes after the HTTP requests have successfully completed.  There's a minor complication that the ESP-01 only makes 2 GPIO pins available, they both have to be pulled high when the ESP boots up, and in fact the ESP-01 module has integrated pullups for this reason.  So the LED is wired between the positive power rail and GPIO-0.  This results in inverted logic - to turn the LED on GPIO-0 has to be set low, to turn it off GPIO-0 has to be either high or in high impedance (input) mode.  Once the HTTP requests have completed and the LED has been flashed, the ESP goes back into deep sleep mode.

The web requests that the button makes include the current battery voltage, so I can figure out when to change the batteries.  Its now been in use for about 15 weeks and the reported battery voltage has fallen from about 3.182v to 3.053v, which I think is pretty reasonable.

The finished doorbell push On the door frame


As mentioned, I'm re-purposing the old electromechanical sounder by replacing the old circuit board and driving the solenoid directly.  The old sounder was powered by 3 C cells.  Unlike the button, the sounder will need to remain connected to the wifi all the time, so batteries aren't really an option.  I've opted to using a USB power supply, and with no need for the batteries there's now a lot of space inside the sounder for my new circuit.

I'm using an ESP-12F in the sounder - it's a bit overkill really, but I wanted to have an output that didn't have to be pulled up on boot.  The mini-USB port is connected to an LM1117T-3.3 linear regulator to provide the 3.3v supply voltage.  The solenoid is driven by an RFD14N05L field effect transistor and I've included a fairly chunky capacitor to help with the high power requirements of the solenoid.  Of course there's also a reverse biassed diode in parallel with the solenoid to absorb the back-EMF (although no magic smoke was released before I bothered adding it).

The whole thing is soldered up onto stripboard and just pushed into the battery compartment of the sounder (having removed the battery contacts).

The sounder connects to the wifi network as soon as it is powered up and sits there waiting for an HTTP request.  Making a request to "/" returns a status page, requesting "/ring" causes it to fire the striker and make the classic "ding-dong".
The old electromechanical sounder with an ESP-12F installed

Problems to be aware of

Receiving a notification on your phone to say the doorbell is being rung when you're a few hundred miles away is infuriating because you know you're going to have to drive up to the postal depot to collect a parcel when you get home. :)

Source code

Source code for the project is available in Subversion:

Friday, 15 July 2016

Adventures in broken apps

Its been a week of frustration, but also success.  We always have a perennial problem of broken apps, which work ok on the average home network, but as soon as you try to secure things you hit problems.  Although there is a good argument for keeping school networks fairly permissive, realistically schools can't just turn off their firewall and let everyone at it - not only would the school be failing in their duty of care, but it would be a security nightmare too.

We're frequently tasked with getting some badly behaved app to work on a school network.  Unfortunately when something works elsewhere but breaks when connected to the school network, the firewall/web filter is often regarded by staff to be "broken", even when we can clearly see that the app is the one doing broken things.  Still, we like to keep our customers happy and be as helpful as possible.

We routinely spend a lot of time diagnosing problems and sending debugging to the app vendors and there are a few vendors that are thankful for our input and will work with us to improve their software.  However, I think its fair to say that the vast majority of app vendors are completely uninterested in fixing bugs in their software.  This attitude is unfortunately prevalent across all kinds of suppliers - from small suppliers, right up to the likes of Microsoft, Apple and Google.  In fact, we no longer submit bug reports to Apple because collecting data for them uses a huge amount of our engineering time and they have never fixed any of the bugs we've reported.  A good example of this recently was CloudMosa, who responded within 24 hours of our bug report, explaining that they weren't going to fix the bugs that we reported in their Puffin Academy app.  WhatsApp have been similarly unhelpful with problems we reported, stating "WhatsApp is not designed to be used with proxy configurations or restricted networks, and we cannot provide support for these network configurations."  What a cop-out!

So with the app developers refusing to properly support their own software, our customers have nowhere left to turn and it is often down to us to do our best to work around the flaws in the applications.  A lot of this comes down to collecting as much information as possible about each connection so we can automatically turn security features on and off on our systems to work around incompatibilities.

We do things like snooping on TLS handshakes - when a device sets up an encrypted connection, it is supposed to include information such as a "server name indication" and we can spot that and use the presented server name to decide whether or not its safe to intercept and analyse the encrypted data.  Some apps aren't compatible with interception, so when we see known problem apps we avoid intercepting their connections.  Unfortunately, every so often you find an app that doesn't bother to include this data, and there's no way for the system to know if it's ok to intercept the connection.

In the second half of this week we developed some new code for the web filter to actively callout to the remote server in these situations.  When the web filter sees an encrypted connection that has no server name indication, it can now connect to the web server, retrieve the certificate and use information in it to figure out what to do next.  We're expecting this to help a lot with the problem apps.  The results of each callout are cached to reduce the impact on the web servers.  This is currently going through testing to make sure it won't cause any problems,

Another frustration has always been Skype - this has always been a real pain to make it work reliably and securely.  We've spent a lot of time this week pouring over network traffic dumps and testing.  There are numerous problems with the Skype protocol, which boil down to:
  1. It makes connections on TCP port 443 (and therefore look like HTTPS), which aren't actually HTTPS, or even TLS.  These connections can go to any IP address, so we can't trivially poke holes in the firewall for them.  They get picked up by the transparent proxy, which treats them as encrypted HTTPS connections and therefore fails to handle them since they aren't actually HTTPS.
  2. It makes real HTTPS connections carrying WebSockets requests.  Unfortunately we don't yet support WebSockets and as Skype doesn't bother to include a server name indication we can't pre-emptively decide not to intercept them.
  3. It sends peer-to-peer voice and video media over UDP using any port numbers between 1024 and 65535.  Since it's peer-to-peer, this traffic can be directed at any IP address on the internet.  Official advice is to just allow that through your firewall - if you do that you may as well not even bother to have a firewall in the first place!
  4. All of Skype's traffic is encrypted so it's almost impossible to figure out what it's actually trying to do and what went wrong when it fails.
  5. If something goes wrong, Skype just breaks in one way or another and provides no indication what actually went wrong.  The Android version of Skype could output some debugging data to Android's standard debugging log, but it doesn't.  The PC version of Skype can be told to produce a debug log, but the log is encrypted so that only Microsoft developers can read it (gee, thanks for nothing Microsoft!)
Fortunately, it turns out that the not-HTTPS-that-looks-like-HTTPS (1) traffic isn't needed if it can successfully set up the peer-to-peer UDP connections (3), so we think we can ignore that problem.

It doesn't actually seem to matter too much if the WebSockets connections (2) fail, but this should be handled by the web filter's new TLS callouts system described above.

So we're left with the UDP traffic, which can be to an IP address on any port (3).  This one is a real problem - blindly allowing all of this traffic would also allow a whole load of other stuff such as VPNs, games, etc.  So we've been playing with the nDPI deep packet inspection library and nDPI-Netfilter.

Normally, firewalling is done based on just the information in the packet headers, such as the source and destination addresses.  Deep packet inspection examines all of the data associated with the connection, including the payload, in an attempt to identify what protocol is being used.  We seem to have got this working pretty reliably now.  The sticking point is that the deep packet inspection system needs to see a few packets before it can identify the protocol - usually you'd allow or refuse the connection immediately, but for DPI to work you have to allow all connections for a while and then terminate any that you don't want to allow.  We're finding that allowing the first 10 kilobytes seems to work reasonably well - after that we chop any connections that haven't been identified as Skype.

Of course, all this was massively complicated by the fact that, unbeknown to us, Skype had a bug which made video unreliable - we found that out on Wednesday when Microsoft released a new version to address the problem.  But not before spending a lot of time trying to figure out what was going wrong (did I mention that Skype problems are almost impossible to debug because absolutely everything, including the debug log, is encrypted so you can't examine it?)

The original intention was to implement deep packet inspection in the new firewall system which we are developing, but by popular demand we've backported this to the existing firewall.  There is currently no user interface to set up the Skype DPI rules, but we can manually set them up for customers on demand for the time being.

Anyway, a moderately successful week - we're still testing the Skype rules, but they should be available Real Soon Now™.

Wednesday, 18 May 2016

Queen's Speech

I'm reading through the BBC's summary of the Queen's speech.  I have to say a lot of this seems a bit ill thought out...

Digital Economy Bill (UK-wide)

  • Every UK household will have legal right to a fast broadband connection
  • Minimum speed of 10Mbps to be guaranteed through Broadband Universal Service Obligation
  • Properties in the "remotest areas" may have to contribute to cost of installation
Starts off sounding ok, but then we get to the last point and I wonder how this is much different from the current situation.  Telcos are already building out fast broadband in non-remote areas, and if you're in a remote area you can already pay (sometimes a lot) to have fast broadband installed.

I wonder whether they are putting cost restrictions on that or accelerating the time scales, because on the surface this doesn't seem to change much...
  • Right to automatic compensation when broadband service goes down
Great, but who's going to be made to pay the compensation?  If ISPs utilising BT Wholesale's infrastructure are expected to hand out compensation for BT's problems, that seems grossly unfair unless BT are made to reimburse those ISPs.  BT needs to be incentive to fix problems with their network, rather than punishing the (often quite small) ISPs who rely on it.
  • Companies must get consent before sending promotional spam emails, with fines for transgressors
I really don't understand this one - we have already had exactly this law for 13 years, in the form of the The Privacy and Electronic Communications (EC Directive) Regulations 2003 ("PECR").  The problem is that the regulator rarely does much to enforce the law.  Passing a new law that basically says exactly what an existing law already says isn't going to help anything - either the regulator needs to be incentivised to take action against companies who are in breach of the regulations, or the recipients of spam need to be empowered to sue the spammers themselves.
The current situation is that recipients of spam who sue spammers have to argue that the spam has caused them a material loss which must be reimbursed rather than being able to impose a punitive fine.  Spammers will often argue in court that the financial cost of a single spam is trivial and the courts have sometimes agreed and let them off the hook.  This could easily be fixed by implementing legislation that awards the recipient a fixed amount per spam.
  • All websites containing pornographic images to require age verification for access
This strikes me as a waste of time and will end up much like the pointless cookie law.  Any website with user-submitted content (i.e. anything that *could* be pornographic) will implement an annoying popup age verification page.  And of course, anyone under age will just click through it anyway.

Education for All Bill (Mainly England only)

  • Powers to convert under-performing schools in "unviable" local authorities to academies
  • Goal of making every school an academy but no compulsion to do so
...Despite there being no evidence that academies do any better than other schools...

Counter-Extremism and Safeguarding Bill (England and Wales)

 We can't have an official government announcement without the usual "scare everyone shitless so they let us do what we like" section, but...
  • Ofcom to have power to regulate internet-streamed material from outside EU
How's that going to work?  Are Ofcom going to be able to force ISPs to censor traffic without a court order?  Sounds concerning.  To be clear, ISPs are already required to censor stuff if a court tells them to.  So this just sounds like its designed to short-circuit due process and being justified with "otherwise the terrorists will kill you all" and "think of the children" as usual.

Intellectual Property Bill (UK-wide)

  • Exempting lawyers and other professional advisers from liability for threatening legal action in certain cases
Allowing people to bully folks with empty threats sounds like a bad plan to me - we already see this stuff time and time again with US patent law, so removing penalties for doing so in the UK doesn't seem smart.  Call me cynical maybe this is because the government is run by lawyers?

Investigatory Powers Bill

  • Overhaul of laws governing how the state gathers and retains private communications or other forms of data to combat crime
  • Broadband and mobile phone providers will be compelled to hold a year's worth of communications data
  • Creation of new Investigatory Powers Commissioner
I'm not sure anything more needs to be said on this - the IP bill has been in discussion for ages and basically works by treating everyone in the UK as a criminal and spying on everything they do on the off-chance that a crime needs to be pinned on them later on.  The government has received an overwhelming amount of evidence from experts explaining why the bill is largely unworkable, will massively erode the liberties of law abiding citizens and undermine the British software industry, whilst doing precisely nothing to help the authorities stop criminals; But the evidence has been completely ignored and the bill steam-rollered through regardless, on a fast track designed to reduce debate.

Bill of Rights (Subject to consultation)

  • Plans for a British bill of rights to replace the Human Rights Act will be published in "due course" and subject to consultation
Replacing the Human Rights Act just scares the crap out of me.  The only reason for doing so is if you want to avoid giving some people their human rights...

Friday, 22 January 2016

Performance improvements

I've been doing quite a lot of work on improving the performance of our Iceni web filter.  An increasing number of schools are now getting internet connections exceeding 100Mbps, and there was a noticeable drop-off of performance at higher throughputs.

We had previously identified a number of areas which could have been acting as performance bottlenecks- amongst these were reduction of the number of memory copies by replacing linear buffers with ring buffers (using mmap() tricks to present the rings as linear memory) and replacement of some of the content analysis code.

However, in testing, we found that the CPU didn't seem to be the limiting factor.  Even when going flat-out, there was plenty of spare CPU time, and we just weren't seeing the performance we expected.  This was unexpected - we had thought that performance would be harmed by inefficiencies, but if that were the case, we'd expect to see all of the CPUs pegged.

Eventually this was narrowed down to a locking bug.  The software is multithreaded - this means that there are effectively multiple copies (threads) of the program running at the same time, all accessing the same data in memory.  The data is "locked" while a thread is accessing it so that another thread doesn't come along and change it.

Its safe to have lots of threads reading a piece of data at the same time, but we definitely don't want a thread to change that data while other threads are accessing it.  So threads can either make a "nonexclusive lock" or an "exclusive lock" - multiple threads can hold nonexclusive locks at the same time, but if a thread holds an exclusive lock, no other thread can acquire a lock (either exclusively or nonexclusively).

So when a thread asks for an exclusive lock, two things have to happen:
  1. It has to wait for all of the other locks (exclusive and nonexclusive) to be released.
  2. No new locks must be acquired by other threads while it waits.
In our code, when a thread is waiting for an exclusive lock to be acquired, it sets an "exclusive" flag, and once the lock has been acquired, that flag is cleared.  Threads trying to acquire nonexclusive locks check this flag and the nonexclusive locks are therefore inhibited.

The problem arose when multiple threads were waiting for exclusive locks at the same time - they would all set the "exclusive" flag, but the first one to successfully get the lock would clear it again, even though other threads were still waiting for exclusive locks.  Nonexclusive locks where therefore no longer inhibited, and the threads trying to get exclusive locks would be competing with them.  The result was that it could take an extremely long time (frequently hundreds of milliseconds) to acquire an exclusive lock!

The fix was simple - the "exclusive" flag was replaced with a counter, which is incremented when a thread is waiting for an exclusive lock and decremented again when it acquires the lock.

This fix has been rolled out to a number of customers and the user experience improvement has been striking - not only can significantly higher throughput be achieved, but the responsiveness of websites is noticably much better, even at low throughputs.

Going forward

We've already done a lot of the content analysis code improvements, although there are still some more to come.  The main thing now is replacing the linear buffers with ring buffers, which will reduce the amount of memory copying needed.  This is involving a rip-out and replace job on the ICAP protocol interface and finite state machine.  The new code is looking a lot neater and much easier to understand, and is giving us the opportunity to better optimise memory usage based on what we've learnt in the years since it was originally written.

The new work revolves around a neat ring buffer library I wrote a few months back - using mmap() tricks, the buffer is presented to the rest of the application as linear memory.  Not having to worry about data crossing the start/end of the ring simplifies things a lot, and this library has already been used in anger elsewhere in Iceni, so it can be treated as well tested.

Monday, 21 December 2015

Finding memory over-usage

Memory leaks are a common problem when writing software - this is where the software has asked the operating system (OS) for a lump of memory, but has forgotten to tell the OS when it finishes using that memory.

Tracking down memory leaks can be a real pain, but software such as Valgrind Memcheck helps - Valgrind lets you run the software (very very slowly!) and keeps track of when memory is allocated and freed.  When the software exits, Valgrind gives you a summary of memory that still hasn't been freed.

However, there is another problem very similar to a memory leak - if the memory has been allocated, but is kept in use by your software for far longer than necessary.  Eventually the software will release the memory again, so technically this isn't a "leak", but the effect is very similar.  Since the software is holding onto memory allocations for a long time, it may require a lot more of the computer's memory than you would expect.  On shut down, the software would usually release all of these memory allocations, so the usual method of tracking down leaks (i.e. see what hasn't been released at the end) isn't going to work.

But we can still use Valgrind, and rather than wait until the program exits we can get a list of currently allocated memory at any point while the program is running.  Obviously this includes everything that has been allocated, not just the "problem" allocations.  However, if the problem allocations amount to a very significant amount of memory, this tends to stick out like a sore thumb.

First of all run the program to be debugged inside Valgrind with the --vgdb=full commandline argument:
valgrind --vgdb=full --track-fds=yes --tool=memcheck --leak-check=full --show-reachable=yes --leak-resolution=high --num-callers=40 ./some_executable

In another window, we can then fire up the the gdb debugger:
gdb some_executable
At the gdb prompt, the following will connect to the running process and obtain the list of allocations:
target remote | vgdb
set logging file /tmp/allocations.txt
set logging on
monitor leak_check full reachable any
The memory allocations will be logged to /tmp/allocations.txt.  You can do this several times while the program is running and then compare the outputs. 

[Edit:  You can use vgdb directly, instead of gdb: "vgdb leak_check full reachable any"]

Wednesday, 2 December 2015

Debugging IPSEC

Today I had occasion to do some IPSEC debugging.  There were multiple tunnels between two endpoints (using OpenS/Wan's "leftsubnets={...}" and "rightsubnets={...}" declarations) and one of the tunnels was reported as being dead.  The remaining tunnels were carrying a lot of traffic and couldn't be shut down.

Usually I'd ping some stuff on the other side of the VPN and tcpdump the connection to see if the pings are being encapsulated.  But with multiple tunnels between the same endpoints, just filtering the traffic by source and destination address doesn't really help - you'd end up capturing traffic for all of the tunnels.

Each tunnel has its own pair of identifiers (SPIs) - one for each direction.  So first thing to do is run "ip xfrm policy show" and you get something like:
# ip xfrm policy show
src dst
    dir out priority 2343 ptype main
    tmpl src dst
        proto esp reqid 16389 mode tunnel
/24 dst
    dir fwd priority 2343 ptype main
    tmpl src dst
        proto esp reqid 16389 mode tunnel
src dst
    dir in priority 2343 ptype main
    tmpl src dst
        proto esp reqid 16389 mode tunnel
(The output will actually be much longer if you've got multiple tunnels).  From this we can extract the reqid from the tunnels we care about - 16389 in this case.

Now we do "ip xfrm state show":
# ip xfrm policy show
src dst
    proto esp spi 0x01234567 reqid 16389 mode tunnel
    replay-window 32 flag 20
src dst
    proto esp spi 0x89abcdef reqid 16389 mode tunnel
    replay-window 32 flag 20

(Again, the output will be much longer than this).  Now look up the reqid and make a note of the associated spis (0x01234567 and 0x89abcdef).

Finally, we can ask tcpdump to only show us traffic that matches those SPIs:
# tcpdump -n -i internet 'ip[20:4] = 0x01234567' or 'ip[20:4] = 0x89abcdef'
In this case I found that the encapsulated traffic was being transmitted ok, but we weren't receiving any from the other end.  Turns out the other side's firewall was dropping their traffic.

Thursday, 17 September 2015

Ranting about LEA Network Administrators

I'm getting increasingly tired of the network administrators at a certain LEA.  I'm going to venture that they aren't really qualified to run the LEA's WAN...

Back at the start of July, one of our customers reported that an application was intermittently extremely slow or completely failed.  Originally the customer thought that it was a firewalling problem, but we identified a DNS problem as the cause - the LEA's DNS server was taking a few orders of magnitude longer than you'd expect to respond to AAAA record lookups for two domains that were used by the app, and eventually responded with a failure.

A quick explanation is probably in order here: when a client needs to connect to a web server on the internet, it has to convert the domain name (e.g. into the numerical IP address(es) of the server(s).  It does this through the Domain Name System (DNS).  Typically a (modern) client requests both "A" records and "AAAA" records from a DNS server - "A" records list the (legacy) IPv4 addresses for the web server, whilst "AAAA" records list its (newer) IPv6 addresses.  A web server may not have both IPv4 and IPv6 addresses, but the DNS server still has to produce a successful response to tell the client this.

Importantly, even if the client only has a legacy IPv4 internet connection, it may not know that it can't contact a server using IPv6 until it actually tries, so it will usually still ask for "AAAA" records so that it can get an address and try it.  Also, whether or not the DNS server has an IPv6 connection is irrelevant - if AAAA records exist it is required to reply with them, and if they don't its required to reply saying they don't; the LEA DNS server was doing neither.

The LEA were informed that their DNS server was breaking when queried for the AAAA records, so they replied saying they had pinged a few things and used done some nslookups (presumably only for the A record!) and they couldn't see a problem.

We ran more tests - looking up the A records worked fine (successful response in 9 milliseconds), looking up the AAAA records failed (failure response in 17 seconds) and looking up the AAAA records through a different DNS server (successful response in 23 milliseconds).  We even gave them transcripts of the tests so that they would know exactly how we tested it and would be able to reproduce it themselves.

The LEA responded with words to the effect of "well no one else has reported this problem", so we ran the same tests from a different school within the same area and demonstrated that they had the same issues.  Again, we sent transcripts of the tests to the LEA (* see footnote).

The LEA then started asking whether the school was using a transparent proxy and what the school's internal domain name is - none of this is relevant to the problem being reported.  We weren't reporting problems with the transparent proxy, or any of the school's internal servers, we were specifically reporting a problem with the LEA's DNS server.

We did some further investigation and got more detail on which DNS lookups were failing, sent this to the LEA together with more transcripts of tests and an offer to work with them to help.  Rather than asking for our help, the LEA closed the ticket as "resolved", but provided no explanation.  We reran the tests, sent them another transcript demonstrating that nothing had been fixed.

The problem was originally reported at the start of the summer holidays.  Two months later the new term started - still the problems weren't fixed, still the LEA hadn't taken us up on our offers to help them (for free!) and now it transpires a lot more domains are affected than we originally investigated.  Its causing really serious problems for the school, so the school started banging heads together and someone from the LEA actually called us.  I explain the problem yet again and he goes off saying he needs to look up some more information.

Then they start talking about transparent proxying again, and again I have to point out that we are reporting a problem with the DNS server and that this has nothing to do with the transparent proxy.  Again, I send them an email describing the problem, providing transcripts of tests, etc.  LEA techie tells the school that I didn't send any information and that I just forwarded his email back to him - I'm a bit stunned about this since it means that (1) he has never seen an email with inline comments before, and (2) he didn't read past the first line of the email.  So the email gets resent to him.

The LEA reply with some screenshots of some tests they have done which they say show that there's no problem:
  • They logged into the leased line router and queried the network interface statistics that show no line errors.
  • They pinged a few machines.
  • They tracerouted to somewhere.
i.e. they didn't test the thing we actually reported being faulty.

The LEA suggests that this is happening because they don't provide IPv6 connectivity (as mentioned above, whether or not IPv6 is available doesn't actually change anything from a DNS perspective - clients still look up AAAA records and DNS servers are still expected to reply).

Now they say they've poked lots of holes in their firewall because they "have no information on what port AAAA records would be using" (errm, 53, the same as every other DNS request in the world?!) and could we retest - unsurprisingly its still broken.

As far as I can see:
  1. They haven't actually run the tests (which we've told them how to run!) to try and reproduce the problem.  They've tested a few other things that were never a problem to begin with.
  2. They don't understand enough about DNS (which is an extremely fundamental internet protocol) to diagnose the issues - they seem to have entered a "change something at random and see if it fixes it" phase instead of trying to get to the root of the problem.
  3. They are completely out of their depth - if they want to run a reliable WAN, they need someone wuo is actually qualified to administer a network.  That means someone who understands how to reproduce problems, use debugging tools such as WireShark, etc.
  4. They haven't handled this in a timely way at all - they had the whole of summer to investigate, and didn't actually start looking at anything in earnest until after the start of term.

I have spent literally hours on this problem, mostly repeating the same explanations and tests over and over (although strictly speaking this isn't "our problem", diagnosing and liaising with the LEA is something we're handling as part of the customer's advanced support contract, so we're not really being paid by the LEA for this level of hand-holding).  I honestly can't see them resolving this problem until they reproduce it themselves and do some proper diagnostics.


As mentioned, part of the LEA's defense is basically "no one else has reported a problem" - now, not looking into a problem because it isn't affecting many people is a pretty crumby attitude to begin with, but there are reasons why some people would be affected and some not.

Fundamentally, how services, such as DNS, are expected to behave are defined by standards.  These boil down to rules like "when a client sends a request like this, the server must send a response like that".  Software that relies on these services is written to expect them to follow the rules laid out by the standards, and there is no standard set of rules saying how to handle a service that is breaking the rules - it is extremely difficult to draw up a standard explaining how to deal with something breaking the standards, simply because there are so many ways the standards could be broken!

So you may have two different pieces of software that do basically the same job, call them A and B.  In an environment where everything is sticking to the rules, they both work equally well since this behaviour is standardised.  However, if some service isn't sticking to the standards then they will often handle this differently - maybe software A still works fine, but software B breaks.  In a different situation the roles may be reversed, with software B working ok.

So its possible for a real problem, such as this, to go unreported simply because a lot of people happen to be using software that, by chance, isn't badly affected by the broken service.  Its also possible for problems to go unreported because people write off the problem as "software A is broken" and so don't report the issue to the operator of the broken server.