Thursday 11 September 2014

Diagnosing Sharepoint Breakage

Every so often you get a proper puzzle to solve, and this morning is one of those times. One of our customers reported that they were unable to contact the Microsoft Sharepoint servers through their proxy server, a quick test on my test system confirmed the same issue so we spent about 3 hours delving right into the nitty gritty to figure out what was going on.

The proxy was reporting "connection reset by peer" during the TLS handshake - TLS (Transport Layer Security) is the cryptography protocol used to secure HTTPS web sites, and TLS problems tend to be a pain since the OpenSSL library usually doesn't give especially verbose error messages.  It was clear this wasn't going to be a trivial problem to solve so we immediately disabled HTTPS interception for the Sharepoint site to get it up and running again.  Customer confirmed that this had resolved the issue, so that takes the pressure off a bit but raises a question: why is it working ok when the browser is negotiating the encryption, but not when the proxy is negotiating?

The first port of call was to capture some network traffic and load it into Wireshark for analysis.  This showed that the proxy is sending a TLS "Client Hello" handshake, the server was returning a TCP ACK, but no TLS response.  30 seconds later the server tears down the connection with a TCP RST.  The ACK confirms that the server got the "Client Hello", and you'd usually expect the response to be sent in the same packet as the ACK so it looked like the packet wasn't being dropped by intermediate network hops - the server simply was never sending a handshake response.

Time to make things simpler - instead of using the proxy server, lets ask OpenSSL to connect directly:
openssl s_client -showcerts -connect 157.55.229.87:443
This failed in the same way when we tried it on the test server, but succeeded when run on my Fedora workstation.  Comparing the network traffic between the working and non-working tests showed that the most obvious different was that the non-working handshake presented a few more ciphers for the server to choose from - maybe one of those extra ciphers was confusing the Sharepoint server.

We tried adjusting the list of cipher suites, but each time we tried we found that the request succeeded and we couldn't pin down anything specific that would break it.  We needed to start with the broken handshake and edit it bit by bit until it started working - that would let us figure out specifically what needed to change to make it work.

So we took the captured network traffic and dumped it out as hex:
tcpdump -r capture.pcap -x > capture.hex
We're not interested in the TCP layer stuff, so the first three packets can be ignored (SYN, SYN ACK, ACK) - these are the normal TCP three-way handshake.  The next packet contains the "Client Hello" which we're interested in, but it also contains the Ethernet, IP and TCP headers.  Using Wireshark it's trivial to identify the start of the payload, and we just trimmed everything before that off the hex dump.

Now to replay it and make sure it still fails:
(sed -e 's/#.*$//' capture.hex | xxd -r -p ; sleep 5) | nc 157.55.229.87 443
The sed bit at the start just strips off anything after a # so we can put comments in the hex file.  xxd converts it back into binary and we used nc to connect to the web server and send the data.

We checked the traffic in Wireshark - all looks as expected and the web server still didn't respond, so far so good.

Again, using Wireshark we can identify the various parts of the packet, and set about modifying them.  Of interest are four headers indicating the length of various sections - the TLS Record Layer has an overall length header, within that there is the "Client Hello" data which has its own length header, and within the "Client Hello" are a cipher suite list and an extension list, which again have their own headers indicating their respective lengths.  Each length header is 16 bits long, so can contain a value of up to 65535.

As mentioned, we were interested in the cipher suites - in particular the extra ones that were presented in the broken handshake but not in the working one.  So we set about removing them one by one - each cipher suite is 16 bits long, so removing it involves deleting it from the cipher suite list, and then reducing the cipher suite length, client hello length and tls record length headers by 2 each.

Each time we removed a cipher suite, we replayed the data to the server and looked to see what happened.  After removing two cipher suites, the server suddenly started responding with a "Server Hello"!  We put these ciphers back and removed two others so see if it was specifically one of those ciphers confusing the server, but that didn't break anything again - the server was still happy.

The broken handshake that we started out with had a TLS record length of 258 octets and removing two ciphers (16 bits each) reduced it to 254 - a number that will fit in a single octet, whereas 258 requires two octets.  So we tried adding all the ciphers back in and removing one of the records from the extensions list (5 octets) instead.  Again, the server responded and was happy.

So there we go.  It looks like Microsoft's Sharepoint server has a bug in it that breaks any client that tries to handshake with a TLS record more than 255 octets long.  Evidently the proxy presents a larger selection of cipher suites to the server than most web browsers, so it works fine from the browser but not from the proxy.

We have contacted Microsoft, although I have no idea if we've contacted the right department but hopefully it will get passed on to the right people.

No comments:

Post a Comment