We had previously identified a number of areas which could have been acting as performance bottlenecks- amongst these were reduction of the number of memory copies by replacing linear buffers with ring buffers (using mmap() tricks to present the rings as linear memory) and replacement of some of the content analysis code.
However, in testing, we found that the CPU didn't seem to be the limiting factor. Even when going flat-out, there was plenty of spare CPU time, and we just weren't seeing the performance we expected. This was unexpected - we had thought that performance would be harmed by inefficiencies, but if that were the case, we'd expect to see all of the CPUs pegged.
Eventually this was narrowed down to a locking bug. The software is multithreaded - this means that there are effectively multiple copies (threads) of the program running at the same time, all accessing the same data in memory. The data is "locked" while a thread is accessing it so that another thread doesn't come along and change it.
Its safe to have lots of threads reading a piece of data at the same time, but we definitely don't want a thread to change that data while other threads are accessing it. So threads can either make a "nonexclusive lock" or an "exclusive lock" - multiple threads can hold nonexclusive locks at the same time, but if a thread holds an exclusive lock, no other thread can acquire a lock (either exclusively or nonexclusively).
So when a thread asks for an exclusive lock, two things have to happen:
- It has to wait for all of the other locks (exclusive and nonexclusive) to be released.
- No new locks must be acquired by other threads while it waits.
The problem arose when multiple threads were waiting for exclusive locks at the same time - they would all set the "exclusive" flag, but the first one to successfully get the lock would clear it again, even though other threads were still waiting for exclusive locks. Nonexclusive locks where therefore no longer inhibited, and the threads trying to get exclusive locks would be competing with them. The result was that it could take an extremely long time (frequently hundreds of milliseconds) to acquire an exclusive lock!
The fix was simple - the "exclusive" flag was replaced with a counter, which is incremented when a thread is waiting for an exclusive lock and decremented again when it acquires the lock.
This fix has been rolled out to a number of customers and the user experience improvement has been striking - not only can significantly higher throughput be achieved, but the responsiveness of websites is noticably much better, even at low throughputs.
Going forward
We've already done a lot of the content analysis code improvements, although there are still some more to come. The main thing now is replacing the linear buffers with ring buffers, which will reduce the amount of memory copying needed. This is involving a rip-out and replace job on the ICAP protocol interface and finite state machine. The new code is looking a lot neater and much easier to understand, and is giving us the opportunity to better optimise memory usage based on what we've learnt in the years since it was originally written.The new work revolves around a neat ring buffer library I wrote a few months back - using mmap() tricks, the buffer is presented to the rest of the application as linear memory. Not having to worry about data crossing the start/end of the ring simplifies things a lot, and this library has already been used in anger elsewhere in Iceni, so it can be treated as well tested.