Steve's Blog: September 2013

Thursday, 26 September 2013

Problems with the CBL

We've recently started having a lot of problems with the Composite Blocking List (CBL). This is supposed to be a list of IP addresses that are known to be sending spam emails, so its employed by people running email servers to automatically reject connections from these senders to reduce the volume of spam email they are getting. The contents of the CBL is also aggregated into other third party block lists, such as Spamhaus's Zen list.

Block lists are a pretty fundamental part of most anti-spam systems, and in general they are a good idea. Unfortunately, the way IP addresses are added to the CBL seems to be very questionable to me - they run "honeypot" servers and if you're caught connecting to one of these servers then you get added to the block list. This makes a lot of sense when the honeypot is just detecting people sending spam email. Unfortunately the CBL's honeypot also looks for people making web requests to it, and this is the problem.

I'll give an example of a couple of typical small office networks:
1. An email smarthost and web proxy are operated on the same server. All the workstations are firewalled off from direct internet access, so they have to use the (authenticated) smarthost to send email and the proxy to access the web.
or:
2. Everything on the network sits behind a single router that does NAT. The router is set to firewall off SMTP so the only machine that can send email is the mail server, but the workstations either have unrestricted web access, or go via a proxy server that also sits behind the same router.

In both of these example networks, the outgoing email and the web traffic comes from the same IP address... And I'm sure the problem is immediately obvious: someone plugs a virus-infected machine into the network, which starts making web requests, the IP address ends up on the CBL and suddenly no one in the office can send email. So anyone using the CBL to reject email, is rejecting email from any network that has had a virus infection, irrespective of whether that infection could have actually sent spam. Anyone who has run a network for a while (especially one full of Windows laptops) knows that virus infections happen all the time.

Recently we seem to be having a lot of customers being hit by the ZBot virus, and ending up with all their email being blocked by people using the CBL because of this.

One solution is to move the email traffic onto a different IP address to the web traffic. In some cases this isn't too hard, but in others the customer may be using an ISP who will only provide them with a single IP address, so implementing this would mean changing ISP.

We could reconfigure everyone to use their ISP's email smarthosts for outgoing email, but we don't routinely do this because they add another possible point of failure, seem to be forever getting black listed themselves since you're sharing them with potentially badly-behaved people, and in my experience ISPs frequently seem to configure them in crazy ways that causes unexpected breakage.

I've been in contact with CBL, suggesting that they could make a list of the honeypot domains available to us. This would allow us to set our customers' proxy servers to block the connections (avoiding being added to the CBL), and also automatically alert the administrator to the virus infection. Unfortunately they say they can't do this due to the rapidly changing nature of the honeypot servers - this seems like a solved problem to me though, they could easily distribute a rapidly changing list of domains using DNS.

Friday, 20 September 2013

Shiny new templates!

As I've mentioned before, our Iceni server product is getting a big redesign at the moment - most of that work is centred around the user interface, but there are a few fundamental changes to the way some of the backend services work too.

Anyway, up until now I've been working to a UI design that I threw together - I'm completely unartistic, so that wasn't great, but it did allow me to get on with writing the new code. The new UI is built using the Smarty templating engine, so I expected to be able to apply a better design at a later date without too much trouble. I didn't realise how little work it would actually be though - its only taken me about two days to completely replace the temporary interface with Sarah's new design.

She gave me the new design as a flat graphic, so the first job was to reproduce this as a web page, using HTML 5 and CSS. A web page needs to be able to re-flow all the text and graphics to fit different sized web browsers, etc. which obviously the flat graphic prototype doesn't do, so it isn't a dead-simple job. It also needs to be flexible enough to allow the layout to adjust to display different types of data. It took about a day to figure out all the HTML and CSS, which I thought was pretty good going.

Today I've been going through the Iceni user interface code and replacing the old templates with the new design - a surprising amount of that is actually just a straight replacement. Of course, a few bits and pieces needed tweaking to make them fit in with the new stuff, but I'm honestly quite impressed that I've got all that done today.

Here's a sneak-peak - there are a few missing graphics and some of the colours need fixing to properly integrate, but all in all I think its come along well.

Tuesday, 17 September 2013

Filtering Google Searches

Google have now suddenly, seemingly without warning, started redirecting people to the HTTPS website. They've been doing this with logged in users since forever, but now they've started sending people who aren't logged in to the encrypted version of the search engine.

Today we caught wind of one local authority who have notified all their schools that they are no longer able to filter Google searches because of this change. This is interesting because none of this affects our filtering systems at all - our Iceni servers are able to force users into the unencrypted version of Google and therefore filter it. We can also turn SafeSearch on for all users automatically. In fact, our systems have been able to do this for years, which is quite reassuring - even though Google seems to have caught some filtering providers off guard, our software has proven robust in this respect and we've not had to do anything.

Monday, 16 September 2013

New web filtering module

Today started with fixing a few bugs in the new web filter code: firstly, it wasn't always telling Squid to decrypt SSL connections when necessary, and secondly the overrides didn't appear to be working at all. Both of these were logic errors - the first one actually already existed in the old code too, although it doesn't seem to manifest itself. The overrides problem was down to setting the initial filter state incorrectly - previously it hadn't shown up because we didn't aggregate the states of the individual filters until after we'd run all of them at least once (which caused the states to be set right), but now we can aggregate the states before running some of the filters and so the incorrect initial states broke things.

I've also been adding a new filtering module to the web filter - this will scan content for URIs and look them up in our URI categorisation database. The idea is that websites often link to similar sites, so aggregating some of the information we have about linked sites should allow better categorisation. Most of the web filter is very modular, and adding a new type of filter should be easy. However, the content filter code is quite old and the newer more modular design has kind of been shoe-horned around it. Eventually that code will need refactoring to bring it in line with the rest of the filter, but that's a lot of work for very little gain, so I'm not going to get into doing that just yet. The new filtering module needs access to all the buffers the content filter currently maintains, so for now the new filtering module is going to be integrated into that existing code. Its half-done now anyway, and happily extracting URIs from the page content, so it shouldn't be too hard to look those URIs up and feed that data into the categorisation engine.

While all this development has been going on, we've also been trying to tackle an odd issue with Apple's "Find my iPhone" service. We've got customers using this service with no problem, but it just doesn't seem to be working on one of our customer's networks. The odd thing is that we can see regular requests to gs-loc.apple.com from the "locationd" service on these devices, and they seem to be succeeding just fine. But whilst Find my iPhone says the devices are online, it says it has no location for them. Its quite frustrating, and of course Apple don't provide any information on their protocols, so debugging the issue is a case of reverse engineering it all. So that's another one to investigate some more tomorrow.

Tuesday, 10 September 2013

Auto-updates

As mentioned in the previous post, we currently use a system built on top of SubVersion to deliver updates to the filtering criteria to clients every night. This is being replaced by a PostgreSQL backed system. We have quite a lot of criteria that the filters use to classify content, so it isn't sensible for all the clients to download the whole lot every night. Instead, we produce deltas - each update is tagged with a revision number and when a client wants new data it tells us what revision its already got and the server just sends the changes. Today I've been trying to figure out the best way to generate deltas from the new database tables.

Initially I set about trying to build SQL that would automatically generate a list of additions and deletions between any two revisions. However, this became unreasonably complex: we produce some of the filtering criteria in-house, and bolster them with data from some third parties. Internally, we have to keep the data from all these sources separate so that we can apply updates to them, but it all gets consolidated for the clients. The various sources can give us duplicate data, and ensuring the deltas behaved sensibly in the face of the duplicates was problematic.

So my second method to try involved running SQL queries to build tables of the "old" data and the "new" data, and then join them together in such a way as to produce the delta. This did work for small amounts of data, but once I loaded all of our standard filtering criteria into the database it ground to a halt and an analysis of how the database was handling the query suggested it was unfixable, so another idea was needed.

In the end, I've used two completely separate SQL queries - one to generate the "old" data and one to generate the "new" data. These are sorted by ID and then the PHP code loops through the two tables in parallel, comparing the IDs to see where records have been removed and added. Although I was hesitant to shift so much data between PostgreSQL and PHP, it does actually seem to work well. I suspect I could write some plpgsql code to do the same job on the PostgreSQL side, but honestly I don't think its necessary with the performance I've seen so far.

Now, the filtering criteria are bundled up into a gzipped xml file, which makes an initial "from-clean" download of around 8MB (obviously the deltas thereafter are much smaller). This is a lot better than the 40-odd MB download the old system does on a fresh install!

Next job is to add some extra types of data that we have to auto-update, which should be relatively trivial since I can just reuse the library code I've now written. Then I need to build client-side updater system to interface with all this. Once all that's done, I'll have our full filtering database on my test machine and can really start to test out the new web filter code to make sure its stable.

Friday, 6 September 2013

Busy week!

Well, its the end of a busy week. It has been the start of term for a lot of schools, so the phone's been ringing off the hook as problems are discovered with maintenance that has happened over the summer. Its pretty much the same every year, so hopefully it will start to settle down next week as usual.

Still, despite that, I've got all the logging stuff done for the web filter, although it needs a thorough testing, and I've been working on the auto-updater for filtering data today.

Our existing system uses SubVersion to store criteria used for filtering web requests - every night we incorporate new data into SubVersion, and then all of the customer machines make web requests for updates and are sent unified diffs, with which they update their local databases. This has turned out to be a little hard to maintain and not as scalable as we would like, so its all being moved into a PostgreSQL database. Today I've been working on some PHP code to generate deltas from the database in response to auto-update requests from the customers. There's quite a lot of filtering data - too much to want to send it all to every customer each night, so instead we will be sending deltas as XML.

But its reached that time on a Friday when my concentration has run out, so that's a job to finish next week!

Tuesday, 3 September 2013

Fun with PostgreSQL

We're currently doing a big overhaul of our Iceni Server product. This is basically a pluggable architecture that allows us to offer various modules, such as web filtering, mail serving/filtering, telephone exchanges, etc.

Today's job has been to work on the logging subsystem within the web filter. Up until now, the web filter has logged information about web accesses to a text file, and then we have a little daemon sitting there reading the text log and stuffing the data into a PostgreSQL database so that it can be searched and analysed. One of the efficiency improvements I wanted to make was to get rid of the daemon and have the web filter process talk to the database directly.

This isn't entirely straight forward. For one thing, the web filter is massively multithreaded, and the old text file system worked by each thread doing its own log writing. This wasn't going to work for a database - communication with the database will have more latency than a file, and we could end up with a massive locking overhead slowing the whole thing down. So much of the existing logging code has been reorganised so that the individual threads just add log entries to an internal queue and a single dedicated thread handles both writing the traditional text file, and talking to the database. I got all that stuff working yesterday, and it seems to be pretty good.

A second problem, that we've previously had with the old system, is that there's a *lot* of log data going into the database - a quick check on a server at one of my busier customers shows the database is over half a terabyte. With this number of records, things can get decreasingly speedy, so we implemented database partitioning quite a long time ago to limit the amount of data in each table. This works by having a set of master tables that remain permanently empty, and then every so often creating a bunch of sub tables using PostgreSQL's inheritance system and adding check constraints on them. This massively speeds up both inserts and searches. So I've been porting the database partitioning idea to the new logging code. This has taken some thought in order to deal with a few corner cases that have caught us out before - things like when the server clock is wrong and gets changed to an earlier time, which means the new database records start failing some of the check constraints on the live tables.

Not quite got that all finished today, but there's not much left on the logging side of things so that should get done tomorrow...

So it begins...

I've decided to start writing a blog. The idea is:

Give my customers some insight into what's going on behind the scenes. Often customers only get to see the finished product, which takes a long time to appear - when we're doing major development they can get tired of being told "real soon now" with no visible sign of progress. So I'm hoping that I can provide some insight into how things are progressing when we have big product development projects going on.
Provide a platform for interesting technical discussion, both for the public and for my customers. I hope to discuss some of the technical challenges we face on a day to day basis, which may turn out to be interesting and helpful to people, and invoke feedback as to whether I'm using good approaches to solve problems.
I'd prefer comments I make to not be read as an official statement from the company. These aren't official announcements, there's no guarantee that anything I mention here will make it to market, etc.

I've chosen to make this a personal blog, rather than officially tied to my business, Opendium, because I don't want to end up talking entirely about work stuff. I'll be posting general "stuff I find interesting" stuff too.