Tuesday, 10 September 2013


As mentioned in the previous post, we currently use a system built on top of SubVersion to deliver updates to the filtering criteria to clients every night.  This is being replaced by a PostgreSQL backed system.  We have quite a lot of criteria that the filters use to classify content, so it isn't sensible for all the clients to download the whole lot every night.  Instead, we produce deltas - each update is tagged with a revision number and when a client wants new data it tells us what revision its already got and the server just sends the changes.  Today I've been trying to figure out the best way to generate deltas from the new database tables.

Initially I set about trying to build SQL that would automatically generate a list of additions and deletions between any two revisions.  However, this became unreasonably complex: we produce some of the filtering criteria in-house, and bolster them with data from some third parties.  Internally, we have to keep the data from all these sources separate so that we can apply updates to them, but it all gets consolidated for the clients.  The various sources can give us duplicate data, and ensuring the deltas behaved sensibly in the face of the duplicates was problematic.

So my second method to try involved running SQL queries to build tables of the "old" data and the "new" data, and then join them together in such a way as to produce the delta.  This did work for small amounts of data, but once I loaded all of our standard filtering criteria into the database it ground to a halt and an analysis of how the database was handling the query suggested it was unfixable, so another idea was needed.

In the end, I've used two completely separate SQL queries - one to generate the "old" data and one to generate the "new" data.  These are sorted by ID and then the PHP code loops through the two tables in parallel, comparing the IDs to see where records have been removed and added.  Although I was hesitant to shift so much data between PostgreSQL and PHP, it does actually seem to work well.  I suspect I could write some plpgsql code to do the same job on the PostgreSQL side, but honestly I don't think its necessary with the performance I've seen so far.

Now, the filtering criteria are bundled up into a gzipped xml file, which makes an initial "from-clean" download of around 8MB (obviously the deltas thereafter are much smaller).  This is a lot better than the 40-odd MB download the old system does on a fresh install!

Next job is to add some extra types of data that we have to auto-update, which should be relatively trivial since I can just reuse the library code I've now written.  Then I need to build client-side updater system to interface with all this.  Once all that's done, I'll have our full filtering database on my test machine and can really start to test out the new web filter code to make sure its stable.

No comments:

Post a Comment