We're doing some modernisation work at the moment, and part of that is moving our products to a CentOS 8 operating system. Our Opendium UTM appliance includes a firewall with a friendly web based user interface, and the back end is built upon iptables. The end user never sees the iptables bit of course - they just set up some firewall policies based on their users, groups and rule bundles:
Deep packet inspection is done in user space, but for performance reasons most of the other decision making is implemented as iptables rules. Those rules can get pretty complicated, amounting to thousands of iptables rules.CentOS 8 has moved away from iptables, switching instead to nftables, which is claimed to improve performance, amongst other things. This shouldn't be a big deal - there's an adaption layer to allow nftables rules to be manipulated in exactly the same way as iptables rules, so in theory no need for big changes to our software.
We have plans to move more of the decision making into user space to reduce the complexity of the in-kernel rules, but we don't want to do that right now, so being able to port over the existing system with little modification is great... Except it didn't work.
iptables configurations consist of "chains", where each chain contains a list of rules. A rule is just some criteria, and an action that will be carried out if those criteria match the network traffic. Actions can be things like "ACCEPT" and "DROP" (which allow or disallow the network traffic respectively), or can be a "go to" or "jump" action that points at another chain.
So we can view iptables configurations as a directed graph, with "chains" as vertices and "go to" and "jump" rules as edges. Cycles are not allowed.
Problem 1: "Too many links"
When I tried to load the iptables rules into the Linux kernel, they were rejected with the error "Too many links". Its not a particularly helpful error, but some poking around revealed that nftables has a fixed size stack of 16, which means that your rules can only jump between chains to a maximum depth of 16 jumps.
We do have some pretty complicated rules, but they shouldn't go more than 16 jumps deep, so what's going on?
Delving into the Kernel, we find this horribleness in nft_immediate.c and similar code in nft_lookup.c:
static int nft_immediate_validate(const struct nft_ctx *ctx,
const struct nft_expr *expr,
const struct nft_data **d)
{
const struct nft_immediate_expr *priv = nft_expr_priv(expr);
struct nft_ctx *pctx = (struct nft_ctx *)ctx;
const struct nft_data *data;
int err;
if (priv->dreg != NFT_REG_VERDICT)
return 0;
data = &priv->data;
switch (data->verdict.code) {
case NFT_JUMP:
case NFT_GOTO:
pctx->level++;
err = nft_chain_validate(ctx, data->verdict.chain);
if (err < 0)
return err;
pctx->level--;
break;
default:
break;
}
return 0;
}
This is part of a validation routine that happens when any new rules are added, and is responsible for the "Too many links" error if you try to add rules with a jump depth that would exhaust the 16 frame stack.
We can see that jumps and gotos are both handled the same way - pctx->level is incremented when following either a jump or a goto. nft_chain_validate() will return an EMLINK error if the level is 16. This seems wrong - jumps take up stack space, but gotos don't. Looking at the rest of the code, I can't see a reason for this, so I changed it to the following (in both files), rebuilt, and that seems to solve that problem:
switch (data->verdict.code) {
case NFT_JUMP:
pctx->level++;
err = nft_chain_validate(ctx, data->verdict.chain);
if (err < 0)
return err;
pctx->level--;
break;
case NFT_GOTO:
err = nft_chain_validate(ctx, data->verdict.chain);
if (err < 0)
return err;
break;
default:
break;
}
There are a couple of other obvious problems with this code which I haven't tried to fix:
- ctx is a constant, but the const qualifier is immediately cast away. Why do this? Because the author likes watching the world burn?
- The "level" attribute of ctx (which is supposed to be a constant) is modified. It does get restored before the function exits, but that is neglected if there's an error. I'm guessing that the calling functions just discard the contents of ctx if there is an error, so this probably doesn't really matter, but yuckity yuck!
Problem 2: CPU Lockup
With the above fix in place I tried again and the machine crashed - it totally vanished off the network, and the console was unresponsive and eventually started showing "kernel:watchdog: BUG: soft lockup - CPU#1 stuck for 23s!" warnings. Great.
When a new rule set is committed, two validation routines are run by the kernel. In nf_tables_api.c we find the nf_tables_check_loops() function, which checks the graph for cycles and rejects it if it has any; and nft_table_validate(), which calls the nft_chain_validate() stuff, mentioned above, to reject anything that would exceed the stack depth.
These functions are executed each time a change is committed. Thankfully this is only once per commit, not once per rule - if you use the iptables-restore command, you can make multiple changes at once and just have the lot validated in one go; if you're adding rules one at a time with the iptables command then you only get to make one change at a time so the validation will be re-run for every rule you add.
Unfortunately the algorithm used by these functions is just a brute-force walk of the entire graph, potentially visiting each vertex multiple times. The following set of rules is a pathological case (it doesn't actually do anything useful, its just a test case to demonstrate the problem):
Chain INPUT (policy ACCEPT)
target prot opt source destination
A0 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain FORWARD (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
Chain A0 (1 references)
target prot opt source destination
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A1 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A1 (10 references)
target prot opt source destination
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A2 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A2 (10 references)
target prot opt source destination
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A3 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A3 (10 references)
target prot opt source destination
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A4 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A4 (10 references)
target prot opt source destination
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A5 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A5 (10 references)
target prot opt source destination
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A6 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A6 (10 references)
target prot opt source destination
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A7 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A7 (10 references)
target prot opt source destination
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
A8 all -- 0.0.0.0/0 0.0.0.0/0 [goto]
Chain A8 (10 references)
target prot opt source destination
This is a mere 81 rules, and it takes about 5 seconds to commit them. Adding an a single additional rule using "iptables -A INPUT -g A0" takes about another 5 seconds because it retriggers the validation, and worst of all, the CPU is locked up for the duration of the validation, leaving the machine unresponsive.
The problem gets exponentially worse, so if we add another chain and another 10 rules the machine will become unresponsive for 50 seconds. Add another and your server will be AWOL for 8 minutes!
nft_table_validate() always walks the whole graph, nft_table_check_loops() often only walks a subsection of it, but nft_table_check_loops() is actually pretty quick even when walking over the whole graph. nft_table_validate() is very very slow by comparison. I couldn't immediately see why there was such a big difference in speed - they both seem to be very similar and neither really does a lot other than recursively walking across the whole graph.
What's the Solution?
The legacy iptables kernel modules still exist in CentOS 8, but the legacy user space tools aren't packaged any more. But the legacy tools are still part of the iptables SRPM - they are even built, but then omitted from the final RPM. It was a trivial job for me to edit the spec file, rebuild the package and continue using the legacy iptables modules rather than switching to nftables.
So in the short term, that's what we're doing. We've got a lot of other modernisation work going on (a big overhaul of the UI and heritable policy structure) so would rather not get side tracked right now.
I've got a longer term plan to switch to using nftables natively, as well as move a lot of the decision making into user space to reduce the complexity of the rules. We may well end up with a simple enough nftables configuration for these problems to largely go away. Although I dislike the idea of living with a bug that could basically bring down a server if you load a particularly pathological configuration.
As mentioned above, it's not clear to me why there's a big difference in speed between nft_table_check_loops() and nft_table_validate(). There are certainly better algorithms for the validation they are doing, and even just adding some simple caching to avoid revisiting the same vertices over and over would be a big help. I may end up revisiting this and rewriting the offending bits of the Linux kernel at some point in the future.
Bugzilla
These bugs have been submitted to the Netfilter Bugzilla at the following links: