merging forwarding & packet filtering?

Discussion:

(too old to reply)

David Young

2011-02-18 17:43:03 UTC

What do people think about gradually merging the packet-forwarding and
packet-filtering functions in the kernel?

I ask because I keep rediscovering that NetBSD is not adequate for
creating any non-trivial router unless I resort to using PF 'route-to'
statements that consult a packet's source address or port and then
contravene the forwarding table, create state, et cetera. Route-to
rules sometimes lead to hard-to-predict and inefficient behavior inside
the network stack: for example, a packet outbound on interface A may
make a hairpin turn to go out interface B, instead. Once I have set
up the forwarding the way I like it, the kernel forwarding table does
such light duty on these routers that you almost but not quite toss it
out[1].

It seems to me that if packet filters are already making up for the
forwarding table's shortcomings, and if their pace of development,
speed, suitability for SMP (thinking of NPF here), and versatility
outmatches the forwarding table, we will save ourselves a lot of effort
to refurbish forwarding by rolling the forwarding/filtering functions
into one subsystem.

Dave

[1] Route-to rules do sometimes consult the forwarding table---that's my
hazy recollection, anyway.

--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

der Mouse

2011-02-18 18:01:14 UTC

Permalink

Post by David Young
I ask because I keep rediscovering that NetBSD is not adequate for
creating any non-trivial router unless I resort to using PF
'route-to' statements that consult a packet's source address [...]

There are also srt interfaces, which I have a hazy memory got brought
into the main tree. They didn't get along with NAT, but I recently
fixed that (not committed because I have no -current test capability;
I'll be happy to send the relevant patche to anyone who wants, eg, for
merging and committing) - or, of course, I could be wrong thinking
someone committed them, in which case, I'll be happy to send you what I
have if you want to give it a try.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Chuck Swiger

2011-02-18 18:38:28 UTC

Permalink

Post by David Young
It seems to me that if packet filters are already making up for the
forwarding table's shortcomings, and if their pace of development,
speed, suitability for SMP (thinking of NPF here), and versatility
outmatches the forwarding table, we will save ourselves a lot of effort
to refurbish forwarding by rolling the forwarding/filtering functions
into one subsystem.

Sounds like a fine idea. (It reminds me of FreeBSD's netgraph....)

Regards,

--
-Chuck

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-02-18 23:20:27 UTC

Permalink

Post by David Young
make a hairpin turn to go out interface B, instead. Once I have set
up the forwarding the way I like it, the kernel forwarding table does
such light duty on these routers that you almost but not quite toss it
out[1].

I'm not sure I have an opinion on the exact question you are asking but
I'm pretty sure the reason you don't find much use for the forwarding
table is that it is already underused, and that by itself is worth
fixing. There are routes which the destination (or destination/source)
address in a packet must be matched against stored all over the network
stack in all kinds of data structures. When a packet arrives at ip_input()
its destination is first looked up in a hash table to see if it is a local
address. Then it is compared to addresses attached to the interface to
see if it is a local network broadcast address. Then it is checked to
see if it is a multicast address, and if so it is looked up in one kind
of table to see if it needs to be forwarded, and then another kind of
table to see if it is something of local interest. It is then checked
to see if it is addressed to the all-one's or all-zero's address. If
it gets this far only then is it looked up in the forwarding table. Oh,
and I forget that fast forwarding cache thing that it gets looked up in
first, before all the rest, the existence of which is an admission that
the rest of this is pretty crappy. It looks at the same header fields
over and over and over again in so many ways.

All of this stuff that is matched against the packet's destination/source
addresses needs to be collected into just one (or maybe two, but no
more) data structure(s), so when a packet arrives you do just one
lookup that tells you what you are dealing with. This is maximally efficient,
and leaves you with just one data structure to make SMP-safe and fast.
And if the lookup you need to do can be implemented by a simple prefix match
against an N-bit (64 bits for IPv4, 256 bits for IPv6) destination/source
key from the packet, as all of the operations described above and done in
so many ways can be, there are data structures for this which are fast,
incrementally updatable and needn't lock out readers during updates. And,
as a side effect, makes it 100 or more lines of code in ip_input which
do exactly the same thing but divide the problem up into a bunch of
serially-executed special cases go away.

Beyond this, however, I'll just point out that my experience has been
that there is an impedance mismatch between the requirements of
standard routing protocols and what I'll call the "policy forwarding"
normally associated with filters. The needs of standard routing protocols
(and I include ARP and ND among them) are fully met by a simple prefix
lookup, but all standard routing protocols learn their routes incrementally
and really want to modify them that way. More than this, since the
things that standard routing protocols learn about are generated by
computers talking to computers, they can sometimes change frequently
(particularly if you have a lot of routes), so in general they want
relatively simple and uniform operations done in structure which can
be incrementally updated at low cost.

"policy forwarding", on the other hand, generally can't be done with
a simple prefix match, and if you need it to go fast then the best
structures seem to depend on knowing all the data in the filter and
compiling it to get the result you need, i.e. fast and incremental
updates are normally mutually exclusive. In practice this is okay,
however, since the information on which "policy forwarding" is based
generally originates from a human somewhere (the filter which implements
the policy might be generated by an automaton, but a human is normally
the ultimate source), and tends to change closer to human time, so the
fact that an update might require considerable computing resources to
compile isn't a big problem in practice.

To keep both types of users well served the structures which implement
standard routing (which must be there) and "policy routing" are generally
kept separate, preserving the scalability of the former while avoiding
implementation constraints that might exist in the latter if it needs
to serve users which want to do frequent, incremental updates. If you
think you have a structure which can serve both without penalizing
one or the other that is great, but in my experience there is no way
to make one thing do both really well.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-03-08 22:25:48 UTC

Permalink

Post by Dennis Ferguson
To keep both types of users well served the structures which implement
standard routing (which must be there) and "policy routing" are generally
kept separate, preserving the scalability of the former while avoiding
implementation constraints that might exist in the latter if it needs
to serve users which want to do frequent, incremental updates. If you
think you have a structure which can serve both without penalizing
one or the other that is great, but in my experience there is no way
to make one thing do both really well.

Thanks for your thoughtful and informative response!

I don't think that it's necessary to create a new data structure
that meets every need. Some policies just refine forwarding rules,
and we can apply those policies after a super-fast forwarding-table
lookup. For example, a policy that uses PF state to pin a flow to the
current default nexthop is just a refinement of the default route, so
let us write it as such. Let pin_flow_to_nexthop be the name of a
packet-filter rule that creates a state for each flow and copies the
nexthop to it, and let -policy attach a policy to a forwarding-table
entry. First let's set the default nexthop to 10.0.0.1 (happens to be
connected to wm0):

# route add -net default 10.0.0.1 -policy pin_flow_to_nexthop

Wait a while for flows to start:

# sleep 10
# route -n get default
route to: default
destination: default
mask: default
gateway: 10.0.0.1
local addr: 10.0.0.10
interface: wm0
policy: pin_flow_to_nexthop
flows: tcp 10.0.0.5:38237 <-> 10.25.43.7:22 nexthop wm0:10.0.0.1
tcp 10.0.0.29:38783 <-> 10.19.0.7:34543 nexthop wm0:10.0.0.1
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire
0 0 0 0 0 0 0 0

Now, let's change the default nexthop to 10.1.1.254 (happens to be
connected to tlp0). Old flows stay with the previous nexthop. New
flows take the new nexthop:

# route change -net default 10.1.1.254 -policy pin_flow_to_nexthop
# sleep 10
# route -n get default
route to: default
destination: default
mask: default
gateway: 10.0.0.1
local addr: 10.0.0.10
interface: tlp0
flags: <UP,GATEWAY,DONE,STATIC>
policy: pin_flow_to_nexthop
flows: tcp 10.0.0.5:38237 <-> 10.25.43.7:22 nexthop wm0:10.0.0.1
tcp 10.0.0.29:38783 <-> 10.19.0.7:34543 nexthop wm0:10.0.0.1
tcp 10.0.0.29:21012 <-> 10.19.25.8:4001 nexthop tlp0:10.1.1.254
recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire
0 0 0 0 0 0 0 0

We can think of these as fancy cloning routes, where the clones are
indexed by 4-tuple instead of by destination.

The advantage of hanging the rule off of a forwarding-table entry in
this way is that I have one place to start if I want to read or set the
policy of the system, and I can tell by reading the forwarding table if
it the entries therein are potentially contravened.

This not the only way, and it may not be the best way to make the
forwarding policy more transparent, but I hope that it gives you an idea
of what I mean by and what benefits I desire from merging forwarding &
packet filtering.

Dave

Thor Lancelot Simon

2011-03-09 03:16:14 UTC

Permalink

Post by David Young

But using PF state is slow.

For what it's worth, we have policy routing requirements probably
different from (much simpler than) Dennis' but we ended up in much
the same place in terms of his general observation at least.

What we ended up with is probably available to NetBSD if it's wanted.
It is very young and will doubtless need bugs shaken out, though...

Along the way we did notice some relatively quick and painless
optimizations (depessimizations?) that could be made to the current
route lookup. One very obvious one is that you shouldn't have to
chase a pointer from the route entry to find out the next hop
address! Some builtin storage in the route entry for this would
be a huge win.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-03-10 04:52:26 UTC

Permalink

Post by David Young
Thanks for your thoughtful and informative response!
I don't think that it's necessary to create a new data structure
that meets every need. Some policies just refine forwarding rules,
and we can apply those policies after a super-fast forwarding-table
lookup. For example, a policy that uses PF state to pin a flow to the
current default nexthop is just a refinement of the default route, so
let us write it as such. Let pin_flow_to_nexthop be the name of a
packet-filter rule that creates a state for each flow and copies the
nexthop to it, and let -policy attach a policy to a forwarding-table
entry. First let's set the default nexthop to 10.0.0.1 (happens to be

Since I don't quite get the useful purpose achieved by the example,
let me just make a scatter-shot response concerning the things I don't
like about it in general (it doesn't mean there is something wrong with
wanting things to work that way, it is just that a world where it didn't
have to work that way is more to my taste). There are possibly
interesting questions about what the basic philosophy for building
a forwarding path should be, however, so I'll take some time making
my arguments.

First, while I realize it is popular, I cringe a bit at the notion that
"policy routing" should be somehow equivalent to flow state instantiation,
that is the creation of state to add to your forwarding path based on
packets which have recently been moving through it. You are far better
off defining policies in a way that allows you to make forwarding
decisions statelessly, i.e. which doesn't require you to remember what
you did the last time you saw a packet that looked like that. Part of
the reason for this is the generic observation that flow forwarding caches
are very often a waste since flows typically don't last long enough to
amortize the cost of instantiating the cache entry (on big routers it is far
worse, if you are filling umpteen Gbps circuits with traffic from teeny, tiny
DSL circuits and have a packet which looks like X it'll take forever before
you see another one like that; the concurrent forwarding state requirement
is huge and doesn't scale), but there also a rationale for avoiding this
based on the constraints the processing hardware doing this work is increasingly
imposing. The most scalable forwarding operations will be the ones which
take no locks while moving a packet from the input interface to the output
queue since these run perfectly in parallel and scale with the number of
processor cores available to do the work. There are very few good lookup data
structures which allow concurrent updating (i.e. writes, rather than reads),
however, and those which do exist tend to be too complex or too constraining
to think about using in real life, so a flow cache is probably going to
require a lock and one that is taken frequently will serialize everything back
down to near-single-core speed. While there are nails which absolutely
require the hammer of flow caching (e.g. NAT), insisting that all forwarding
should be stateless with respect to traffic (beyond counting things) unless
there is no way to make do with that is a policy which will very probably
yield the best results in the most circumstances with the least complexity.
And note that I'm ignoring the other fact, that making a flow cache perfectly
reliable requires the router to be prepared to reassemble fragmented packets
not its own, something which it never has to otherwise do for IPv4 and which
is an even bigger problem for IPv6 where the fragmentation protocol is
end-to-end. I don't think flow caching is ever an attractive option, so it
is better to keep that for situations where there is no other option.

The second thing I don't like about the example, however, is that I don't
have a good picture of how it is supposed to continue to work if you need
to do the same thing (whatever that is) in a slightly larger network where
installing routes with route(8) doesn't work and you have to run a routing
protocol to find your routes instead. In that case you don't install the
routes, your routing protocol implementation does, so how would it know
to make certain routes "special"? Would the plan then be to modify your
routing protocol implementation so that it did this special thing too?
And, if so, if you wanted to set policy like that not on a default route
but rather on a route to a direct neighbour, how would that get done?
You don't install those routes either, another routing protocol (ARP)
does, so does ARP need to learn about your policy requirements as well?
It seems like modifying route(8) is just the start, you may also need
to modify everything which isn't route(8) but which puts routes in the same
table to make this general. And this gets more fuzzy given that the
example seems to be doing something which strikes me as fundamentally
wrong. I don't know what would prompt you to change the default route
with route(8), but I know why a routing protocol would do it: it would
do it because it determined that the old default route wasn't working
any more, and in this case leaving those flow entries pointing in the
old, broken direction just can't be right. Does this mean the routing
protocol implementation would also have to be taught about the flow cache,
so that it can go in and fix up the broken bits when appropriate, like
when the routes change because the old ones are broken rather than
"just because"? It seems like the complexity of this has the potential
to creep all over the place.

Finally, though, there is the issue of what useful purpose this might
serve and whether there are other ways to get to the same place. I'm
not sure what the purpose of the example might be, but let me just assume
that it is a method for doing something useful when you have two
working default routes and want to split traffic between them. I really
like multiple routing tables for this instead. That is, configure
two kernel routing tables, add a different default route to each table,
and then write (stateless) policy ahead of the route lookup to determine
the appropriate table to do the lookup in for the packet you are looking
at. This works not only with route(8), but also with real routing
protocols; just run two instances of the routing protocol, one to
maintain the routes in each of the tables separately. The routing
protocols don't care what your policy is doing, and your policy
doesn't care where the routes are coming from. If the purpose really
is just determining what is going on, however, then maybe it would be
worth while looking at things like packet header tapping (like bpf
only done at the end of the forwarding process, so you get not only
what the packet looked like but what the forwarding operation decided
to do with it), and/or maybe packet injection, where you insert a packet
that looks like 'blah' into the start of the forwarding path and capture
it again at the end to see what happened to it. These mechanisms might
allow you to find out everything about the operation that the flow cache
would tell you, but with the huge benefit that they only cost something
when you actively want to find that out; the flow cache cost is forever,
even if there is no one to look at it.

Anyway, here are my biases. I think it is best to assume that the
contents of kernel forwarding tables are entirely owned by standard,
automatic routing protocols, and to build a policy routing
infrastructure around them rather than in them. If one isn't running
a routing protocol for some routes (ARP or ND are almost always in use)
and are instead populating parts of these with route(8), don't let
route(8) do anything a standard routing protocol couldn't do on its
own. I think forwarding operations should be stateless unless there
is absolutely no way to provide a function without keeping state,
and that function is important enough that it needs to be done,
since stateless operations are maximally scalable and minimally
complex. If there is an issue of performance related to doing
stateless operations, work on making the stateless operations faster
rather than trying to remember the result you got the last time you
did it, since faster stateless forwarding is guaranteed to make
everything faster while result caching only helps if some assumptions
about the nature of the traffic you are handling are also true, and they
commonly aren't. And I think facilities used for debugging should only cost
something when you are actively using them to debug something and should
cost little or nothing when you aren't.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-03-10 16:34:11 UTC

Permalink

Post by Dennis Ferguson
Finally, though, there is the issue of what useful purpose this might
serve and whether there are other ways to get to the same place. I'm
not sure what the purpose of the example might be, but let me just assume
that it is a method for doing something useful when you have two
working default routes and want to split traffic between them.

It's a method for achieving the best possible Internet reliability at a
site that connects to two or more Internet providers on consumer-class
subscriber lines---i.e., BGP is not available---and the computers at
the site connect to the Internet through a NAT router. When the link
to provider A goes down, you don't know ahead of time for how long.
It is helpful to direct new flows to provider B during an outage of
provider A, however, redirecting existing flows to provider B during an
outage is unhelpful at best. At worst, it kills the flows[1]. If the
outage lasts just 10 seconds, and switching providers kills flows, then
reliability may be worse than if you did not fail over to B all. The
best possible thing to do is to hold existing flows on provider A and
to let new flows start on provider B. I haven't found a way to do that
without keeping some flow state.

Dave

[1] Under certain circumstances a TCP RST or an ICMP packet will
come back from provider B.

Dennis Ferguson

2011-03-15 03:54:48 UTC

Permalink

Post by David Young

Ah, I had a feeling this would end up having something to do with NAT.
What you are doing is sort of like a special case of NAT, call it NAT
Ultra-Lite, where you aren't doing the NAT operation itself but need
to behave in a way which mimics the behaviour and constraints of the
downstream routers which are. All things related to NAT are (necessary)
evil, and inherently require one to keep flow state.

What I would object to isn't the need to do it, but rather how and where
you want to do it. My personal indicator that a function is being
implemented in the wrong spot, or is being thought of the wrong way,
is when it seems to require unnatural acts to get it to work correctly
in all cases. In this case those "wrong spot" alarms are going off
all over: a correctly implemented flow state table inherently requires
a packet reassembly stage in front of it, so that fragmented packets are
made whole before the flow lookup, since you can't do a flow lookup on a
not-first fragment and it is only by getting those not-first fragments
attached to their first fragment before you look up the flow that you end
up with everything that needs to going in the right direction. It may
be that fragmented packets are uncommon (or never happen, even) in many
situations, but not dealing with them isn't "right" even if it might work
in a lot of cases. I just can't see how this can be made "right" the way
you want to do it.

I'll just point out that if you had to repair this, both for your NAT
Ultra-Lite or for full NAT, you would probably end up with something that
looks like this:

<forwarding/policy>--><reassembly>--><flow lookup>--><create state/do stuff/send packet>

That is, you would use a (stateless) forwarding/policy lookup to
identify those packets that need to be processed through the flow table,
then reassemble the fragments (a null function for not-fragmented packets)
and only then do the flow state lookup. This suggests that <forwarding/policy>
and <flow lookup> probably need to be separate.

Now let me cut-and-paste the same thing with a slightly different set of
operations on the right:

<forwarding/policy>--><reassembly>--><flow lookup>--><L4(e.g. TCP)>--><socket buffer>

What that is describing is the function of the input side of a host
networking stack of the kind that needs to exist in all kernels. In
this case (stateless) <forwarding/policy> picks out the packets arrive
which need the flow lookup by observing that the destination address in
the packet is a "local" one, the packets are reassembled if necessary,
and then a flow lookup is done to find the right transport/raw protocol
machinery and connection state to process the packet data out to the
application's socket. The <flow lookup> data structure required here
is a slightly more general one than you maybe need, since it needs to
do partial, as well as full, matches on the 5-tuple (think, say, a
service socket with only a protocol or protocol and port number binding),
but would otherwise provide the exact service that NAT, and NAT Ultra-Lite,
need to do their work too.

Of course the current host stack implementation doesn't really have
anything that fits cleanly into the <flow lookup> spot; while it eventually
does the equivalent operation it does it on sort of an ad hoc, special
case basis that spreads the code around. Like the destination address
matching done 6 ways that I ranted about before, however, this represents
a problem for making the code SMP-safe and lockless for readers since it
requires chasing through all the special cases to make sure they work.
A better arrangement would consolidate all of this into one, single lookup
structure that would sit in the <flow lookup> spot, with entries roughly
corresponding to open sockets of any description, since then you only have
one data structure (a "flow state" lookup) to make SMP-safe and fast.

What this means is that if you have to have a "flow state" lookup to do
a different function, like NAT or NAT Ultra-Lite, the best way to do that
would be to just reuse the same code, and a different instance of the same
data structure, that the host stack is going to require anyway. All flow state
problems should use the same solution, since then you only need one solution.

Or, to put it the other way around, any solution to a "flow state" problem
which doesn't get you closer to the integrated connection lookup the host
stack needs is just adding more special-case-of-the-same-thing code that will
need to be dragged around forever. I prefer an approach that tries to get more
done with less code, even if that makes the immediate special-case problem
a little harder to get done.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

der Mouse

2011-03-15 05:18:21 UTC

Permalink

Most of what Dennis says I agree with (not surprising, considering that

All things related to NAT are (necessary) evil, and inherently
require one to keep flow state.

Actually, not quite. There are some (moderately rare, in today's net)
cases of NAT which don't require keeping state. (In ipnat terms, rdr
and bimap rules are examples, if I understand them right - I don't use
ipnat all that much.)

[...] a correctly implemented flow state table inherently requires a
packet reassembly stage in front of it, so that fragmented packets
are made whole before the flow lookup, since you can't do a flow
lookup on a not-first fragment and it is only by getting those
not-first fragments attached to their first fragment before you look
up the flow that you end up with everything that needs to going in
the right direction.

I disagree with this too. Doing reassembly is certainly an easy way of
doing this, but I don't see it as essential; non-first fragments must
be held until the first fragment arrives, but the packet IDs that drive
reassembly in an end host could equally well drive flow state
selection, rather than reassembly, in a NAT. (Fragments which include
part but not all of the UDP/TCP header complicate things, it's true,
but it's hardly impossible to deal with the rewriting issues involved,
just annoying, and avoiding the issues reassembly brings might be worth
it in some environments.)

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Darren Reed

2011-03-13 23:03:12 UTC

Permalink

Post by David Young
What do people think about gradually merging the packet-forwarding and
packet-filtering functions in the kernel?

Probably the most sensible thing to do is to make it possible for
the inbound filter to return a "hint" about where the kernel should
route the packet and if that hint is null when the kernel gets to doing
the forwarding, then the kernel consults the routing table(s).

Needing or having the firewall do forwarding is ridiculous and a
gross hack. Yes, it works, but that doesn't make it right.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

S.P.Zeidler

2011-03-18 07:48:02 UTC

Permalink

Post by David Young
What do people think about gradually merging the packet-forwarding and
packet-filtering functions in the kernel?

If we touch the packet forwarding at all, please consider:

- for IPv6 PA multihoming you must consider source prefix as well
(sending provider B traffic with provider A prefix is not going to work
if provider B has their ducks in a row).

The solution to the two-providers-and-NAPT problem is to stop natting
new connections to provider A and to route -after- NAT based on the
source address you have, ie you should have <addr A>:default and
<addr B>:default at the same time. Thus you only need to keep the
NAPT state. Signalling the translator that source address A became
a bad choice is left as exercise to the reader :-P

- metric; also, stateless ECMP (RFC2991) routing.
Not a must, but a rather definite want :)

regards,
spz

--
***@serpens.de (S.P.Zeidler)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de