Post by David YoungThanks for your thoughtful and informative response!
I don't think that it's necessary to create a new data structure
that meets every need. Some policies just refine forwarding rules,
and we can apply those policies after a super-fast forwarding-table
lookup. For example, a policy that uses PF state to pin a flow to the
current default nexthop is just a refinement of the default route, so
let us write it as such. Let pin_flow_to_nexthop be the name of a
packet-filter rule that creates a state for each flow and copies the
nexthop to it, and let -policy attach a policy to a forwarding-table
entry. First let's set the default nexthop to 10.0.0.1 (happens to be
Since I don't quite get the useful purpose achieved by the example,
let me just make a scatter-shot response concerning the things I don't
like about it in general (it doesn't mean there is something wrong with
wanting things to work that way, it is just that a world where it didn't
have to work that way is more to my taste). There are possibly
interesting questions about what the basic philosophy for building
a forwarding path should be, however, so I'll take some time making
my arguments.
First, while I realize it is popular, I cringe a bit at the notion that
"policy routing" should be somehow equivalent to flow state instantiation,
that is the creation of state to add to your forwarding path based on
packets which have recently been moving through it. You are far better
off defining policies in a way that allows you to make forwarding
decisions statelessly, i.e. which doesn't require you to remember what
you did the last time you saw a packet that looked like that. Part of
the reason for this is the generic observation that flow forwarding caches
are very often a waste since flows typically don't last long enough to
amortize the cost of instantiating the cache entry (on big routers it is far
worse, if you are filling umpteen Gbps circuits with traffic from teeny, tiny
DSL circuits and have a packet which looks like X it'll take forever before
you see another one like that; the concurrent forwarding state requirement
is huge and doesn't scale), but there also a rationale for avoiding this
based on the constraints the processing hardware doing this work is increasingly
imposing. The most scalable forwarding operations will be the ones which
take no locks while moving a packet from the input interface to the output
queue since these run perfectly in parallel and scale with the number of
processor cores available to do the work. There are very few good lookup data
structures which allow concurrent updating (i.e. writes, rather than reads),
however, and those which do exist tend to be too complex or too constraining
to think about using in real life, so a flow cache is probably going to
require a lock and one that is taken frequently will serialize everything back
down to near-single-core speed. While there are nails which absolutely
require the hammer of flow caching (e.g. NAT), insisting that all forwarding
should be stateless with respect to traffic (beyond counting things) unless
there is no way to make do with that is a policy which will very probably
yield the best results in the most circumstances with the least complexity.
And note that I'm ignoring the other fact, that making a flow cache perfectly
reliable requires the router to be prepared to reassemble fragmented packets
not its own, something which it never has to otherwise do for IPv4 and which
is an even bigger problem for IPv6 where the fragmentation protocol is
end-to-end. I don't think flow caching is ever an attractive option, so it
is better to keep that for situations where there is no other option.
The second thing I don't like about the example, however, is that I don't
have a good picture of how it is supposed to continue to work if you need
to do the same thing (whatever that is) in a slightly larger network where
installing routes with route(8) doesn't work and you have to run a routing
protocol to find your routes instead. In that case you don't install the
routes, your routing protocol implementation does, so how would it know
to make certain routes "special"? Would the plan then be to modify your
routing protocol implementation so that it did this special thing too?
And, if so, if you wanted to set policy like that not on a default route
but rather on a route to a direct neighbour, how would that get done?
You don't install those routes either, another routing protocol (ARP)
does, so does ARP need to learn about your policy requirements as well?
It seems like modifying route(8) is just the start, you may also need
to modify everything which isn't route(8) but which puts routes in the same
table to make this general. And this gets more fuzzy given that the
example seems to be doing something which strikes me as fundamentally
wrong. I don't know what would prompt you to change the default route
with route(8), but I know why a routing protocol would do it: it would
do it because it determined that the old default route wasn't working
any more, and in this case leaving those flow entries pointing in the
old, broken direction just can't be right. Does this mean the routing
protocol implementation would also have to be taught about the flow cache,
so that it can go in and fix up the broken bits when appropriate, like
when the routes change because the old ones are broken rather than
"just because"? It seems like the complexity of this has the potential
to creep all over the place.
Finally, though, there is the issue of what useful purpose this might
serve and whether there are other ways to get to the same place. I'm
not sure what the purpose of the example might be, but let me just assume
that it is a method for doing something useful when you have two
working default routes and want to split traffic between them. I really
like multiple routing tables for this instead. That is, configure
two kernel routing tables, add a different default route to each table,
and then write (stateless) policy ahead of the route lookup to determine
the appropriate table to do the lookup in for the packet you are looking
at. This works not only with route(8), but also with real routing
protocols; just run two instances of the routing protocol, one to
maintain the routes in each of the tables separately. The routing
protocols don't care what your policy is doing, and your policy
doesn't care where the routes are coming from. If the purpose really
is just determining what is going on, however, then maybe it would be
worth while looking at things like packet header tapping (like bpf
only done at the end of the forwarding process, so you get not only
what the packet looked like but what the forwarding operation decided
to do with it), and/or maybe packet injection, where you insert a packet
that looks like 'blah' into the start of the forwarding path and capture
it again at the end to see what happened to it. These mechanisms might
allow you to find out everything about the operation that the flow cache
would tell you, but with the huge benefit that they only cost something
when you actively want to find that out; the flow cache cost is forever,
even if there is no one to look at it.
Anyway, here are my biases. I think it is best to assume that the
contents of kernel forwarding tables are entirely owned by standard,
automatic routing protocols, and to build a policy routing
infrastructure around them rather than in them. If one isn't running
a routing protocol for some routes (ARP or ND are almost always in use)
and are instead populating parts of these with route(8), don't let
route(8) do anything a standard routing protocol couldn't do on its
own. I think forwarding operations should be stateless unless there
is absolutely no way to provide a function without keeping state,
and that function is important enough that it needs to be done,
since stateless operations are maximally scalable and minimally
complex. If there is an issue of performance related to doing
stateless operations, work on making the stateless operations faster
rather than trying to remember the result you got the last time you
did it, since faster stateless forwarding is guaranteed to make
everything faster while result caching only helps if some assumptions
about the nature of the traffic you are handling are also true, and they
commonly aren't. And I think facilities used for debugging should only cost
something when you are actively using them to debug something and should
cost little or nothing when you aren't.
Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de