Dennis Ferguson
2014-10-10 23:57:13 UTC
I have a library implementing a longest match prefix
lookup (i.e. a route lookup) which might be useful
and which I'd like people to look at if anyone is
interested. The library is intended to be useful both
in the kernel and as a user space library with the code
as-is by virtue of the fact that it does memory allocations
using functions provided by the user. The library has
only a few other dependencies on the external environment,
it is mostly self-contained data structure builder and
parser code. It should be available here:
http://www.netbsd.org/~dennis/rttree.tar.gz
The library implements a fairly modern route lookup data
structure. Its memory usage is O(N), that is a reasonably
constant number of bytes per route installed in the
structure. As for the value of that constant, there is
a memory-versus-performance tradeoff made by letting the
user tell it how aggressive it should be. The default
schedule used makes the internal memory usage more or less
the same as the current kernel radix trie, though I like the
effect of the alternative schedule which grows internal
memory to maybe 30% larger than that.
The structure is a tree but the average complexity of a lookup
scales at better than O(log(N)). I can't tell you how much better.
I once convinced myself that if the algorithm were solely constrained
by the O(N) memory consumption it would scale as something
like O(log(log(N))) on average, but additional constraints
placed on it to make incremental updates efficient have likely
impaired that.
The structure can be modified while lookups are concurrently
in progress, though this support is not perfectly transparent
to readers and requires readers in a tree being modified to
use slightly different code to deal with the transient states
they might see which wouldn't otherwise occur. So as not to
penalize any use the library currently provides alternative
entry points to use when the tree might be changing and when
it isn't, though that might eventually change (the difference
in code isn't very large). Modification operations are add
a node, delete a node, or replace a node with another having
the same prefix. There are tree traversal functions for every
kind of traversal I've ever had a use for in both non-volatile
and volatile tree versions, though some of the latter are still
experimental (i.e. suspect).
The structure additionally supports the storage of multiple nodes
with the same destination prefix but which are distinguished from
each other by some other user-defined attribute. This support is
a bit clumsy and doesn't scale to large numbers per prefix, but
it doesn't cost anything if you don't use it and for certain
applications (including the kernel) it provides some utility. If
someone makes use of it and has opinions, it can be fixed. It
lets the structure do dual duty as both a forwarding table (where
more than one route per prefix is irrelevant) and a route table
(where it might be relevant).
There is one thing it doesn't do: it doesn't support (not-prefix)
routes with non-contiguous masks. I spent quite a while trying to
figure out how to support this at one point, with the only constraint
being that I didn't want the existence of that support to impair
performance when there were no non-contiguous masks in the table
(since no one ever uses them), but ended up convincing myself that
this was impossible. You can build a much better best match
structure if you can build the assumption that all nodes are
prefixes into the structure itself than you can if you need to
build a structure where this might not be true. I might be wrong,
and I stand ready to be corrected, but I don't know of a way to
build a good route lookup compatible with non-contiguous masks,
and I didn't want to have a not-so-good route lookup for the
benefit of that feature.
I've done route lookup structures for money before, but this one
has been a personal project I've been tinkering with for quite a
while now. I wanted to find a best match lookup good enough
that it would be a minor fraction of the practical cost of
getting a packet from one interface to another in a kernel-software
router (and remove any reason to think a forwarding path route
cache would somehow make this better), even with a lot of routes,
while also being appropriate and economical for a host with one
interface and not too many routes (hence the fixation with O(N)
memory use and self-scaling memory consumption). I'm biased, and
I'm pretty sure there is plenty of room for improvement, but I think
this one makes the tradeoffs pretty well and isn't too bad.
I also thought I should try to measure performance of the thing
before talking about it, since everyone likes numbers. I'd
actually not done this before because I'm personally pleased enough
by complexity arguments, but also because performance is a "fuzzy"
topic with no correct answer and I never quite know how to measure
it realistically (I guess this is also an admission that the code
has never been performance-optimized). The practical performance
of a software route lookup, like any search of memory, is hugely
dominated not so much by the number of reads you have to do to get
a result, but by the number of those reads which need something not
already in the processor cache. The first lookup in something you
haven't touched for a while can take a long time, but the second
can take no time at all. I don't think either the really slow
number or the really speedy number is representative of anything
(and the slow number is hard to measure), so I wasn't quite sure what
to do.
I arrived at the following since it was easy for my test bench. I
downloaded a full set of external Internet IPv4 routes, about 523,000,
and added them to a table. Including the routes and internal stuff
this produced about 90 MB of data structure on an amd64 machine, way
too big to fit in any processor cache. I then constructed 523,000
addresses to look up in the table, one matching each route in there,
and looked these up in one of two orders. The first processed
addresses in the tree's lexical order, so that each lookup would
follow a path through the structure memory almost the same as the
previous lookup. This "good cache" case will still have cache
misses, but just enough to cycle the 90 MB through the cache once.
The other case instead fully randomizes the order of the addresses,
so route lookups instead take random pokes at the 90 MB to make it
the "bad cache" case. They hence both do the same work but in a
different order. I should note that this is still not a good
measure of real life, since real life would have way more than
just route lookup competing for cache space (then again, real
life for this structure is unlikely to be 523,000 routes very
often), but it's something.
The "good cache" case runs about 5 times faster than the "bad cache"
case on my machine, so it is doing something. On a recent Core i7
running NetBSD I get 30 ns per "good cache" lookup and 150 ns
per "bad cache". I thought these numbers seemed low but they came
out the same on my Mac laptop with a similar processor and a
different compiler. I eventually resorted to instruction counting
to make sure the 30 ns didn't violate physics, and it does seem
plausible. If everything goes perfect 30 ns might be enough time
to execute as many as 150 instructions on these machines, and I
think a typical lookup in this table might end up needing somewhere
in the ballpark of 110-120 instructions. As this is a class of
machine where I think you'd be doing really well to fully process
a packet from input to output with general purpose code in less
than a few microseconds I feel pretty comfortable arguing that
adding a forwarding path cache to avoid doing a full longest-match
route lookup per packet is unlikely to save anything worth the
cost of maintaining any additional lines of code, and that
just about anything matching rtcache_* in the kernel should not
be there.
In any case, this is part of a larger library of lockless-reader
data structure code and other bits which provide a sufficient base
to construct a basic network stack which can move packets from
input to output interface, or input interface to demultiplexed
transport session, without locking around the shared data. I'm
trying to organize the rest of it so that I can donate it all, but
everything is more or less implemented in the "style" of the
above (e.g. builds into the kernel or a user space library
unchanged, often provides separate "volatile" and "non-volatile"
reader functions, ...?) and I was interested in seeing if anyone
found bits of this objectionable.
Oh, and it probably won't work on a multiprocessor DEC Alpha the
way it is. I started off aiming for that, but found having to
encapsulate every pointer dereference made the code even uglier
than it is already. I guess I could go back and retrofit that
if it is essential.
Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
lookup (i.e. a route lookup) which might be useful
and which I'd like people to look at if anyone is
interested. The library is intended to be useful both
in the kernel and as a user space library with the code
as-is by virtue of the fact that it does memory allocations
using functions provided by the user. The library has
only a few other dependencies on the external environment,
it is mostly self-contained data structure builder and
parser code. It should be available here:
http://www.netbsd.org/~dennis/rttree.tar.gz
The library implements a fairly modern route lookup data
structure. Its memory usage is O(N), that is a reasonably
constant number of bytes per route installed in the
structure. As for the value of that constant, there is
a memory-versus-performance tradeoff made by letting the
user tell it how aggressive it should be. The default
schedule used makes the internal memory usage more or less
the same as the current kernel radix trie, though I like the
effect of the alternative schedule which grows internal
memory to maybe 30% larger than that.
The structure is a tree but the average complexity of a lookup
scales at better than O(log(N)). I can't tell you how much better.
I once convinced myself that if the algorithm were solely constrained
by the O(N) memory consumption it would scale as something
like O(log(log(N))) on average, but additional constraints
placed on it to make incremental updates efficient have likely
impaired that.
The structure can be modified while lookups are concurrently
in progress, though this support is not perfectly transparent
to readers and requires readers in a tree being modified to
use slightly different code to deal with the transient states
they might see which wouldn't otherwise occur. So as not to
penalize any use the library currently provides alternative
entry points to use when the tree might be changing and when
it isn't, though that might eventually change (the difference
in code isn't very large). Modification operations are add
a node, delete a node, or replace a node with another having
the same prefix. There are tree traversal functions for every
kind of traversal I've ever had a use for in both non-volatile
and volatile tree versions, though some of the latter are still
experimental (i.e. suspect).
The structure additionally supports the storage of multiple nodes
with the same destination prefix but which are distinguished from
each other by some other user-defined attribute. This support is
a bit clumsy and doesn't scale to large numbers per prefix, but
it doesn't cost anything if you don't use it and for certain
applications (including the kernel) it provides some utility. If
someone makes use of it and has opinions, it can be fixed. It
lets the structure do dual duty as both a forwarding table (where
more than one route per prefix is irrelevant) and a route table
(where it might be relevant).
There is one thing it doesn't do: it doesn't support (not-prefix)
routes with non-contiguous masks. I spent quite a while trying to
figure out how to support this at one point, with the only constraint
being that I didn't want the existence of that support to impair
performance when there were no non-contiguous masks in the table
(since no one ever uses them), but ended up convincing myself that
this was impossible. You can build a much better best match
structure if you can build the assumption that all nodes are
prefixes into the structure itself than you can if you need to
build a structure where this might not be true. I might be wrong,
and I stand ready to be corrected, but I don't know of a way to
build a good route lookup compatible with non-contiguous masks,
and I didn't want to have a not-so-good route lookup for the
benefit of that feature.
I've done route lookup structures for money before, but this one
has been a personal project I've been tinkering with for quite a
while now. I wanted to find a best match lookup good enough
that it would be a minor fraction of the practical cost of
getting a packet from one interface to another in a kernel-software
router (and remove any reason to think a forwarding path route
cache would somehow make this better), even with a lot of routes,
while also being appropriate and economical for a host with one
interface and not too many routes (hence the fixation with O(N)
memory use and self-scaling memory consumption). I'm biased, and
I'm pretty sure there is plenty of room for improvement, but I think
this one makes the tradeoffs pretty well and isn't too bad.
I also thought I should try to measure performance of the thing
before talking about it, since everyone likes numbers. I'd
actually not done this before because I'm personally pleased enough
by complexity arguments, but also because performance is a "fuzzy"
topic with no correct answer and I never quite know how to measure
it realistically (I guess this is also an admission that the code
has never been performance-optimized). The practical performance
of a software route lookup, like any search of memory, is hugely
dominated not so much by the number of reads you have to do to get
a result, but by the number of those reads which need something not
already in the processor cache. The first lookup in something you
haven't touched for a while can take a long time, but the second
can take no time at all. I don't think either the really slow
number or the really speedy number is representative of anything
(and the slow number is hard to measure), so I wasn't quite sure what
to do.
I arrived at the following since it was easy for my test bench. I
downloaded a full set of external Internet IPv4 routes, about 523,000,
and added them to a table. Including the routes and internal stuff
this produced about 90 MB of data structure on an amd64 machine, way
too big to fit in any processor cache. I then constructed 523,000
addresses to look up in the table, one matching each route in there,
and looked these up in one of two orders. The first processed
addresses in the tree's lexical order, so that each lookup would
follow a path through the structure memory almost the same as the
previous lookup. This "good cache" case will still have cache
misses, but just enough to cycle the 90 MB through the cache once.
The other case instead fully randomizes the order of the addresses,
so route lookups instead take random pokes at the 90 MB to make it
the "bad cache" case. They hence both do the same work but in a
different order. I should note that this is still not a good
measure of real life, since real life would have way more than
just route lookup competing for cache space (then again, real
life for this structure is unlikely to be 523,000 routes very
often), but it's something.
The "good cache" case runs about 5 times faster than the "bad cache"
case on my machine, so it is doing something. On a recent Core i7
running NetBSD I get 30 ns per "good cache" lookup and 150 ns
per "bad cache". I thought these numbers seemed low but they came
out the same on my Mac laptop with a similar processor and a
different compiler. I eventually resorted to instruction counting
to make sure the 30 ns didn't violate physics, and it does seem
plausible. If everything goes perfect 30 ns might be enough time
to execute as many as 150 instructions on these machines, and I
think a typical lookup in this table might end up needing somewhere
in the ballpark of 110-120 instructions. As this is a class of
machine where I think you'd be doing really well to fully process
a packet from input to output with general purpose code in less
than a few microseconds I feel pretty comfortable arguing that
adding a forwarding path cache to avoid doing a full longest-match
route lookup per packet is unlikely to save anything worth the
cost of maintaining any additional lines of code, and that
just about anything matching rtcache_* in the kernel should not
be there.
In any case, this is part of a larger library of lockless-reader
data structure code and other bits which provide a sufficient base
to construct a basic network stack which can move packets from
input to output interface, or input interface to demultiplexed
transport session, without locking around the shared data. I'm
trying to organize the rest of it so that I can donate it all, but
everything is more or less implemented in the "style" of the
above (e.g. builds into the kernel or a user space library
unchanged, often provides separate "volatile" and "non-volatile"
reader functions, ...?) and I was interested in seeing if anyone
found bits of this objectionable.
Oh, and it probably won't work on a multiprocessor DEC Alpha the
way it is. I started off aiming for that, but found having to
encapsulate every pointer dereference made the code even uglier
than it is already. I guess I could go back and retrofit that
if it is essential.
Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de