Discussion:
Improving the data supplied by BPF
(too old to reply)
Thor Lancelot Simon
2008-12-25 18:16:42 UTC
Permalink
This set of diffs attempts to address that by introducing a new BPF
record format that the kernel may make. At present this is enabled
by issuing an ioctl that could effectively be turned into something
that "tuned" the data format provided to applications. The other way
I thought of providing a different format was to create a /dev/ebpf,
but that meant a whole lot more trouble.
I have an immediate use for this and am very glad you did the work.

I hope this is checked in to NetBSD.

Thank you!

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Arnaud Lacombe
2008-12-25 20:58:54 UTC
Permalink
Hi,
Recently I've talked with a few different folks about packet capture
and have become aware of some of the problems that people face when
trying to use BPF vs other propritary solutions that exist. While it
may be possible to capture data at a good rate with BPF, there is
important meta data that isn't provided.
could you details what BPF is missing vs. other proprietary solutions
? What a heavy tcpdump user can expect compared to the actual one ?
This set of diffs attempts to address that by introducing a new BPF
maybe your changes would be clearer if you only provided the diff made
on BPF itself (about 10% of the whole diff), and a sample use-case.
Everything else is only API change.
The purpose of the sequence number is to provide the rolling counter
of the packets captured for the one in question. Thus if in successive
reads the count went from 2 to 5, you know 3 packets have been missed.
what if the count goes from 3 to... 3, ie. the seq number overflowed
(for whatever reason) ?
/*
* Enhanced BPF packet record structure
*/
typedef struct ebpf_rec_s {
uint64_t ebr_secs; /* No more Y2k38 problem */
why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
There is an obvious ABI breakage when we will switch to 64bits time_t
but this is be a better type than raw integer. The breakage is a
different trouble and should be dealt with separately.
uint32_t ebr_nsecs;
why do you want nano second precision if you getting your information
from a micro second precision variable. There is no information gain
there, and your code reflect this (ie. you just "* 1000" to get the
nano second value from the micro second value).

This field would have a meaning if you change the call the call to
microtime() to nanotime() in bpf_tap()/bpd_deliver() and build a
homegrown `struct timeval' in the non-extended capture format. You
don't have any precision loss in that case.

btw, why not just using a `struct timespec' ?
uint32_t ebr_seqno; /* sequence number in capture */
how to detect wrap in sequence number ?

As we have timestamps, this can be use to order sequence number as
done with TCP's PAWS I guess.
uint32_t ebr_flags;
uint32_t ebr_rlen; /* 16 bits is not enough for
IPv6 */
uint32_t ebr_wlen; /* Jumbograms, so we have to
use */
uint32_t ebr_clen; /* 32 bits to represent all
lengths */
uint32_t ebr_pktoff;
uint16_t ebr_type; /* DLT_* type */
uint16_t ebr_subtype;
} ebpf_rec_t;
/*
* rlen = total record length (header + packet)
* wlen = wire length of packet
* clen = captured length of packet
* pktoff = offset from ebr_secs to the start of the packet data (may not be
* the same as sizeof(ebr_rec_t))
*
s/asa/as/ :)
*/
#define EBPF_OUT 0x00000001 /* Transmitted
packet */
I guess there will also be EBPF_IN, do you forsee any other flag possible ?

Many thanks,

- Arnaud

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2008-12-25 23:34:10 UTC
Permalink
Post by Arnaud Lacombe
The purpose of the sequence number is to provide the rolling counter
of the packets captured for the one in question. Thus if in successive
reads the count went from 2 to 5, you know 3 packets have been missed.
what if the count goes from 3 to... 3, ie. the seq number overflowed
(for whatever reason) ?
Highly unlinkely.
Post by Arnaud Lacombe
why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
There is an obvious ABI breakage when we will switch to 64bits time_t
but this is be a better type than raw integer. The breakage is a
different trouble and should be dealt with separately.
You answered your own question. All types in the struct should be fixed size
and time_t will change soon to be 64 bits.
Post by Arnaud Lacombe
uint32_t ebr_nsecs;
why do you want nano second precision if you getting your information
from a micro second precision variable. There is no information gain
there, and your code reflect this (ie. you just "* 1000" to get the
nano second value from the micro second value).
In the time_t branch it already uses nanoseconds. Plus why you would want to
stick with micros in the new interface?
Post by Arnaud Lacombe
This field would have a meaning if you change the call the call to
microtime() to nanotime() in bpf_tap()/bpd_deliver() and build a
homegrown `struct timeval' in the non-extended capture format. You
don't have any precision loss in that case.
btw, why not just using a `struct timespec' ?
fixed sizes. Most on wire structures use fixed sizes. this allows portability
across different architectures.
Post by Arnaud Lacombe
uint32_t ebr_seqno; /* sequence number in capture */
how to detect wrap in sequence number ?
It does not happen in real situations.


christos


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Arnaud Lacombe
2008-12-26 04:45:07 UTC
Permalink
Post by Christos Zoulas
Post by Arnaud Lacombe
The purpose of the sequence number is to provide the rolling counter
of the packets captured for the one in question. Thus if in successive
reads the count went from 2 to 5, you know 3 packets have been missed.
what if the count goes from 3 to... 3, ie. the seq number overflowed
(for whatever reason) ?
Highly unlinkely.
2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
sequence number will wrap after 5.6 days (if you consider
uni-directional traffic), on a 1Gb link, half a day and a bit more
than 1 hour on a 10Gb link. This is the worst case scenario. Two
records taken at the <wrap_time> interval will likely collide on
high-load link.
Post by Christos Zoulas
Post by Arnaud Lacombe
why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
There is an obvious ABI breakage when we will switch to 64bits time_t
but this is be a better type than raw integer. The breakage is a
different trouble and should be dealt with separately.
You answered your own question. All types in the struct should be fixed size
and time_t will change soon to be 64 bits.
This is not the problem here.

The ABI breakage will anyway cause a problem with all system call
taking a `struct timespec' as a parameter or including a `struct
timespec' in one of their parameter.
Post by Christos Zoulas
Post by Arnaud Lacombe
uint32_t ebr_nsecs;
why do you want nano second precision if you getting your information
from a micro second precision variable. There is no information gain
there, and your code reflect this (ie. you just "* 1000" to get the
nano second value from the micro second value).
In the time_t branch it already uses nanoseconds. Plus why you would want to
stick with micros in the new interface?
The patch I commented used a micro value to get a nano value. I don't
see your point. I don't get the link between Darren's patch and the
time_t branch either.
Post by Christos Zoulas
Post by Arnaud Lacombe
This field would have a meaning if you change the call the call to
microtime() to nanotime() in bpf_tap()/bpd_deliver() and build a
homegrown `struct timeval' in the non-extended capture format. You
don't have any precision loss in that case.
btw, why not just using a `struct timespec' ?
fixed sizes. Most on wire structures use fixed sizes. this allows portability
across different architectures.
the current BPF don't... It use a `struct timeval' whose tv_sec and
tv_usec are `long' which is not portable across architecture (i386's
long == 4, amd64's long == 8).
Post by Christos Zoulas
Post by Arnaud Lacombe
uint32_t ebr_seqno; /* sequence number in capture */
how to detect wrap in sequence number ?
It does not happen in real situations.
I'm not no so sure about this, cf above.

btw, why is the current `struct bpf_hdr' not packed ? this would avoid
the SIZEOF_BPF_HDR hack...

- Arnaud

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2008-12-26 15:13:00 UTC
Permalink
On Dec 25, 11:45pm, ***@gmail.com ("Arnaud Lacombe") wrote:
-- Subject: Re: Improving the data supplied by BPF

| 2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
| sequence number will wrap after 5.6 days (if you consider
| uni-directional traffic), on a 1Gb link, half a day and a bit more
| than 1 hour on a 10Gb link. This is the worst case scenario. Two
| records taken at the <wrap_time> interval will likely collide on
| high-load link.

we could make it 64 bit, but still what's the scenario here? That
we look at the log file and we can't tell if wrapped?

| >>why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
| >>There is an obvious ABI breakage when we will switch to 64bits time_t
| >>but this is be a better type than raw integer. The breakage is a
| >>different trouble and should be dealt with separately.
| >
| > You answered your own question. All types in the struct should be fixed size
| > and time_t will change soon to be 64 bits.
| >
| This is not the problem here.
|
| The ABI breakage will anyway cause a problem with all system call
| taking a `struct timespec' as a parameter or including a `struct
| timespec' in one of their parameter.

This has already been taken care of in the branch, along with timeval.
All the calls that use them have been versioned.

| > In the time_t branch it already uses nanoseconds. Plus why you would want to
| > stick with micros in the new interface?
| >
| The patch I commented used a micro value to get a nano value. I don't
| see your point. I don't get the link between Darren's patch and the
| time_t branch either.

In the time_t branch time is 64 bits and most networking kernel sampling
code has been converted to timespecs.

| the current BPF don't... It use a `struct timeval' whose tv_sec and
| tv_usec are `long' which is not portable across architecture (i386's
| long == 4, amd64's long == 8).

And has caused me a lot of trouble in the past. This fixes it.

| btw, why is the current `struct bpf_hdr' not packed ? this would avoid
| the SIZEOF_BPF_HDR hack...

packing a structure should be avoided when we can force optimal packing
by re-ordering members because it makes the code compiler neutral.

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
der Mouse
2008-12-26 17:18:46 UTC
Permalink
Post by Christos Zoulas
Post by Arnaud Lacombe
btw, why is the current `struct bpf_hdr' not packed ? this would
avoid the SIZEOF_BPF_HDR hack...
packing a structure should be avoided when we can force optimal
packing by re-ordering members because it makes the code compiler
neutral.
Until you meet a compiler with unusual structure member alignment
rules. (Nothing says, for exmaple, that a struct beginning with
multiple 64-bit values will have no padding between them - as may very
well happen on a machine with 9- or 18-bit bytes.)

I would much prefer to see a decision whether the interface is "however
this struct turns out to be laid out in memory" (in which case the only
reason to care about padding is efficiency of memory use) or "this
sequence of values of defined sizes, packed tightly" (in which case the
resulting definition may be borderline unimplementable on unusual
architectures - and depending on a struct to get it is, to steal a
phrase, unwarranted chumminess with the compiler).

I've sometimes thought about building a bug-finding compiler, one which
goes out of its way to break common-but-not-guaranteed assumptions
(such as the struct-packing one mentioned above, or "nil pointers of
all types are all the same size and all all-0-bits"), to help people
who want to produce more portable code.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2008-12-26 18:39:42 UTC
Permalink
Post by Christos Zoulas
| btw, why is the current `struct bpf_hdr' not packed ? this would avoid
| the SIZEOF_BPF_HDR hack...
packing a structure should be avoided when we can force optimal packing
by re-ordering members because it makes the code compiler neutral.
I don't think that you can force space-optimal packing by re-ordering
members, can you? AFAIK, no standard forbids the compiler from
padding this struct,

struct x {
uint64_t x;
uint32_t y;
uint16_t z;
uint8_t w;
};

like this

struct x {
uint64_t x;
uint32_t y;
uint32_t pad1;
uint16_t z;
uint16_t pad2;
uint32_t pad3;
uint8_t w;
uint8_t pad4;
uint16_t pad5;
uint32_t pad6;
};

Ordinarily, I would qualify the struct x definition with __packed
__aligned(8) in order to get the close-packed binary format that
I desired.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 278-3933

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2008-12-26 18:45:53 UTC
Permalink
Post by David Young
I don't think that you can force space-optimal packing by re-ordering
members, can you? AFAIK, no standard forbids the compiler from
padding this struct,
The standard doesn't, but any sane ABI will. I don't think we should
really bother with theoritical compilers or the mistakes of the ancient
past. The only mildly insane ABI left is ARM with the explicit struct
alignment rules.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2008-12-26 19:26:14 UTC
Permalink
Post by Joerg Sonnenberger
Post by David Young
I don't think that you can force space-optimal packing by re-ordering
members, can you? AFAIK, no standard forbids the compiler from
padding this struct,
The standard doesn't, but any sane ABI will. I don't think we should
really bother with theoritical compilers or the mistakes of the ancient
past. The only mildly insane ABI left is ARM with the explicit struct
alignment rules.
One should still write code that is portable by writing that the
struct is __packed __aligned(n), if only to persuade those of us
who try to rely on only one arcane standard at a time to interpret
a program. :-)

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 278-3933

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jim Wise
2008-12-27 00:43:28 UTC
Permalink
Post by Arnaud Lacombe
Post by Christos Zoulas
Post by Arnaud Lacombe
what if the count goes from 3 to... 3, ie. the seq number overflowed
(for whatever reason) ?
Highly unlinkely.
2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
sequence number will wrap after 5.6 days (if you consider
uni-directional traffic), on a 1Gb link, half a day and a bit more
than 1 hour on a 10Gb link. This is the worst case scenario. Two
records taken at the <wrap_time> interval will likely collide on
high-load link.
Will they? Wouldn't you have to not sample for a whole <wrap period>
for it not to be immediately obvious that a wraparound had occurred?

Is an application which wants to fill a 100MB link, but not sample more
than once every five days what we should be designing for?
--
Jim Wise
***@draga.com
Arnaud Lacombe
2008-12-27 01:32:47 UTC
Permalink
Hi,
Post by Jim Wise
Post by Arnaud Lacombe
2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
sequence number will wrap after 5.6 days (if you consider
uni-directional traffic), on a 1Gb link, half a day and a bit more
than 1 hour on a 10Gb link. This is the worst case scenario. Two
records taken at the <wrap_time> interval will likely collide on
high-load link.
Will they? Wouldn't you have to not sample for a whole <wrap period>
for it not to be immediately obvious that a wraparound had occurred?
no, think about the case where you set up a low pass filter in tcpdump
to monitor only some event. In this case, sequence number got consumed
by high frequency events and the wrap can happen in background.
Post by Jim Wise
Is an application which wants to fill a 100MB link, but not sample more
than once every five days what we should be designing for?
no, the timestamp will help to discern and to tell if wrap occurred.
If two records have the same timestamp (we only have microsec.
precision now) the seq number will give you their order. If two
records have the same seq number, but not the same timestamp, then you
can say that wrap occurred and you'll do the distinction based on
timestamp. You are in trouble if a record has the same timestamp and
seq number :-)

Note that the wrap can happens sooner than the 5.6 days if the packets
sent are smaller.

- Arnaud

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2008-12-27 03:09:18 UTC
Permalink
Post by Arnaud Lacombe
Hi,
Post by Jim Wise
Post by Arnaud Lacombe
2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
sequence number will wrap after 5.6 days (if you consider
uni-directional traffic), on a 1Gb link, half a day and a bit more
than 1 hour on a 10Gb link. This is the worst case scenario. Two
records taken at the <wrap_time> interval will likely collide on
high-load link.
Will they? Wouldn't you have to not sample for a whole <wrap period>
for it not to be immediately obvious that a wraparound had occurred?
no, think about the case where you set up a low pass filter in tcpdump
to monitor only some event. In this case, sequence number got consumed
by high frequency events and the wrap can happen in background.
Post by Jim Wise
Is an application which wants to fill a 100MB link, but not sample more
than once every five days what we should be designing for?
no, the timestamp will help to discern and to tell if wrap occurred.
If two records have the same timestamp (we only have microsec.
precision now) the seq number will give you their order. If two
records have the same seq number, but not the same timestamp, then you
can say that wrap occurred and you'll do the distinction based on
timestamp. You are in trouble if a record has the same timestamp and
seq number :-)
Note that the wrap can happens sooner than the 5.6 days if the packets
sent are smaller.
- Arnaud
Well, we can make the counter 64 bits...


christos


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2008-12-28 02:03:48 UTC
Permalink
Post by Arnaud Lacombe
Hi,
Post by Jim Wise
Post by Arnaud Lacombe
2^32 1500-bytes packets is about 6TB of data, on 100Mbit link, the
sequence number will wrap after 5.6 days (if you consider
uni-directional traffic), on a 1Gb link, half a day and a bit more
than 1 hour on a 10Gb link. This is the worst case scenario. Two
records taken at the <wrap_time> interval will likely collide on
high-load link.
Will they? Wouldn't you have to not sample for a whole <wrap period>
for it not to be immediately obvious that a wraparound had occurred?
no, think about the case where you set up a low pass filter in tcpdump
to monitor only some event. In this case, sequence number got consumed
by high frequency events and the wrap can happen in background.
The filter is applied before the counter increments.

It is a count of packets accepted by the filter.

It is not a count of packets on the NIC unless the filter accepts
all packets on the NIC.

So if your filter was "arp" or "port 67", then if there were
500,000pps of NFS traffic, the counter would not move because
of NFS packets, only because of the ARP or DHCP/BOOTP messages.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Arnaud Lacombe
2008-12-28 03:59:57 UTC
Permalink
Post by Darren Reed
The filter is applied before the counter increments.
It is a count of packets accepted by the filter.
It is not a count of packets on the NIC unless the filter accepts
all packets on the NIC.
So if your filter was "arp" or "port 67", then if there were
500,000pps of NFS traffic, the counter would not move because
of NFS packets, only because of the ARP or DHCP/BOOTP messages.
looks I had a too quick view to the code :)

cheers,

- Arnaud

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2008-12-26 13:38:12 UTC
Permalink
Post by Arnaud Lacombe
Hi,
Recently I've talked with a few different folks about packet capture
and have become aware of some of the problems that people face when
trying to use BPF vs other propritary solutions that exist. While it
may be possible to capture data at a good rate with BPF, there is
important meta data that isn't provided.
could you details what BPF is missing vs. other proprietary solutions
? What a heavy tcpdump user can expect compared to the actual one ?
Notification about when packets are dropped, an indication of
whether or not the packet was going in or out... there are
additional characteristics, such as if the packet was an "error"
(i.e. bad ethernet CRC, runt, etc) but it appears BPF doesn't
see those anyway. Being able to easily find the start of the
packet, being told the complete size of the current record...
Post by Arnaud Lacombe
This set of diffs attempts to address that by introducing a new BPF
maybe your changes would be clearer if you only provided the diff made
on BPF itself (about 10% of the whole diff), and a sample use-case.
Everything else is only API change.
If you're sufficiently interested then I'm sure you can extract the
part that concerns you... but honestly, the code changes to BPF are
trivial. What's really important is what I included and what you've
commented on.

Simple use case? Say you've got a bridge port on your NetBSD box and
you capture packets on it. How do you know which packets were going
to boxes that are connected out that wire vs some other bridge port?
i.e. if you used tcpdump today, you've got a bunch of packets that
show a conversation between two hosts. How do you know from the
capture which packets were sent out the NIC vs which were received?

Say you've got two raw capture files from different interfaces that
have different media types. How do you merge them into one for easier
analysis? (Current pcap files encode the link type in the file header,
thus implying every packet has the same MAC type.)
Post by Arnaud Lacombe
The purpose of the sequence number is to provide the rolling counter
of the packets captured for the one in question. Thus if in successive
reads the count went from 2 to 5, you know 3 packets have been missed.
what if the count goes from 3 to... 3, ie. the seq number overflowed
(for whatever reason) ?
So while the program was sleeping, 4 billion packets went through.
Well, I suppose that's only an hour or so of sleeping with line
rate on a 10G card. I think there's a chance that a sequence number
wrap will be noticed in those conditions.... not to mention that
grabbing the BPF statistics would show a very very large delta in
bs_drop. But left long enough on a fast NIC, even that will wrap.
Post by Arnaud Lacombe
/*
* Enhanced BPF packet record structure
*/
typedef struct ebpf_rec_s {
uint64_t ebr_secs; /* No more Y2k38 problem */
why unsigned ? currently `tv_sec' is signed. Why not using time_t ?
There is an obvious ABI breakage when we will switch to 64bits time_t
but this is be a better type than raw integer. The breakage is a
different trouble and should be dealt with separately.
I'd use "time_t" here but I don't want to risk that being
mistake for a 32bit value. uint64_t allows me to be specific
about the size of the field. Why unsigned? Because unsigned
containers never influence the value that gets put in them.
Post by Arnaud Lacombe
uint32_t ebr_nsecs;
why do you want nano second precision if you getting your information
from a micro second precision variable. There is no information gain
there, and your code reflect this (ie. you just "* 1000" to get the
nano second value from the micro second value).
Lets see... with a 10GB port, what do you think the spacing
is between packets when they're arriving at a rate of 10,000,000
per second? Finer than microsecond granularity can provide.
I don't know what the current line speed tests of NetBSD are
with 10G cards, but at Sun I've seen boxes forwarding at
greater than 50% of 10G line speed (>5,000,000 pps.)

The point of defining the field in this manner is to make it
easily possible for future code changes to take advantage of
the extra precision available.
Post by Arnaud Lacombe
This field would have a meaning if you change the call the call to
microtime() to nanotime() in bpf_tap()/bpd_deliver() and build a
homegrown `struct timeval' in the non-extended capture format. You
don't have any precision loss in that case.
I'm just trying to leverage off of existing code and make the
minimal amount of changes necessary to support a new format.

But by defining a new time format to use nano-seconds rather
than microseconds, I make the change you've described possible.

For example, I don't know if nanotime() is designed to be called
1 million or more times a second... it may be the wrong thing to
use when it becomes necessary to deal with packets at that speed.
Even now, microtime isn't that fine-grained (it's rather chunky),
so I'm not trying to pretend that nano-second precision is
possible with the existing APIs but at the same time, if a change
is to be made then it needs to look forward and that means using
nanoseonds here.
Post by Arnaud Lacombe
btw, why not just using a `struct timespec' ?
Because I don't want there to be any vagueness about the size of
the field to store seconds in.

For example, even -current on i386 defines time_t (which timespec
uses) as being a "long", so it would be 32bits. Again, if change
is to be made then we need to apply some amount of future-proofing.

I suppose that we could define it in terms of picoseconds if you
feel that nanoseconds is not enough and make it a 64bit field too?
Post by Arnaud Lacombe
uint32_t ebr_seqno; /* sequence number in capture */
how to detect wrap in sequence number ?
That's up to the consumer to decide. Whatever size field is used,
there's always going to be a "wrap problem", no matter what sort
of counter or wrap-counting counter is used.

I could almost be convinced to make this a 64bit counter but the
counter it pulls information from (bh_ccount) is only 32bits on
some platforms (its a long in bpfdesc.h) so it's possibly a waste
of bits, anyway. Then again, maybe bh_{c,d,r}count should all be
forcibly bumped to 64bits and then this also...
Post by Arnaud Lacombe
As we have timestamps, this can be use to order sequence number as
done with TCP's PAWS I guess.
This field isn't there for sequencing, it's to provide the
consumer of the BPF data with knowledge about whether or not
there has been a dropped packet in the black of data received.
Post by Arnaud Lacombe
uint32_t ebr_flags;
uint32_t ebr_rlen; /* 16 bits is not enough for
IPv6 */
uint32_t ebr_wlen; /* Jumbograms, so we have to
use */
uint32_t ebr_clen; /* 32 bits to represent all
lengths */
uint32_t ebr_pktoff;
uint16_t ebr_type; /* DLT_* type */
uint16_t ebr_subtype;
} ebpf_rec_t;
/*
* rlen = total record length (header + packet)
* wlen = wire length of packet
* clen = captured length of packet
* pktoff = offset from ebr_secs to the start of the packet data (may not be
* the same as sizeof(ebr_rec_t))
*
s/asa/as/ :)
*/
#define EBPF_OUT 0x00000001 /* Transmitted
packet */
I guess there will also be EBPF_IN, do you forsee any other flag possible ?
How do you know if something is black?

If EBPF_OUT isn't set to indicate out, doesn't that
then imply that the packet is an "input" packet?

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2008-12-26 18:21:51 UTC
Permalink
Post by Darren Reed
Post by Arnaud Lacombe
*/
#define EBPF_OUT 0x00000001 /* Transmitted
packet */
I guess there will also be EBPF_IN, do you forsee any other flag possible ?
How do you know if something is black?
If EBPF_OUT isn't set to indicate out, doesn't that
then imply that the packet is an "input" packet?
It could mean "don't know" or "don't care". It is the sort of
thing that deserves to be spelled out in documentation.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 278-3933

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2008-12-26 03:03:29 UTC
Permalink
One thing I've always found missing in bpf is a flag to denote outgoing
vs incoming packets. It might also be nice to note 'tapped as added to
queue' vs 'tapped when tx dma set up' vs 'tapped when tx ack arrives'.
Darren Reed
2008-12-26 13:49:16 UTC
Permalink
Post by Greg Troxel
One thing I've always found missing in bpf is a flag to denote outgoing
vs incoming packets. It might also be nice to note 'tapped as added to
queue' vs 'tapped when tx dma set up' vs 'tapped when tx ack arrives'.
Interesting but I think that's outside the scope of what I'm
trying to do here and now. Additionally, if there were possible,
we'd need to think about if/how that is exposed to the user
and how to interpret a user saying "tap tx dma set up packets
for loopback" (because they'll try to use the same CLI/filter
for loopback as they did the bge chip...)

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2008-12-26 14:45:08 UTC
Permalink
Post by Darren Reed
Post by Greg Troxel
One thing I've always found missing in bpf is a flag to denote outgoing
vs incoming packets. It might also be nice to note 'tapped as added to
queue' vs 'tapped when tx dma set up' vs 'tapped when tx ack arrives'.
Interesting but I think that's outside the scope of what I'm
trying to do here and now. Additionally, if there were possible,
we'd need to think about if/how that is exposed to the user
and how to interpret a user saying "tap tx dma set up packets
for loopback" (because they'll try to use the same CLI/filter
for loopback as they did the bge chip...)
I wasn't proposing an interface to control tap location, just flag bits
to record what happened. I once added support to ath(4) to do the
tapping on tx complete so it could include more status.
But I see the point that one has to stop somewhere.
Darren Reed
2008-12-26 15:03:50 UTC
Permalink
Post by Greg Troxel
Post by Darren Reed
Post by Greg Troxel
One thing I've always found missing in bpf is a flag to denote outgoing
vs incoming packets. It might also be nice to note 'tapped as added to
queue' vs 'tapped when tx dma set up' vs 'tapped when tx ack arrives'.
Interesting but I think that's outside the scope of what I'm
trying to do here and now. Additionally, if there were possible,
we'd need to think about if/how that is exposed to the user
and how to interpret a user saying "tap tx dma set up packets
for loopback" (because they'll try to use the same CLI/filter
for loopback as they did the bge chip...)
I wasn't proposing an interface to control tap location, just flag bits
to record what happened. I once added support to ath(4) to do the
tapping on tx complete so it could include more status.
But I see the point that one has to stop somewhere.
I wonder if what you're looking for is something more akin to
what dtrace allows? bpf's purpose is primarily packet data and
not so much NIC status, etc. A dtrace-like solution would let
you tap packets as they move along, in and out of packets and
at specific "probe" points. I say that because it sounds like
you are just as (if not more) interested in non-packet data
and for that, I think BPF is not really the right answer...

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2008-12-26 15:27:41 UTC
Permalink
No, I am really looking for packet data, but perhaps packet metadata a
la DLT_IEEE802_11_RADIO. As a first step, I want the data source
labeled, so I can know if I have queueing delays or not. The point of
tapping at tx complete is that you can find out how many retries were
used and what the actual rate for that packet was (sometimes an 802.11
interface will drop back to 1 Mb/s after failing to get an ack at the
chosen rate), or if no ack was received. I wanted this data to compute
the ETX metric

http://pdos.csail.mit.edu/papers/grid:mobicom03/paper.pdf
http://pdos.csail.mit.edu/decouto/abstract.html

which is otherwise a bit hard to get. I was relatively unconcerned
about processing overhead and how the software was functioning etc. In
my case I added a new 'tracerecord' DLT that had a metadata record for
each packet with the at-tx-complete stats.
Darren Reed
2008-12-27 14:55:11 UTC
Permalink
Post by Greg Troxel
No, I am really looking for packet data, but perhaps packet metadata a
la DLT_IEEE802_11_RADIO. As a first step, I want the data source
labeled, so I can know if I have queueing delays or not. The point of
tapping at tx complete is that you can find out how many retries were
used and what the actual rate for that packet was (sometimes an 802.11
interface will drop back to 1 Mb/s after failing to get an ack at the
chosen rate), or if no ack was received. I wanted this data to compute
the ETX metric
http://pdos.csail.mit.edu/papers/grid:mobicom03/paper.pdf
http://pdos.csail.mit.edu/decouto/abstract.html
which is otherwise a bit hard to get. I was relatively unconcerned
about processing overhead and how the software was functioning etc. In
my case I added a new 'tracerecord' DLT that had a metadata record for
each packet with the at-tx-complete stats.
That sounds like a very specialised piece of instrumentation,
which BPF (as a generic packet capture feature) isn't really
well suited to be (I think.)

*but* I do think dtrace would help here... for example, you can
measure the time between a packet entering ip_output and it being
free'd or entering the send routine for a NIC lower down.

I can't see anything in your description that would preclude a
tool like dtrace from being used...

Darren


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Michael Richardson
2008-12-29 22:29:02 UTC
Permalink
Post by Greg Troxel
I wasn't proposing an interface to control tap location, just
flag bits to record what happened. I once added support to
ath(4) to do the tapping on tx complete so it could include more
status. But I see the point that one has to stop somewhere.
It would be very nice to be able to annotate where the tap was.
I've often shoved data into the bpf system from unusual places as a
debug feature....
Darren> I wonder if what you're looking for is something more akin
Darren> to what dtrace allows? bpf's purpose is primarily packet
Darren> data and not so much NIC status, etc. A dtrace-like solution
Darren> would let you tap packets as they move along, in and out of
Darren> packets and at specific "probe" points. I say that because
Darren> it sounds like you are just as (if not more) interested in
Darren> non-packet data and for that, I think BPF is not really the
Darren> right answer...

uhm. yeah, dtrace-ish, but really, that's too programmer oriented I
think, even if under the hood that's the right implementation.
--
] Y'avait une poule de jammé dans l'muffler!!!!!!!!! | firewalls [
] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[
] ***@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[
] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Darren Reed
2008-12-30 00:04:33 UTC
Permalink
Whilst letting the bits and bytes of this change digest over the
last few days, one aspect of the current BPF design gave me cause
to pause: the existing BPF header record for /dev/bpf is 18 bytes
long, meaning that when combined with ethernet, IP packets start
on a 32bit aligned boundary. This is significant for platforms
such as SPARC where thus access of IP addresses is via correctly
aligned addresses.

With this in mind, I added bf_exthdrlen. As per the comments in
the attached diff, it gets bumped up by an amount to ensure that
the header following the link layer header is on a 32bit aligned
address. For things such as gre/tun/loopback, which are using
bpf_mtap_af(), there would be no padding (link layer record is a
32bit int) but for ethernet, 2 bytes of padding is added.

Additionally, I've made some changes to force the buffer size
set via BIOCSBLEN to always be a multiple of 64bits. This change
takes precedence over bpf_maxbufsize and BPF_MINBUFSIZE, so if
either of those would force the set length to be a number that
is not a multiple of 8, that is now lost.

The question I find myself asking is, are the changes to start
an IP (layer3) header on a 32bit aligned address warranted?
Or is that needless microbenchmarking?

Darren

Loading...