hardware timestamping of packets

On 13/06/2012 8:44 AM, Vlad Balan wrote:
...

Post by Vlad Balan
First of all, most hardware cards that claim to do timestamping can
only timestamp a limited number of packets and use filters in order to
determine which packets to stamp.

...

Post by Vlad Balan
When the filter is disabled and all packets are
timestamped, the registers are no longer locked and the timestamp
always corresponds to the latest received packet. Therefore if you
have multiple packets in one interrupt, you only get the hardware
timestamp of the last one. The only way to go around this (at least
for the Intel card) is to set a filter that makes most packet arrivals
create an interrupt, but then capturing at line speed would probably
not be feasible.

What is the resolution of the time stamp?
Microseconds? Nanoseconds? Picoseconds?
Or is it in some other measure, such as ticks?

Post by Vlad Balan
Another issue has to do with the interface when timestamping TX
packets. Timestamps are returned over the socket as ancillary data,
which means that they come with the associated packet. Of course, on
the transmit side there is no incoming packet, so what Linux does is
to copy the outgoing packet into a third queue associated with each
socket (the error queue, this seems Linux specific) and attach the
timestamp as ancillary data to that. This extension is not impossible
to add to NetBSD, however I wanted to check with you before adding a
third queue to the sockets. Linux uses a specific flag in recvmsg to
indicate that it wants a packet from the error queue.

Why wouldn't it be appropriate to use SCM_TIMESTAMP or
something similar to that?

Post by Vlad Balan
The third issue has to do with transmitting timestamps associated with
packets from the driver level to the socket level, where they can be
transformed into ancillary information to be returned to the users.

I would recommend investigating use of m_tag.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Vlad Balan

2012-06-13 03:06:36 UTC

Hi Darren,

Post by Vlad Balan
First of all, most hardware cards that claim to do timestamping can
only timestamp a limited number of packets and use filters in order to
determine which packets to stamp.

...

What is the resolution of the time stamp?
Microseconds? Nanoseconds? Picoseconds?
Or is it in some other measure, such as ticks?

The Intel card that I looked at returns values in 16ns increments. I
think the natural resolution would be nanoseconds.

Why wouldn't it be appropriate to use SCM_TIMESTAMP or
something similar to that?

That works on the receive path, however on the send path we still want
in some scenarios to know, on the sender machine, when the packet was
sent. sendmsg does not return information, so we must call recvmsg to
get the ancillary data. The data is obtained by calling recvmsg on the
error queue, where the copy of the sent packet, together with the
ancillary data that tells us when the packet was sent, is waiting to
be read.

I would recommend investigating use of m_tag.

This is a good idea. That's what I get for using Stevens as my only
reference :) I should look at all the newer additions to the kernel,
but this definitely seems like a good way to go.

Post by Darren Reed
Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Darren Reed

2012-06-13 16:12:06 UTC

Post by Vlad Balan
Hi Darren,

Post by Vlad Balan
First of all, most hardware cards that claim to do timestamping can
only timestamp a limited number of packets and use filters in order to
determine which packets to stamp.

...

What is the resolution of the time stamp?
Microseconds? Nanoseconds? Picoseconds?
Or is it in some other measure, such as ticks?

The Intel card that I looked at returns values in 16ns increments.
I think the natural resolution would be nanoseconds.

At 16ns, that allows for a packet rate pf 6 million pps?

More pertinent is that this allows for us to "invent" a timestamp.
So if there are 16 packets delivered every 16ns, then we can add
1ns to the timestamp of each packet after the first to represent
the time that they arrived.

Post by Vlad Balan

Why wouldn't it be appropriate to use SCM_TIMESTAMP or
something similar to that?

And how do you tie something received with recvmsg() with something sent
via sendmsg()?

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matthew Mondor

2012-06-13 19:42:40 UTC

On Thu, 14 Jun 2012 02:12:06 +1000

Post by Vlad Balan
That works on the receive path, however on the send path we still want
in some scenarios to know, on the sender machine, when the packet was
sent. sendmsg does not return information, so we must call recvmsg to
get the ancillary data. The data is obtained by calling recvmsg on the
error queue, where the copy of the sent packet, together with the
ancillary data that tells us when the packet was sent, is waiting to
be read.

And how do you tie something received with recvmsg() with something sent
via sendmsg()?

I was wondering the same: should one then recvmsg(2) from the error
queue every packet after sendmsg(2) to ensure synchronization? What
happens if the application doesn't, will those accumulate in the "error
queue", or is there room for a single packet there? Another common
idiom would be getsockopt(2) to obtain status, but this has the same
difficulties...

It also makes the protocol synchronous, with two syscalls required per
sent packet. I wonder if there's precedent on some other OSs for a
sendmsg(2) variant which can accept a struct msghdr *? If so, would
this be realistic using our current network+device stack to have the
status filled before the syscall returns?

--
Matt

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matthew Mondor

2012-06-16 00:52:32 UTC

Post by Dennis Ferguson
I think they return the entire transmitted packet to the error queue, so
the contents of the packet are available to use to figure out which
transmitted packet the timestamp corresponds to.
I think the queue can be maintained like any datagram socket receive
buffer; if you don't read it regularly it may fill to its limit, after
which subsequent packets will be dropped off the end of the queue.
I think this works.
In contrast BSD kernels have managed to get by without this for other
purposes so this would be a mechanism that would be added to (all?)
sockets on the off chance that the application is going to want to
timestamp outbound packets, something which I suspect very few
processes are going to be interested in.
Unless there is some sense that the "error queue" thing is going to be
useful for a wider variety of applications than just this. If
timestamping packets is the only application for this it might be better
to design a mechanism which only needs to exist when the application
declares that it is going to be timestamping packets (perhaps using a
setsockopt()/getsockopt() protocol).
Therefore if you have multiple packets in one interrupt, you only get
the hardware timestamp of the last one. The only way to go around this
(at least for the Intel card) is to set a filter that makes most packet
arrivals create an interrupt, but then capturing at line speed would
probably not be feasible.

Thanks for the details,

It seems that indeed the simplest is to receive whole timestamped
packets back like on Linux (and probably good for an initial
implementation)... I looked at the Linux documentation for MSG_ERRQUEUE,
and have no opinion on if we'd want it in the future. Are kqueue(2)'s
EV_ERROR a possible reason we don't need it?

But if we don't want to support it fully, it probably makes sense to
have such queue only be created on request, as Dennis suggested.
Either a setsockopt(2) IO_SNDTS to enable bidirectional mode,
or a setsockopt(2) to create a second socket and getsockopt(2) to
obtain its FD?

I guess that even if we wanted to support the general MSG_ERRQUEUE it'd
be possible to enable it on a per-socket basis using a setsockopt(2)
like SO_ERRQUEUE?

Post by Dennis Ferguson
If some sort of "transaction ID" were included with the original
sendmsg() then it wouldn't be required to return the entire
transmitted packet to the application, just the timestamp and the
"transaction ID" would do it.

Post by Matthew Mondor
It also makes the protocol synchronous, with two syscalls required per
sent packet. I wonder if there's precedent on some other OSs for a
sendmsg(2) variant which can accept a struct msghdr *? If so, would
this be realistic using our current network+device stack to have the
status filled before the syscall returns?

I guess the difficulty with this is that the underlying transmit timestamp
mechanism is actually asynchronous with respect to the sendmsg() syscall.
That is, the syscall to send a datagram is generally finished when the packet
is queued at the tail of the transmit queue of the output interface but the
timestamp is unavailable from the hardware until after the packet is output
to the wire. There is a variable, and possibly quite large, time between
these two events, so returning the result to a single system call is going
to require a sleep to wait for completion where no sleep is done now.

Other potential ideas, which probably don't matter unless we need to
eventually obtain the timestamps of a steady fast flow of outgoing
packets, if a card supported that and the existing interface turned
out to be suboptimal:

If returning a timestamp for an outgoing packet before the syscall
returns is problematic, if we wanted to avoid whole packets, indeed a
transaction-id could be either user-provided via an ancillary message
(possibly problematic), or assigned by the kernel to packets via a
sendmsg(2)-variant, in which case a single read(2) could potentially
contain multiple id/timestamp entries...

A new kqueue(2) filter (say EVFILT_SENDMSGTS), but here again
transaction-IDs would have to fit in "ident", be process-wide-unique,
probably assigned via a sendmsg(2)-variant (say sendmsgts(2), and not
need EV_ADD), and timestamps would have to fit within the 64-bit "data"
field...

Some async signal with siginfo (I'm actually unsure of the efficiency
of this, and it has its own issues). Perhaps by extension notifying
that an mmapped buffer (mapped read-only in the process) of
id/timestamp pairs is full, like is done for some DMA-capable devices
supporting mmap(2)... but I think I'm getting in awful territory :)

--
Matt

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Darren Reed

2012-06-14 03:38:16 UTC

Post by Matthew Mondor
On Thu, 14 Jun 2012 02:12:06 +1000

And how do you tie something received with recvmsg() with something sent
via sendmsg()?

I was wondering the same: should one then recvmsg(2) from the error
queue every packet after sendmsg(2) to ensure synchronization? What
happens if the application doesn't, will those accumulate in the "error
queue", or is there room for a single packet there? Another common
idiom would be getsockopt(2) to obtain status, but this has the same
difficulties...
It also makes the protocol synchronous, with two syscalls required per
sent packet. I wonder if there's precedent on some other OSs for a
sendmsg(2) variant which can accept a struct msghdr *? If so, would
this be realistic using our current network+device stack to have the
status filled before the syscall returns?

That I don't know - it may be a good question to go back to
Intel with as they may have some exposure to what folks are
doing software wise in order to take advantage of their hardware.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2012-06-15 17:40:18 UTC

Post by Matthew Mondor
On Thu, 14 Jun 2012 02:12:06 +1000

And how do you tie something received with recvmsg() with something sent
via sendmsg()?

I think they return the entire transmitted packet to the error queue, so
the contents of the packet are available to use to figure out which transmitted
packet the timestamp corresponds to. If the application is one of the more
likely users (i.e. NTP or PTP) this is straight forward to do since there will
be (a) unique-to-the-packet timestamp(s) in there which the application will be
preserving anyway. If the error queue packets also preserve the sendmsg()
order then drops can be detected by observing when packets which are expected to
be in the error queue are missing in the received sequence. I think the
queue can be maintained like any datagram socket receive buffer; if you don't
read it regularly it may fill to its limit, after which subsequent packets will
be dropped off the end of the queue. I think this works.

The reason I'm not fond of this, as I understand it, is that for Linux that
error queue was preexisting mechanism which was used for other things, so using
it for this as well was "free" in that it leveraged existing, common mechanism.
In contrast BSD kernels have managed to get by without this for other purposes
so this would be a mechanism that would be added to (all?) sockets on the off
chance that the application is going to want to timestamp outbound packets,
something which I suspect very few processes are going to be interested in.
Unless there is some sense that the "error queue" thing is going to be useful
for a wider variety of applications than just this. If timestamping packets
is the only application for this it might be better to design a mechanism which
only needs to exist when the application declares that it is going to be timestamping
packets (perhaps using a setsockopt()/getsockopt() protocol). If some sort of
"transaction ID" were included with the original sendmsg() then it wouldn't be
required to return the entire transmitted packet to the application, just the
timestamp and the "transaction ID" would do it.

Dennis Ferguson

2012-06-20 20:00:44 UTC

Post by Vlad Balan

Post by Vlad Balan
First of all, most hardware cards that claim to do timestamping can
only timestamp a limited number of packets and use filters in order to
determine which packets to stamp.

...

What is the resolution of the time stamp?
Microseconds? Nanoseconds? Picoseconds?
Or is it in some other measure, such as ticks?

The Intel card that I looked at returns values in 16ns increments. I
think the natural resolution would be nanoseconds

Vlad,

Have you managed to verify that 16 ns number with the actual hardware?

The reason I'm asking is that the only reference I can find to 16 ns in
the data sheet you linked to is in section 7.10.3.1, in the sentence that
starts "For example if the cycle time is 16 ns and the incperiod is one...".
That this isn't saying the actual hardware cycle time is 16 ns is suggested
by the fact that the 82599 controller data sheet, here

http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf

duplicates that entire section (it is section 7.9.3.1 instead) with exactly
the same 16 ns "For example" sentence, but follows it with a table showing
what are apparently the actual hardware cycle times: 6.4 ns when the link is
10 Gbps, 64 ns when the link is 1 Gbps and 640 ns when the link is 100 Mbps.
None of these match the 16 ns example, and the fact that the tick rate follows
the link speed makes it seem (reading between the lines) like the clock may not
increment at all when there is no ethernet link.

Unreliable, variable speed hardware counters don't make good time-of-day clocks,
so I'm interested in how you plan to use this and to present the results to
consumers. What will the timestamp value you return to the application look
like, and how will the application interpret what that means?

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Vlad Balan

2012-06-20 21:47:03 UTC

This is a good question. I was planning to follow Linux's lead, where
the hardware counter and the system counter can be both reported, upon
request. For each cycle counter in the system Linux maintains a rough
translate to system time, approximating the skew and the offset based
on a few measurements and comparisons between the cycle counter and
system clock. This kind of timestamp translation is not hard to
implement.

I will start by providing support for the hardware timestamp and then
take care of translating the cycle counts to system time as well.

The rest of the discussion identified the problems present in
providing a socket interface for timestamping, on the transmission
path. I will try, for starters, to provide an interface that is
compatible with the Linux one in order to avoid too much
differentiation to userland applications. I agree that the current
form of the interface satisfies the needs of only a handful of
applications. I will try to maintain the implementation of
transmit-side timestamps clean in order to make it possible to
implement additional mechanisms for passing them on to userland.

I agree that providing hardware timestamps to bpf is a good idea. I
will try to see how to best implement it, given that bpf takes its
timestamps through a slightly different mechanism, close to the packet
arrival time. Here we might have to decide upon an interface again, to
specify the source of timestamps to be returned.

Finally, it appears that none of the cards currently available is
capable of obtaining hardware timestamps for all packets and returning
them in the packet descriptors. Line capture cards built on FPGAs have
this capability, however my understanding is that they use different
interfaces to the host (possibly userspace), making them unsuitable
for testing the hardware timestamping kernel feature.

Regards,
Vlad

On Wed, Jun 20, 2012 at 1:00 PM, Dennis Ferguson

Post by Dennis Ferguson

Post by Vlad Balan

Post by Vlad Balan
First of all, most hardware cards that claim to do timestamping can
only timestamp a limited number of packets and use filters in order to
determine which packets to stamp.

...

What is the resolution of the time stamp?
Microseconds? Nanoseconds? Picoseconds?
Or is it in some other measure, such as ticks?

The Intel card that I looked at returns values in 16ns increments. I
think the natural resolution would be nanoseconds

Vlad,
Have you managed to verify that 16 ns number with the actual hardware?
The reason I'm asking is that the only reference I can find to 16 ns in
the data sheet you linked to is in section 7.10.3.1, in the sentence that
starts "For example if the cycle time is 16 ns and the incperiod is one...".
That this isn't saying the actual hardware cycle time is 16 ns is suggested
by the fact that the 82599 controller data sheet, here
http://www.intel.com/content/dam/doc/datasheet/82599-10-gbe-controller-datasheet.pdf
duplicates that entire section (it is section 7.9.3.1 instead) with exactly
the same 16 ns "For example" sentence, but follows it with a table showing
what are apparently the actual hardware cycle times: 6.4 ns when the link is
10 Gbps, 64 ns when the link is 1 Gbps and 640 ns when the link is 100 Mbps.
None of these match the 16 ns example, and the fact that the tick rate follows
the link speed makes it seem (reading between the lines) like the clock may not
increment at all when there is no ethernet link.
Unreliable, variable speed hardware counters don't make good time-of-day clocks,
so I'm interested in how you plan to use this and to present the results to
consumers. What will the timestamp value you return to the application look
like, and how will the application interpret what that means?
Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2012-06-21 18:01:04 UTC

Post by Vlad Balan
Finally, it appears that none of the cards currently available is
capable of obtaining hardware timestamps for all packets and returning
them in the packet descriptors. Line capture cards built on FPGAs have
this capability, however my understanding is that they use different
interfaces to the host (possibly userspace), making them unsuitable
for testing the hardware timestamping kernel feature.

There are several WLAN adapters that provide a microsecond timestamp,
corresponding to the 802.11 Time Synchronization Function (TSF), in
the descriptors. Sometimes the timestamp in the descriptor is 64 bits
wide, but sometimes the timestamp in the descriptor is less than 16
bits wide. Thus you may have 65 milliseconds or less to convert the
descriptor's timestamp to an unambiguous timestamp for your application.
That may involve reading a couple of TSF registers, timebase conversion
(TSF -> uptime), etc.

I found that I could rely on the Rx timestamps on an Atheros WLAN
adapter to help me resolve the distance from one Atheros WLAN adapter to
another to within 10 or 20 feet by sending a carefully-crafted train of
packets from one adapter to the other, recording and averaging the time
interval between the train of link-layer acknowledgements. I reckon
that I couldn't have done that if it wasn't a quality timestamp.

I think that converting TSF timestamps to civil time may be a bit tricky
in the details.

It has always seemed to me that the Linux API for packet timestamps
is inadequate. I'm concerned that following the Linux API will
divert your thought and development from the most productive path. I
suggest creating an original userland API for packet timestamping that
incorporates the advice that you receive here. Emulating the Linux API
should not be difficult.

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2012-06-22 19:36:41 UTC

Post by Vlad Balan
This is a good question. I was planning to follow Linux's lead, where
the hardware counter and the system counter can be both reported, upon
request. For each cycle counter in the system Linux maintains a rough
translate to system time, approximating the skew and the offset based
on a few measurements and comparisons between the cycle counter and
system clock. This kind of timestamp translation is not hard to
implement.
I will start by providing support for the hardware timestamp and then
take care of translating the cycle counts to system time as well.
The rest of the discussion identified the problems present in
providing a socket interface for timestamping, on the transmission
path. I will try, for starters, to provide an interface that is
compatible with the Linux one in order to avoid too much
differentiation to userland applications. I agree that the current
form of the interface satisfies the needs of only a handful of
applications. I will try to maintain the implementation of
transmit-side timestamps clean in order to make it possible to
implement additional mechanisms for passing them on to userland.
I agree that providing hardware timestamps to bpf is a good idea. I
will try to see how to best implement it, given that bpf takes its
timestamps through a slightly different mechanism, close to the packet
arrival time. Here we might have to decide upon an interface again, to
specify the source of timestamps to be returned.
Finally, it appears that none of the cards currently available is
capable of obtaining hardware timestamps for all packets and returning
them in the packet descriptors. Line capture cards built on FPGAs have
this capability, however my understanding is that they use different
interfaces to the host (possibly userspace), making them unsuitable
for testing the hardware timestamping kernel feature.

After thinking about it for a bit, I think I'd have the following
concerns about and preferences for this.

The reason why I'm interested in the applications which will use
this facility is that different time consumers have different
requirements. I agree that hardware tick counters need to be
translated into a regular, common time format for presentation to
consumers, but how you do this and where the data best comes from
depends a lot on what the application is using this for. Roughly
maintaining the hardware clock in synchronization with the system
clock (and hardwiring that relationship) might be appropriate if
what you want from the hardware timestamps is a rough approximation
to system time, but if what you instead want to do with this is an
IEEE 1588 implementation operating at the precision of the hardware
it seems like both the "rough translation" thing and the synchronization
direction end up being exactly wrong. I think for a good PTP
implementation you want the application which is sending and receiving
PTP packets and having them timestamped with ethernet clock timestamps
is to precisely adjust the ethernet clock into synchronization with
the PTP master (this can be done very precisely since the data you are
getting, referenced to the ethernet clock, is very good) and then to
synchronize the system clock as closely as possible to the now-synchronized
ethernet clock. This suggests to me that this might be more useful
if what was provided was a precise user-space clock adjustment
interface that could be used to discipline any or all clocks in the
system and allow the application using that API to determine what
is the best way to apply it.

The other thing I'd like is for system clock timestamps to be acquired
as close to the packet arrival time as possible (i.e. early in
the receive interrupt). This would make bpf's normal source of
receive timestamps the same as for regular packets delivered
to sockets; if the timestamp(s) was(/were) attached to the packet's
mbuf by the time bpf looked at it, bpf could just copy the timestamp
out of there. Being able to timestamp received packets with a system
timestamp very soon after they've arrived will provide moderately
accurate time samples for interfaces with no hardware timestamping
capability, having every packet carry a system clock arrival timestamp
allows an interesting and useful alternative way to benchmark the
kernel's network stack (measuring how long it takes to get individual
packets from where they entered the kernel to where they end up),
and any timestamps unused for other purposes might (or might not)
provide useful entropy for the random number pool. Implementing the
transmit timestamp interface in terms of system clock timestamps would
complete this so that you could always count on the timestamp mechanism
working to the level of precision available even without hardware
support.

Finally, I have a rudimentary design for a clock adjustment interface
which deals with multiple clocks in a system using common adjustment
code, along with addressing some issues to make it easier to to write
accurate time synchronization applications. If I haven't made a mistake
it should be available here:

http://www.mistimed.com/Clock.pdf

This was designed without your application in mind, and its current
drawback for this particular application is that I never considered
trying to use this to maintain an "unreliable" clock whose underlying
hardware rate might change during operation (which is why I was
concerned about this). It instead thinks the only thing you'll
want to do is avoid using such clocks altogether, but in this case I
can see how trying to deal with this might be useful and with a little
thinking it might be possible to enhance the interface to help track
the state of such clocks. Beyond this, however, it does address the
issues of how to maintain clock conversions accurately and how to poll
one clock against another so that one can be synchronized to the other,
so it might provide a more generically useful framework for your work
than just hardcoding sampling and conversion stuff in the kernel.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Alexander Nasonov

2012-06-15 17:38:11 UTC