Network driver receive path

Discussion:

(too old to reply)

Maen Suleiman

2007-03-14 17:24:43 UTC

Hi,

I am trying to tune our giga driver performance , I have noticed that
the system spends 57% of the time on interrupts when we do an oriented
receive test, while the system spends only 20% of the time on
interrupts when we do an oriented send test.

From the profiler results , we understood that the main reason of
spending this time on the RX interrupt was because of the
MGETHDR,MCLGET and bus_dmamap_load , and mainly because of the
bus_dmamap_load function.

The problem is that we couldn't find an alternative of allocating
mbufs and calling bus_dmamap_load in the RX interrupt,!

Will using a task to do the mbuf handling help ?

Is there a way to tell the TCP stack to give me back the mbuf that was
delivered to it, and then I can re-use the same mbufs without calling
bus_dmamap_load?

Is there a way to allocate a constant physical memory block for the RX
DMA , and then using this block for the mbufs that will be delivered
to the stack? In this case I must know when the TCP stack has finished
handling the mbuf, and then I will re-use the same memory physical
space!

Thanks in advance

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Joerg Sonnenberger

2007-03-14 17:40:56 UTC

Permalink

Post by Maen Suleiman
I am trying to tune our giga driver performance , I have noticed that
the system spends 57% of the time on interrupts when we do an oriented
receive test, while the system spends only 20% of the time on
interrupts when we do an oriented send test.

First of all, whicch hardware platform is this on?

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Maen Suleiman

2007-03-14 17:46:42 UTC

Permalink

It is based on NetBSD ARM port. and the giga port is part of the ARM
SoC and not a PCI device, our hardware is not part of the official
NetBSD yet, I hope this answers the question
Thanks

Post by Joerg Sonnenberger

First of all, whicch hardware platform is this on?
Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Joerg Sonnenberger

2007-03-14 17:56:28 UTC

Permalink

Post by Maen Suleiman
It is based on NetBSD ARM port. and the giga port is part of the ARM
SoC and not a PCI device, our hardware is not part of the official
NetBSD yet, I hope this answers the question

Well, the reason for the question is whether it is an architecture were
direct-mapped physical memory is available or not.

The other question is whether your device has interrupt mitigation support
and it is used as interrupts in general are expensive.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Maen Suleiman

2007-03-14 18:02:56 UTC

Permalink

Yes we have a HW support for interrupts mitigation that we are using ,
I think the problem is with ammount of time that we spend in the
interrupt and not how many interrupts/second ( we get almost the same
interrupts/second in send and receive cases )
Thanks in advance

Post by Joerg Sonnenberger

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

j***@dsg.stanford.edu

2007-03-14 18:28:45 UTC

Permalink

Hi,
I am trying to tune our giga driver performance,

Is a "giga" a gigabit Ethernet interface?

I have noticed that
the system spends 57% of the time on interrupts when we do an oriented
receive test, while the system spends only 20% of the time on
interrupts when we do an oriented send test.
From the profiler results , we understood that the main reason of
spending this time on the RX interrupt was because of the
MGETHDR,MCLGET and bus_dmamap_load , and mainly because of the
bus_dmamap_load function.

Are your tests sustaining the same (or closely comparable) throughput?
If so, then your driver is DMA-mapping roughly the same amount of data
for both transmit and receive. Again, if so, that 'd tend to suggest
the problem is interrupt rate on receive side, rather than transmit
side. The fix for *that* is to use interrupt mitigation, if you can.

On the other hand, if you are confident in your profile data pointing
to bus_dmamap_load, perhaps the DMA map for receive data really is
significantly more expensive (per packet), than for TX data. At a
wild guess, perhaps Rx incurs more work than for Tx (e.g., forcing
lines of cached data from the CPU cache out into main memory?)

The problem is that we couldn't find an alternative of allocating
mbufs and calling bus_dmamap_load in the RX interrupt,!
Will using a task to do the mbuf handling help ?

Nope, not at this time. And in general, probably not for any
single-CPU system: you're doing the same work, plus adding some
context-switch overhead.

[... reordered...]

Is there a way to allocate a constant physical memory block for the RX
DMA , and then using this block for the mbufs that will be delivered
to the stack? In this case I must know when the TCP stack has finished
handling the mbuf, and then I will re-use the same memory physical
space!

Not really, not in any MI way in NetBSD. bus_dma(9) does include a
"BUS_DMA_COHERENT" mapping, but it's documented as being a "hint" to
(machine-dependent) implementations of bus_dma(9); portable NetBSD
drivers still have to issue appropriate bus_dma_sync() calls.

Is there a way to tell the TCP stack to give me back the mbuf that was
delivered to it, and then I can re-use the same mbufs without calling
bus_dmamap_load?

Not for mbufs, not really.

For mbuf *clusters* you could implement a driver-private mbuf cluster
pool, backed by normal DMA mechanisms. You _could_ then attempt some
machine-dependent violations of the machine-independent API, based on
your own knowledge of your CPU and private memory pool; but such a
driver wouldn't work on other ports of NetBSD to other CPU architectures
(e.g., those which have IOMMUs and therefore rely on drivers following the documented
bus_dma(9) API for correct operation.

If possible, a better approach might be to extend the bus_dma(9)
implementation and mbuf-cluster information, to attempt to cache and
reuse more information, to avoid (for example) repeated
KVA-to-physical mappings if you reuse the same physical
addresses. That's likely to be a big undertaking, and I'd suggest some
close discussion with Jason Thorpe before going down that route.

But my guess is, you really need to find, and discuss options with,
someone who understands both the bus_dma(9) backend for your CPU
(ARM?) and your non-PCI "giga" device.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Maen Suleiman

2007-03-15 10:48:19 UTC

Permalink

Post by j***@dsg.stanford.edu

Hi,
I am trying to tune our giga driver performance,

Is a "giga" a gigabit Ethernet interface?

Yes

Post by j***@dsg.stanford.edu

I get in TX 150% better performance than RX

Post by j***@dsg.stanford.edu
On the other hand, if you are confident in your profile data pointing
to bus_dmamap_load, perhaps the DMA map for receive data really is
significantly more expensive (per packet), than for TX data. At a
wild guess, perhaps Rx incurs more work than for Tx (e.g., forcing
lines of cached data from the CPU cache out into main memory?)

Usually TX involves cache flush while RX involve invalidation, and
invalidate should be less expensive than flush

Post by j***@dsg.stanford.edu

The problem is that we couldn't find an alternative of allocating
mbufs and calling bus_dmamap_load in the RX interrupt,!
Will using a task to do the mbuf handling help ?

Nope, not at this time. And in general, probably not for any
single-CPU system: you're doing the same work, plus adding some
context-switch overhead.
[... reordered...]

Thanks

Post by j***@dsg.stanford.edu

Thanks

Post by j***@dsg.stanford.edu

Is there a way to tell the TCP stack to give me back the mbuf that was
delivered to it, and then I can re-use the same mbufs without calling
bus_dmamap_load?

Not for mbufs, not really.
For mbuf *clusters* you could implement a driver-private mbuf cluster
pool, backed by normal DMA mechanisms. You _could_ then attempt some
machine-dependent violations of the machine-independent API, based on
your own knowledge of your CPU and private memory pool; but such a
driver wouldn't work on other ports of NetBSD to other CPU architectures
(e.g., those which have IOMMUs and therefore rely on drivers following the documented
bus_dma(9) API for correct operation.
If possible, a better approach might be to extend the bus_dma(9)
implementation and mbuf-cluster information, to attempt to cache and
reuse more information, to avoid (for example) repeated
KVA-to-physical mappings if you reuse the same physical
addresses. That's likely to be a big undertaking, and I'd suggest some
close discussion with Jason Thorpe before going down that route.
But my guess is, you really need to find, and discuss options with,
someone who understands both the bus_dma(9) backend for your CPU
(ARM?) and your non-PCI "giga" device.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de