tlp(4) DMA synchronization

Post by David Young
I made a patch to complete/optimize DMA synchronization in tlp(4),
ftp://cuw.ojctech.com/users/netbsd-c325f0fd/tulip.dmasync

+ TULIP_CDRXSYNC(sc, i,
+ BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
+ if (txstat & TDSTAT_OWN) {
+ TULIP_CDTXSYNC(sc, txs->txs_lastdesc, 1,
+ BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);

Only BUS_DMASYNC_PREREAD is enough because
no write (host -> device) op will happen after
these pollings.

These ops are required on systems which don't handle BUS_DMA_COHERENT.
But strictly speaking, on such system we'd have to use chained mode
(not ring mode) with proper padded between each DMA descriptor,
i.e. one descriptor per cacheline to avoid host vs DMA race.
---
Izumi Tsutsui

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2009-08-27 23:57:18 UTC

Post by David Young
I made a patch to complete/optimize DMA synchronization in tlp(4),
ftp://cuw.ojctech.com/users/netbsd-c325f0fd/tulip.dmasync

+ TULIP_CDRXSYNC(sc, i,
+ BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);
+ if (txstat & TDSTAT_OWN) {
+ TULIP_CDTXSYNC(sc, txs->txs_lastdesc, 1,
+ BUS_DMASYNC_PREREAD|BUS_DMASYNC_PREWRITE);

Only BUS_DMASYNC_PREREAD is enough because
no write (host -> device) op will happen after
these pollings.
These ops are required on systems which don't handle BUS_DMA_COHERENT.

According to the documentation, we cannot count on BUS_DMA_COHERENT to
do anything, so the ops are always required. :-)

Post by Izumi Tsutsui
But strictly speaking, on such system we'd have to use chained mode
(not ring mode) with proper padded between each DMA descriptor,
i.e. one descriptor per cacheline to avoid host vs DMA race.

You're right, of course. I have seen these races occur on architectures
that we aim to support, such as ARM.

I think that in principle, the host can use ring mode if does not reuse
a descriptor until after the NIC has relinquished every other descriptor
in the same cacheline.

We may be able to use cacheline-aligned descriptors in ring mode if the
chip respects the Descriptor Skip Length (DSL) field of the Bus Mode
Register. According to the datasheet, the DSL field "Specifies the
number of longwords to skip between two unchained descriptors."

What do you think?

Dave

Izumi Tsutsui

2009-08-28 11:29:21 UTC

Post by Izumi Tsutsui
These ops are required on systems which don't handle BUS_DMA_COHERENT.

According to the documentation, we cannot count on BUS_DMA_COHERENT to
do anything, so the ops are always required. :-)

Yes, we should always call sync ops after touching DMA descriptors.
But in fact a few drivers do it properly, and it means
most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).

For example, wm(4) and re(4) won't work without BUS_DMA_COHERENT
on sgimips O2, even though the latter may have all necessary
bus_dmamap_sync(9) calls (I believe ;-).

You're right, of course. I have seen these races occur on architectures
that we aim to support, such as ARM.

Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?

Post by David Young
I think that in principle, the host can use ring mode if does not reuse
a descriptor until after the NIC has relinquished every other descriptor
in the same cacheline.

Consider the following scenario:

(1) rxdescs[0].td_status in rxintr is polled and cached
(2) the received packet for rxdescs[0] is handled
(3) rxdescs[0] data in cacheline is updated for the next RX op
in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
(4) rxdescs[0] data in the cacheline is written back and invalidated
by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()

If the cachelinesize is larger than sizeof rxdescs
(i.e. the same cacheline also fetches rxdescs[1])
and rxdescs[1] for the next descriptor is being updated
(to clear TDSTAT_OWN) by the device between (1) and (4),
the updated data will be lost by the writeback op at (4).
We can put a PREREAD sync op before (3), but race could still
happen between (3) and (4) by write allocate at (3).

I think this could happen in usual operations because
the next RX packet will soon be received right after
RX interrupt caused by the previous packet.

Post by David Young
We may be able to use cacheline-aligned descriptors in ring mode if the
chip respects the Descriptor Skip Length (DSL) field of the Bus Mode
Register. According to the datasheet, the DSL field "Specifies the
number of longwords to skip between two unchained descriptors."
What do you think?

In the perfect world, it should work ;-)
(though I have not checked the datasheet yet)

In real world, we need the following changes:

- prepare a new MI API which returns maximum cache line size
for each architecture, at least on ports which have bus_dma(9)
- store the cache line size value into tlp_softc
- calculate size of whole TX/RX descriptors (and memory for setup packets)
dynamically on attach
(we can't use static sizeof(struct tulip_control_data))
- replace all macro which use offsetof(struct tulip_control_data, x)
to get region of each descriptor, including all sync ops
- for tlp(4) specifc problem, make sure all dumb clones support
chained mode or DSL field properly

One example of such implementation is in sys/dev/ic/i82596.c for iee(4),
but I'm afraid it's much easier to implement machine dependent
uncached mapping support for BUS_DMA_COHERENT if hardware supports it.
I guess that's the reason why many drivers still lack proper
bus_dmamap_sync(9) calls for DMA descriptors even nowadays.

(note iee(4) which uses direct DMA with the complex sync ops seems
slower than old ie(4) which uses fixed DMA buffer and copies on hp700)
---
Izumi Tsutsui

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2009-08-28 17:34:45 UTC

Post by Izumi Tsutsui
These ops are required on systems which don't handle BUS_DMA_COHERENT.

According to the documentation, we cannot count on BUS_DMA_COHERENT to
do anything, so the ops are always required. :-)

Yes, we should always call sync ops after touching DMA descriptors.
But in fact a few drivers do it properly, and it means
most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).

The drivers rely on BUS_DMA_COHERENT, and BUS_DMA_COHERENT cannot be
relied on. It is a sad state of affairs. :-/

You're right, of course. I have seen these races occur on architectures
that we aim to support, such as ARM.

Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?

I don't know. You mentioned wm(4) and re(4). Do all of the ports where
those drivers will break without BUS_DMA_COHERENT provide the uncached
mappings?

Post by David Young
I think that in principle, the host can use ring mode if does not reuse
a descriptor until after the NIC has relinquished every other descriptor
in the same cacheline.

(1) rxdescs[0].td_status in rxintr is polled and cached
(2) the received packet for rxdescs[0] is handled
(3) rxdescs[0] data in cacheline is updated for the next RX op
in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
(4) rxdescs[0] data in the cacheline is written back and invalidated
by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()
If the cachelinesize is larger than sizeof rxdescs
(i.e. the same cacheline also fetches rxdescs[1])
and rxdescs[1] for the next descriptor is being updated
(to clear TDSTAT_OWN) by the device between (1) and (4),
the updated data will be lost by the writeback op at (4).
We can put a PREREAD sync op before (3), but race could still
happen between (3) and (4) by write allocate at (3).

That is just the scenario that I had in mind. I think that we can use
ring mode and avoid that scenario, if we postpone step (3) until the NIC
is finished with the rest of the Rx descriptors in the same cacheline,
rxdescs[1] through rxdescs[descs_per_cacheline - 1].

Post by Izumi Tsutsui
I think this could happen in usual operations because
the next RX packet will soon be received right after
RX interrupt caused by the previous packet.

Oh, I know from experience that it does! :-)

In the perfect world, it should work ;-)
(though I have not checked the datasheet yet)
- prepare a new MI API which returns maximum cache line size
for each architecture, at least on ports which have bus_dma(9)

I think that the *maximum* cacheline size could be a compile-time
MI constant. Then we can avoid a lot of complicated code by using either
something like this,

struct tulip_desc {
/* ... */
} __packed __aligned(MAX_CACHE_LINE_SIZE);

or something like this,

struct proto_tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad;
};

struct tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad[MAX_CACHE_LINE_SIZE -
offsetof(struct proto_tulip_desc, td_pad)];
} __packed __aligned(4);

Either way we do it, I think that it avoids the complexity of the
following, what do you think?

Post by Izumi Tsutsui
- store the cache line size value into tlp_softc
- calculate size of whole TX/RX descriptors (and memory for setup packets)
dynamically on attach
(we can't use static sizeof(struct tulip_control_data))
- replace all macro which use offsetof(struct tulip_control_data, x)
to get region of each descriptor, including all sync ops
- for tlp(4) specifc problem, make sure all dumb clones support
chained mode or DSL field properly
(note iee(4) which uses direct DMA with the complex sync ops seems
slower than old ie(4) which uses fixed DMA buffer and copies on hp700)

Just wondering aloud: will performance improve on all architectures if
we avoid such cacheline interference as leads to the dangerous race
condition on the non-coherent architectures? For example, will i386
benefit? IIUC, an i386 CPU watches the bus for bus-master access to
memory regions covered by active cache lines, and it writes back a dirty
line or discards a clean line ahead of bus-master access to it. If the
CPU writes to lines where the bus-master still owns some descriptors,
then there will be more write-backs than if the driver is programmed
never to write those lines. Those write-backs come with a cost in
memory bandwidth; if the bus-master has to retry its access, there may
be additional latency, too.

Dave

Izumi Tsutsui

2009-08-28 18:53:54 UTC

Post by David Young
According to the documentation, we cannot count on BUS_DMA_COHERENT to
do anything, so the ops are always required. :-)

Yes, we should always call sync ops after touching DMA descriptors.
But in fact a few drivers do it properly, and it means
most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).

The drivers rely on BUS_DMA_COHERENT, and BUS_DMA_COHERENT cannot be
relied on. It is a sad state of affairs. :-/

Yes. The problem is tradeoff among hardware cost,
software complexity, and total performance, but
it looks non-coherent DMA systems often have trouble,
for example:
http://mail-index.NetBSD.org/port-sgimips/2000/06/29/0006.html

Post by Izumi Tsutsui
Hmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?

I don't know. You mentioned wm(4) and re(4). Do all of the ports where
those drivers will break without BUS_DMA_COHERENT provide the uncached
mappings?

On ports whose cache systems don't handle DMA (by bus-snoop etc.), yes.
At least they had troubles on O2:
http://mail-index.NetBSD.org/port-sgimips/2008/01/20/msg000022.html
http://mail-index.NetBSD.org/source-changes/2006/10/20/msg176308.html
(though the problem is cachelinesize vs descsize mentioned below,
not driver itself)

Post by David Young
I think that in principle, the host can use ring mode if does not reuse
a descriptor until after the NIC has relinquished every other descriptor
in the same cacheline.

Hmm, it might work on RX, which uses one descriptor per packet.
On the other hand, TX packets might use multiple descs to handle
fragmentation (which would not be a multiple of descs_per_cacheline),
so I'm not sure if we can handle it properly.

Post by Izumi Tsutsui
- prepare a new MI API which returns maximum cache line size
for each architecture, at least on ports which have bus_dma(9)

I think that the *maximum* cacheline size could be a compile-time
MI constant. Then we can avoid a lot of complicated code by using either
something like this,
struct tulip_desc {
/* ... */
} __packed __aligned(MAX_CACHE_LINE_SIZE);
or something like this,
struct proto_tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad;
};
struct tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad[MAX_CACHE_LINE_SIZE -
offsetof(struct proto_tulip_desc, td_pad)];
} __packed __aligned(4);
Either way we do it, I think that it avoids the complexity of the
following, what do you think?

Hmm, I put the similar code in sys/arch/cobalt/stand/boot/tlp.c,
but in MI drivers there is one concern, how large the possbile
MAX_CACHE_LINE_SIZE is.

On mips, the cacheline size can be 128 bytes, while most systems
use 32 bytes. Wasting 128 byte DMA safe memory for 16 byte descs
might be problematic on some ports, because such memory could be
limited resource and bus_dmamem_alloc(9) might fail for too large
segments, especially on attaching devices on running systems which
could have less physically contiguous memory than at boot time.

ex(4) uses more DMA memory for descriptors even without alignment
(IIRC it's >64KB), but it already has a problem on hotswap:
http://www.NeTBSD.org/cgi-bin/query-pr-single.pl?number=10734

In tlp(4) case, NTXDESC is 1024 (== 64 * 16) and NRXDESC is 64,
so using 128byte per descs consumes >128KB DMA safe memory.
(we could use non-contiguous pages on chained mode though)

Post by Izumi Tsutsui
(note iee(4) which uses direct DMA with the complex sync ops seems
slower than old ie(4) which uses fixed DMA buffer and copies on hp700)

Well, I don't have evidence which operations
(cache flush ops, desc ops by software, bus arbitration by hw etc.)
could be bottleneck. (modern hardware is fast and optimized enough)

Nowadays most hardware designers consider about only x86 systems
which don't have any DMA coherent issue (that should be handled
by CPU or chipset), so I guess few people actually consider about
performance around bus-snoop or arbitration ops vs DMA descriptor
alignments. Anyway, we need proper benchmarks for each implementation.

---
Izumi Tsutsui

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2009-08-28 19:28:36 UTC

Post by Izumi Tsutsui
On mips, the cacheline size can be 128 bytes, while most systems
use 32 bytes. Wasting 128 byte DMA safe memory for 16 byte descs
might be problematic on some ports, because such memory could be
limited resource and bus_dmamem_alloc(9) might fail for too large
segments, especially on attaching devices on running systems which
could have less physically contiguous memory than at boot time.

The effective cache line size on some i386 and amd64 systems can be
128 bytes, because the cache lines are 64 bytes but adjacent cache
line prefetch means fetching one fetches two. I believe the snoopy
logic still uses a 64 byte granularity though.

--
Thor Lancelot Simon ***@rek.tjls.com
"Even experienced UNIX users occasionally enter rm *.* at the UNIX
prompt only to realize too late that they have removed the wrong
segment of the directory structure." - Microsoft WSS whitepaper

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2009-08-30 16:35:01 UTC

Post by Izumi Tsutsui
(1) rxdescs[0].td_status in rxintr is polled and cached
(2) the received packet for rxdescs[0] is handled
(3) rxdescs[0] data in cacheline is updated for the next RX op
in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
(4) rxdescs[0] data in the cacheline is written back and invalidated
by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()
If the cachelinesize is larger than sizeof rxdescs
(i.e. the same cacheline also fetches rxdescs[1])
and rxdescs[1] for the next descriptor is being updated
(to clear TDSTAT_OWN) by the device between (1) and (4),
the updated data will be lost by the writeback op at (4).
We can put a PREREAD sync op before (3), but race could still
happen between (3) and (4) by write allocate at (3).

For RX you also need to put new buffers into the cache-line sized
block of descriptors in reverse order, making them available all together
even if an intervening cache flush happens.

TX seems rather harder, since the chip will be writing TX status while
driver is filling in later entries in the same cache line.
Ignoring TX status for all but the last active descriptor might work.

Linked mode for TX can put multiple descriptors per cache line - provided
they are for the same frame.

Presumably uncached memory could be used for the rings?
(Or double-map, cached for reads and uncached for writes!)

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2009-09-30 19:17:17 UTC