Post by David YoungPost by Izumi TsutsuiPost by David YoungAccording to the documentation, we cannot count on BUS_DMA_COHERENT to
do anything, so the ops are always required. :-)
Yes, we should always call sync ops after touching DMA descriptors.
But in fact a few drivers do it properly, and it means
most drivers rely on BUS_DMA_COHERENT (or cache coherent hardware).
The drivers rely on BUS_DMA_COHERENT, and BUS_DMA_COHERENT cannot be
relied on. It is a sad state of affairs. :-/
Yes. The problem is tradeoff among hardware cost,
software complexity, and total performance, but
it looks non-coherent DMA systems often have trouble,
for example:
http://mail-index.NetBSD.org/port-sgimips/2000/06/29/0006.html
Post by David YoungPost by Izumi TsutsuiHmm, how hard is it to implement uncached mappings for BUS_DMA_COHERENT?
I don't know. You mentioned wm(4) and re(4). Do all of the ports where
those drivers will break without BUS_DMA_COHERENT provide the uncached
mappings?
On ports whose cache systems don't handle DMA (by bus-snoop etc.), yes.
At least they had troubles on O2:
http://mail-index.NetBSD.org/port-sgimips/2008/01/20/msg000022.html
http://mail-index.NetBSD.org/source-changes/2006/10/20/msg176308.html
(though the problem is cachelinesize vs descsize mentioned below,
not driver itself)
Post by David YoungPost by Izumi TsutsuiPost by David YoungI think that in principle, the host can use ring mode if does not reuse
a descriptor until after the NIC has relinquished every other descriptor
in the same cacheline.
(1) rxdescs[0].td_status in rxintr is polled and cached
(2) the received packet for rxdescs[0] is handled
(3) rxdescs[0] data in cacheline is updated for the next RX op
in TULIP_INIT_RXDESC() and then the cacheline is marked dirty
(4) rxdescs[0] data in the cacheline is written back and invalidated
by bus_dmamap_sync(9) op at the end of TULIP_INIT_RXDESC()
If the cachelinesize is larger than sizeof rxdescs
(i.e. the same cacheline also fetches rxdescs[1])
and rxdescs[1] for the next descriptor is being updated
(to clear TDSTAT_OWN) by the device between (1) and (4),
the updated data will be lost by the writeback op at (4).
We can put a PREREAD sync op before (3), but race could still
happen between (3) and (4) by write allocate at (3).
That is just the scenario that I had in mind. I think that we can use
ring mode and avoid that scenario, if we postpone step (3) until the NIC
is finished with the rest of the Rx descriptors in the same cacheline,
rxdescs[1] through rxdescs[descs_per_cacheline - 1].
Hmm, it might work on RX, which uses one descriptor per packet.
On the other hand, TX packets might use multiple descs to handle
fragmentation (which would not be a multiple of descs_per_cacheline),
so I'm not sure if we can handle it properly.
Post by David YoungPost by Izumi Tsutsui- prepare a new MI API which returns maximum cache line size
for each architecture, at least on ports which have bus_dma(9)
I think that the *maximum* cacheline size could be a compile-time
MI constant. Then we can avoid a lot of complicated code by using either
something like this,
struct tulip_desc {
/* ... */
} __packed __aligned(MAX_CACHE_LINE_SIZE);
or something like this,
struct proto_tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad;
};
struct tulip_desc {
/* ... descriptor fields ... */
uint8_t td_pad[MAX_CACHE_LINE_SIZE -
offsetof(struct proto_tulip_desc, td_pad)];
} __packed __aligned(4);
Either way we do it, I think that it avoids the complexity of the
following, what do you think?
Hmm, I put the similar code in sys/arch/cobalt/stand/boot/tlp.c,
but in MI drivers there is one concern, how large the possbile
MAX_CACHE_LINE_SIZE is.
On mips, the cacheline size can be 128 bytes, while most systems
use 32 bytes. Wasting 128 byte DMA safe memory for 16 byte descs
might be problematic on some ports, because such memory could be
limited resource and bus_dmamem_alloc(9) might fail for too large
segments, especially on attaching devices on running systems which
could have less physically contiguous memory than at boot time.
ex(4) uses more DMA memory for descriptors even without alignment
(IIRC it's >64KB), but it already has a problem on hotswap:
http://www.NeTBSD.org/cgi-bin/query-pr-single.pl?number=10734
In tlp(4) case, NTXDESC is 1024 (== 64 * 16) and NRXDESC is 64,
so using 128byte per descs consumes >128KB DMA safe memory.
(we could use non-contiguous pages on chained mode though)
Post by David YoungPost by Izumi Tsutsui(note iee(4) which uses direct DMA with the complex sync ops seems
slower than old ie(4) which uses fixed DMA buffer and copies on hp700)
Just wondering aloud: will performance improve on all architectures if
we avoid such cacheline interference as leads to the dangerous race
condition on the non-coherent architectures? For example, will i386
benefit? IIUC, an i386 CPU watches the bus for bus-master access to
memory regions covered by active cache lines, and it writes back a dirty
line or discards a clean line ahead of bus-master access to it. If the
CPU writes to lines where the bus-master still owns some descriptors,
then there will be more write-backs than if the driver is programmed
never to write those lines. Those write-backs come with a cost in
memory bandwidth; if the bus-master has to retry its access, there may
be additional latency, too.
Well, I don't have evidence which operations
(cache flush ops, desc ops by software, bus arbitration by hw etc.)
could be bottleneck. (modern hardware is fast and optimized enough)
Nowadays most hardware designers consider about only x86 systems
which don't have any DMA coherent issue (that should be handled
by CPU or chipset), so I guess few people actually consider about
performance around bus-snoop or arbitration ops vs DMA descriptor
alignments. Anyway, we need proper benchmarks for each implementation.
---
Izumi Tsutsui
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de