NetBSD 5.1 TCP performance issue (lots of ACK)

Discussion:

NetBSD 5.1 TCP performance issue (lots of ACK)

(too old to reply)

Manuel Bouyer

2011-10-17 16:03:18 UTC

Hello,
I've been playing with glusterfs a bit today, and found some performance
differences between NetBSD and linux, which I tracked down to our TCP
stack. Basically, between a NetBSD/linux pair, performances are much
better than between 2 NetBSD hosts. It doesn't matter if linux is client
or server, so this point some issue outside of glusterfs.

So I have done some packet capture and found a strange TCP behavior between
2 NetBSD hosts.

The setup: 192.168.1.2 is a NetBSD 5.1 glusterfs server, 192.168.1.1 is a
NetBSD 5.1 glusterfs client and 192.168.1.3 is a linux (RHEL6) glusterfs
client. The clients read a 640Mb file from the server (with
dd if=file of=/dev/null bs=64k). All 3 hosts are strictly identical
hardware (same CPUs, ram, motherboard, gigabit network adapter and hard disk).
The linux client can read at 95MB/s from the NetBSD server, the NetBSD
client only 50MB/s. (But in other tests, a NetBSD client can read at 90MB/s
out of a linux server, no neither the NetBSD server nor the NetBSD client
is the bottleneck in a NetBSD/NetBSD setup).

Attached are the tcptrace outputs for both client. The problem is, the
NetBSD client is sending 242873 packets to the server for the file read,
while the linux client sends only 34581 packets), so I suspect there's
something wrong with our TCP ack code (it looks like we ack way too often).

The relevant part of sysctl.conf of the client is:
net.inet.tcp.sendbuf_auto=0
net.inet.tcp.recvbuf_auto=0
kern.sbmax=4194304
net.inet.tcp.sendbuf_max=1048576
net.inet.tcp.recvbuf_max=1048576
net.inet.tcp.sendspace=524288
net.inet.tcp.recvspace=524288
net.inet.tcp.abc.enable=0
net.inet.ip.ifq.maxlen=512

Any idea of a sysctl setup (or something else) that could help ?

I actually suspect it's also the reason why I see a lower data rate
over my ADSL connection when downloading from ftp.fr.netbsd.org than
when talking do some linux server ...

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

Greg Troxel

2011-10-17 17:25:28 UTC

To diagnose this, I recommend making a tcpdump of the traffic with

tcpdump -w TRACE tcp

and then using tcpdump2plot in graphics/xplot-devel.

I am not clear if graphics/xplot-devel (or /xplot) has caught up to
minor changes in NetBSD tcp

These plots are not 100% obvious how to read, but I'm happy to look at
them if you can provide the above trace file. Counts are useful, but
this lets you see the fine-grained behavior of what got sent when.

From the stats, it looks like there is loss in the data stream causing
SACK. But I don't see how there would be 237850 pure acks for ~98K data
packets; there should be more like 49K.

Manuel Bouyer

2011-10-17 17:52:03 UTC

Post by Greg Troxel
To diagnose this, I recommend making a tcpdump of the traffic with
tcpdump -w TRACE tcp
and then using tcpdump2plot in graphics/xplot-devel.
I am not clear if graphics/xplot-devel (or /xplot) has caught up to
minor changes in NetBSD tcp

I looked at this quickly but I didn't understand how to use tcpdump2plot
(or maybe tcpdump2plot is too old for our tcpdump).

Post by Greg Troxel
These plots are not 100% obvious how to read, but I'm happy to look at
them if you can provide the above trace file. Counts are useful, but
this lets you see the fine-grained behavior of what got sent when.

ftp://ftp-asim.lip6.fr/outgoing/xen1.3.pcap

Post by Greg Troxel
From the stats, it looks like there is loss in the data stream causing
SACK.

I've noticed this. I don't know where the data loss occurs.
No errors on interfaces, and no drop in ipintr or interface queue.

Post by Greg Troxel
But I don't see how there would be 237850 pure acks for ~98K data
packets; there should be more like 49K.

that's my understanding too. There's way too much ack. the trace for both the
linux and NetBSD clients where collected on the NetBSD server, so I don't
think it can be bogus duplicate packets at the capture level.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Havard Eidnes

2011-10-24 09:44:10 UTC

Post by Greg Troxel
To diagnose this, I recommend making a tcpdump of the traffic with
tcpdump -w TRACE tcp
and then using tcpdump2plot in graphics/xplot-devel.
I am not clear if graphics/xplot-devel (or /xplot) has caught up to
minor changes in NetBSD tcp

I'm more familiar with the graphs which tcptrace can produce. It can
make time/sequence plots (ref. the -S option) which are well
understood by graphics/xplot, at least last I looked.

Post by Greg Troxel
From the stats, it looks like there is loss in the data stream causing
SACK. But I don't see how there would be 237850 pure acks for ~98K data
packets; there should be more like 49K.

I agree, the traditional behaviour is that NetBSD TCP acks every
second TCP segment.

There doesn't appear to be all that much retransmission, though; only
54 data packets have been retransmitted in the "connection 5" example,
although the retransmission rate *is* non-zero, so some packets may
indeed go missing in the network somewhere, which is never a good
recipe for high TCP performance.

Generally, the congestion control algorithm used by default by Linux
these days is a bit more agressive when ramping up the congestion
window ("how much data can we send before having to wait for an ack")
after an idle period or after a packet loss event, it *may* be that
the effect of this is what you're seeing.

When you say that "NetBSD with Linux performs well", which end in this
setup is the sender of the majority of data? If it's the Linux end,
that could be a further hint that it's the TCP congestion control
algorithm variants which plays a role.

Regards,

- Håvard

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-24 10:10:57 UTC

Post by Havard Eidnes

Post by Greg Troxel
To diagnose this, I recommend making a tcpdump of the traffic with
tcpdump -w TRACE tcp
and then using tcpdump2plot in graphics/xplot-devel.
I am not clear if graphics/xplot-devel (or /xplot) has caught up to
minor changes in NetBSD tcp

I'm more familiar with the graphs which tcptrace can produce. It can
make time/sequence plots (ref. the -S option) which are well
understood by graphics/xplot, at least last I looked.

Post by Greg Troxel
From the stats, it looks like there is loss in the data stream causing
SACK. But I don't see how there would be 237850 pure acks for ~98K data
packets; there should be more like 49K.

I agree, the traditional behaviour is that NetBSD TCP acks every
second TCP segment.
There doesn't appear to be all that much retransmission, though; only
54 data packets have been retransmitted in the "connection 5" example,
although the retransmission rate *is* non-zero, so some packets may
indeed go missing in the network somewhere, which is never a good
recipe for high TCP performance.

I've also trace where there's no retransmissions at all, but this
doesn't change much the data rate.

Post by Havard Eidnes
Generally, the congestion control algorithm used by default by Linux
these days is a bit more agressive when ramping up the congestion
window ("how much data can we send before having to wait for an ack")
after an idle period or after a packet loss event, it *may* be that
the effect of this is what you're seeing.
When you say that "NetBSD with Linux performs well", which end in this
setup is the sender of the majority of data? If it's the Linux end,
that could be a further hint that it's the TCP congestion control
algorithm variants which plays a role.

It doesn't matter which end linux is. A NetBSD client is performing
better agaist a Linux server than against NetBSD server, and a linux
client performs better than a NetBSD client against the same NetBSD server.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2011-10-24 17:26:19 UTC

Post by Havard Eidnes

Post by Greg Troxel
From the stats, it looks like there is loss in the data stream causing
SACK. But I don't see how there would be 237850 pure acks for ~98K data
packets; there should be more like 49K.

I agree, the traditional behaviour is that NetBSD TCP acks every
second TCP segment.

I've seen Linux stacks defer sending an ACK until the next? kernel
clock tick. This will reduce the ACK count somewhat.
In my case is caused problems with 'slow start' at the other end
(which was also Linux).

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-26 20:55:14 UTC

Post by Greg Troxel
Looking at the trace you provided, I am mostly seeing correct
every-other ack behavior. I continue to wonder if the bad pcap trace is
masking something else. Try setting net.bpf.maxbufsize larger, but I am
still not used to seeing 0-len captures even if packets are dropped.
In counting packets, I concur that something seems wrong. But I am
unable to find much fine-grained oddness.
Big buffers should not be an issue.

But it looks like there are, as I get twice the speed between NetBSD
and linux than between 2 NetBSD, on stricly identical hardware.

I reran some tests. I collected pcap traces on both client and
server, with
xen1:/domains#tcpdump -n -p -i wm0 -w netbsd-client.pcap host xen2-priv
tcpdump: listening on wm0, link-type EN10MB (Ethernet), capture size 96 bytes
^C
698848 packets captured
701273 packets received by filter
2342 packets dropped by kernel

and
xen2:/domains#tcpdump -n -p -i wm0 -w netbsd-server.pcap host xen1-priv
tcpdump: listening on wm0, link-type EN10MB (Ethernet), capture size 96 bytes
^C
565269 packets captured
565345 packets received by filter
0 packets dropped by kernel

(net.bpf.maxbufsize was set to 4194304).

I ran this after a fresh reboot of both client and server, and
netstat -s shows:
on client:
tcp:
241942 packets sent
5227 data packets (857181 bytes)
0 data packets (0 bytes) retransmitted
228294 ack-only packets (229090 delayed)
0 URG only packets
0 window probe packets
8415 window update packets
6 control packets
0 send attempts resulted in self-quench
459790 packets received
2818 acks (for 857137 bytes)
0 duplicate acks
0 acks for unsent data
456840 packets (655320993 bytes) received in-sequence
6 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
259 out-of-order packets (370132 bytes)
0 packets (0 bytes) of data after window
0 window probes
3 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
3 connection requests
1 connection accept
4 connections established (including accepts)
15 connections closed (including 0 drops)
0 embryonic connections dropped
0 delayed frees of tcpcb
2821 segments updated rtt (of 1189 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts (resulting in 0 dropped connections)
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
98 correct ACK header predictions
455263 correct data packet header predictions
166 PCB hash misses
82 dropped due to no socket
0 connections drained due to memory shortage
0 PMTUD blackholes detected
0 bad connection attempts
1 SYN cache entries added
0 hash collisions
1 completed
0 aborted (no space to build PCB)
0 timed out
0 dropped due to overflow
0 dropped due to bucket overflow
0 dropped due to RST
0 dropped due to ICMP unreachable
1 delayed free of SYN cache entries
0 SYN,ACKs retransmitted
0 duplicate SYNs received for entries already in the cache
0 SYNs dropped (no route or no space)
0 packets with bad signature
0 packets with good signature
0 sucessful ECN handshakes
0 packets with ECN CE bit
0 packets ECN ECT(0) bit

and on server:
tcp:
323882 packets sent
321397 data packets (656445809 bytes)
5 data packets (12476 bytes) retransmitted
2364 ack-only packets (2753 delayed)
0 URG only packets
0 window probe packets
6 window update packets
110 control packets
0 send attempts resulted in self-quench
242229 packets received
229309 acks (for 656075639 bytes)
0 duplicate acks
0 acks for unsent data
5135 packets (853013 bytes) received in-sequence
489 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
0 out-of-order packets (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
7299 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
107 connection requests
7 connection accepts
8 connections established (including accepts)
65660 connections closed (including 0 drops)
106 embryonic connections dropped
0 delayed frees of tcpcb
229310 segments updated rtt (of 22778 attempts)
1 retransmit timeout
0 connections dropped by rexmit timeout
0 persist timeouts (resulting in 0 dropped connections)
133 keepalive timeouts
133 keepalive probes sent
0 connections dropped by keepalive
6 correct ACK header predictions
4991 correct data packet header predictions
32 PCB hash misses
9 dropped due to no socket
0 connections drained due to memory shortage
0 PMTUD blackholes detected
0 bad connection attempts
7 SYN cache entries added
0 hash collisions
7 completed
0 aborted (no space to build PCB)
0 timed out
0 dropped due to overflow
0 dropped due to bucket overflow
0 dropped due to RST
0 dropped due to ICMP unreachable
7 delayed free of SYN cache entries
0 SYN,ACKs retransmitted
0 duplicate SYNs received for entries already in the cache
0 SYNs dropped (no route or no space)
0 packets with bad signature
0 packets with good signature
0 sucessful ECN handshakes
0 packets with ECN CE bit
0 packets ECN ECT(0) bit

I still have the bad-len packets on the server side, but not on the
client side. I wonder if this could be because of tso4 on the interface.

traces are available in ftp://ftp-asim.lip6.fr/outgoing/bouyer/

I also transfered the same file using ttcp instead of through
glusterfs:
xen1:/home/bouyer>ttcp -s -r -l 65536
ttcp-r: buflen=65536, nbuf=2048, align=16384/0, port=5001 tcp
ttcp-r: socket
ttcp-r: accept from 192.168.1.2
ttcp-r: 655360000 bytes in 6.61 real seconds = 96871.52 KB/sec +++
ttcp-r: 14856 I/O calls, msec/call = 0.46, calls/sec = 2248.63
ttcp-r: 0.0user 1.4sys 0:06real 21% 0i+0d 0maxrss 0+16pf 7627+2csw

xen2:/home/bouyer>ttcp -t -l 65536 xen1-priv < /glpool/truc
ttcp-t: buflen=65536, nbuf=2048, align=16384/0, port=5001 tcp -> xen1-priv
ttcp-t: socket
ttcp-t: connect
ttcp-t: 655360000 bytes in 6.60 real seconds = 96899.55 KB/sec +++
ttcp-t: 10000 I/O calls, msec/call = 0.68, calls/sec = 1514.06
ttcp-t: -1.9user 5.3sys 0:06real 80% 0i+0d 0maxrss 0+16pf 1394+171csw

I also got pcap traces:
xen1:/domains#tcpdump -n -p -i wm0 -w netbsd-ttcpclient.pcap host xen2-priv
tcpdump: listening on wm0, link-type EN10MB (Ethernet), capture size 96 bytes
^C
690857 packets captured
694270 packets received by filter
3249 packets dropped by kernel

xen2:/domains#tcpdump -n -p -i wm0 -w netbsd-ttcpserver.pcap host xen1-priv
tcpdump: listening on wm0, link-type EN10MB (Ethernet), capture size 96 bytes
^C
546336 packets captured
546595 packets received by filter
0 packets dropped by kernel

There is again the IP bad-len 0 in the server-side trace but not in the
client side trace.

netstat -s:
client:
tcp:
241714 packets sent
249 data packets (19529 bytes)
0 data packets (0 bytes) retransmitted
227005 ack-only packets (225830 delayed)
0 URG only packets
0 window probe packets
14459 window update packets
1 control packet
0 send attempts resulted in self-quench
453104 packets received
219 acks (for 19484 bytes)
0 duplicate acks
0 acks for unsent data
451342 packets (653225521 bytes) received in-sequence
0 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
368 out-of-order packets (532864 bytes)
0 packets (0 bytes) of data after window
0 window probes
0 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
0 connection requests
2 connection accepts
2 connections established (including accepts)
17 connections closed (including 0 drops)
0 embryonic connections dropped
0 delayed frees of tcpcb
219 segments updated rtt (of 202 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts (resulting in 0 dropped connections)
0 keepalive timeouts
0 keepalive probes sent
0 connections dropped by keepalive
46 correct ACK header predictions
451213 correct data packet header predictions
352 PCB hash misses
174 dropped due to no socket
0 connections drained due to memory shortage
0 PMTUD blackholes detected
0 bad connection attempts
2 SYN cache entries added
0 hash collisions
2 completed
0 aborted (no space to build PCB)
0 timed out
0 dropped due to overflow
0 dropped due to bucket overflow
0 dropped due to RST
0 dropped due to ICMP unreachable
2 delayed free of SYN cache entries
0 SYN,ACKs retransmitted
0 duplicate SYNs received for entries already in the cache
0 SYNs dropped (no route or no space)
0 packets with bad signature
0 packets with good signature
0 sucessful ECN handshakes
0 packets with ECN CE bit
0 packets ECN ECT(0) bit

and server:
tcp:
305529 packets sent
305129 data packets (655917613 bytes)
0 data packets (0 bytes) retransmitted
200 ack-only packets (216 delayed)
0 URG only packets
0 window probe packets
3 window update packets
197 control packets
0 send attempts resulted in self-quench
242418 packets received
226234 acks (for 655384706 bytes)
0 duplicate acks
0 acks for unsent data
225 packets (13905 bytes) received in-sequence
1534 completely duplicate packets (0 bytes)
0 old duplicate packets
0 packets with some dup. data (0 bytes duped)
1 out-of-order packet (0 bytes)
0 packets (0 bytes) of data after window
0 window probes
14338 window update packets
0 packets received after close
0 discarded for bad checksums
0 discarded for bad header offset fields
0 discarded because packet too short
197 connection requests
4 connection accepts
6 connections established (including accepts)
65747 connections closed (including 0 drops)
195 embryonic connections dropped
0 delayed frees of tcpcb
226236 segments updated rtt (of 39837 attempts)
0 retransmit timeouts
0 connections dropped by rexmit timeout
0 persist timeouts (resulting in 0 dropped connections)
244 keepalive timeouts
244 keepalive probes sent
0 connections dropped by keepalive
0 correct ACK header predictions
88 correct data packet header predictions
24 PCB hash misses
8 dropped due to no socket
0 connections drained due to memory shortage
0 PMTUD blackholes detected
0 bad connection attempts
4 SYN cache entries added
0 hash collisions
4 completed
0 aborted (no space to build PCB)
0 timed out
0 dropped due to overflow
0 dropped due to bucket overflow
0 dropped due to RST
0 dropped due to ICMP unreachable
4 delayed free of SYN cache entries
0 SYN,ACKs retransmitted
0 duplicate SYNs received for entries already in the cache
0 SYNs dropped (no route or no space)
0 packets with bad signature
0 packets with good signature
0 sucessful ECN handshakes
0 packets with ECN CE bit
0 packets ECN ECT(0) bit

So:
- it seems ack is not the issue, we have about the same number of ack
with ttcp, and we can run full speed.
- the hardware and network can get more than 90MB/s as ttcp manages
to do it.
- glusterfs can also do it, as a linux client can get data from the
netbsd server at more than 90MB/s, and the NetBSD client can get data
from a linux server at more than 90MB/s. the linux box involved
is strictly identical (hardware-wise) to the NetBSD boxes, and
connected to the same gigabit switch

I don't know where to look next, any idea welcome.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2011-10-27 00:15:44 UTC

The speed

with glusterfs: seems to be a combination of 74 MB/s and pauses

with ttcp: seems to be 112 MB/s burst (for 0.2s) and some smaller
pauses, which gut checks with 95 MB/s as reported by ttcp.

So are you seeing high 40s MB/s out of glusterfs?

What is between these two devices? Is this just a gigabit switch, or
anything more complicated? We are seeing reordering which I would not
expect on an ethernet. I wonder if the tso4 option is causing that.
What if you turn off the offload options? (I realize it may slow down,
but if both are then equal, that's a clue.)

I wonder if the very large buffers get full and that causes cache
thrashing. What happens if you change gluster to have smaller buffers
(I don't understand why it's ok to have the FS change the tcp socket
buffer options from system default)?

Grab

http://www.ir.bbn.com/~gdt/netbsd/gluster-xplot.tgz

to look at the plots yourself (you'll need xplot, but the conversion
From tcdpump is done already).

Manuel Bouyer

2011-10-27 09:45:16 UTC

Post by Greg Troxel
The speed
with glusterfs: seems to be a combination of 74 MB/s and pauses
with ttcp: seems to be 112 MB/s burst (for 0.2s) and some smaller
pauses, which gut checks with 95 MB/s as reported by ttcp.
So are you seeing high 40s MB/s out of glusterfs?

Yes, between 40 and 50MB/s

Post by Greg Troxel
What is between these two devices? Is this just a gigabit switch, or
anything more complicated?

they're all (the 2 NetBSD and the linux host) connected to a cisco 3750
gigabit switch. I also tested with a single crossover cable, this doens't
change anything .

Post by Greg Troxel
We are seeing reordering which I would not
expect on an ethernet. I wonder if the tso4 option is causing that.
What if you turn off the offload options? (I realize it may slow down,
but if both are then equal, that's a clue.)

that's easy. And yes, I get better performances: 77MB/s instead of < 50.
So it looks like we have something wrong with TSO.
The traces are still at ftp://asim.lip6.fr/outgoing/bouyer/
(netbsd-{client,server}-notso.pcap.gz).

Did you see the reordering in the ttcp trace too ?

But, that still doesn't explain why I get good performances when one
of the host is linux. NetBSD used tso as well, and it didn't seem to cause
problems for linux ...

BTW, how is TSO working ? does the adapter get a single data block of
a full window size ? if so, maybe the transmit ring just isn't big
enough ...

Post by Greg Troxel
I wonder if the very large buffers get full and that causes cache
thrashing. What happens if you change gluster to have smaller buffers
(I don't understand why it's ok to have the FS change the tcp socket
buffer options from system default)?

Because it knows the size of its packets, or its internal receive buffers ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2011-10-27 12:30:12 UTC

Post by Manuel Bouyer
Yes, between 40 and 50MB/s

ok, that matches what I see in the trace.

Post by Manuel Bouyer

Post by Greg Troxel
What is between these two devices? Is this just a gigabit switch, or
anything more complicated?

they're all (the 2 NetBSD and the linux host) connected to a cisco 3750
gigabit switch. I also tested with a single crossover cable, this doens't
change anything .

OK - I've just seen enough things that are supposed to be transparant
and aren't.

Post by Manuel Bouyer
that's easy. And yes, I get better performances: 77MB/s instead of < 50.

And does gluster then match ttcp, as in both 77?

Post by Manuel Bouyer
So it looks like we have something wrong with TSO.
The traces are still at ftp://asim.lip6.fr/outgoing/bouyer/
(netbsd-{client,server}-notso.pcap.gz).
Did you see the reordering in the ttcp trace too ?

There were some, but it seems not big enough to cause real problems. As
long as TCP does fast recovery and doesn't go into timeout, things work
ok enough that it's really hard to notice.

Post by Manuel Bouyer
But, that still doesn't explain why I get good performances when one
of the host is linux. NetBSD used tso as well, and it didn't seem to cause
problems for linux ...

Sure, but TCP performance is subtle and there are all sorts of ways
things can line up to provoke or not provoke latent bugs. It seems
likely that whatever bad behavior the tso option is causing is either
doesn't bother the linux receiver in terms of the acks it sends, or the
congestion window doesn't get big enough to trigger the tso bugs, or
something else like that. You can't conclude much from linux/netbsd
working well other than that things are mostly ok.

Post by Manuel Bouyer
BTW, how is TSO working ? does the adapter get a single data block of
a full window size ? if so, maybe the transmit ring just isn't big
enough ...

I have no idea. Also, is there receive offload? The receiver has
packets arriving all together whereas they are showing up more spread
out at the transmitter. It may be that reordering happens in the
controller, or it may be that it happens at the receiver when the
packets are regenerated from the large buffer (and then injected out of
order).

One thing to keep in mind is that the tcpdump timestamps are not when
the packet arrives on the wire. They are the system time when the bpf
call is made, which is in many drivers when the packet's pointers are
loaded into the transmit ring.

Post by Manuel Bouyer

Post by Greg Troxel
thrashing. What happens if you change gluster to have smaller buffers

I would do this experiment; that may avoid the problem. I'm not
suggesting that you run this way forever, but it will help us understand
what's wrong.

Post by Manuel Bouyer

Post by Greg Troxel
(I don't understand why it's ok to have the FS change the tcp socket
buffer options from system default)?

Because it knows the size of its packets, or its internal receive buffers ?

This is TCP, so gluster can have a large buffer in user space
independently of what the TCP socket buffer is. People set TCP socket
buffers to control the advertised window and to balance throughput on
long fat pipes with memory usage. In your case the RTT is only a few ms
even under load, so it wouldn't seem that huge buffers are necessary.

Do you have actual problems if gluster doesn't force the buffer to be
large?

(That said, having buffers large enough to allow streaming is generally
good. But if you need that, it's not really about one user of TCP. I
have been turning on

net.inet.tcp.recvbuf_auto = 1
net.inet.tcp.sendbuf_auto = 1
net.inet6.tcp6.recvbuf_auto = 1
net.inet6.tcp6.sendbuf_auto = 1

to let buffers get bigger when TCP would be blocked by socket buffer.
In 5.1, that seems to lead to running out of mbuf clusters rather than
reclaiming them (when there are lots of connections), but I'm hoping
this is better in -current (or rather deferring looking into it until I
jump to current).

If you can get ttcp to show the same performance problems (by setting
buffer sizes, perhaps), then we can debug this without gluster, which
would help.

Also, it would be nice to have a third machine on the switch and run
tcpdump (without any funky offload behavior) and see what the packets on
the wire really look like. With the tso behavior I am not confident
that either trace is exactly what's on the wire.

Have you seen: http://gnats.netbsd.org/42323

Manuel Bouyer

2011-10-27 14:02:12 UTC

Post by Greg Troxel

Post by Manuel Bouyer
Yes, between 40 and 50MB/s

ok, that matches what I see in the trace.

Post by Manuel Bouyer

Post by Greg Troxel
What is between these two devices? Is this just a gigabit switch, or
anything more complicated?

they're all (the 2 NetBSD and the linux host) connected to a cisco 3750
gigabit switch. I also tested with a single crossover cable, this doens't
change anything .

OK - I've just seen enough things that are supposed to be transparant
and aren't.

Post by Manuel Bouyer
that's easy. And yes, I get better performances: 77MB/s instead of < 50.

And does gluster then match ttcp, as in both 77?

ttcp is at 108KB/s (so it's also faster without tso4). Looks like there's
definitively a problem with TSO on our side.

Post by Greg Troxel
[...]
I have no idea. Also, is there receive offload? The receiver has
packets arriving all together whereas they are showing up more spread
out at the transmitter. It may be that reordering happens in the
controller, or it may be that it happens at the receiver when the
packets are regenerated from the large buffer (and then injected out of
order).

there is ip/tcp checksum offload on the receveir side but nothing else.

Post by Greg Troxel

Post by Manuel Bouyer

Post by Greg Troxel
thrashing. What happens if you change gluster to have smaller buffers

I would do this experiment; that may avoid the problem. I'm not
suggesting that you run this way forever, but it will help us understand
what's wrong.

Post by Manuel Bouyer

Post by Greg Troxel
(I don't understand why it's ok to have the FS change the tcp socket
buffer options from system default)?

Because it knows the size of its packets, or its internal receive buffers ?

This is TCP, so gluster can have a large buffer in user space
independently of what the TCP socket buffer is. People set TCP socket
buffers to control the advertised window and to balance throughput on
long fat pipes with memory usage. In your case the RTT is only a few ms
even under load, so it wouldn't seem that huge buffers are necessary.
Do you have actual problems if gluster doesn't force the buffer to be
large?

that's interesting: I now have 78MB/s with tso4, and 48MB/s without
tso4. Just as if the setsockopt would turn tso4 off.

Post by Greg Troxel
(That said, having buffers large enough to allow streaming is generally
good. But if you need that, it's not really about one user of TCP. I
have been turning on
net.inet.tcp.recvbuf_auto = 1
net.inet.tcp.sendbuf_auto = 1
net.inet6.tcp6.recvbuf_auto = 1
net.inet6.tcp6.sendbuf_auto = 1
to let buffers get bigger when TCP would be blocked by socket buffer.
In 5.1, that seems to lead to running out of mbuf clusters rather than
reclaiming them (when there are lots of connections), but I'm hoping
this is better in -current (or rather deferring looking into it until I
jump to current).

I have these too, and no nmbclusters issues.

Post by Greg Troxel
If you can get ttcp to show the same performance problems (by setting
buffer sizes, perhaps), then we can debug this without gluster, which
would help.

I tried ttcp -l524288 (this is what gluster uses) but it doesn't cause
problems either.

Post by Greg Troxel
Also, it would be nice to have a third machine on the switch and run
tcpdump (without any funky offload behavior) and see what the packets on
the wire really look like. With the tso behavior I am not confident
that either trace is exactly what's on the wire.

playing with rspan it should be possible; I'll have a look.

Post by Greg Troxel
Have you seen: http://gnats.netbsd.org/42323

Yes, but I'm not seeing the problems described here.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-10-29 01:24:53 UTC

Post by Manuel Bouyer

Post by David Young

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.
Index: sys/dev/pci/if_wm.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_wm.c,v
retrieving revision 1.162.4.15
diff -u -p -u -r1.162.4.15 if_wm.c
--- sys/dev/pci/if_wm.c 7 Mar 2011 04:14:19 -0000 1.162.4.15
+++ sys/dev/pci/if_wm.c 28 Oct 2011 14:03:33 -0000
@@ -2879,11 +2907,7 @@ wm_rxintr(struct wm_softc *sc)
device_xname(sc->sc_dev), i));
WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);
-
status = sc->sc_rxdescs[i].wrx_status;
- errors = sc->sc_rxdescs[i].wrx_errors;
- len = le16toh(sc->sc_rxdescs[i].wrx_len);
- vlantag = sc->sc_rxdescs[i].wrx_special;
if ((status & WRX_ST_DD) == 0) {
/*
@@ -2892,6 +2916,14 @@ wm_rxintr(struct wm_softc *sc)
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
break;
}

Should

Post by Manuel Bouyer
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);

move above

Post by Manuel Bouyer
if ((status & WRX_ST_DD) == 0) {

?

I don't think so: if WRX_ST_DD is not set, we won't read anything more frm
this descriptor so there's no need to sync it again.

Currently, if WRX_ST_DD is not set, we sync the descriptor and
get out of the loop:

WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
break;

If WRX_ST_DD is set, however, we do read more from the descriptor. That
is why I ask whether we should sync it again.

It is strange and possibly unnecessary to have two sync calls
back-to-back, for it would be:

WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
if ((status & WRX_ST_DD) == 0) {
break;
}

/*
* sync again, to make sure the values below have been read
* after status.
*/
WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);

Dave

--
David Young OJC Technologies is now Pixo
***@pixotech.com Urbana, IL (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-29 20:02:10 UTC

Post by David Young

Post by Manuel Bouyer
I don't think so: if WRX_ST_DD is not set, we won't read anything more frm
this descriptor so there's no need to sync it again.

Currently, if WRX_ST_DD is not set, we sync the descriptor and
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
break;
If WRX_ST_DD is set, however, we do read more from the descriptor. That
is why I ask whether we should sync it again.
It is strange and possibly unnecessary to have two sync calls
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
if ((status & WRX_ST_DD) == 0) {
break;
}
/*
* sync again, to make sure the values below have been read
* after status.
*/
WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);

OK, I see. But, is there a platoform where BUS_DMASYNC_PREREAD is not
a NOP ? I can't what kind of work BUS_DMASYNC_PREREAD could have to do ...

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Martin Husemann

2011-10-29 20:14:18 UTC

Post by Manuel Bouyer
OK, I see. But, is there a platoform where BUS_DMASYNC_PREREAD is not
a NOP ? I can't what kind of work BUS_DMASYNC_PREREAD could have to do ...

On platforms with memory reordering it might be a barrier (i.e. sparc64 does
a membar_sync() for it).

Martin

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-02 19:04:40 UTC

Post by Martin Husemann

Post by Manuel Bouyer
OK, I see. But, is there a platoform where BUS_DMASYNC_PREREAD is not
a NOP ? I can't what kind of work BUS_DMASYNC_PREREAD could have to do ...

On platforms with memory reordering it might be a barrier (i.e. sparc64 does
a membar_sync() for it).

So does x86. But I think it's redundant with the POSTREAD (which is also
a barrier) we'll do later.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-03 19:22:27 UTC

Post by Manuel Bouyer

Post by Martin Husemann

Post by Manuel Bouyer
OK, I see. But, is there a platoform where BUS_DMASYNC_PREREAD is not
a NOP ? I can't what kind of work BUS_DMASYNC_PREREAD could have to do ...

On platforms with memory reordering it might be a barrier (i.e. sparc64 does
a membar_sync() for it).

So does x86. But I think it's redundant with the POSTREAD (which is also
a barrier) we'll do later.

It looks to me, too, like the x86 implementation will make a redundant
x86_lfence() call.

I think it is preferable for the bus_dma implementation to try to avoid
redundant operations than for MI drivers to try to do so. What do you
think?

Dave

--
David Young OJC Technologies is now Pixo
***@pixotech.com Urbana, IL (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-28 18:53:37 UTC

Post by David Young

Post by Manuel Bouyer

Post by Martin Husemann

Post by Manuel Bouyer
OK, I see. But, is there a platoform where BUS_DMASYNC_PREREAD is not
a NOP ? I can't what kind of work BUS_DMASYNC_PREREAD could have to do ...

On platforms with memory reordering it might be a barrier (i.e. sparc64 does
a membar_sync() for it).

So does x86. But I think it's redundant with the POSTREAD (which is also
a barrier) we'll do later.

It looks to me, too, like the x86 implementation will make a redundant
x86_lfence() call.
I think it is preferable for the bus_dma implementation to try to avoid
redundant operations than for MI drivers to try to do so. What do you
think?

Back to this old question I missed

It's not clear to me how the bus_dma implementation could avoid
redundant calls here. I think PREREAD and POSTREAD should both
be a barrier. I'm not sure we can avoid redundant barriers anyway,
as we can't mix, e.g. POSTREAD with PREWRITE.

Back to this specific case, the PREREAD not needed on every loop because,
when descriptors are updated, wm_add_rxbuf() will do a PREREAD|PREWRITE.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-27 15:31:47 UTC

Post by Manuel Bouyer

Post by Greg Troxel
Do you have actual problems if gluster doesn't force the buffer to be
large?

that's interesting: I now have 78MB/s with tso4, and 48MB/s without
tso4. Just as if the setsockopt would turn tso4 off.

Even more interesting: without changes on the linux side, a linux client
gets only 25MB out of the NetBSD server without setsockopt (it gets 95MB/s
when the netbsd server sets the snd/rcv buf size),
and the NetBSD client without setsockopt gets only 73MB/s out of the linux
server (it gets 95MB/s)

I'll sumarize this in the table below (all hosts are using tso4 and
no large receive offload)

server / client NetBSD NetBSD Linux
no SND/RVCBUF with SND/RVCBUF with SND/RVCBUF
NetBSD no SND/RVCBUF 78MB/s 49MB/s 25MB/s
NetBSD with SND/RVCBUF 52MB/s 49MB/s 95MB/s
Linux with SND/RVCBUF 73MB/s 99MB/s

I wonder if it could have something to do with the handling of PUSH in tcp.
With large socket buffers, data may be delayed for longer before being made
available to userland process.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-27 15:57:00 UTC

Post by Manuel Bouyer
ttcp is at 108KB/s (so it's also faster without tso4). Looks like there's
definitively a problem with TSO on our side.

The sending network adapter is a 'wm'? There are a few models of wm
with broken TSO support, but as far as I know most work fine. In
particular, most 82573 variants you'll actually encounter in the field
are bad.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-27 16:00:33 UTC

Post by Manuel Bouyer

Post by Manuel Bouyer

Post by Greg Troxel
Do you have actual problems if gluster doesn't force the buffer to be
large?

that's interesting: I now have 78MB/s with tso4, and 48MB/s without
tso4. Just as if the setsockopt would turn tso4 off.

Even more interesting: without changes on the linux side, a linux client
gets only 25MB out of the NetBSD server without setsockopt (it gets 95MB/s
when the netbsd server sets the snd/rcv buf size),
and the NetBSD client without setsockopt gets only 73MB/s out of the linux
server (it gets 95MB/s)

It's possible this has to do with the interrupt moderation tuning. I
believe we've been pending the checkin of better values than the ones
I worked out from the documentation for quite some time -- there were
highly unobvious performance effects with small buffers. Simon did
a bunch of testing and concluded, as I recall, that the values used
by Intel in the Linux driver were "magic" and that we should use
those, not mine.

If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

Also it may be worth checking we are not doing TSO twice, once in
hardware and once in software. There was a bug a long time ago
that could cause that but last time I looked I determined by setting
breakpoints in the relevant segmentation functions that it was,
as far as I could tell, fixed.

There should never be a performance loss from TSO, given a sane
implementation. In fact there shouldn't be much gain either on
modern systems _unless_ they are under very heavy load, since the
real savings is in transmitting host CPU time.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-27 16:03:34 UTC

Post by Thor Lancelot Simon

Post by Manuel Bouyer
ttcp is at 108KB/s (so it's also faster without tso4). Looks like there's
definitively a problem with TSO on our side.

The sending network adapter is a 'wm'? There are a few models of wm
with broken TSO support, but as far as I know most work fine. In
particular, most 82573 variants you'll actually encounter in the field
are bad.

Mine is:
wm0 at pci4 dev 0 function 0: i80003 dual 1000baseT Ethernet, rev. 1
wm0: interrupting at ioapic0 pin 18
wm0: PCI-Express bus
wm0: 65536 word (16 address bits) SPI EEPROM
wm0: Ethernet address 00:30:48:32:13:10
ikphy0 at wm0 phy 1: i82563 10/100/1000 media interface, rev. 2

And, except a few strange thing in the pcap file on the sender side,
it seems to work fine (I'm also doing ssh, including some large file
transfers over this interface with tso enabled).

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-27 19:38:09 UTC

Post by Greg Troxel
Perhaps, but bpf doesn't know about the interface MTU, and has its own
snaplen of how much, so if it is seeing a 32 KB TCP packet then that
should be ok. But I saw no such packets in the trace.

TSO packets can be up to 64k.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2011-10-27 16:44:22 UTC

one has to wonder how bpf works when tso is enabled. are the packets
read back from the card, or are they also created in software and then
discarded?

Manuel Bouyer

2011-10-27 17:51:43 UTC

Post by Thor Lancelot Simon
It's possible this has to do with the interrupt moderation tuning. I
believe we've been pending the checkin of better values than the ones
I worked out from the documentation for quite some time -- there were
highly unobvious performance effects with small buffers. Simon did
a bunch of testing and concluded, as I recall, that the values used
by Intel in the Linux driver were "magic" and that we should use
those, not mine.
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.

With this (installed on both NetBSD hosts) I get mittiged results:
- the NetBSD client against the linux server gets degranded and unstable
performances several runs gives large variations in speed
- the NetBSD client against the NetBSD server gets better performances
in average (but still not in the 90MB range) and also with large
variations between runs
- the linux client against the NetBSD server gets a little boost and
the speed is stll stable between runs
- ttcp performances between NetBSD hosts gets a little boost too,
and the speed is stll stable between runs

But I do get Ierrs on both NetBSD hosts now, with the ttcp or glusterfs
test. I don't know where these errors comes from. Linux has no errors.
I don't think it's wm_add_rxbuf(), netstat -m and vmstat -m shows
no issues with mbuf allocations.
So I guess these are errors at the adapter level, we may need to change
more things to match these values.
Also, linux seems to be using more advanced features for these adapters,
this is something we may have to look at too.

Post by Thor Lancelot Simon
Also it may be worth checking we are not doing TSO twice, once in
hardware and once in software. There was a bug a long time ago
that could cause that but last time I looked I determined by setting
breakpoints in the relevant segmentation functions that it was,
as far as I could tell, fixed.

This would affect the linux client/server against the NetBSD server/client
as well, isn't it ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

Manuel Bouyer

2011-10-27 17:55:16 UTC

Post by Greg Troxel
one has to wonder how bpf works when tso is enabled. are the packets
read back from the card, or are they also created in software and then
discarded?

The mbuf passed to the adapter is also passed to bpf_mtap(),
so bpf_mtap() gets the tso packet and not the on-wire packets when
tso is in use.

AFAIK bpf_mtap() doesn't do any attempts at doing something special
with TSO mbufs it could get. So I guess tcpdump is getting the TSO
buffer ...

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-27 17:57:22 UTC

Post by Manuel Bouyer

Post by Thor Lancelot Simon
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.

I think what Simon tested was the change to the moderation timers
and thresholds only -- not to the total number of descriptors in
use. Might be worth checking to see whether that is what's causing
the input errors you're seeing.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-27 17:58:15 UTC

Post by Manuel Bouyer

Post by Greg Troxel
one has to wonder how bpf works when tso is enabled. are the packets
read back from the card, or are they also created in software and then
discarded?

The mbuf passed to the adapter is also passed to bpf_mtap(),
so bpf_mtap() gets the tso packet and not the on-wire packets when
tso is in use.
AFAIK bpf_mtap() doesn't do any attempts at doing something special
with TSO mbufs it could get. So I guess tcpdump is getting the TSO
buffer ...

...which should frequently exceed the MTU, shouldn't it? So I would
actually expect tcpdump to misbehave.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-27 18:02:15 UTC

Post by Thor Lancelot Simon

Post by Manuel Bouyer

Post by Thor Lancelot Simon
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.

I think what Simon tested was the change to the moderation timers
and thresholds only -- not to the total number of descriptors in
use. Might be worth checking to see whether that is what's causing
the input errors you're seeing.

I also tested with the original value (256) and I did get the same
errors. I increased it in case the errors were because the adapter
was overflowing the receive ring.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2011-10-27 19:34:14 UTC

Post by Thor Lancelot Simon

Post by Manuel Bouyer
The mbuf passed to the adapter is also passed to bpf_mtap(),
so bpf_mtap() gets the tso packet and not the on-wire packets when
tso is in use.
AFAIK bpf_mtap() doesn't do any attempts at doing something special
with TSO mbufs it could get. So I guess tcpdump is getting the TSO
buffer ...

...which should frequently exceed the MTU, shouldn't it? So I would
actually expect tcpdump to misbehave.

Perhaps, but bpf doesn't know about the interface MTU, and has its own
snaplen of how much, so if it is seeing a 32 KB TCP packet then that
should be ok. But I saw no such packets in the trace.

Manuel Bouyer

2011-10-28 14:10:36 UTC

Post by Manuel Bouyer

Post by Thor Lancelot Simon
It's possible this has to do with the interrupt moderation tuning. I
believe we've been pending the checkin of better values than the ones
I worked out from the documentation for quite some time -- there were
highly unobvious performance effects with small buffers. Simon did
a bunch of testing and concluded, as I recall, that the values used
by Intel in the Linux driver were "magic" and that we should use
those, not mine.
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.
- the NetBSD client against the linux server gets degranded and unstable
performances several runs gives large variations in speed
- the NetBSD client against the NetBSD server gets better performances
in average (but still not in the 90MB range) and also with large
variations between runs
- the linux client against the NetBSD server gets a little boost and
the speed is stll stable between runs
- ttcp performances between NetBSD hosts gets a little boost too,
and the speed is stll stable between runs
But I do get Ierrs on both NetBSD hosts now, with the ttcp or glusterfs
test. I don't know where these errors comes from. Linux has no errors.
I don't think it's wm_add_rxbuf(), netstat -m and vmstat -m shows
no issues with mbuf allocations.
So I guess these are errors at the adapter level, we may need to change
more things to match these values.
Also, linux seems to be using more advanced features for these adapters,
this is something we may have to look at too.

Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.

With this I get 113MB/s with the ttcp test, and between 70 and 90MB/s
with glusterfs. the NetBSD client now gets the same speed with a
NetBSD or linux server.

In the patch there is changes for the WM_F_NEWQUEUE adapters but they
may not be correct. When using WM_F_NEWQUEUE for the i80003 (which
the linux driver does), performances are a little lower and I get
a high interrupt rate - just as if interrupt mitigation was not
working.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

David Young

2011-10-28 15:27:56 UTC

Post by Manuel Bouyer

Post by Manuel Bouyer

Post by Thor Lancelot Simon
It's possible this has to do with the interrupt moderation tuning. I
believe we've been pending the checkin of better values than the ones
I worked out from the documentation for quite some time -- there were
highly unobvious performance effects with small buffers. Simon did
a bunch of testing and concluded, as I recall, that the values used
by Intel in the Linux driver were "magic" and that we should use
those, not mine.
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.
- the NetBSD client against the linux server gets degranded and unstable
performances several runs gives large variations in speed
- the NetBSD client against the NetBSD server gets better performances
in average (but still not in the 90MB range) and also with large
variations between runs
- the linux client against the NetBSD server gets a little boost and
the speed is stll stable between runs
- ttcp performances between NetBSD hosts gets a little boost too,
and the speed is stll stable between runs
But I do get Ierrs on both NetBSD hosts now, with the ttcp or glusterfs
test. I don't know where these errors comes from. Linux has no errors.
I don't think it's wm_add_rxbuf(), netstat -m and vmstat -m shows
no issues with mbuf allocations.
So I guess these are errors at the adapter level, we may need to change
more things to match these values.
Also, linux seems to be using more advanced features for these adapters,
this is something we may have to look at too.

Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.
Index: sys/dev/pci/if_wm.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_wm.c,v
retrieving revision 1.162.4.15
diff -u -p -u -r1.162.4.15 if_wm.c
--- sys/dev/pci/if_wm.c 7 Mar 2011 04:14:19 -0000 1.162.4.15
+++ sys/dev/pci/if_wm.c 28 Oct 2011 14:03:33 -0000
@@ -2879,11 +2907,7 @@ wm_rxintr(struct wm_softc *sc)
device_xname(sc->sc_dev), i));
WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);
-
status = sc->sc_rxdescs[i].wrx_status;
- errors = sc->sc_rxdescs[i].wrx_errors;
- len = le16toh(sc->sc_rxdescs[i].wrx_len);
- vlantag = sc->sc_rxdescs[i].wrx_special;
if ((status & WRX_ST_DD) == 0) {
/*
@@ -2892,6 +2916,14 @@ wm_rxintr(struct wm_softc *sc)
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
break;
}

Should

Post by Manuel Bouyer
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);

move above

Post by Manuel Bouyer
if ((status & WRX_ST_DD) == 0) {

?

Post by Manuel Bouyer
/*
+ /*
+ * sync again, to make sure the values below have been read
+ * after status.
+ */
+ WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);
+ errors = sc->sc_rxdescs[i].wrx_errors;
+ len = le16toh(sc->sc_rxdescs[i].wrx_len);
+ vlantag = sc->sc_rxdescs[i].wrx_special;

Dave

--
David Young OJC Technologies is now Pixo
***@pixotech.com Urbana, IL (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-28 16:30:57 UTC

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.
With this I get 113MB/s with the ttcp test, and between 70 and 90MB/s
with glusterfs. the NetBSD client now gets the same speed with a
NetBSD or linux server.
In the patch there is changes for the WM_F_NEWQUEUE adapters but they
may not be correct. When using WM_F_NEWQUEUE for the i80003 (which
the linux driver does), performances are a little lower and I get
a high interrupt rate - just as if interrupt mitigation was not
working.

I'd say turn off WM_F_NEWQUEUE -- assuming that works -- and commit.

While you're in there, can you see whether we're actually feeding
the TSO enough data at once that it's actually useful? The tcpdump
makes me suspect for some reason, we may not be.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-28 16:54:31 UTC

Post by Thor Lancelot Simon
I'd say turn off WM_F_NEWQUEUE -- assuming that works -- and commit.

I'll look at this.

Post by Thor Lancelot Simon
While you're in there, can you see whether we're actually feeding
the TSO enough data at once that it's actually useful? The tcpdump
makes me suspect for some reason, we may not be.

I added a sc_ev_txlargetso, which gets incremented by:
if (m0->m_pkthdr.len > 32768)
WM_EVCNT_INCR(&sc->sc_ev_txlargetso);

in wm_tx_offload() at the end of the
if ((m0->m_pkthdr.csum_flags & (M_CSUM_TSOv4 | M_CSUM_TSOv6)) != 0) {
block.

With
ttcp -t -l65536 -b524288 -D xen1-priv < /glpool/truc

(64k writes but a 512k TCP buffer) I get no "largetso".
With ttcp -t -l524288 -b524288 -D xen1-priv < /glpool/truc
I get only 16 (the file to send is 640MB large so I'd expect 10000 TSO
segments).

So yes, tso is not very usefull it seems.

When talking to the linux host,
ttcp -t -l65536 -b524288 -D
gives 615 largetso (still not enough, but better), and 693 with
ttcp -t -l524288 -b524288.

A glusterfs read of the same 640MB file generates no largetso when sending
to the NetBSD client, and 2159 when sending to the linux host.

So there's still something in our TCP that prevents sending large
TCP segments. I suspect we're back to someting with ACK (like,
we ack a small segment so the sender sends a small segment to keep the window
full).

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-28 18:24:04 UTC

Post by Manuel Bouyer

Post by Thor Lancelot Simon
I'd say turn off WM_F_NEWQUEUE -- assuming that works -- and commit.

I'll look at this.

Post by Thor Lancelot Simon
While you're in there, can you see whether we're actually feeding
the TSO enough data at once that it's actually useful? The tcpdump
makes me suspect for some reason, we may not be.

if (m0->m_pkthdr.len > 32768)
WM_EVCNT_INCR(&sc->sc_ev_txlargetso);

It'd be interesting, perhaps, to see how often we were actually even > 1500.

Post by Manuel Bouyer
ttcp -t -l65536 -b524288 -D xen1-priv < /glpool/truc
(64k writes but a 512k TCP buffer) I get no "largetso".

Does the window ever have 32K of space, I wonder? Maybe the delay is
not large enough to ever let it open that wide.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-29 20:25:54 UTC

Post by Thor Lancelot Simon

Post by Manuel Bouyer
if (m0->m_pkthdr.len > 32768)
WM_EVCNT_INCR(&sc->sc_ev_txlargetso);

It'd be interesting, perhaps, to see how often we were actually even > 1500.

I added this:
if (m0->m_pkthdr.len >= 32768)
WM_EVCNT_INCR(&sc->sc_ev_txtsolarge);
if (m0->m_pkthdr.len >= 16384)
WM_EVCNT_INCR(&sc->sc_ev_txtsomedium);
if (m0->m_pkthdr.len >= 1500)
WM_EVCNT_INCR(&sc->sc_ev_txtsomtu);

and here's the results:
for ttcp against a NetBSD client:
wm0 txtso 188555 695 misc
wm0 txtsomtu 188555 695 misc
wm0 txtsomedium 92 0 misc
wm0 txtsolarge 4 0 misc

glusterfs against a NetBSD client:
wm0 txtso 114607 690 misc
wm0 txtsomtu 114607 690 misc
wm0 txtsomedium 1496 9 misc
wm0 txtsolarge 709 4 misc

glusterfs against a linux client:
wm0 txtso 86567 1185 misc
wm0 txtsomtu 86567 1185 misc
wm0 txtsomedium 6897 94 misc
wm0 txtsolarge 2391 32 misc

Post by Thor Lancelot Simon

Post by Manuel Bouyer
ttcp -t -l65536 -b524288 -D xen1-priv < /glpool/truc
(64k writes but a 512k TCP buffer) I get no "largetso".

Does the window ever have 32K of space, I wonder? Maybe the delay is
not large enough to ever let it open that wide.

We should have this in the trace ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-03 21:06:21 UTC

Post by Manuel Bouyer

Post by Thor Lancelot Simon

Post by Manuel Bouyer
if (m0->m_pkthdr.len > 32768)
WM_EVCNT_INCR(&sc->sc_ev_txlargetso);

It'd be interesting, perhaps, to see how often we were actually even > 1500.

if (m0->m_pkthdr.len >= 32768)
WM_EVCNT_INCR(&sc->sc_ev_txtsolarge);
if (m0->m_pkthdr.len >= 16384)
WM_EVCNT_INCR(&sc->sc_ev_txtsomedium);
if (m0->m_pkthdr.len >= 1500)
WM_EVCNT_INCR(&sc->sc_ev_txtsomtu);
wm0 txtso 188555 695 misc
wm0 txtsomtu 188555 695 misc
wm0 txtsomedium 92 0 misc
wm0 txtsolarge 4 0 misc
wm0 txtso 114607 690 misc
wm0 txtsomtu 114607 690 misc
wm0 txtsomedium 1496 9 misc
wm0 txtsolarge 709 4 misc
wm0 txtso 86567 1185 misc
wm0 txtsomtu 86567 1185 misc
wm0 txtsomedium 6897 94 misc
wm0 txtsolarge 2391 32 misc

Post by Thor Lancelot Simon

Post by Manuel Bouyer
ttcp -t -l65536 -b524288 -D xen1-priv < /glpool/truc
(64k writes but a 512k TCP buffer) I get no "largetso".

Does the window ever have 32K of space, I wonder? Maybe the delay is
not large enough to ever let it open that wide.

We should have this in the trace ?

Should be.

Does anything keep NetBSD from dribbling MTU-size segments onto the
interface output queue while one or two super-segments (size > MTU)
still wait to be transmitted?

The TSO code definitely tries to prevent pathological alternation
between super-segments and runts[1], however, it looks to me like
nothing prevents alternation of super-segments and MTU-size segments.
That is not pathological, perhaps, but we can probably do better.

Dave

[1] This code is in tcp_output():

if (len > txsegsize) {
if (use_tso) {
/*
* Truncate TSO transfers to IP_MAXPACKET, and make
* sure that we send equal size transfers down the
* stack (rather than big-small-big-small-...).
*/
#ifdef INET6
CTASSERT(IPV6_MAXPACKET == IP_MAXPACKET);
#endif
len = (min(len, IP_MAXPACKET) / txsegsize) * txsegsize;
if (len <= txsegsize) {
use_tso = 0;
}
} else
len = txsegsize;

It looks to me like the code prevents an alternation large,
sub-t_segsz, large, sub-t_segsz, large, but not an alternation
large, t_segsz, large, t_segsz, large....

--
David Young OJC Technologies is now Pixo
***@pixotech.com Urbana, IL (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2011-10-28 17:55:30 UTC

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.

I'm not entirely clear where the mis-ordering happens. I presume the
fields a volatile so gcc won't re-order them. Which seems to imply
that the only problem can be the adapter writing the fields in the
wrong order (unless the data is cached and spans cache lines).
In that case the BUS_DMASYNC is also acting as a delay.

I'm also not entirely certain it is a good idea to use le16toh()
on a volatile data item!

Post by Manuel Bouyer
len = le16toh(sc->sc_rxdescs[i].wrx_len);
len = sc->sc_rxdescs[i].wrx_len;
len = le16toh(len);

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-28 19:21:08 UTC

Post by David Young

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.
Index: sys/dev/pci/if_wm.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_wm.c,v
retrieving revision 1.162.4.15
diff -u -p -u -r1.162.4.15 if_wm.c
--- sys/dev/pci/if_wm.c 7 Mar 2011 04:14:19 -0000 1.162.4.15
+++ sys/dev/pci/if_wm.c 28 Oct 2011 14:03:33 -0000
@@ -2879,11 +2907,7 @@ wm_rxintr(struct wm_softc *sc)
device_xname(sc->sc_dev), i));
WM_CDRXSYNC(sc, i, BUS_DMASYNC_POSTREAD|BUS_DMASYNC_POSTWRITE);
-
status = sc->sc_rxdescs[i].wrx_status;
- errors = sc->sc_rxdescs[i].wrx_errors;
- len = le16toh(sc->sc_rxdescs[i].wrx_len);
- vlantag = sc->sc_rxdescs[i].wrx_special;
if ((status & WRX_ST_DD) == 0) {
/*
@@ -2892,6 +2916,14 @@ wm_rxintr(struct wm_softc *sc)
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);
break;
}

Should

Post by Manuel Bouyer
WM_CDRXSYNC(sc, i, BUS_DMASYNC_PREREAD);

move above

Post by Manuel Bouyer
if ((status & WRX_ST_DD) == 0) {

?

I don't think so: if WRX_ST_DD is not set, we won't read anything more frm
this descriptor so there's no need to sync it again.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-29 19:59:07 UTC

Post by David Laight

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.

I'm not entirely clear where the mis-ordering happens. I presume the
fields a volatile so gcc won't re-order them. Which seems to imply
that the only problem can be the adapter writing the fields in the
wrong order (unless the data is cached and spans cache lines).
In that case the BUS_DMASYNC is also acting as a delay.

AFAIK the CPU is allowed to reorder reads. linux has a rmb() here,
which is an equivalent of our x86_lfence() I guess.
But for platforms where BUS_DMASYNC is not a simple barrier,
2 BUS_DMASYNC calls are needed.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-10-29 20:37:40 UTC

Post by Manuel Bouyer

Post by David Laight

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.

I'm not entirely clear where the mis-ordering happens. I presume the
fields a volatile so gcc won't re-order them. Which seems to imply
that the only problem can be the adapter writing the fields in the
wrong order (unless the data is cached and spans cache lines).
In that case the BUS_DMASYNC is also acting as a delay.

AFAIK the CPU is allowed to reorder reads. linux has a rmb() here,
which is an equivalent of our x86_lfence() I guess.
But for platforms where BUS_DMASYNC is not a simple barrier,
2 BUS_DMASYNC calls are needed.

CPUs in general are allowed to reorder reads, but Intel and AMD
x86 CPUs in particular won't do that. The linux rmb() expands to
an empty asm() statement, essentially (not quite) a NOP.

There might be another problem, though. If your code looks like

volatile int a, b;
<. . .>
my_a = a;
my_b = b;

I don't think it is guaranteed that the compiler will generate code
which reads a before it reads b. The volatile declarations ensure
that a and b will be (re-)read, but don't indicate to the compiler that
reads from different locations need to be done in the order you wrote
them. To ensure the latter you need to do something like

my_a = a;
something();
my_b = b;

where something() is either a function call or that empty asm()
statement that linux rmb() expands to. In essence this means
that you can't ever leave out the calls to the memory barrier
primitives if the code depends on reads being done in order,
even on uniprocessors and even if the CPU hardware doesn't
reorder reads, because you still need something there to tell
the compiler to maintain the order. On NetBSD you would need a
membar_consumer() in there (though it would be better if the
x86 membar_consumer() turned into an empty asm() statement
the way linux x86 rmb() does).

I've been writing a lot of SMP data structure code recently, and
you sometimes find bugs like this when you change compilers.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2011-10-31 21:29:30 UTC

Post by Dennis Ferguson
CPUs in general are allowed to reorder reads, but Intel and AMD
x86 CPUs in particular won't do that.

The rules can also be different for cached v uncached values.

Post by Dennis Ferguson
The linux rmb() expands to
an empty asm() statement, essentially (not quite) a NOP.

An empty asm statement isn't enough. You need:
asm volatile ( ::: "memory" )
to tell gcc that the statement might modify any memory location.
However, if the data reference is volatile that isn't necessary.

Post by Dennis Ferguson
There might be another problem, though. If your code looks like
volatile int a, b;
<. . .>
my_a = a;
my_b = b;
I don't think it is guaranteed that the compiler will generate code
which reads a before it reads b. The volatile declarations ensure
that a and b will be (re-)read, but don't indicate to the compiler that
them.

The volatiles do guarantee the order of the instructions.
(They don't guarantee the size of the bus cycles)

...

Post by Dennis Ferguson
I've been writing a lot of SMP data structure code recently, and
you sometimes find bugs like this when you change compilers.

You also find a lot of code that just doesn't work...
Typical hints are locks that don't tie two actions together, and
places where locks are released over function calls (they can be ok).

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-13 21:04:51 UTC

Post by David Laight

Post by Dennis Ferguson
CPUs in general are allowed to reorder reads, but Intel and AMD
x86 CPUs in particular won't do that.

The rules can also be different for cached v uncached values.

Post by Dennis Ferguson
The linux rmb() expands to
an empty asm() statement, essentially (not quite) a NOP.

asm volatile ( ::: "memory" )
to tell gcc that the statement might modify any memory location.
However, if the data reference is volatile that isn't necessary.

Post by Dennis Ferguson
There might be another problem, though. If your code looks like
volatile int a, b;
<. . .>
my_a = a;
my_b = b;
I don't think it is guaranteed that the compiler will generate code
which reads a before it reads b. The volatile declarations ensure
that a and b will be (re-)read, but don't indicate to the compiler that
them.

The volatiles do guarantee the order of the instructions.
(They don't guarantee the size of the bus cycles)

Perhaps there are differences between GCC in 5.1 and in -current such
that different assembly is generated, but in my recentish -current
sources, I've added volatile to the rx descriptor, and GCC generated the
same assembly as before. It is clear from reading the assembly that the
fields are read in the correct order.

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-13 21:04:28 UTC

Post by Dennis Ferguson

Post by Manuel Bouyer

Post by David Laight

Post by Manuel Bouyer
Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.

I'm not entirely clear where the mis-ordering happens. I presume the
fields a volatile so gcc won't re-order them. Which seems to imply
that the only problem can be the adapter writing the fields in the
wrong order (unless the data is cached and spans cache lines).
In that case the BUS_DMASYNC is also acting as a delay.

AFAIK the CPU is allowed to reorder reads. linux has a rmb() here,
which is an equivalent of our x86_lfence() I guess.
But for platforms where BUS_DMASYNC is not a simple barrier,
2 BUS_DMASYNC calls are needed.

CPUs in general are allowed to reorder reads, but Intel and AMD
x86 CPUs in particular won't do that. The linux rmb() expands to
an empty asm() statement, essentially (not quite) a NOP.

I have established that in -current, at least, the compiler
doesn't reorder the reads in wm_rxintr(). People seem to disagree
whether an x86 CPU will reorder the reads. :-)

According to <http://www.linuxjournal.com/article/8212?page=0,2>, x86
will reorder reads:

..., x86 CPUs give no ordering guarantees for loads, so the smp_mb()
and smp_rmb() primitives expand to lock;addl.

In NetBSD-current, membar_consumer() is

ENTRY(_membar_consumer)
LOCK(13)
addl $0, -4(%esp)
ret
ENDLABEL(membar_consumer_end)

which resembles the x86_lfence() that bus_dmamap_sync(POSTREAD) calls,

ENTRY(x86_lfence)
lock
addl $0, -4(%esp)
ret
END(x86_lfence)

I believe that on a UP machine, the LOCK prefix in membar_consumer() is
overwritten with a NOP. The LOCK prefix in x86_lfence() is not erased
in that way. Is the LOCK prefix important to the proper operation of
bus_dmamap_sync() even on a UP machine?

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Oster

2011-11-21 20:58:57 UTC

On Fri, 28 Oct 2011 16:10:36 +0200

Post by Manuel Bouyer

Post by Manuel Bouyer

Post by Thor Lancelot Simon
It's possible this has to do with the interrupt moderation
tuning. I believe we've been pending the checkin of better
values than the ones I worked out from the documentation for
quite some time -- there were highly unobvious performance
effects with small buffers. Simon did a bunch of testing and
concluded, as I recall, that the values used by Intel in the
Linux driver were "magic" and that we should use those, not mine.
If this hasn't been adjusted to match the Linux driver, you might
want to take a quick look at the values it uses and see whether
they yield better small-buffer performance in your case.

I looked quickly at this and came up with the attached patch.
- the NetBSD client against the linux server gets degranded and
unstable performances several runs gives large variations in speed
- the NetBSD client against the NetBSD server gets better
performances in average (but still not in the 90MB range) and also
with large variations between runs
- the linux client against the NetBSD server gets a little boost and
the speed is stll stable between runs
- ttcp performances between NetBSD hosts gets a little boost too,
and the speed is stll stable between runs
But I do get Ierrs on both NetBSD hosts now, with the ttcp or
glusterfs test. I don't know where these errors comes from. Linux
has no errors. I don't think it's wm_add_rxbuf(), netstat -m and
vmstat -m shows no issues with mbuf allocations.
So I guess these are errors at the adapter level, we may need to
change more things to match these values.
Also, linux seems to be using more advanced features for these
adapters, this is something we may have to look at too.

Here is an updated patch. The key point to avoid the receive errors is
to do another BUS_DMASYNC after reading wrx_status, before reading the
other values to avoid reading e.g. len before status gets updated.
The errors were because of 0-len receive descriptors.
With this I get 113MB/s with the ttcp test, and between 70 and 90MB/s
with glusterfs. the NetBSD client now gets the same speed with a
NetBSD or linux server.
In the patch there is changes for the WM_F_NEWQUEUE adapters but they
may not be correct. When using WM_F_NEWQUEUE for the i80003 (which
the linux driver does), performances are a little lower and I get
a high interrupt rate - just as if interrupt mitigation was not
working.

Hi Manuel

Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up? (says he who saw reasonably
abysmal network performance this weekend on a bunch of machines with
wm0 in use, and would like to see some better performance.)

Thanks.

Later...

Greg Oster

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-22 06:41:13 UTC

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

In the discussion of the patches, people seem to disagree how the
patches work to improve the performance as they do, whether the patches
are portable, and whether or not the whole patch is necessary or just
the bus_dmamap_sync() part of it. I hope our understanding improves
before there is a commit. :-/

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-22 14:32:41 UTC

Post by David Young

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

In the discussion of the patches, people seem to disagree how the
patches work to improve the performance as they do, whether the patches
are portable, and whether or not the whole patch is necessary or just
the bus_dmamap_sync() part of it. I hope our understanding improves
before there is a commit. :-/

There are 2 parts: changes to the interrupt setup, and the reading of
receive descriptors.

I didn't see peoples having issues with the interrupt setup.

As for the discussion if a x86 CPU will reorder reads, I'm sure they do:
I've had troubles in Xen front/back driver because of this
(and there is explicit lfence() in the linux Xen drivers, because of this).

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-11-22 20:23:06 UTC

Post by Manuel Bouyer

Post by David Young

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

In the discussion of the patches, people seem to disagree how the
patches work to improve the performance as they do, whether the patches
are portable, and whether or not the whole patch is necessary or just
the bus_dmamap_sync() part of it. I hope our understanding improves
before there is a commit. :-/

There are 2 parts: changes to the interrupt setup, and the reading of
receive descriptors.
I didn't see peoples having issues with the interrupt setup.
I've had troubles in Xen front/back driver because of this
(and there is explicit lfence() in the linux Xen drivers, because of this).

Needless to say, the last bit would be entirely inconsistent with section
7.2 of any version of the "Intel 64 and IA-32 Architectures Software Developer’s
Manual, Volume 3A: System Programming Guide, Part 1" published more recently
than 2007. I won't repeat what it says here, but it is rather unambiguous
about the fact that newer reads (in program order) are always done after older
reads, at least in the basic instruction set.

Of course if that is always true then it would also imply that an lfence instruction
is useless, because the only thing an lfence instruction does would seem to be
guaranteed even if an lfence instruction isn't there. Yet the lfence instruction
does exist, which made me wonder what it is used for?

After looking for an answer to that it turns out that while read ordering is
guaranteed for loads done using the basic x86 instruction set, it is not guaranteed
with respect to loads done by certain SSE instructions. The lfence can be necessary
if the compiler is generating SSE instructions, and if we now have a complier which
is more aggressive about finding SSE instructions to generate it is possible that
there will be code which once worked fine without memory barriers which now requires
them. Maybe this is an instance of that?

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-22 20:53:53 UTC

Post by Dennis Ferguson
[...]
Needless to say, the last bit would be entirely inconsistent with section
7.2 of any version of the "Intel 64 and IA-32 Architectures Software Developer?s
Manual, Volume 3A: System Programming Guide, Part 1" published more recently
than 2007. I won't repeat what it says here, but it is rather unambiguous
about the fact that newer reads (in program order) are always done after older
reads, at least in the basic instruction set.

I have "IA-32 Intel® Architecture Software Developer's Manual Volume 3:
System Programming Guide" from 2004, and it says at the beggining
of 7.2.2:
| In a single-processor system for memory regions defined as write-back
| cacheable, the following ordering rules apply:
| 1. Reads can be carried out speculatively and in any order.

and a bit later
| In a multiple-processor system, the following ordering rules apply:
| · Individual processors use the same ordering rules as in a
| single-processor system.
| · Writes by a single processor are observed in the same order by all
| processors.
| · Writes from the individual processors on the system bus are NOT ordered
| with respect to each other.

I can't see anything about read ordering being stronger in a SMP system.

So we have to assume that reads can be reorderd, unless we want
NetBSD to run only on x86 systems newer than 2007 (and a lfence instruction
is enough to enforce read ordering).
and yes, my test system is older than 2007.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-11-22 23:10:52 UTC

Post by Manuel Bouyer

Post by Dennis Ferguson
[...]
Needless to say, the last bit would be entirely inconsistent with section
7.2 of any version of the "Intel 64 and IA-32 Architectures Software Developer?s
Manual, Volume 3A: System Programming Guide, Part 1" published more recently
than 2007. I won't repeat what it says here, but it is rather unambiguous
about the fact that newer reads (in program order) are always done after older
reads, at least in the basic instruction set.

System Programming Guide" from 2004, and it says at the beggining
| In a single-processor system for memory regions defined as write-back
| 1. Reads can be carried out speculatively and in any order.

You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clear in the post-2007 revision I have, where the section you quote
above now says:

7.2.2 Memory Ordering in P6 and More Recent Processor Families

The Intel Core 2 Duo, Intel Core Duo, Pentium 4, and P6 family processors also
use a processor-ordered memory-ordering model that can be further defined as
“write ordered with store-buffer forwarding.” This model can be characterized as follows.

In a single-processor system for memory regions defined as write-back cacheable,
the following ordering principles apply […]:

• Reads are not reordered with other reads.
• Writes are not reordered with older reads.

and, about speculative reads in particular, later says:

The processor-ordering model described in this section is virtually identical to
that used by the Pentium and Intel486 processors. The only enhancements in the Pentium 4,
Intel Xeon, and P6 family processors are:

• Added support for speculative reads, while still adhering to the ordering principles above.
• Store-buffer forwarding, when a read passes a write to the same memory location.

That is, they've tightened up the guarantees for the modern processors, while the
older processors were in fact even more strictly ordered than this. Speculative
reads can still occur on the modern processors, but they are now guaranteed
to be implemented in a way which still observes program ordering. There are no
Intel processors, past or present, which work some other way.

Post by Manuel Bouyer
I can't see anything about read ordering being stronger in a SMP system.
So we have to assume that reads can be reorderd, unless we want
NetBSD to run only on x86 systems newer than 2007 (and a lfence instruction
is enough to enforce read ordering).
and yes, my test system is older than 2007.

You are reading about what they thought, in 2004, that they might build in future.
The post-2007 manuals make clear how all processors have actually worked, from
the 386 to the modern ones, and none of them have worked that way. If you
don't want to take my word for it then take a look at the x86 section of
this (very good) 2007 paper:

http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf

which says about the same thing. If you are using an Intel or AMD CPU (I think
Cyrix x86's might have had an out-of-order option) it won't be reordering
reads in the basic instruction set.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-23 11:12:05 UTC

Post by Dennis Ferguson
[...]
You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

Post by Dennis Ferguson
This is clear in the post-2007 revision I have, where the section you quote

It also says that we should not rely on this behavior and, for compatibility
with future processors programmers should use memory barrier instructions
where appropriate.

Anyway, what prompted this discussion is the added bus_dmamap_sync()
in thw wm driver. It's needed because:
- we may be using bounce buffering, and we don't know in which order
the copy to bounce buffer is done
- all the world is not x86.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-23 11:30:39 UTC

Post by Manuel Bouyer

Post by Dennis Ferguson
[...]
You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

To be more specific: on linux, rmb() is *not* a simple compiler barrier,
it's either lock; addl $0,0(%%esp) or lfence depending on CPU
target.
smp_rmb() is defined to either barrier() (a compiler barrier) or
rmb() when compiled with CONFIG_X86_PPRO_FENCE option.

Post by Manuel Bouyer

Post by Dennis Ferguson
This is clear in the post-2007 revision I have, where the section you quote

It also says that we should not rely on this behavior and, for compatibility
with future processors programmers should use memory barrier instructions
where appropriate.
Anyway, what prompted this discussion is the added bus_dmamap_sync()
- we may be using bounce buffering, and we don't know in which order
the copy to bounce buffer is done
- all the world is not x86.

Also, the Intel manual specifies what happens between CPUs, it doesn't
says what happens when main memory is written to by a DMA device.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-11-23 19:22:55 UTC

Post by Manuel Bouyer

Post by Dennis Ferguson
[...]
You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

Are the xen front-/back-end drivers otherwise correct? I.e., using
volatile where they ought to? wm(4) definitely does *not* use volatile
everywhere it ought to, and I've just found out that that explains this
bug.

I've just tried the same experiment on the netbsd-5 branch. The
compiler generates different assembly for wm_rxintr() before and after.
The before-assembly definitely loads wrx_len before wrx_status, which is
wrong; the after-assembly loads wrx_status, first. So we can explain
the wm(4) bug with re-ordering of reads by the compiler, not the CPU.

(BTW, in -current, when I added volatile to the rx descriptor
members and recompiled, the compiler generated the same assembly for
wm_rxintr(). Makes me wonder, does the newer GCC in -current cover a
lot of bugs?)

Post by Manuel Bouyer

Post by Dennis Ferguson
This is clear in the post-2007 revision I have, where the section you quote

It also says that we should not rely on this behavior and, for compatibility
with future processors programmers should use memory barrier instructions
where appropriate.

Agreed.

Post by Manuel Bouyer
Anyway, what prompted this discussion is the added bus_dmamap_sync()
- we may be using bounce buffering, and we don't know in which order
the copy to bounce buffer is done
- all the world is not x86.

I agree strongly with your bullet points, and I think that by the same
rationale, we need one more bus_dmamap_sync(). :-)

Maybe I do not remember correctly, but I thought that the previous
discussion of how many _sync()s to use, where they should go, and why,
left off with me asking, "what do you think?" I do really want to know!

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2011-11-23 23:25:36 UTC

Post by Manuel Bouyer

Post by Dennis Ferguson
[...]
You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

I am very sure that adding lfence() calls to that code fixed it. What I
suspect is that you don't understand why it fixed it, since I'm pretty positive the
original problem couldn't have been an Intel CPU reordering reads from cached
memory. For example if the thing you did to generate the instruction was
either a function call or an `asm volatile ("lfence":::"memory")' it will
have effects beyond just adding the instruction and those effects, rather
than the instruction, might be what mattered.

Post by Manuel Bouyer

Post by Dennis Ferguson
This is clear in the post-2007 revision I have, where the section you quote

It also says that we should not rely on this behavior and, for compatibility
with future processors programmers should use memory barrier instructions
where appropriate.

If you are talking about the last paragraph in 7.2 it doesn't say you should
add memory barrier instructions where they serve no purpose. It says you should
use a memory synchronization API that can be made to do the right thing if
ordering constraints become weaker in future. With current hardware an
'lfence' instruction, while being costly to execute, is very nearly useless
(I've heard it is useful only for write-combining memory), so it makes no
sense for the API to generate it until there are CPUs which need it.

Post by Manuel Bouyer
Anyway, what prompted this discussion is the added bus_dmamap_sync()
- we may be using bounce buffering, and we don't know in which order
the copy to bounce buffer is done
- all the world is not x86.

Same thing. I'm sure the bus_dmamap_sync() (or some bit of API which generates
a barrier instruction on machines would need it) is required there for some
machines other than the x86, but the fact is that the problem occurred on an x86
and a read barrier instruction by itself isn't fixing any problem there
(though apparently the compiler barrier that comes along with that might do
the trick in this case).

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2011-11-24 19:24:26 UTC

Post by Dennis Ferguson

Post by Manuel Bouyer
This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

I am very sure that adding lfence() calls to that code fixed it. What I
suspect is that you don't understand why it fixed it, since I'm pretty positive the
original problem couldn't have been an Intel CPU reordering reads from cached
memory. For example if the thing you did to generate the instruction was
either a function call or an `asm volatile ("lfence":::"memory")' it will
have effects beyond just adding the instruction and those effects, rather
than the instruction, might be what mattered.

The change also separated the two reads by a lot more C code - which
would in itself change the timings.

I do remember some docs going way back to the early pentium days
that implied that some reads/writes might happen out of sequence.
But those rules would actually have broken a lot of legacy code
- so were probably never actually implemented.

Some of the 'fence' instructions might be required for correct
sequencing between cached and uncached operations - eg to
ensure that a write to cached memory is snoopable before an IO
write is seen.

IIRC the code in question shouldn't need any kind of barrier
- provided the compiler generated the reads in the correct order
which I believe it is required to do for volatile data.

I'd look at the order of the reads in the faulty code (I think they've
been noted to be reversed), then add 'asm volatile ("":::"memory")'
between the them (without changing anything else), recompile and retest.

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-12-31 18:44:45 UTC

Post by Manuel Bouyer

Post by Manuel Bouyer

Post by Dennis Ferguson
[...]
You are assuming the above somehow applied to Intel CPUs which existed
in 2004, but that assumption is incorrect. There were no Intel (or AMD)
CPUs which worked like that in 2004, since post-2007 manuals document the
ordering behavior of all x86 models from the 386 forward, and explicitly
says that none of them have reordered reads, so the above could only a
statement of what they expected future CPUs might do and not what
they actually did.

This is clearly not my experience. I can say for sure that without lfence
instructions, the xen front/back drivers are not working properly
(and I'm not the only one saying this).

To be more specific: on linux, rmb() is *not* a simple compiler barrier,
it's either lock; addl $0,0(%%esp) or lfence depending on CPU
target.
smp_rmb() is defined to either barrier() (a compiler barrier) or
rmb() when compiled with CONFIG_X86_PPRO_FENCE option.

Post by Manuel Bouyer

Post by Dennis Ferguson
This is clear in the post-2007 revision I have, where the section you quote

It also says that we should not rely on this behavior and, for compatibility
with future processors programmers should use memory barrier instructions
where appropriate.
Anyway, what prompted this discussion is the added bus_dmamap_sync()
- we may be using bounce buffering, and we don't know in which order
the copy to bounce buffer is done
- all the world is not x86.

Also, the Intel manual specifies what happens between CPUs, it doesn't
says what happens when main memory is written to by a DMA device.

The ordering question seems to be settled for write-back cached memory
that is accessed exclusively by one or more Intel CPUs: reads will not
be re-ordered on any past or present Intel CPU, so no memory barriers
are necessary to protect against said re-ordering.

I still have some doubts that the same rules apply to memory that is
uncached, write-through cached, or else memory that is write-back cached
but shared with bus-mastering peripherals. Perhaps that is the reason
for the fence instructions to exist?

(BTW, recently I read a NetBSD kernel profile where x86_mfence and
x86_lfence together accounted for 5% of the kernel run time. That seems
like an awful lot of time to spend on those barriers if they really are
unnecessary!)

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Young

2011-12-31 18:09:59 UTC

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

Post by Greg Oster
(says he who saw reasonably
abysmal network performance this weekend on a bunch of machines with
wm0 in use, and would like to see some better performance.)

But it would be good to test these patches on different hardware
or with different applications :)

The changes to interrupt-mitigation made performance worse with the
82573: the CPU saturates at a lower packet rate with the patch than
without.

It looks to me like wm(4) tries to throttle receive interrupts using two
different methods simultaneously. One method is RDTR + RADV, and the
other is ITR. It seems to me that the driver should pick one method
based on the product and use that method exclusively.

In the manual for the 82573, Intel recommends ITR instead of RDTR+RADV
(as opposed to ITR in addition to RDTR+RADV). I think that we should
follow the manual.

I see that wm(4) tries to throttle transmit interrupts, too. I think
that the driver would be simpler, and it would perform no worse, if it
did not request transmit interrupts. Just reclaim transmitted buffers
in the if_start routine and in an if_drain routine.

Dave

--
David Young
***@pobox.com Urbana, IL (217) 721-9981

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

David Laight

2012-01-01 17:20:17 UTC

Post by David Young
I see that wm(4) tries to throttle transmit interrupts, too. I think
that the driver would be simpler, and it would perform no worse, if it
did not request transmit interrupts. Just reclaim transmitted buffers
in the if_start routine and in an if_drain routine.

We used to request 'end of transmit' interrupts under two conditions:
1) Transmit ring full - packet queued until space available.
2) Some 'buffer loan' conditions, in particular the SYSV NFS stack
wouldn't transmit another packet until the callback activated
when an earlier packet was freed.
Dunno is netbsd has any similar code.

Most of my ethernet drivers would copy tx packets into a dedicated
tx buffer area - particularly beneficial on systems with a real iommu
(think sparc sbus), those with memory the ethernet chip couldn't
directly access, packets with a large number of buffer fragments,
chipsets that required the first fragment to be at least 100 bytes
(amd lance - so it could refetch after collision).

David

--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-21 21:11:17 UTC

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

Post by Greg Oster
(says he who saw reasonably
abysmal network performance this weekend on a bunch of machines with
wm0 in use, and would like to see some better performance.)

But it would be good to test these patches on different hardware
or with different applications :)

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

John R. Shannon

2011-11-21 22:03:17 UTC

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

Post by Greg Oster
(says he who saw reasonably
abysmal network performance this weekend on a bunch of machines with
wm0 in use, and would like to see some better performance.)

But it would be good to test these patches on different hardware
or with different applications :)

I've been running the patch on two NetBSD 5.1_STABLE with the following
controllers:
- Intel i82574L
- Intel i82546GB
built into the motherboards. I've experienced no problems with them
after applying the patches.

--
John R. Shannon
DSCI

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-11-21 22:19:57 UTC

Post by John R. Shannon

Post by Manuel Bouyer

Post by Greg Oster
Hi Manuel
Have there been issues with these patches that prevent them from being
applied to -current and/or pulled up?

Nothing wrong AFAIK, I just got distracted.

Post by Greg Oster
(says he who saw reasonably
abysmal network performance this weekend on a bunch of machines with
wm0 in use, and would like to see some better performance.)

But it would be good to test these patches on different hardware
or with different applications :)

I've been running the patch on two NetBSD 5.1_STABLE with the
- Intel i82574L
- Intel i82546GB
built into the motherboards. I've experienced no problems with them
after applying the patches.

And did you experience better performances ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-25 13:05:08 UTC

Post by David Laight
I've seen Linux stacks defer sending an ACK until the next? kernel
clock tick. This will reduce the ACK count somewhat.
In my case is caused problems with 'slow start' at the other end
(which was also Linux).

In this case NetBSD sends nearly one ACK for every data packet received.
I suspect this is what is causing the performances issues.
Without deffering too much we can probably be clever here.

I also suspect it's a bug somewere, because in other condition NetBSD
doesn't send that much ACK. It seems that glusterfs is using
SO_SNDBUF/SO_RCVBUF, I wonder if this is what is causing this behavior.
Both are set to 512KB.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2011-10-25 16:58:29 UTC

Looking at the trace you provided, I am mostly seeing correct
every-other ack behavior. I continue to wonder if the bad pcap trace is
masking something else. Try setting net.bpf.maxbufsize larger, but I am
still not used to seeing 0-len captures even if packets are dropped.

In counting packets, I concur that something seems wrong. But I am
unable to find much fine-grained oddness.

Big buffers should not be an issue.

Matthias Scheler

2011-10-22 14:21:34 UTC

Post by Manuel Bouyer
I've been playing with glusterfs a bit today, and found some performance
differences between NetBSD and linux, which I tracked down to our TCP
stack. Basically, between a NetBSD/linux pair, performances are much
better than between 2 NetBSD hosts.

What kind of network interfaces have these machines? And are you really
using NetBSD 5.1 or NetBSD 5.1_STABLE? Have you enable hardware
checksums under NetBSD? Linux turns them on by default.

If you use NetBSD 5.1 and the machines have bge(4) interfaces I wouldn't
be surprised that you get bad performance. An update to NetBSD 5.1_STABLE
should help in that case.

Kind regards

--
Matthias Scheler http://zhadum.org.uk/

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-22 14:55:17 UTC

Post by Matthias Scheler

Post by Manuel Bouyer
I've been playing with glusterfs a bit today, and found some performance
differences between NetBSD and linux, which I tracked down to our TCP
stack. Basically, between a NetBSD/linux pair, performances are much
better than between 2 NetBSD hosts.

What kind of network interfaces have these machines?

wm0 at pci4 dev 0 function 0: i80003 dual 1000baseT Ethernet, rev. 1
wm0: interrupting at ioapic0 pin 18
wm0: PCI-Express bus
wm0: 65536 word (16 address bits) SPI EEPROM
wm0: Ethernet address 00:30:48:32:13:10
ikphy0 at wm0 phy 1: i82563 10/100/1000 media interface, rev. 2

Post by Matthias Scheler
And are you really
using NetBSD 5.1 or NetBSD 5.1_STABLE?

NetBSD 5.1_STABLE/amd64
Build date Wed Oct 5 16:51:41 UTC 2011
Built by ***@b6.netbsd.org

BUILDID = '201110051530Z'
DESTDIR = '/home/builds/ab/netbsd-5/amd64/201110051530Z-dest'

Post by Matthias Scheler
Have you enable hardware
checksums under NetBSD? Linux turns them on by default.

Yes, without hardware checksums and tso4 performances are much worse.

Note that both NetBSD host have good performances against the linux host.
The priblem is between the 2 NetBSD hosts. the 3 boxes have strictly
identical hardware.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2011-10-22 18:43:05 UTC

Post by Matthias Scheler

Post by Manuel Bouyer
I've been playing with glusterfs a bit today, and found some performance
differences between NetBSD and linux, which I tracked down to our TCP
stack. Basically, between a NetBSD/linux pair, performances are much
better than between 2 NetBSD hosts.

What kind of network interfaces have these machines? And are you really
using NetBSD 5.1 or NetBSD 5.1_STABLE? Have you enable hardware
checksums under NetBSD? Linux turns them on by default.

Not sure this is really relevant -- Manuel has pointed out what seems
like it must be a serious issue with the TCP stack, namely its generation
of huge numbers of ACKs for data flows where the Linux stack ACKs about
as often as one would expect.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2011-10-22 18:50:24 UTC

Post by Thor Lancelot Simon

Post by Matthias Scheler

Post by Manuel Bouyer
I've been playing with glusterfs a bit today, and found some performance
differences between NetBSD and linux, which I tracked down to our TCP
stack. Basically, between a NetBSD/linux pair, performances are much
better than between 2 NetBSD hosts.

What kind of network interfaces have these machines? And are you really
using NetBSD 5.1 or NetBSD 5.1_STABLE? Have you enable hardware
checksums under NetBSD? Linux turns them on by default.

Not sure this is really relevant -- Manuel has pointed out what seems
like it must be a serious issue with the TCP stack, namely its generation
of huge numbers of ACKs for data flows where the Linux stack ACKs about
as often as one would expect.

I think this was in a private mail to Greg Troxel, but I double-checked
this watching port counters on the cisco switch, and from the
output of netstat -s on both hosts after a clean boot and a single run of
the test.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

65 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Manuel Bouyer 2011-10-17 16:03:18 UTC

Greg Troxel 2011-10-17 17:25:28 UTC

Manuel Bouyer 2011-10-17 17:52:03 UTC

Havard Eidnes 2011-10-24 09:44:10 UTC

Manuel Bouyer 2011-10-24 10:10:57 UTC

David Laight 2011-10-24 17:26:19 UTC

Manuel Bouyer 2011-10-26 20:55:14 UTC

Greg Troxel 2011-10-27 00:15:44 UTC

Manuel Bouyer 2011-10-27 09:45:16 UTC

Greg Troxel 2011-10-27 12:30:12 UTC

Manuel Bouyer 2011-10-27 14:02:12 UTC

David Young 2011-10-29 01:24:53 UTC

Manuel Bouyer 2011-10-29 20:02:10 UTC

Martin Husemann 2011-10-29 20:14:18 UTC

Manuel Bouyer 2011-11-02 19:04:40 UTC

David Young 2011-11-03 19:22:27 UTC

Manuel Bouyer 2011-11-28 18:53:37 UTC

Manuel Bouyer 2011-10-27 15:31:47 UTC

Thor Lancelot Simon 2011-10-27 15:57:00 UTC

Thor Lancelot Simon 2011-10-27 16:00:33 UTC

Manuel Bouyer 2011-10-27 16:03:34 UTC

Manuel Bouyer 2011-10-27 19:38:09 UTC

Greg Troxel 2011-10-27 16:44:22 UTC

Manuel Bouyer 2011-10-27 17:51:43 UTC

Manuel Bouyer 2011-10-27 17:55:16 UTC

Thor Lancelot Simon 2011-10-27 17:57:22 UTC

Thor Lancelot Simon 2011-10-27 17:58:15 UTC

Manuel Bouyer 2011-10-27 18:02:15 UTC

Greg Troxel 2011-10-27 19:34:14 UTC

Manuel Bouyer 2011-10-28 14:10:36 UTC

David Young 2011-10-28 15:27:56 UTC

Thor Lancelot Simon 2011-10-28 16:30:57 UTC

Manuel Bouyer 2011-10-28 16:54:31 UTC

Thor Lancelot Simon 2011-10-28 18:24:04 UTC

Manuel Bouyer 2011-10-29 20:25:54 UTC

David Young 2011-11-03 21:06:21 UTC

David Laight 2011-10-28 17:55:30 UTC

Manuel Bouyer 2011-10-28 19:21:08 UTC

Manuel Bouyer 2011-10-29 19:59:07 UTC

Dennis Ferguson 2011-10-29 20:37:40 UTC

David Laight 2011-10-31 21:29:30 UTC

David Young 2011-11-13 21:04:51 UTC

David Young 2011-11-13 21:04:28 UTC

Greg Oster 2011-11-21 20:58:57 UTC

David Young 2011-11-22 06:41:13 UTC

Manuel Bouyer 2011-11-22 14:32:41 UTC

Dennis Ferguson 2011-11-22 20:23:06 UTC

Manuel Bouyer 2011-11-22 20:53:53 UTC

Dennis Ferguson 2011-11-22 23:10:52 UTC

Manuel Bouyer 2011-11-23 11:12:05 UTC

Manuel Bouyer 2011-11-23 11:30:39 UTC

David Young 2011-11-23 19:22:55 UTC

Dennis Ferguson 2011-11-23 23:25:36 UTC

David Laight 2011-11-24 19:24:26 UTC

David Young 2011-12-31 18:44:45 UTC

David Young 2011-12-31 18:09:59 UTC

David Laight 2012-01-01 17:20:17 UTC

Manuel Bouyer 2011-11-21 21:11:17 UTC

John R. Shannon 2011-11-21 22:03:17 UTC

Manuel Bouyer 2011-11-21 22:19:57 UTC

Manuel Bouyer 2011-10-25 13:05:08 UTC

Greg Troxel 2011-10-25 16:58:29 UTC

Matthias Scheler 2011-10-22 14:21:34 UTC

Manuel Bouyer 2011-10-22 14:55:17 UTC

Thor Lancelot Simon 2011-10-22 18:43:05 UTC

Manuel Bouyer 2011-10-22 18:50:24 UTC

about - legalese

Loading...