Discussion:
NetBSD in BSD Router / Firewall Testing
(too old to reply)
Hubert Feyrer
2006-11-30 23:06:17 UTC
Permalink
[adding tech-net@ as I don't really know what to answer...

Context: adding NetBSD in the benchmark at
http://www.tancsa.com/blast.html, with the wm(4) driver in
-current, as it's not available in 3.1]


On Thu, 30 Nov 2006, Mike Tancsa wrote:
> Gave it a try and I posted the results on the web page. The Intel driver
> doesnt seem to work too well. Is there debugging in this kernel ?

That sounds indeed not so bright. I do not know about the wm(4) driver,
but maybe someone on tech-net@ (CC:d) has an idea. IIRC that's with a
-current (HEAD) GENERIC kernel and the wm(4) driver, while bge(4) driver
works ok.

What I wonder is: how does the bge(4) driver perform under -current, do
you have numbers for that? (Just to make sure it's not -curren that's
hosed)


- Hubert

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2006-11-30 23:49:04 UTC
Permalink
On Fri, Dec 01, 2006 at 12:06:17AM +0100, Hubert Feyrer wrote:
>
> [adding tech-net@ as I don't really know what to answer...
>
> Context: adding NetBSD in the benchmark at
> http://www.tancsa.com/blast.html, with the wm(4) driver in
> -current, as it's not available in 3.1]
>
>
> On Thu, 30 Nov 2006, Mike Tancsa wrote:
> >Gave it a try and I posted the results on the web page. The Intel driver
> >doesnt seem to work too well. Is there debugging in this kernel ?
>
> That sounds indeed not so bright. I do not know about the wm(4) driver,
> but maybe someone on tech-net@ (CC:d) has an idea. IIRC that's with a
> -current (HEAD) GENERIC kernel and the wm(4) driver, while bge(4) driver
> works ok.

There are some severe problems with the test configuration.

1) The published test results freely mix configurations where the switch
applies and removes the vlan tags with configurations where the host
does so. This is not a good idea:

1) The efficiency of the switch itself will differ in these configurations
2) The difference in frame size will actually measurably impact the PPS.
3) One of the device drivers you're testing doesn't do hardware VLAN
tag insertion/removal in NetBSD due to a bug (wm). Obviously, this
one is our fault, not yours.

2) The NetBSD kernels you're testing don't have options GATEWAY, so they
don't have the fastroute code.

3) There is a problem with autonegotiation either on your switch, on the
particular wm adapter you're using, or in NetBSD -- there's not quite
enough data to tell which. But look at the number of input errors on
the wm adapter in your test with NetBSD-current: it's 3 million. This
alone is probably responsible for most of the performance difference
between the wm and bge test cases with NetBSD kernels (and the hardware
vlan support in the bge driver may be responsible for the rest).

4) You don't appear to be using hardware IP checksum offload. You're going
to have trouble turning this on with a mismatched kernel and ifconfig
executable, however. :-/

With these fixed, we can probably help diagnose any remaining issues
pretty quickly.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 00:41:45 UTC
Permalink
At 06:49 PM 11/30/2006, Thor Lancelot Simon wrote:

>There are some severe problems with the test configuration.
>
>1) The published test results freely mix configurations where the switch
> applies and removes the vlan tags with configurations where the host
> does so. This is not a good idea:

Hi,
The switch is always involved. The ports are only in trunk mode for
the trunking tests. Otherwise, its switchport access. The same
"limitations" apply to all tested configurations. When I swapped in
a faster CPU briefly, I was seeing rates of +1Mpps on RELENG_4 with
no dropped packets with no firewall in the kernel. Thats the same
hardware, so I am not sure how its inadequate hardware on NetBSD
tests all of a sudden.


> 1) The efficiency of the switch itself will differ in these configurations

Why ? The only thing being changed from test to test is the OS.


> 2) The difference in frame size will actually measurably impact the PPS.

Framesize is always the same. UDP packet with a 10byte payload. The
generators are the same devices all the time. I am not using
different frame sizes for different setups to try and make something
look good and other things bad.



> 3) One of the device drivers you're testing doesn't do hardware VLAN
> tag insertion/removal in NetBSD due to a bug (wm). Obviously, this
> one is our fault, not yours.

When I did the wm tests, this was just plugged into the switch with a
port based VLAN (the cisco equiv of switchport access). There was no
trunking going on.


>2) The NetBSD kernels you're testing don't have options GATEWAY, so they
> don't have the fastroute code.

Like I said to Hubert, I am not a NetBSD person and just did the
default install. I am happy to re-test with a more appropriate kernel
config. FreeBSD and dfly both have fastforward (I am guessing like
your fastroute) as a sysctl tuneable so I could add that to the kernel...



>3) There is a problem with autonegotiation either on your switch, on the
> particular wm adapter you're using, or in NetBSD -- there's not quite
> enough data to tell which. But look at the number of input errors on
> the wm adapter in your test with NetBSD-current: it's 3 million. This
> alone is probably responsible for most of the performance difference

.... Or the kernel just was not able forward fast enough. FYI, The
switch proper negotiated with all the other OSes tested nor were
there errors on the switchport.

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2006-12-01 02:43:22 UTC
Permalink
On Thu, Nov 30, 2006 at 07:41:45PM -0500, Mike Tancsa wrote:
> At 06:49 PM 11/30/2006, Thor Lancelot Simon wrote:
>
> > 1) The efficiency of the switch itself will differ in these
> > configurations
>
> Why ? The only thing being changed from test to test is the OS.

Because the switch hardware does not forward packets at the same rate
when it is inserting and removing VLAN tags as it does when it's not.
The effect will be small, but measurable.

> > 2) The difference in frame size will actually measurably impact the PPS.
>
> Framesize is always the same. UDP packet with a 10byte payload.

No. The Ethernet packets with the VLAN tag on them are not, in fact,
the same size as those without it; and for a packet as small as a 10
byte UDP packet, this will make quite a large difference if you actually
have a host that can inject packets at anywhere near wire speed.

> generators are the same devices all the time. I am not using
> different frame sizes for different setups to try and make something
> look good and other things bad.

I didn't say that you were, just to be clear. But that does not mean
that running some tests with tagging turned on, and others not, is
good benchmarking practice: you should run the exact same set of tests
for all host configurations, because doing otherwise yields distorted
results.

> >3) There is a problem with autonegotiation either on your switch, on the
> > particular wm adapter you're using, or in NetBSD -- there's not quite
> > enough data to tell which. But look at the number of input errors on
> > the wm adapter in your test with NetBSD-current: it's 3 million. This
> > alone is probably responsible for most of the performance difference
>
> .... Or the kernel just was not able forward fast enough.

No; that will simply not cause the device driver to report an input
error, whereas your netstat output shows that it reported three *million*
of them. Something is wrong at the link layer. It could be in the NetBSD
driver for the Intel gigabit PHY, but there's not enough data in your
report to be sure. FWIW, I work for a server load balancer vendor that
ships a FreeBSD-based product, and I consequently do a lot of load testing.
Even with tiny UDP packets, I get better forwarding performance from
basically _every_ OS you tested than you seem to, which is why I think
there's something that's not quite right with your test rig. I am just
doing my best to point out the first things that come to mind when I look
at the data you've put online.

I note that you snipped the text where I noted that because you're
testing the wm card with mismatched kernel and ifconfig, you're not
using its hardware checksum offload. That's one thing you should
definitely fix, and if you don't have that turned on for other
kernels you're testing, of course you should probably fix it there too.

--
Thor Lancelot Simon ***@rek.tjls.com
"The liberties...lose much of their value whenever those who have greater
private means are permitted to use their advantages to control the course
of public debate." -John Rawls

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 03:15:04 UTC
Permalink
At 09:43 PM 11/30/2006, Thor Lancelot Simon wrote:
>On Thu, Nov 30, 2006 at 07:41:45PM -0500, Mike Tancsa wrote:
> > At 06:49 PM 11/30/2006, Thor Lancelot Simon wrote:
> >
> > > 1) The efficiency of the switch itself will differ in these
> > > configurations
> >
> > Why ? The only thing being changed from test to test is the OS.
>
>Because the switch hardware does not forward packets at the same rate
>when it is inserting and removing VLAN tags as it does when it's not.
>The effect will be small, but measurable.


But the same impact will hurt *all* the OSes tested equally, not just
NetBSD. Besides, supposedly the switch is rated to 17Mpps. No doubt
there is a bit of vendor exaggeration, but I doubt they would stretch
the number by a factor of 10. Still, even if they did, I would not
be able to push over 1Mpps on my RELENG_4 setup.



> > > 2) The difference in frame size will actually measurably
> impact the PPS.
> >
> > Framesize is always the same. UDP packet with a 10byte payload.
>
>No. The Ethernet packets with the VLAN tag on them are not, in fact,

I did both sets of tests. eg . the line

RELENG_6 UP i386 FastFWD Polling

means that em0 was in the equiv of 0/4 and em1 0/5 with port 0/4
switchport access 44
and port 0/5
switchport access 88

where as the test

RELENG_4, FastFWD, vlan44 and vlan88 off single int, em1 Polling, HZ=2000

has a switch config of
switchport trunk encapsulation dot1q
switchport trunk allowed vlan 44,88
switchport mode trunk

on port 5.

I tested NetBSD 3.1 bge nic against config b). So I dont see how you
cant compare the results of that to

HEAD, FastFWD, vlan44 and vlan88 off single int, em1 (Nov 24th sources)
RELENG_4, FastFWD, vlan44 and vlan88 off single int, em1 Polling, HZ=2000
HEAD, FastFWD, vlan44 and vlan88 off single int, bge0 (Nov 24th sources)
RELENG_6, FastFWD, INTR_FAST, vlan44 and vlan44 off single int, em1

which had the exact same switch config.


>the same size as those without it; and for a packet as small as a 10
>byte UDP packet, this will make quite a large difference if you actually
>have a host that can inject packets at anywhere near wire speed.

Thats why I use at least 2...


> > generators are the same devices all the time. I am not using
> > different frame sizes for different setups to try and make something
> > look good and other things bad.
>
>I didn't say that you were, just to be clear. But that does not mean
>that running some tests with tagging turned on, and others not, is
>good benchmarking practice: you should run the exact same set of tests
>for all host configurations, because doing otherwise yields distorted
>results.

I did where I could. I am not saying compare the trunking performance
of NetBSD to the non trunking performance of FreeBSD. I am looking
at trunking to trunking, non trunking to non trunking. I did the
majority of my testing with the Intel PCIe dual port card which
NetBSD 3.1 does not support. So, since I had some bge tests, I ran
the bge tests in vlan mode which I dont see why you cant compare that
to vlan mode on FreeBSD using the same bge card. Its the exact same
switch config for both set of tests, and the same traffic generators
so I dont see why its not a valid comparison.




> > >3) There is a problem with autonegotiation either on your switch, on the
> > > particular wm adapter you're using, or in NetBSD -- there's not quite
> > > enough data to tell which. But look at the number of input errors on
> > > the wm adapter in your test with NetBSD-current: it's 3 million. This
> > > alone is probably responsible for most of the performance difference
> >
> > .... Or the kernel just was not able forward fast enough.
>
>No; that will simply not cause the device driver to report an input
>error, whereas your netstat output shows that it reported three *million*
>of them. Something is wrong at the link layer. It could be in the NetBSD
>driver for the Intel gigabit PHY, but there's not enough data in your
>report to be sure. FWIW, I work for a server load balancer vendor that
>ships a FreeBSD-based product, and I consequently do a lot of load testing.
>Even with tiny UDP packets, I get better forwarding performance from
>basically _every_ OS you tested than you seem to, which is why I think
>there's something that's not quite right with your test rig. I am just
>doing my best to point out the first things that come to mind when I look
>at the data you've put online.

Stock FreeBSD, or modified FreeBSD ? With RELENG_4 I can push over
1Mpps. All of the test setups I used saw input errors when I tried
to push too many packets through the box. I really dont know much
about NetBSD but it too will have some sort of limit as to how much
it can forward. Once its limit is hit, how does it report that
? Does it just silently drop the packet ? Or does it show up as an
input error ?




>I note that you snipped the text where I noted that because you're
>testing the wm card with mismatched kernel and ifconfig, you're not
>using its hardware checksum offload. That's one thing you should
>definitely fix, and if you don't have that turned on for other
>kernels you're testing, of course you should probably fix it there too.

It didnt seem to make much difference on FreeBSD (i.e. turn hardware
checksums on or off for routing performance) but I will see if I can
get the box rebuilt to sync the base with the kernel.

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 06:06:24 UTC
Permalink
At 09:43 PM 11/30/2006, Thor Lancelot Simon wrote:

>I note that you snipped the text where I noted that because you're
>testing the wm card with mismatched kernel and ifconfig, you're not
>using its hardware checksum offload. That's one thing you should
>definitely fix, and if you don't have that turned on for other
>kernels you're testing, of course you should probably fix it there too.

OK, I updated the base as well and rebuilt the kernel. There doesnt
seem to be much difference, perhaps +5Kpps by turning it on. But it
seems to be the driver, as I get FAR better results with the bge nic
(see below)


# ifconfig wm0
wm0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
enabled=300<IP4CSUM_Rx,IP4CSUM_Tx>
address: 00:15:17:0b:70:98
media: Ethernet autoselect (1000baseT
full-duplex,flowcontrol,rxpause,txpause)
status: active
inet 192.168.88.223 netmask 0xffffff00 broadcast 192.168.88.255
inet6 fe80::215:17ff:fe0b:7098%wm0 prefixlen 64 scopeid 0x5
# ifconfig wm1
wm1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=7ff80<TSO4,IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx,UDP4CSUM_Rx,UDP4CSUM_Tx,TCP6CSUM_Rx,TCP6CSUM_Tx,UDP6CSUM_Rx,UDP6CSUM_Tx,TSO6>
enabled=300<IP4CSUM_Rx,IP4CSUM_Tx>
address: 00:15:17:0b:70:99
media: Ethernet autoselect (1000baseT
full-duplex,flowcontrol,rxpause,txpause)
status: active
inet 192.168.44.223 netmask 0xffffff00 broadcast 192.168.44.255
inet6 fe80::215:17ff:fe0b:7099%wm1 prefixlen 64 scopeid 0x6
# netstat -ni
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Colls
wm0 1500 <Link> 00:15:17:0b:70:98 32226898
281780 15 0 0
wm0 1500 192.168.88/24 192.168.88.223 32226898
281780 15 0 0
wm0 1500 fe80::/64 fe80::215:17ff:fe 32226898
281780 15 0 0
wm1 1500 <Link> 00:15:17:0b:70:99 34 0 7117358 0 0
wm1 1500 192.168.44/24 192.168.44.223 34 0 7117358 0 0
wm1 1500 fe80::/64 fe80::215:17ff:fe 34 0 7117358 0 0


There are no errors on the switchport.

And in SMP which is GERNIC.MP with

options GATEWAY # packet forwarding





NetBSD 4.99.4 (ROUTER) #1: Thu Nov 30 19:23:52 EST 2006
***@r2-netbsd.sentex.ca:/usr/obj/sys/arch/i386/compile/ROUTER
total memory = 2047 MB
avail memory = 2002 MB
timecounter: Timecounters tick every 10.000 msec
timecounter: Timecounter "i8254" frequency 1193182 Hz quality 100
BIOS32 rev. 0 found at 0xf21a0
mainbus0 (root)
mainbus0: Intel MP Specification (Version 1.4) (OEM00000 PROD00000000)
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: AMD Unknown K7 (Athlon) (686-class), 2015.10 MHz, id 0x20fb1
cpu0: features f7dbfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features f7dbfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,NOX,MMXX,MMX>
cpu0: features f7dbfbff<FXSR,SSE,SSE2,HTT,LONG,3DNOW2,3DNOW>
cpu0: features2 1<SSE3>
cpu0: "AMD Athlon(tm) 64 X2 Dual Core Processor 3800+"
cpu0: I-cache 64 KB 64B/line 2-way, D-cache 64 KB 64B/line 2-way
cpu0: L2 cache 512 KB 64B/line 16-way
cpu0: ITLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu0: DTLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu0: AMD Power Management features: f<TTP,VID,FID,TS>
cpu0: calibrating local timer
cpu0: apic clock running at 201 MHz
cpu0: 8 page colors
cpu1 at mainbus0: apid 1 (application processor)
cpu1: starting
cpu1: AMD Unknown K7 (Athlon) (686-class), 2015.00 MHz, id 0x20fb1
cpu1: features f7dbfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu1: features f7dbfbff<PGE,MCA,CMOV,PAT,PSE36,MPC,NOX,MMXX,MMX>
cpu1: features f7dbfbff<FXSR,SSE,SSE2,HTT,LONG,3DNOW2,3DNOW>
cpu1: features2 1<SSE3>
cpu1: "AMD Athlon(tm) 64 X2 Dual Core Processor 3800+"
cpu1: I-cache 64 KB 64B/line 2-way, D-cache 64 KB 64B/line 2-way
cpu1: L2 cache 512 KB 64B/line 16-way
cpu1: ITLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu1: DTLB 32 4 KB entries fully associative, 8 4 MB entries fully associative
cpu1: AMD Power Management features: f<TTP,VID,FID,TS>
mpbios: bus 0 is type PCI
mpbios: bus 1 is type PCI
mpbios: bus 2 is type PCI
mpbios: bus 3 is type PCI
mpbios: bus 4 is type PCI
mpbios: bus 5 is type PCI
mpbios: bus 6 is type ISA
ioapic0 at mainbus0 apid 2 (I/O APIC)
ioapic0: pa 0xfec00000, version 11, 24 pins
ioapic0: misconfigured as apic 0
ioapic0: remapped to apic 2
pci0 at mainbus0 bus 0: configuration mode 1
pci0: i/o space, memory space enabled, rd/line, rd/mult, wr/inv ok
NVIDIA nForce4 Memory Controller (miscellaneous memory, revision
0xa3) at pci0 dev 0 function 0 not configured
pcib0 at pci0 dev 1 function 0
pcib0: NVIDIA product 0x0050 (rev. 0xa3)
NVIDIA nForce4 SMBus (SMBus serial bus, revision 0xa2) at pci0 dev 1
function 1 not configured
viaide0 at pci0 dev 6 function 0
viaide0: NVIDIA nForce4 IDE Controller (rev. 0xf2)
viaide0: bus-master DMA support present
viaide0: primary channel configured to compatibility mode
viaide0: primary channel interrupting at ioapic0 pin 14 (irq 14)
atabus0 at viaide0 channel 0
viaide0: secondary channel configured to compatibility mode
viaide0: secondary channel interrupting at ioapic0 pin 15 (irq 15)
atabus1 at viaide0 channel 1
ppb0 at pci0 dev 9 function 0: NVIDIA nForce4 PCI Host Bridge (rev. 0xa2)
pci1 at ppb0 bus 5
pci1: i/o space, memory space enabled
vga1 at pci1 dev 8 function 0: ATI Technologies Rage XL (AGP) (rev. 0x65)
wsdisplay0 at vga1 kbdmux 1: console (80x25, vt100 emulation)
wsmux1: connecting to wsdisplay0
nfe0 at pci0 dev 10 function 0: ioapic0 pin 3 (irq 3), address
00:13:d4:ae:9b:6b
makphy0 at nfe0 phy 1: Marvell 88E1111 Gigabit PHY, rev. 2
makphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb1 at pci0 dev 11 function 0: NVIDIA nForce4 PCIe Host Bridge (rev. 0xa3)
pci2 at ppb1 bus 4
pci2: i/o space, memory space enabled, rd/line, wr/inv ok
bge0 at pci2 dev 0 function 0: Broadcom BCM5751 Gigabit Ethernet
bge0: interrupting at ioapic0 pin 11 (irq 11)
bge0: pcie mode=0x105000
bge0: ASIC BCM5750 A1 (0x4001), Ethernet address 00:10:18:14:15:12
bge0: setting short Tx thresholds
brgphy0 at bge0 phy 1: BCM5750 1000BASE-T media interface, rev. 0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb2 at pci0 dev 12 function 0: NVIDIA nForce4 PCIe Host Bridge (rev. 0xa3)
pci3 at ppb2 bus 3
pci3: i/o space, memory space enabled, rd/line, wr/inv ok
bge1 at pci3 dev 0 function 0: Broadcom BCM5751 Gigabit Ethernet
bge1: interrupting at ioapic0 pin 10 (irq 10)
bge1: pcie mode=0x105000
bge1: ASIC BCM5750 A1 (0x4001), Ethernet address 00:10:18:14:27:d5
bge1: setting short Tx thresholds
brgphy1 at bge1 phy 1: BCM5750 1000BASE-T media interface, rev. 0
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb3 at pci0 dev 13 function 0: NVIDIA nForce4 PCIe Host Bridge (rev. 0xa3)
pci4 at ppb3 bus 2
pci4: i/o space, memory space enabled, rd/line, wr/inv ok
bge2 at pci4 dev 0 function 0: Broadcom BCM5751 Gigabit Ethernet
bge2: interrupting at ioapic0 pin 5 (irq 5)
bge2: pcie mode=0x105000
bge2: ASIC BCM5750 A1 (0x4001), Ethernet address 00:10:18:14:38:d2
bge2: setting short Tx thresholds
brgphy2 at bge2 phy 1: BCM5750 1000BASE-T media interface, rev. 0
brgphy2: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
ppb4 at pci0 dev 14 function 0: NVIDIA nForce4 PCIe Host Bridge (rev. 0xa3)
pci5 at ppb4 bus 1
pci5: i/o space, memory space enabled, rd/line, wr/inv ok
wm0 at pci5 dev 0 function 0: Intel PRO/1000 PT (82571EB), rev. 6
wm0: interrupting at ioapic0 pin 7 (irq 7)
wm0: PCI-Express bus
wm0: 65536 word (16 address bits) SPI EEPROM
wm0: Ethernet address 00:15:17:0b:70:98
igphy0 at wm0 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
wm1 at pci5 dev 0 function 1: Intel PRO/1000 PT (82571EB), rev. 6
wm1: interrupting at ioapic0 pin 5 (irq 5)
wm1: PCI-Express bus
wm1: 65536 word (16 address bits) SPI EEPROM
wm1: Ethernet address 00:15:17:0b:70:99
igphy1 at wm1 phy 1: Intel IGP01E1000 Gigabit PHY, rev. 0
igphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseT,
1000baseT-FDX, auto
pchb0 at pci0 dev 24 function 0
pchb0: Advanced Micro Devices AMD64 HyperTransport configuration (rev. 0x00)
pchb1 at pci0 dev 24 function 1
pchb1: Advanced Micro Devices AMD64 Address Map configuration (rev. 0x00)
pchb2 at pci0 dev 24 function 2
pchb2: Advanced Micro Devices AMD64 DRAM configuration (rev. 0x00)
pchb3 at pci0 dev 24 function 3
pchb3: Advanced Micro Devices AMD64 Miscellaneous configuration (rev. 0x00)
isa0 at pcib0
com0 at isa0 port 0x3f8-0x3ff irq 4: ns16550a, working fifo
pckbc0 at isa0 port 0x60-0x64
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pms0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pms0 mux 0
attimer0 at isa0 port 0x40-0x43: AT Timer
pcppi0 at isa0 port 0x61
pcppi0: children must have an explicit unit
midi0 at pcppi0: PC speaker (CPU-intensive output)
sysbeep0 at pcppi0
isapnp0 at isa0 port 0x279: ISA Plug 'n Play device support
npx0 at isa0 port 0xf0-0xff
npx0: reported by CPUID; using exception 16
pcppi0: attached to attimer0
isapnp0: no ISA Plug 'n Play devices found
ioapic0: enabling
timecounter: Timecounter "clockinterrupt" frequency 100 Hz quality 0
Kernelized RAIDframe activated
wd0 at atabus0 drive 0: <ST340014A>
wd0: drive supports 16-sector PIO transfers, LBA48 addressing
wd0: 38166 MB, 77545 cyl, 16 head, 63 sec, 512 bytes/sect x 78165360 sectors
wd0: 32-bit data port
wd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd0(viaide0:0:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
atapibus0 at atabus1: 2 targets
cd0 at atapibus0 drive 1: <AOPEN 8X8 DVD Dual AAN, , 1.4A> cdrom removable
cd0: 32-bit data port
cd0: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 2 (Ultra/33)
wd1 at atabus1 drive 0: <ST340014A>
wd1: drive supports 16-sector PIO transfers, LBA48 addressing
wd1: 38166 MB, 77545 cyl, 16 head, 63 sec, 512 bytes/sect x 78165360 sectors
wd1: 32-bit data port
wd1: drive supports PIO mode 4, DMA mode 2, Ultra-DMA mode 5 (Ultra/100)
wd1(viaide0:1:0): using PIO mode 4, Ultra-DMA mode 5 (Ultra/100) (using DMA)
cd0(viaide0:1:1): using PIO mode 4, Ultra-DMA mode 2 (Ultra/33) (using DMA)
boot device: wd0
root on wd0a dumps on wd0b
root file system type: ffs
cpu1: CPU 1 running
wsdisplay0: screen 1 added (80x25, vt100 emulation)
wsdisplay0: screen 2 added (80x25, vt100 emulation)
wsdisplay0: screen 3 added (80x25, vt100 emulation)
wsdisplay0: screen 4 added (80x25, vt100 emulation)
#



The best I can get is about 125Kpps


However, if I switch to the 2 bge nics (ie NON trunked mode), I get
close to 600 Kpps on the one stream and a max of 360Kpps when I have
the stream in the opposite direction going. This is comparable to
the other boxes. However, the driver did wedge and I had to ifconfig
down/up it to recover once during testing.

Nov 30 19:36:21 r2-netbsd /netbsd: bge1: pcie mode=0x105000
Nov 30 19:38:00 r2-netbsd /netbsd: bge2: pcie mode=0x105000
Nov 30 19:54:18 r2-netbsd /netbsd: bge: failed on len 52?
Nov 30 19:54:49 r2-netbsd last message repeated 10930 times
Nov 30 19:55:55 r2-netbsd last message repeated 14526 times
Nov 30 19:56:11 r2-netbsd /netbsd: ed on len 52?
Nov 30 19:56:11 r2-netbsd /netbsd: bge: failed on len 52?
Nov 30 19:56:12 r2-netbsd last message repeated 719 times
Nov 30 19:56:20 r2-netbsd /netbsd: ed on len 52?
Nov 30 19:56:20 r2-netbsd /netbsd: bge: failed on len 52?
Nov 30 19:56:21 r2-netbsd last message repeated 717 times

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jonathan Stone
2006-12-01 18:49:30 UTC
Permalink
As sometime principial maintaner of NetBSD's bge(4) driver, and the
author of many of the changes and chip-variant support subsequently
folded into OpenBSD's bge(4) by ***@openbsd.org, I'd like to speak
to a couple of points here.

First point is Thor's comment about variance in framesize due to inserting,
or not inserting, VLAN tags. I've always quietly assumed that
full-dupex Ethernet packets obey the orinial 10Mbit CSMA/CD minimum
packet length: in case, for example, a small frame is switched onto a
half-duplex link, such as a 100Mbit hub, or 10Mbit coax.

I beleive the UDP packets in Mike's tests are all so small that, even
with a VLAN tag added, the Ethernet payload (IPv4 header, UDP header,
10 bytes UDP payload), plus 14-byte Ethernet header, plus 4-byte CRC,
is still less than the ETHER_MIN_MTU. If so, I don't see how
framesize is a factor, since the packets will be padded to the minimum
valid Ethernet payload in any case. OTOH, Switch forwarding PPS may
well show a marginal degradation due to VLAN insertion; but we're
still 2 or 3 orders of magnitude away from those limits.

Second point: NetBSD's bge(4) driver includes support for runtime
manual tuning of interrupt mitigation. I chose the tuning values
based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.

If my (dimming) memory serves, the default value of 0 yields
thresh-holds close to Bill Paul's original FreeBSD driver. A value of
1 yields an bge interrrupt for every two full-sized Ethernet
frames. Each increment of the sysctl knob will, roughly, halve receive
interrupt rate, up to a maximum of 5, which interrupts about every 30
to 40 full-sized TCP segments.

I personally haven't done peak packet-rate measurements with bge(4) in
years. *However*, I can state for a fact that for ttcp-like
workloads, the NetBSD-style interrupt mitigation gives superior
throughput and lower CPU utilization than FreeBSD-6.1. (I have discussed
various measurments pritavely with Robert Watson, Andre , and Sam Leffler
at length).

I therefore see very, very good grounds to expect that NetBSD would
show much better performance if you increase bge interrupt mitigation.
However, as interrupt mitigation increases, the lengths of
per-interrupt bursts of packets hitting ipintrq build up by a factor
of 2 for each increment in interrupt level. I typcally run ttcp with
BGE interrupt mitigation at 4 or 5, and an ipintrq depth of 512 per
interface (2048 for 4 interfaces). NetBSD-3.1 on a 2.4Ghz Opteron can
handle at least 320,00 packets/sec of receive TCP traffic, including
delivering the TCP traffic to userspace. For a tinygram stream, I'd
expect you would need to make ipintrq even deeper.

On a related note: each setting of the ge-interrupt mitigation "knob"
has two values, one for per-packet limits and one for DMA-segment
limits (essentially, bytes). I'd not be surprised if the per-packet
limits are suboptimal for traffic consisting solely of tinygrams.


That said: I see a very strong philosophical design difference between
FreeBSD's polling machinery, and the interrupt-mitigation approaches
variously implemented by Jason Thorpe in wm(4) and by myself in
bge(4). For the workloads I care about, the design-point tradeoffs in
FreeBSD-4's polling are simply not acceptable. I *want* kernel
softint processing to pre-empt userspace procesese, and even
kthreads. I acknowledge that my needs are, perhaps, unusual.

Even so, I'd be glad to work on improving bge(4) tuning for workloads
dominated by tinygrams. The same packet rate as ttcp (over
400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
--- unless there's a whole lot of CPU processing going on inside
IP-forwarding that I'm wholly unaware of.

At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
5,000 interrupts per bge per second. What interrupt rates are you
seeing for each bge device in your tests?


>NetBSD 4.99.4 (ROUTER) #1: Thu Nov 30 19:23:52 EST 2006

[snip dmesg showing Broadcom 5750 NICs; see origianl for details]


>The best I can get is about 125Kpps
>
>
>However, if I switch to the 2 bge nics (ie NON trunked mode), I get
>close to 600 Kpps on the one stream and a max of 360Kpps when I have
>the stream in the opposite direction going. This is comparable to
>the other boxes. However, the driver did wedge and I had to ifconfig
>down/up it to recover once during testing.


>Nov 30 19:36:21 r2-netbsd /netbsd: bge1: pcie mode=0x105000
>Nov 30 19:38:00 r2-netbsd /netbsd: bge2: pcie mode=0x105000

Oops. Those messages were for my own verification and shouldn't be in
normal builds.

>Nov 30 19:54:18 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:54:49 r2-netbsd last message repeated 10930 times
>Nov 30 19:55:55 r2-netbsd last message repeated 14526 times
>Nov 30 19:56:11 r2-netbsd /netbsd: ed on len 52?
>Nov 30 19:56:11 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:56:12 r2-netbsd last message repeated 719 times
>Nov 30 19:56:20 r2-netbsd /netbsd: ed on len 52?
>Nov 30 19:56:20 r2-netbsd /netbsd: bge: failed on len 52?
>Nov 30 19:56:21 r2-netbsd last message repeated 717 times

I've never seen that particular bug. I don't beleive I have any acutal
5750 chips to try to reproduce it. I do have access to: 5700, 5701,
5705, 5704, 5721, 5752, 5714, 5715, 5780. (I have one machine with one
5752; and the 5780 is one-dualport-per HT-2000 chip, which means one
per motherboard. But for most people's purposes, the 5780/5714/5715
are indistinguishable).

I wonder, does this problem go away if you crank up interrupt mitigation?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2006-12-01 10:49:59 UTC
Permalink
On Thu, Nov 30, 2006 at 10:15:04PM -0500, Mike Tancsa wrote:
> Stock FreeBSD, or modified FreeBSD ? With RELENG_4 I can push over
> 1Mpps. All of the test setups I used saw input errors when I tried
> to push too many packets through the box. I really dont know much
> about NetBSD but it too will have some sort of limit as to how much
> it can forward. Once its limit is hit, how does it report that
> ? Does it just silently drop the packet ? Or does it show up as an
> input error ?

Input errors are problems of the hardware, not dropped packets (*). Try
"netstat -q" if you want to see whether the queues dropped packets --
otherwise they are processed.

(*) If the RX processing can't keep up, it might signal it as error, not
sure. That would hint to an interrupt problem though.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 14:31:23 UTC
Permalink
At 06:06 PM 11/30/2006, Hubert Feyrer wrote:

>[adding tech-net@ as I don't really know what to answer...
>
> Context: adding NetBSD in the benchmark at
> http://www.tancsa.com/blast.html, with the wm(4) driver in
> -current, as it's not available in 3.1]
>
>
>On Thu, 30 Nov 2006, Mike Tancsa wrote:
>>Gave it a try and I posted the results on the web page. The Intel
>>driver doesnt seem to work too well. Is there debugging in this kernel ?
>
>That sounds indeed not so bright. I do not know about the wm(4)
>driver, but maybe someone on tech-net@ (CC:d) has an idea. IIRC
>that's with a -current (HEAD) GENERIC kernel and the wm(4) driver,
>while bge(4) driver works ok.
>
>What I wonder is: how does the bge(4) driver perform under -current,
>do you have numbers for that? (Just to make sure it's not -curren that's hosed)

Done and posted. I also looked at netstat -q and indeed it reports
dropped packets

# netstat -q
arpintrq:
queue length: 0
maximum queue length: 50
packets dropped: 151
ipintrq:
queue length: 0
maximum queue length: 256
packets dropped: 133721212
ip6intrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq1:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq2:
queue length: 0
maximum queue length: 256
packets dropped: 0
clnlintrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoediscinq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoeinq:
queue length: 0
maximum queue length: 256
packets dropped: 0


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Steven M. Bellovin
2006-12-01 14:55:06 UTC
Permalink
On Fri, 01 Dec 2006 09:31:23 -0500
Mike Tancsa <***@sentex.net> wrote:

>
> # netstat -q
> arpintrq:
> queue length: 0
> maximum queue length: 50
> packets dropped: 151

I'm not sure this one matters much in the real world -- I suspect it can
only happen when a large number of addresses are polled in a very short
time. (OTOH, it might happen if a scanning worm was working through
the router.)

> ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 133721212

This is the second report we've seen recently of packet drops in this
queue. We need to understand what's going on, I think.

--Steve Bellovin, http://www.cs.columbia.edu/~smb

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 15:00:09 UTC
Permalink
At 09:55 AM 12/1/2006, Steven M. Bellovin wrote:

> > ipintrq:
> > queue length: 0
> > maximum queue length: 256
> > packets dropped: 133721212
>
>This is the second report we've seen recently of packet drops in this
>queue. We need to understand what's going on, I think.

Hi,

I am guessing I am just overwhelming the box no ? Each of my
generator boxes are blasting about 600Kpps in opposite directions
through the box 10 byte UDP packets. Even when doing just the one
stream in NetBSD, the box (r2) acting as the router is totally
unresponsive from the serial console and OOB NIC.

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Steven M. Bellovin
2006-12-01 16:25:31 UTC
Permalink
On Fri, 01 Dec 2006 10:00:09 -0500
Mike Tancsa <***@sentex.net> wrote:

> At 09:55 AM 12/1/2006, Steven M. Bellovin wrote:
>
> > > ipintrq:
> > > queue length: 0
> > > maximum queue length: 256
> > > packets dropped: 133721212
> >
> >This is the second report we've seen recently of packet drops in this
> >queue. We need to understand what's going on, I think.
>
> Hi,
>
> I am guessing I am just overwhelming the box no ? Each of my
> generator boxes are blasting about 600Kpps in opposite directions
> through the box 10 byte UDP packets. Even when doing just the one
> stream in NetBSD, the box (r2) acting as the router is totally
> unresponsive from the serial console and OOB NIC.
>
I'd have expected the problem to show as drops on the output queue, not
ipintrq, unless you're running at near-100% CPU. The previous case did
not involve CPU exhaustion -- does yours?

--Steve Bellovin, http://www.cs.columbia.edu/~smb

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 16:55:21 UTC
Permalink
At 11:25 AM 12/1/2006, Steven M. Bellovin wrote:
> >
>I'd have expected the problem to show as drops on the output queue, not
>ipintrq, unless you're running at near-100% CPU. The previous case did
>not involve CPU exhaustion -- does yours?

Hi,

I think it does in this case. As I cannot interact with the box at
the time of testing its hard to tell. But if I moderate the blast to
a slower rate, top seems to indicate its approaching full utilization
for interrupt processing. I am using FreeBSD's
/usr/src/tools/tools/netrate to generate the traffic.

At 100K, interrupt usage gets to 30%

load
averages: 0.06, 0.08, 0.08
up 0 days, 11:22 06:49:02
37 processes: 1 runnable, 35 sleeping, 1 on processor
CPU0 states: 0.0% user, 0.0% nice, 0.0% system, 28.3% interrupt, 71.7% idle
CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 31M Act, 484K Wired, 4100K Exec, 5284K File, 1950M Free
Swap: 128M Total, 128M Free


200K

load
averages: 0.13, 0.09, 0.08
up 0 days, 11:23 06:50:06
38 processes: 1 runnable, 36 sleeping, 1 on processor
CPU0 states: 0.0% user, 0.0% nice, 0.0% system, 50.0% interrupt, 50.0% idle
CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 31M Act, 484K Wired, 4100K Exec, 5300K File, 1950M Free
Swap: 128M Total, 128M Free

As it gets to 450Kpps, the box gets a little sluggish and difficult
to interact with.


load
averages: 0.15, 0.11, 0.09
up 0 days, 11:26 06:53:19
38 processes: 1 runnable, 36 sleeping, 1 on processor
CPU0 states: 0.0% user, 0.0% nice, 0.0% system, 97.2% interrupt, 2.8% idle
CPU1 states: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Memory: 31M Act, 484K Wired, 4100K Exec, 5300K File, 1950M Free
Swap: 128M Total, 128M Free


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2006-12-01 16:46:41 UTC
Permalink
On Fri, Dec 01, 2006 at 09:31:23AM -0500, Mike Tancsa wrote:
>
> ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 133721212

You may actually be able to get the broadcom driver to do this, too
(not that you'd *want* to, but it would be an interesting experiment!)
by turning the interrupt moderation all the way up -- in the bge driver,
it can be turned up by hand. I think what's going on is that because
the packets you're sending are so small, the wm driver's dumping more
than a queuefull of them onto the network stack at each interrupt; it
is coalescing too aggressively. I will see about adding some
diagnostics for this.

But I think there is _also_ still another issue. Still, let's fix
what we can, while we can.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 17:06:00 UTC
Permalink
At 11:46 AM 12/1/2006, Thor Lancelot Simon wrote:
>(not that you'd *want* to, but it would be an interesting experiment!)
>by turning the interrupt moderation all the way up -- in the bge driver,
>it can be turned up by hand. I think what's going on is that because


I did a bit of twiddling in the Intel Driver under FreeBSD as it has
a number of tuneables when it comes to the moderation rate but didnt
get very far (results on the web page)... The performance difference
wasnt that great and I think if I really wanted to tune properly
those values, I should try and generate more realistic traffic
patterns. (variable packet sizes and rates). I recall skimming
through an article (way above my head) about how very sophisticated
algorithms are used in NICs that support coalescing, so I am
speculating these algorithms make assumptions about traffic patterns
which my testing probably is counter to.

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2006-12-01 17:23:59 UTC
Permalink
On Dec 1, 2006, at 7:00 AM, Mike Tancsa wrote:

> At 09:55 AM 12/1/2006, Steven M. Bellovin wrote:
>
>> > ipintrq:
>> > queue length: 0
>> > maximum queue length: 256
>> > packets dropped: 133721212
>>
>> This is the second report we've seen recently of packet drops in this
>> queue. We need to understand what's going on, I think.
>
> Hi,
>
> I am guessing I am just overwhelming the box no ? Each of my
> generator boxes are blasting about 600Kpps in opposite directions
> through the box 10 byte UDP packets. Even when doing just the one
> stream in NetBSD, the box (r2) acting as the router is totally
> unresponsive from the serial console and OOB NIC.

I've jumped into this thread late -- what exactly is your
configuration? Are you using IP Filter or PF anywhere in the mix
here? If not, then it would be good to know why IP Fast Forwarding
isn't kicking in here (bypasses the IP input queue completely).

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2006-12-01 17:21:27 UTC
Permalink
On Nov 30, 2006, at 10:06 PM, Mike Tancsa wrote:

> wm0 1500 <Link> 00:15:17:0b:70:98 32226898 281780
> 15 0 0

That's still a lot of input errors.

There are a few reasons for these to accumulate:

- Receive ring overrun. Unfortunately, the log message for this is
wrapped in #ifdef WM_DEBUG, so you'll need to tweak the driver and
rebuild the kernel to see the log message.

- Failure to allocate a new receive buffer. When this happens, the
received packet is dropped the its buffer recycled. Again,
unfortunately, this has a debug-only kernel printf associated with it.

- The chip reported some sort of error with the packet. It logs
messages in system log for the following:

- symbol error
- receive sequence error
- CRC error

A carrier extension error or a Rx data error could also occur, but
these are not logged.

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 18:07:44 UTC
Permalink
At 12:23 PM 12/1/2006, Jason Thorpe wrote:

>On Dec 1, 2006, at 7:00 AM, Mike Tancsa wrote:
>
>>At 09:55 AM 12/1/2006, Steven M. Bellovin wrote:
>>
>>> > ipintrq:
>>> > queue length: 0
>>> > maximum queue length: 256
>>> > packets dropped: 133721212
>>>
>>>This is the second report we've seen recently of packet drops in this
>>>queue. We need to understand what's going on, I think.
>>
>>Hi,
>>
>>I am guessing I am just overwhelming the box no ? Each of my
>>generator boxes are blasting about 600Kpps in opposite directions
>>through the box 10 byte UDP packets. Even when doing just the one
>>stream in NetBSD, the box (r2) acting as the router is totally
>>unresponsive from the serial console and OOB NIC.
>
>I've jumped into this thread late -- what exactly is your
>configuration?

Hi,


Details of the test setup at
http://www.tancsa.com/blast.html


>Are you using IP Filter

On NetBSD, enabled and disabled.... But not removed from the kernel


>or PF anywhere in the mix

Only on FreeBSD, but it was far too slow

>here? If not, then it would be good to know why IP Fast Forwarding
>isn't kicking in here (bypasses the IP input queue completely).

I was told options GATEWAY would do it. Perhaps because I am testing
SMP ? Dont know. This week was my first experience with NetBSD.

---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mike Tancsa
2006-12-01 20:34:21 UTC
Permalink
At 01:49 PM 12/1/2006, Jonathan Stone wrote:

>As sometime principial maintaner of NetBSD's bge(4) driver, and the
>author of many of the changes and chip-variant support subsequently
>folded into OpenBSD's bge(4) by ***@openbsd.org, I'd like to speak
>to a couple of points here.

First off, thanks for the extended insights! This has been a most
interesting exercise for me.



>I beleive the UDP packets in Mike's tests are all so small that, even
>with a VLAN tag added, the Ethernet payload (IPv4 header, UDP header,
>10 bytes UDP payload), plus 14-byte Ethernet header, plus 4-byte CRC,
>is still less than the ETHER_MIN_MTU. If so, I don't see how
>framesize is a factor, since the packets will be padded to the minimum
>valid Ethernet payload in any case. OTOH, Switch forwarding PPS may
>well show a marginal degradation due to VLAN insertion; but we're
>still 2 or 3 orders of magnitude away from those limits.

Unfortunately, my budget is not so high that I can afford to have a
high end gigE switch in my test area. I started off with a linksys,
which I managed to hang under moderately high loads. I had an
opportunity to test the Netgear and it was a pretty reasonable price
(~$650 USD) for what it claims its capable of (17Mpps). It certainly
hasnt locked up and I tried putting a bunch of boxes on line and
forwarding packets as fast as all 8 of the boxes could and there
didnt seem to be any ill effects on the switch. Similarly, trunking,
although a bit wonky to configure (I am far more used to Cisco land)
at least works and doesnt seem to degrade overall performance.


>Second point: NetBSD's bge(4) driver includes support for runtime
>manual tuning of interrupt mitigation. I chose the tuning values
>based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.
>
>If my (dimming) memory serves, the default value of 0 yields
>thresh-holds close to Bill Paul's original FreeBSD driver. A value of
>1 yields an bge interrrupt for every two full-sized Ethernet
>frames. Each increment of the sysctl knob will, roughly, halve receive
>interrupt rate, up to a maximum of 5, which interrupts about every 30
>to 40 full-sized TCP segments.

I take it this is it
# sysctl -d hw.bge.rx_lvl
hw.bge.rx_lvl: BGE receive interrupt mitigation level
# sysctl hw.bge.rx_lvl
hw.bge.rx_lvl = 0
#

With ipf enabled and 10 poorly written rules.

rx_lvl pps

0 219,181
1 229,334
2 280,508
3 328,896
4 333,585
5 346,974


Blasting for 10 seconds with the value set to 5, here is the before
and after for netstat -i and netstat -q after doing
[4600X2-88-176]# ./netblast 192.168.44.1 500 10 10

start: 1165001022.659075049
finish: 1165001032.659352738
send calls: 5976399
send errors: 0
approx send rate: 597639
approx error rate: 0
[4600X2-88-176]#


# netstat -q
arpintrq:
queue length: 0
maximum queue length: 50
packets dropped: 153
ipintrq:
queue length: 0
maximum queue length: 256
packets dropped: 180561075
ip6intrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq1:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq2:
queue length: 0
maximum queue length: 256
packets dropped: 0
clnlintrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoediscinq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoeinq:
queue length: 0
maximum queue length: 256
packets dropped: 0
# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Colls
nfe0 1500 <Link> 00:13:d4:ae:9b:6b 38392 584 5517 0 0
nfe0 1500 fe80::/64 fe80::213:d4ff:fe 38392 584 5517 0 0
nfe0 1500 192.168.43/24 192.168.43.222 38392 584 5517 0 0
bge0* 1500 <Link> 00:10:18:14:15:12 0 0 0 0 0
bge1 1500 <Link> 00:10:18:14:27:d5 46026021 489390
213541721 0 0
bge1 1500 192.168.44/24 192.168.44.223 46026021 489390
213541721 0 0
bge1 1500 fe80::/64 fe80::210:18ff:fe 46026021 489390
213541721 0 0
bge2 1500 <Link> 00:10:18:14:38:d2 354347890 255587
19537142 0 0
bge2 1500 192.168.88/24 192.168.88.223 354347890 255587
19537142 0 0
bge2 1500 fe80::/64 fe80::210:18ff:fe 354347890 255587
19537142 0 0
wm0 1500 <Link> 00:15:17:0b:70:98 17816154 72 31 0 0
wm0 1500 fe80::/64 fe80::215:17ff:fe 17816154 72 31 0 0
wm1 1500 <Link> 00:15:17:0b:70:99 1528 0 2967696 0 0
wm1 1500 fe80::/64 fe80::215:17ff:fe 1528 0 2967696 0 0
lo0 33192 <Link> 3 0 3 0 0
lo0 33192 127/8 localhost 3 0 3 0 0
lo0 33192 localhost/128 ::1 3 0 3 0 0
lo0 33192 fe80::/64 fe80::1 3 0 3 0 0
# netstat -q
arpintrq:
queue length: 0
maximum queue length: 50
packets dropped: 153
ipintrq:
queue length: 0
maximum queue length: 256
packets dropped: 183066795
ip6intrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq1:
queue length: 0
maximum queue length: 256
packets dropped: 0
atintrq2:
queue length: 0
maximum queue length: 256
packets dropped: 0
clnlintrq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoediscinq:
queue length: 0
maximum queue length: 256
packets dropped: 0
ppoeinq:
queue length: 0
maximum queue length: 256
packets dropped: 0
# netstat -i
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Colls
nfe0 1500 <Link> 00:13:d4:ae:9b:6b 38497 585 5596 0 0
nfe0 1500 fe80::/64 fe80::213:d4ff:fe 38497 585 5596 0 0
nfe0 1500 192.168.43/24 192.168.43.222 38497 585 5596 0 0
bge0* 1500 <Link> 00:10:18:14:15:12 0 0 0 0 0
bge1 1500 <Link> 00:10:18:14:27:d5 46026057 489390
217012400 0 0
bge1 1500 192.168.44/24 192.168.44.223 46026057 489390
217012400 0 0
bge1 1500 fe80::/64 fe80::210:18ff:fe 46026057 489390
217012400 0 0
bge2 1500 <Link> 00:10:18:14:38:d2 360324326 255587
19537143 0 0
bge2 1500 192.168.88/24 192.168.88.223 360324326 255587
19537143 0 0
bge2 1500 fe80::/64 fe80::210:18ff:fe 360324326 255587
19537143 0 0
wm0 1500 <Link> 00:15:17:0b:70:98 17816195 72 31 0 0
wm0 1500 fe80::/64 fe80::215:17ff:fe 17816195 72 31 0 0
wm1 1500 <Link> 00:15:17:0b:70:99 1528 0 2967696 0 0
wm1 1500 fe80::/64 fe80::215:17ff:fe 1528 0 2967696 0 0
lo0 33192 <Link> 3 0 3 0 0
lo0 33192 127/8 localhost 3 0 3 0 0
lo0 33192 localhost/128 ::1 3 0 3 0 0
lo0 33192 fe80::/64 fe80::1 3 0 3 0 0



>I therefore see very, very good grounds to expect that NetBSD would
>show much better performance if you increase bge interrupt mitigation.

Yup, it certainly seems so!



>That said: I see a very strong philosophical design difference between
>FreeBSD's polling machinery, and the interrupt-mitigation approaches
>variously implemented by Jason Thorpe in wm(4) and by myself in
>bge(4). For the workloads I care about, the design-point tradeoffs in
>FreeBSD-4's polling are simply not acceptable. I *want* kernel
>softint processing to pre-empt userspace procesese, and even
>kthreads. I acknowledge that my needs are, perhaps, unusual.

There are certainly tradeoffs. I guess for me in a firewall capacity,
I want to be able to get into the box OOB when its under
attack. 1Mpps is still considered a medium to heavy attack right
now, but with more and more botnets out there, its only going to get
more common place :( I guess I would like the best of both worlds, a
way to give priority for OOB access, be that serial console or other
interface... But I dont see a way of doing that right now via Interrupt method.





>Even so, I'd be glad to work on improving bge(4) tuning for workloads
>dominated by tinygrams. The same packet rate as ttcp (over
>400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
>--- unless there's a whole lot of CPU processing going on inside
>IP-forwarding that I'm wholly unaware of.

The AMD I am testing on is just a 3800 X2 so ~ 2.0Ghz.



>At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
>5,000 interrupts per bge per second. What interrupt rates are you
>seeing for each bge device in your tests?


After 10 seconds of blasting,

# vmstat -i
interrupt total rate
cpu0 softclock 5142870 98
cpu0 softnet 1288284 24
cpu0 softserial 697 0
cpu0 timer 5197361 100
cpu0 FPU synch IPI 5 0
cpu0 TLB shootdown IPI 373 0
cpu1 timer 5185327 99
cpu1 FPU synch IPI 2 0
cpu1 TLB shootdown IPI 1290 0
ioapic0 pin 14 1659 0
ioapic0 pin 15 30 0
ioapic0 pin 3 44586 0
ioapic0 pin 10 2596838 49
ioapic0 pin 5 11767286 226
ioapic0 pin 7 64269 1
ioapic0 pin 4 697 0
Total 31291574 602

# vmstat -i
interrupt total rate
cpu0 softclock 5145604 98
cpu0 softnet 1288376 24
cpu0 softserial 697 0
cpu0 timer 5201094 100
cpu0 FPU synch IPI 5 0
cpu0 TLB shootdown IPI 373 0
cpu1 timer 5189060 99
cpu1 FPU synch IPI 2 0
cpu1 TLB shootdown IPI 1291 0
ioapic0 pin 14 1659 0
ioapic0 pin 15 30 0
ioapic0 pin 3 44664 0
ioapic0 pin 10 2596865 49
ioapic0 pin 5 11873637 228
ioapic0 pin 7 64294 1
ioapic0 pin 4 697 0
Total 31408348 603

That was with hw.bge.rx_lvl=5

>

>I've never seen that particular bug. I don't beleive I have any acutal
>5750 chips to try to reproduce it. I do have access to: 5700, 5701,
>5705, 5704, 5721, 5752, 5714, 5715, 5780. (I have one machine with one
>5752; and the 5780 is one-dualport-per HT-2000 chip, which means one
>per motherboard. But for most people's purposes, the 5780/5714/5715
>are indistinguishable).
>
>I wonder, does this problem go away if you crank up interrupt mitigation?

Its hard to reproduce, but if I use 2 generators to blast in one
direction, it seems to trigger it even with the value at 5

Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd last message repeated 2 times
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd last message repeated 3 times
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:29 r2-netbsd last message repeated 2 times
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?
Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 52?
Dec 1 10:21:30 r2-netbsd last message repeated 2365 times


With ipfilter disabled, I am able to get about 680Kpps through the
box using 2 streams in one direction. (As a comparison, RELENG_4 was
able to do 950Kpps and with a faster CPU (AMD 4600), about 1.2Mpps)

Note, with all these tests, the NetBSD box is essentially locked up
servicing interrupts


---Mike


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
j***@dsg.stanford.edu
2006-12-01 22:31:31 UTC
Permalink
In message <***@lava.sentex.ca>Mike Tancsa writes
>At 01:49 PM 12/1/2006, Jonathan Stone wrote:
>
>>As sometime principial maintaner of NetBSD's bge(4) driver, and the
>>author of many of the changes and chip-variant support subsequently
>>folded into OpenBSD's bge(4) by ***@openbsd.org, I'd like to speak
>>to a couple of points here.
>
>First off, thanks for the extended insights! This has been a most
>interesting exercise for me.

You're most welcome. (And thank you in turn for giving me a periodic
reminder that I really should write some text about interrupt
mitigation for NetBSD's bge(4) manpage.)

[[Jonathan comments that we're 2 or 3 orders of magnitude away
from where switch VLAN insertion should matter].

>Unfortunately, my budget is not so high that I can afford to have a
>high end gigE switch in my test area. I started off with a linksys,
>which I managed to hang under moderately high loads. I had an
>opportunity to test the Netgear and it was a pretty reasonable price
>(~$650 USD) for what it claims its capable of (17Mpps).

Hmm, so 17Mpps versus some 0.45 Mpps is a factor of 37; lets call
it 2 and a half orders of magnitude :-/.

> Similarly, trunking,
>although a bit wonky to configure (I am far more used to Cisco land)
>at least works and doesnt seem to degrade overall performance.

"Trunking" is overloaded: it can be used mean either link aggregation,
or VLAN-tagging. I have found "trunking" causese enough
misunderstandings that I avoid using the term. I assume here you mean
insertion of VLAN tags, as e.g., commonly used for switch-to-switch
links?


>>Second point: NetBSD's bge(4) driver includes support for runtime
>>manual tuning of interrupt mitigation. I chose the tuning values
>>based on empirical measurements of large TCP flows on bcm5700s and bcm5704s.

[....]

>hw.bge.rx_lvl = 0

Yes. I can never remember if it's a global or per-device-instance.
(My original code was global, others have asked for per-instance).

Snipping the following...

>#
>
>With ipf enabled and 10 poorly written rules.
>
>rx_lvl pps
>
>0 219,181
>1 229,334
>2 280,508
>3 328,896
>4 333,585
>5 346,974

I beleive the following were before-and-after stats for a 10-second
run:


>ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 180561075



>ipintrq:
> queue length: 0
> maximum queue length: 256
> packets dropped: 183066795

Hmm. That indicates ipintrq dropped 2505720 packets during your
10-second run. Call it 250k packet drops/sec. Can you repeat your test
after increasing ipintrq via (as root)

sysctl=-w net.inet.ip.ifq.maxlen=1024

Or even increase to 2048? As I mentioned earlier, even TCP traffic
(bidirectional ttcp streams have 1 ack ever 2 packets or a 2:1 ratio
of full-size framse to minimum-size frames), I need to configure about
512 ipintrq entries per interface. The default value of 256 isn't
really appropriate for multiple GbE interfaces using interrupt
moderation; but it is at least better than the former [ex-CSRG]
default of 50 which dated back to 10Mbit Ethernet. (Or even 3Mbit?)



>>I therefore see very, very good grounds to expect that NetBSD would
>>show much better performance if you increase bge interrupt mitigation.
>
>Yup, it certainly seems so!

I would hope NetBSD can do even better again, after attention to
runtime tunables; but see below.

>There are certainly tradeoffs. I guess for me in a firewall capacity,
>I want to be able to get into the box OOB when its under
>attack. 1Mpps is still considered a medium to heavy attack right
>now, but with more and more botnets out there, its only going to get
>more common place :( I guess I would like the best of both worlds, a
>way to give priority for OOB access, be that serial console or other
>interface... But I dont see a way of doing that right now via Interrupt method.

Oh, it's doable, given patience; I've done it. The first step is to
mitigate hardware interrupts to a level where the CPU can keep up with
hardware interrupt servicing of a minimal-length traffic stream, with
CPU to spare. The second step is to tweak (or fine-tune) ipintrq max
depth to where ipintrq overflows *just* enough that procssing the
non-overflowed packets (done at spl[soft]net) don't leave you
livelocked. On the other hand, any fastpath forwarding that bypasses
ipintrq makes that approach impossible :).


>>Even so, I'd be glad to work on improving bge(4) tuning for workloads
>>dominated by tinygrams. The same packet rate as ttcp (over
>>400kpacket/sec on a 2.4Ghz Opteron) seems like an achievable target
>>--- unless there's a whole lot of CPU processing going on inside
>>IP-forwarding that I'm wholly unaware of.
>
>The AMD I am testing on is just a 3800 X2 so ~ 2.0Ghz.

Hmm. I can probably attempt to set up two bcm5721s in a similar box;
I'd have to look into load-generation.

>>At a recieve rate of 123Mbyte/sec per bge interface, I see roughly
>>5,000 interrupts per bge per second. What interrupt rates are you
>>seeing for each bge device in your tests?
>

[...]
>
>That was with hw.bge.rx_lvl=5

Sorry, I didn't keep your dmesg. which interrupts were the bge devices?



>Its hard to reproduce, but if I use 2 generators to blast in one
>direction, it seems to trigger it even with the value at 5
>
>Dec 1 10:21:29 r2-netbsd /netbsd: bge: failed on len 142?

If I'm reading -current correctly, the message indicfates that the
hardware Tx queue filled up, and therefore an outbound packet was put
onto the software queue, IFF_OACTIVE was set, in the hope that the
packet will be picked up later when the Tx queue has space available.
But for that to work, bge_start() should return whenever it's called with
OFF_ACTIVE set. bge_start() lacks that check. bge_intr() has a check before
it calls bge_start(), but the other calls to bge_start (bge_tick()
don't do that. (Some calls check for ifq_snd non-NULL, but that may be
a hangover from Christos' iintial import of Bill Paul's original code.

Let's talk about that offline. if nothing else, you could try ifdef'ing
out the printf().

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...