Measuring dropped packets

Discussion:

(too old to reply)

Christoph Kaegi

2006-10-26 08:59:14 UTC

Hello List

Our 3.0 ipf Firewall handles several thousand users on a 40MBit/s
link to the internet.

Now we experience delays on internet connections and certain
applications (video conferencing) report packet loss.

How can I find out if and where packets are dropped on the firewall?
(apart from netstat -di)

One of the observations is, that we have a quite high interrupt load
(between 10'000 to 20'000 interrupts/second at the moment with
larger peaks, but I can't remember them anymore)

NIC's are:
---------------------------- 8< ----------------------------
wm0 at pci3 dev 2 function 0: Intel i82546EB 1000BASE-T Ethernet, rev. 1
wm0: interrupting at irq 12
wm0: 64-bit 133MHz PCIX bus
makphy0 at wm0 phy 1: Marvell 88E1011 Gigabit PHY, rev. 3
wm1 at pci3 dev 2 function 1: Intel i82546EB 1000BASE-T Ethernet, rev. 1
wm1: interrupting at irq 12
wm1: 64-bit 133MHz PCIX bus
makphy1 at wm1 phy 1: Marvell 88E1011 Gigabit PHY, rev. 3
wm2 at pci5 dev 1 function 0: Intel i82546GB 1000BASE-T Ethernet, rev. 3
wm2: interrupting at irq 11
wm2: 64-bit 66MHz PCIX bus
makphy2 at wm2 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
wm3 at pci5 dev 1 function 1: Intel i82546GB 1000BASE-T Ethernet, rev. 3
wm3: interrupting at irq 11
wm3: 64-bit 66MHz PCIX bus
makphy3 at wm3 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
wm4 at pci5 dev 2 function 0: Intel i82546GB 1000BASE-T Ethernet, rev. 3
wm4: interrupting at irq 11
wm4: 64-bit 66MHz PCIX bus
makphy4 at wm4 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
wm5 at pci5 dev 2 function 1: Intel i82546GB 1000BASE-T Ethernet, rev. 3
wm5: interrupting at irq 11
wm5: 64-bit 66MHz PCIX bus
makphy5 at wm5 phy 1: Marvell 88E1011 Gigabit PHY, rev. 5
---------------------------- 8< ----------------------------

I also get messages like:
---------------------------- 8< ----------------------------
wm2: Receive overrun
wm0: device timeout (txfree 3706 txsfree 0 txnext 22)
wm0: device timeout (txfree 3750 txsfree 0 txnext 459)
wm0: device timeout (txfree 3852 txsfree 0 txnext 352)
wm2: Receive overrun
---------------------------- 8< ----------------------------

The problems seems to primarily hurt UDP traffic, but
TCP traffic could also be affected because we use
stateful firewalls.

It would be great, if anybody could give me the right
pointers where to look at.
Also, are there useful tools (apart from tcpdump) that
would help to diagnose such a situation?

Thanks
Chris

--
----------------------------------------------------------------------
Christoph Kaegi ***@zhwin.ch
----------------------------------------------------------------------

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2006-10-26 13:14:33 UTC

Permalink

Post by Christoph Kaegi
Hello List
Our 3.0 ipf Firewall handles several thousand users on a 40MBit/s
link to the internet.
Now we experience delays on internet connections and certain
applications (video conferencing) report packet loss.
How can I find out if and where packets are dropped on the firewall?
(apart from netstat -di)

Well, netstat -di can give a good hint already. But the wm driver
didn't properly report some inputs error, I fixed this recently in
current.
You can also look at netstat -q, to see if there are drops at a highter
level. If you see drops here you can try to bump IFQ_MAXLEN
to something larger than 256.

Also look at vmstat -m, especially for failed requests to mbpl and mclpl.
If there are failed requests you have to bump NMBCLUSTERS (you'll have to
if you bump IFQ_MAXLEN anyway, I think)

You may also want to install something like pkgsrc/net/mrtg, to
monitor traffic, in both byte count and packets counts (the script provided
in the above package does byte count, but it's trivial to change it to
do packets count too)

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Christoph Kaegi

2006-10-26 14:19:16 UTC

Permalink

Post by Manuel Bouyer
You can also look at netstat -q, to see if there are drops at a highter
level. If you see drops here you can try to bump IFQ_MAXLEN
to something larger than 256.

---------------------------- 8< ----------------------------
# netstat -q
arpintrq:
queue length: 0
maximum queue length: 50
packets dropped: 826
ipintrq:
queue length: 0
maximum queue length: 256
packets dropped: 91897332
---------------------------- 8< ----------------------------

Ugh! Seems like a lot of dropped packets on the last line.
Can you explain, where exactly that these packets were
dropped?

Post by Manuel Bouyer
Also look at vmstat -m, especially for failed requests to mbpl and mclpl.
If there are failed requests you have to bump NMBCLUSTERS (you'll have to
if you bump IFQ_MAXLEN anyway, I think)

---------------------------- 8< ----------------------------
intfw (primary)# vmstat -m
vmstat: Kmem statistics are not being gathered by the kernel.
Memory resource pool statistics
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
mbpl 256 3436 0 0 216 0 216 216 1 inf 1
mclpl 2048 1501 0 0 755 0 755 755 4 1024 4
---------------------------- 8< ----------------------------

Post by Manuel Bouyer
You may also want to install something like pkgsrc/net/mrtg, to
monitor traffic, in both byte count and packets counts (the script provided
in the above package does byte count, but it's trivial to change it to
do packets count too)

I'm running nload at the moment which says for the outer interface:
---------------------------- 8< ----------------------------
Incoming: Outgoing:
Curr: 11.92 MBit/s Curr: 3.34 MBit/s
Avg: 11.67 MBit/s Avg: 6.67 MBit/s
Min: 0.00 MBit/s Min: 0.90 MBit/s
Max: 61.23 MBit/s Max: 44.10 MBit/s
---------------------------- 8< ----------------------------

I'm going to look at where to bump IFQ_MAXLEN now...

Thanks and until later

Chris

Christoph Kaegi

2006-10-26 14:35:22 UTC

Permalink

Post by Christoph Kaegi
---------------------------- 8< ----------------------------
# netstat -q
queue length: 0
maximum queue length: 50
packets dropped: 826
queue length: 0
maximum queue length: 256
packets dropped: 91897332
---------------------------- 8< ----------------------------

sysctl seems to report this number also:

---------------------------- 8< ----------------------------
# sysctl -a |grep -i ifq
net.inet.ip.ifq.len = 0
net.inet.ip.ifq.maxlen = 256
net.inet.ip.ifq.drops = 91897332
---------------------------- 8< ----------------------------

The options(4) manpage says the default is 50. Isn't this
correct anymore? Should I send-pr this?

I tried

# sysctl -w net.inet.ip.ifq.maxlen=512

I'll watch if this helps now.

Thanks
Chris

Manuel Bouyer

2006-10-26 18:24:37 UTC

Permalink

Post by Christoph Kaegi

---------------------------- 8< ----------------------------
# sysctl -a |grep -i ifq
net.inet.ip.ifq.len = 0
net.inet.ip.ifq.maxlen = 256
net.inet.ip.ifq.drops = 91897332
---------------------------- 8< ----------------------------
The options(4) manpage says the default is 50. Isn't this
correct anymore? Should I send-pr this?
I tried
# sysctl -w net.inet.ip.ifq.maxlen=512
I'll watch if this helps now.

I would sent it even larger. If you have several wm(4) cards, I think
you should make this at last 256 * the number of adapters.

Manuel Bouyer

2006-10-26 18:27:27 UTC

Permalink

Post by Christoph Kaegi

Post by Manuel Bouyer
You can also look at netstat -q, to see if there are drops at a highter
level. If you see drops here you can try to bump IFQ_MAXLEN
to something larger than 256.

---------------------------- 8< ----------------------------
# netstat -q
queue length: 0
maximum queue length: 50
packets dropped: 826
queue length: 0
maximum queue length: 256
packets dropped: 91897332
---------------------------- 8< ----------------------------
Ugh! Seems like a lot of dropped packets on the last line.
Can you explain, where exactly that these packets were
dropped?

When a packet is received, the interrupt handler for the adapter places it
in the IP input queue, which is processed later at a lower interrupt
priority. What happens here is, I guess, that you have several adapters,
each with a large receive ring. When the interrupt handler is going to
process the receive ring of the adapters, and as each have several packets
it reaches the ipintrq length limit.

Christoph Kaegi

2006-10-27 06:30:22 UTC

Permalink

Post by Manuel Bouyer

Post by Christoph Kaegi
packets dropped: 91897332
---------------------------- 8< ----------------------------
Ugh! Seems like a lot of dropped packets on the last line.
Can you explain, where exactly that these packets were
dropped?

Thanks for the explanation Manuel. :-)

Chris

Christoph Kaegi

2006-10-30 07:47:26 UTC

Permalink

Post by Manuel Bouyer

Post by Christoph Kaegi
# sysctl -w net.inet.ip.ifq.maxlen=512
I'll watch if this helps now.

I would sent it even larger. If you have several wm(4) cards, I think
you should make this at last 256 * the number of adapters.

I've set net.inet.ip.ifq.maxlen to 1024 on friday. Then, during the
weekend, the machine seemed to forward less and less packets,
until sunday evening, when it stopped completely forwarding.

There was no Message on the console, in the kernel message buffer
or in the logs. The machine just was stuck and didn't respond
to network packets or even the keyboard anymore.

What could be the reason for this?
And, more importantly for me, how can I prevent this from occuring
again or how can I measure this slow dying?

Chris

Rui Paulo

2006-10-30 14:05:54 UTC

Permalink

Post by Christoph Kaegi

Post by Manuel Bouyer

Post by Christoph Kaegi
# sysctl -w net.inet.ip.ifq.maxlen=512
I'll watch if this helps now.

I would sent it even larger. If you have several wm(4) cards, I think
you should make this at last 256 * the number of adapters.

I've set net.inet.ip.ifq.maxlen to 1024 on friday. Then, during the
weekend, the machine seemed to forward less and less packets,
until sunday evening, when it stopped completely forwarding.
There was no Message on the console, in the kernel message buffer
or in the logs. The machine just was stuck and didn't respond
to network packets or even the keyboard anymore.
What could be the reason for this?
And, more importantly for me, how can I prevent this from occuring
again or how can I measure this slow dying?

Maybe you run out of mbufs ?
netstat -m can help you. If you have a lot of calls to protocol drain
routines, you need to bump NMBCLUSTERS.
Eitherway, if you reached mclpool limit, a message should've been
printed in the console/dmesg.

--
Rui Paulo

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Christoph Kaegi

2006-10-30 15:28:27 UTC

Permalink

Post by Rui Paulo
Maybe you run out of mbufs ?
netstat -m can help you. If you have a lot of calls to protocol drain
routines, you need to bump NMBCLUSTERS.

# sysctl -a |grep nmbclust
---------------------------- 8< ----------------------------
kern.mbuf.nmbclusters = 2048
---------------------------- 8< ----------------------------

and netstat -m says:
---------------------------- 8< ----------------------------
1028 mbufs in use:
1025 mbufs allocated to data
3 mbufs allocated to packet headers
247034 calls to protocol drain routines
---------------------------- 8< ----------------------------

Post by Rui Paulo
Eitherway, if you reached mclpool limit, a message should've been
printed in the console/dmesg.

# vmstat -m |egrep "Requests|mcl"
---------------------------- 8< ----------------------------
vmstat: Kmem statistics are not being gathered by the kernel.
Name Size Requests Fail Releases Pgreq Pgrel Npage Hiwat Minpg Maxpg Idle
mclpl 2048 2002 247026 0 1001 0 1001 1001 4 1024 0
---------------------------- 8< ----------------------------

Looking at the 'Fail' column, I get the impression I should
indeed increase NMBCLUSTERS.

Are there other variables I should look at?

Chris