problems with nmbcluster (?)

Discussion:

(too old to reply)

6***@6bone.informatik.uni-leipzig.de

2007-01-07 16:44:48 UTC

hello,

I have some problems with the network. I have to restart my server
continuously, because after some days the server loses all connection to
the network. You cannot establish any connections or do any pings. You can
only restart the server. After the restart everything works fine for some
days.....

I have tested some kernels (3.0, 3.1, current....) but always the same
effect occurs. On the server runs no special service. Only apache2 and
postgresql from the pkgsrc. I don't know why the problem only occurs at my
system. It is a dual i386/PIII with enabled IPv6 and an intel nic.

I cannot give you more special hints. Only one output from 'netstat -mss'
after the connection was lost:

1441 mbufs in use:
1150 mbufs allocated to data
291 mbufs allocated to packet headers
132521 calls to protocol drain routines

Can anyone give me a hint for a possible solution or workaround? The
continuous restarts are not longer possible. I have already exchanged the
complete hard- and software.

thank you for your efforts
Uwe

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2007-01-07 18:09:59 UTC

Permalink

Post by 6***@6bone.informatik.uni-leipzig.de
hello,
I have some problems with the network. I have to restart my server
continuously, because after some days the server loses all connection to
the network. You cannot establish any connections or do any pings. You can
only restart the server. After the restart everything works fine for some
days.....
I have tested some kernels (3.0, 3.1, current....) but always the same
effect occurs. On the server runs no special service. Only apache2 and
postgresql from the pkgsrc. I don't know why the problem only occurs at my
system. It is a dual i386/PIII with enabled IPv6 and an intel nic.
I cannot give you more special hints. Only one output from 'netstat -mss'
1150 mbufs allocated to data
291 mbufs allocated to packet headers
132521 calls to protocol drain routines
Can anyone give me a hint for a possible solution or workaround? The
continuous restarts are not longer possible. I have already exchanged the
complete hard- and software.

What does 'vmstat -m|grep mclpl' shows ?

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

6***@6bone.informatik.uni-leipzig.de

2007-01-07 19:28:10 UTC

Permalink

Date: Sun, 7 Jan 2007 19:09:59 +0100
Subject: Re: problems with nmbcluster (?)

What does 'vmstat -m|grep mclpl' shows ?
--
NetBSD: 26 ans d'experience feront toujours la difference
--

the uptime at the moment is only 4h - so I can only report the actual
output:

netstat -mss && vmstat -m|grep mclpl

1497 mbufs in use:
1110 mbufs allocated to data
387 mbufs allocated to packet headers
34 calls to protocol drain routines

vmstat: Kmem statistics are not being gathered by the kernel.
mclpl 2048 1578 0 938 408 74 334 398 4 512 7

Uwe

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2007-01-07 21:16:59 UTC

Permalink

Post by 6***@6bone.informatik.uni-leipzig.de

Date: Sun, 7 Jan 2007 19:09:59 +0100
Subject: Re: problems with nmbcluster (?)
On Sun, Jan 07, 2007 at 05:44:48PM +0100,

What does 'vmstat -m|grep mclpl' shows ?
--
NetBSD: 26 ans d'experience feront toujours la difference
--

the uptime at the moment is only 4h - so I can only report the actual
netstat -mss && vmstat -m|grep mclpl
1110 mbufs allocated to data
387 mbufs allocated to packet headers
34 calls to protocol drain routines
vmstat: Kmem statistics are not being gathered by the kernel.
mclpl 2048 1578 0 938 408 74 334 398 4 512

I suspect your system is running out of mclpl on occasion, and this cause the
network atapter (or the IP stack) to stall. Try bumping nmbclusters.

For example on ftp.fr.netbsd.org I have it set to 8192.

6***@6bone.informatik.uni-leipzig.de

2007-01-08 06:01:40 UTC

Permalink

Date: Sun, 7 Jan 2007 22:16:59 +0100
Subject: Re: problems with nmbcluster (?)

Post by 6***@6bone.informatik.uni-leipzig.de

Date: Sun, 7 Jan 2007 19:09:59 +0100
Subject: Re: problems with nmbcluster (?)
On Sun, Jan 07, 2007 at 05:44:48PM +0100,

What does 'vmstat -m|grep mclpl' shows ?
--
NetBSD: 26 ans d'experience feront toujours la difference
--

I suspect your system is running out of mclpl on occasion, and this cause the
network atapter (or the IP stack) to stall. Try bumping nmbclusters.
For example on ftp.fr.netbsd.org I have it set to 8192.
--
NetBSD: 26 ans d'experience feront toujours la difference
--

I have already testet with NMBCLUSTERS=4096. I think the system runs some
days longer until the stall occurs. Now I will test with 8192, but I think
it will not solve the problem.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2007-01-08 10:20:36 UTC

Permalink

Post by 6***@6bone.informatik.uni-leipzig.de
I have already testet with NMBCLUSTERS=4096. I think the system runs some
days longer until the stall occurs. Now I will test with 8192, but I think
it will not solve the problem.

Then we'll need more infos when this occurs:
ifconfig -a
netstat -n
netstat -m
vmstat -m

if you can take a core dump (reboot -d) it would be helpfull too

--
Manuel Bouyer, LIP6, Universite Paris VI. ***@lip6.fr
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Stephen Jones

2007-01-11 01:35:09 UTC

Permalink

On Mon, Jan 08, 2007 at 07:01:40AM +0100,

Manuel -

Why is there such a black magic to this? Is this something that
could be handled more
gracefully with kernel warnings prior to actually hanging? Could it
be set to increase
(or decrease) dynamically?

Nearly all the NetBSD crashes I experience are related to this, or so
I am told, and over
the years I've never gotten it figured out. I've cited this as a
'vnlock deadlock' issue,
but thats just a symptom. The real issue is resource starvation ..
but is NMBCLUSTER a
spectre or the real ghost?

One of the big problems is that you might not even get a clue before
a system hangs.
So for me, I see about 18-24 days of uptime prior to inevitable
silent hang. No
warning, no panic .. just a hang on the NFS server which causes all
of the clients
to cascade vnlock deadlocks.

Just a few days ago I had a fortunate clue. I awoke to my phone
beeping at me telling
me of a problem and when I got to the console I was able to break to
a debugger and
kill init to get the NFS server to drop to single user mode. I was
being patient
hoping that it would eventually recover and give me a shell so I
could bring it back up when:

mclpool limit reached: increase NMBCLUSTERS

spewed down the screen 50 or so times. Finally, a real clue and
confirmation! So whats the history
of this?

I tried 8192, 16k, 24k, 32k, 64k .. now I'm at 92k, yet still .. I
need to increase NMBCLUSTERS.
To quote Nintendo, How high can you go? Whats the logic behind
NMBCLUSTERS? I realise that
this is a single value that can affect other parameters, isn't that
correct? So is it a phantom
or should I really be ever increasing NMBCLUSTERS? What happens if I
tell it to go 256k? Is
that too high?

Did you mention to 6bone to send the output of pstat -T .. Will that
help out?

Stephen

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2007-01-11 11:35:21 UTC

Permalink

Post by Stephen Jones
Manuel -
Why is there such a black magic to this? Is this something that
could be handled more
gracefully with kernel warnings prior to actually hanging?

I think it will print messages about it. Also, it usually doens't hang,
but recovers from this situation.

Post by Stephen Jones
Could it
be set to increase
(or decrease) dynamically?

This is just a limit on a memory pool. We could remove the limit, but then
this would make DOS easier. If it's properly tuned for the system's usage
it should be safe. The default value is fine for most usage, I usually
needed to tune it only on system with a lot of outgoing connections.

Post by Stephen Jones
Nearly all the NetBSD crashes I experience are related to this, or so
I am told, and over
the years I've never gotten it figured out. I've cited this as a
'vnlock deadlock' issue,
but thats just a symptom. The real issue is resource starvation ..
but is NMBCLUSTER a
spectre or the real ghost?
One of the big problems is that you might not even get a clue before
a system hangs.
So for me, I see about 18-24 days of uptime prior to inevitable
silent hang. No
warning, no panic .. just a hang on the NFS server which causes all
of the clients
to cascade vnlock deadlocks.
Just a few days ago I had a fortunate clue. I awoke to my phone
beeping at me telling
me of a problem and when I got to the console I was able to break to
a debugger and
kill init to get the NFS server to drop to single user mode. I was
being patient
hoping that it would eventually recover and give me a shell so I
mclpool limit reached: increase NMBCLUSTERS
spewed down the screen 50 or so times. Finally, a real clue and
confirmation! So whats the history
of this?
I tried 8192, 16k, 24k, 32k, 64k .. now I'm at 92k, yet still .. I
need to increase NMBCLUSTERS.

Ouh, there's a problem here. With that many NMBCLUSTERS it's possible
that you're running in other limits, depending on how much RAM
your system has (92k NMBCLUSTERS is 46MB RAM, non-pageable).

I suspect you're experiencing a mbuf leak here. To help debug this,
please rebuild a kernel with
options MBUFTRACE
and provide the outputs of
netstat -m
netstat -n
vmstat -m

after a few days of use (or, better, once the network is hung). You can also
get a core dump from the kernel once the limit is reached: reboot -d,
or enter ddb and type reboot(0x104)

David Young

2007-01-11 10:22:55 UTC

Permalink

I am going to take a wild guess that apache does not read its sockets
fast enough to keep its socket queues from growing long tails of mbufs,
and then apache tries to do a blocking write(2) on a socket before
read(2)ing its sockets. If apache tries to write(2) when all mbufs are
either on its receive queues or on wm's receive ring, it seems to me
that the system will deadlock.

Dave

--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 278-3933

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de