DIAGNOSTIC panic in unp

Discussion:

DIAGNOSTIC panic in unp_gc function

(too old to reply)

Ritesh Agrawal

2015-02-27 13:00:50 UTC

Hi All,

I have been seeing random panics on my NetBSD-6.0 based system and this
panic was in AF_LOCAL protocol code (uipc_usrreq.c). One of the panic was:

uvm_fault(0xffffffff81a195e0, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff803042ff cs 8 rflags 10283 cr2 8 cpl 0 rsp
fffffe810f27bab0
panic: trap
cpu1: Begin traceback...
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
unp_detach() at netbsd:unp_detach+0x2e
uipc_usrreq() at netbsd:uipc_usrreq+0x79
soclose() at netbsd:soclose+0x79
soo_close() at netbsd:soo_close+0x1a
closef() at netbsd:closef+0x4a
unp_thread() at netbsd:unp_thread+0x3cb
cpu1: End traceback...

I than installed the NetBSD kernel with "DIAGNOSTIC option" and the
DIAGNOSTIC kernel paniced on following line no:

sys/kern/uipc_usrreq.c:1713 with current TOT for MAIN branch.

It seems that we can get into this code with file pointer reference
count as 0.
I got into this situation by following steps:

1) I am passing file between two processes using AF_LOCAL socket.
2) process 1 opens a AF_LOCAL socket with fd "x"
3) now process 1 passes the same fd as part of control message to
process 2 using this fd.
4) process 1 closes its fd
5) This fd is now only in kernel as part of control message of mbuf
receive Q for the AF_LOCAL socket held by process 2
6) Therefore "fp->f_count" of the passed fd is 1
7) Now unp_thread kicks in to process the deferred closed
8) It looks into the filehead list and for UNIX domain socket scans the
receive mbufs
9) It marks all the file descriptor which are socket in control message
as "FDEFER"
10) Therefore the file descriptor sent by process 1 is marked as FDEFER
11) Before the "filehead" list is rescanned again (FDEFER fd is ahead in
list), process 2 wakes up and receives the mbuf and closes the FDEFER
file descriptor
12) As there was only one reference count on this file descriptor, the
file is put into "file_cache" pool cache when it is closed by process 2.
13) This file is freed but it still remains in "filehead" list because
file is removed from "filehead" list in "file_dtor" function.
14) "file_dtor" function is "pc_dtor" function and are called
conditionally.
15) Now the "unp_gc" rescans the "filehead" and finds this file
descriptor with "FDEFER" set and file pointer "f_count" as 0
16) It hits KASSERT and system panics
17) If it is not DIAGNOSTIC kernel then it access data which are not
valid and crashes elsewhere.

I saw the current code and this code still exist and looks similar. I
think we should either increase the f_count of this file pointer while
marking it FDEFER and then cleanup when we get into this loop. We can
also check for file pointer with "FDEFER" and "f_count == 0" in line
1713 and just "continue" as this could be valid case.

Please advice.

Thanks
Ritesh

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Christos Zoulas

2015-02-27 14:29:11 UTC

Permalink

Post by Ritesh Agrawal
Hi All,
I have been seeing random panics on my NetBSD-6.0 based system and this
uvm_fault(0xffffffff81a195e0, 0x0, 1) -> e
fatal page fault in supervisor mode
trap type 6 code 0 rip ffffffff803042ff cs 8 rflags 10283 cr2 8 cpl 0 rsp
fffffe810f27bab0
panic: trap
cpu1: Begin traceback...
printf_nolog() at netbsd:printf_nolog
startlwp() at netbsd:startlwp
alltraps() at netbsd:alltraps+0x96
unp_detach() at netbsd:unp_detach+0x2e
uipc_usrreq() at netbsd:uipc_usrreq+0x79
soclose() at netbsd:soclose+0x79
soo_close() at netbsd:soo_close+0x1a
closef() at netbsd:closef+0x4a
unp_thread() at netbsd:unp_thread+0x3cb
cpu1: End traceback...
I than installed the NetBSD kernel with "DIAGNOSTIC option" and the
sys/kern/uipc_usrreq.c:1713 with current TOT for MAIN branch.
It seems that we can get into this code with file pointer reference
count as 0.
1) I am passing file between two processes using AF_LOCAL socket.
2) process 1 opens a AF_LOCAL socket with fd "x"
3) now process 1 passes the same fd as part of control message to
process 2 using this fd.
4) process 1 closes its fd
5) This fd is now only in kernel as part of control message of mbuf
receive Q for the AF_LOCAL socket held by process 2
6) Therefore "fp->f_count" of the passed fd is 1
7) Now unp_thread kicks in to process the deferred closed
8) It looks into the filehead list and for UNIX domain socket scans the
receive mbufs
9) It marks all the file descriptor which are socket in control message
as "FDEFER"
10) Therefore the file descriptor sent by process 1 is marked as FDEFER
11) Before the "filehead" list is rescanned again (FDEFER fd is ahead in
list), process 2 wakes up and receives the mbuf and closes the FDEFER
file descriptor
12) As there was only one reference count on this file descriptor, the
file is put into "file_cache" pool cache when it is closed by process 2.
13) This file is freed but it still remains in "filehead" list because
file is removed from "filehead" list in "file_dtor" function.
14) "file_dtor" function is "pc_dtor" function and are called
conditionally.
15) Now the "unp_gc" rescans the "filehead" and finds this file
descriptor with "FDEFER" set and file pointer "f_count" as 0
16) It hits KASSERT and system panics
17) If it is not DIAGNOSTIC kernel then it access data which are not
valid and crashes elsewhere.
I saw the current code and this code still exist and looks similar. I
think we should either increase the f_count of this file pointer while
marking it FDEFER and then cleanup when we get into this loop. We can
also check for file pointer with "FDEFER" and "f_count == 0" in line
1713 and just "continue" as this could be valid case.

Yes, I've seen that too but it is rare. Can you make an example program
that triggers it?

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Ritesh Agrawal

2015-02-27 14:53:38 UTC

Permalink

Post by Christos Zoulas

Yes, I've seen that too but it is rare. Can you make an example program
that triggers it?

I agree, this is not easily reproducible but I am able to reproduce this
in my setup using in house developed automation tool.
It occurs in specific condition and I am not sure whether I will be able
to write a sample code which can reproduce this issue at will. And might
need a complete infrastructure and similar setup to reproduce this. We
will need to simulate closing of socket by process 2 when the "unp_gc"
has marked the socket as "FDEFER".
We hit upon this issue by mistake, we were suppose to pass the regular
file descriptor but due to some bug in code the original file descriptor
got closed. And the communicating AF_LOCAL socket took the same fd
number and got passed along the message.
Do you think how we should fix this issue as we can definitely reproduce
this in closed environment?
I put a check instead of "KASSERT(fp->f_count != 0);" and did continue.

1713c1713,1716
< KASSERT(fp->f_count != 0);
---

Post by Christos Zoulas
if (fp->f_count == 0) {
mutex_exit(&fp->f_lock);
continue;
}

But I know this is not a better method to fix it and we will need a
better fix for this problem.

Ritesh

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Christos Zoulas

2015-02-27 17:02:02 UTC

Permalink

On Feb 27, 8:23pm, ***@teznetworks.com (Ritesh Agrawal) wrote:
-- Subject: Re: DIAGNOSTIC panic in unp_gc function

| > Yes, I've seen that too but it is rare. Can you make an example program
| > that triggers it?

| I agree, this is not easily reproducible but I am able to reproduce this
| in my setup using in house developed automation tool.
| It occurs in specific condition and I am not sure whether I will be able
| to write a sample code which can reproduce this issue at will. And might
| need a complete infrastructure and similar setup to reproduce this. We
| will need to simulate closing of socket by process 2 when the "unp_gc"
| has marked the socket as "FDEFER".
| We hit upon this issue by mistake, we were suppose to pass the regular
| file descriptor but due to some bug in code the original file descriptor
| got closed. And the communicating AF_LOCAL socket took the same fd
| number and got passed along the message.
| Do you think how we should fix this issue as we can definitely reproduce
| this in closed environment?
| I put a check instead of "KASSERT(fp->f_count != 0);" and did continue.
|
| 1713c1713,1716
| < KASSERT(fp->f_count != 0);
| ---
| > if (fp->f_count == 0) {
| > mutex_exit(&fp->f_lock);
| > continue;
| > }
|
| But I know this is not a better method to fix it and we will need a
| better fix for this problem.

Well, the only other place where we decrement the f_count is in
kern_descrip.c closef(), and that does not pay attention to FDEFER,
so that fix is probably correct (or we will have to make closef()
pay attention to FDEFER). Assuming that my guess is right and closef()
is called while we are gc'ing.

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de