Discussion:
[Fwd: Re: dccifd: restart after signal 6]
(too old to reply)
Petar Bogdanovic
2009-06-07 09:38:16 UTC
Permalink
Hi,

attached is an excerpt from the [1]DCC mailing list. One of the dcc
daemons started to die unexpectedly because it used _res in conjunction
with threads which seems to be a weird combination on NetBSD even if you
assume that your resolver is not thread safe and lock him properly:

/usr/src/include/resolv.h:
/*
* Source and Binary compatibility; _res will not work properly
* with multi-threaded programs.
*/
extern struct __res_state *__res_state(void);
#define _res (*__res_state())

/usr/src/lib/libpthread/res_state.c:
/*
* This is aliased via a macro to _res; don't allow multi-threaded programs
* to use it.
*/
res_state
__res_state(void)
{
static const char res[] = "_res is not supported for multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}


At least that's how I interpreted the story so it would be nice if
somebody here could analyze/comment the issue.

Thank you,



Petar Bogdanovic



[1] http://www.rhyolite.com/dcc/

----- Forwarded message from Vernon Schryver <***@calcite.rhyolite.com> -----

Date: Fri, 5 Jun 2009 21:52:05 GMT
As expected, this is the result of a kill(2) call.
#0 0xbbaf923f in kill () from /usr/lib/libc.so.12
#1 0xbbb95a64 in abort () from /usr/lib/libc.so.12
#2 0xbbbdc60c in __res_state () from /usr/lib/libpthread.so.0
#3 0x0806b9ca in dcc_res_delays (budget=4) at get_port.c:476
#4 0x080660f2 in dcc_clnt_rdy (emsg=0xb91fff50 "", ctxt=0x80c9000,
clnt_fgs=8 '\b') at clnt_send.c:1740
#5 0x080544dc in clnt_resolve_thread (arg=0x0) at clnt_threaded.c:394
#6 0xbbbe562d in pthread_join () from /usr/lib/libpthread.so.0
#7 0xbbb1aa2c in swapcontext () from /usr/lib/libc.so.12
Now that you mention it, I saw an instance of it a week or two ago,
but hoped it was a fluke. I've been unable to reproduce it then or today.

That's an ugly one, because it's not in my code.
This is the relevant part of my get_port.c:

if (!dcc_host_locked)
dcc_logbad(EX_SOFTWARE, "dcc_get_host() not locked");

/* get the current value */
if (!(_res.options & RES_INIT))
res_init();

dcc_logbad() calls abort() after syslog().
Because I assume the resolver is not thread safe and check that it's
locked, it can't be a simple, valid locking problem.

I guess I'll have to look for NetBSD's version of the resolver library
to see what NetBSD has done to it. There are no abort() calls in the
FreeBSD 7.1 version of res_state.c


Have I mentioned that I'm not a fan of the clean target in
the NetBSD bsd.prog.mk because it deletes .gdbinit?


thanks,
Vernon Schryver ***@rhyolite.com
_______________________________________________
DCC mailing list ***@rhyolite.com
http://www.rhyolite.com/mailman/listinfo/dcc

----- End forwarded message -----
----- Forwarded message from MrC <lists-***@cappella.us> -----

Date: Fri, 05 Jun 2009 16:16:23 -0700
As expected, this is the result of a kill(2) call.
#0 0xbbaf923f in kill () from /usr/lib/libc.so.12
#1 0xbbb95a64 in abort () from /usr/lib/libc.so.12
#2 0xbbbdc60c in __res_state () from /usr/lib/libpthread.so.0
#3 0x0806b9ca in dcc_res_delays (budget=4) at get_port.c:476
#4 0x080660f2 in dcc_clnt_rdy (emsg=0xb91fff50 "", ctxt=0x80c9000,
clnt_fgs=8 '\b') at clnt_send.c:1740
#5 0x080544dc in clnt_resolve_thread (arg=0x0) at clnt_threaded.c:394
#6 0xbbbe562d in pthread_join () from /usr/lib/libpthread.so.0
#7 0xbbb1aa2c in swapcontext () from /usr/lib/libc.so.12
Now that you mention it, I saw an instance of it a week or two ago,
but hoped it was a fluke. I've been unable to reproduce it then or today.
That's an ugly one, because it's not in my code.
if (!dcc_host_locked)
dcc_logbad(EX_SOFTWARE, "dcc_get_host() not locked");
/* get the current value */
if (!(_res.options& RES_INIT))
res_init();
dcc_logbad() calls abort() after syslog().
Because I assume the resolver is not thread safe and check that it's
locked, it can't be a simple, valid locking problem.
I guess I'll have to look for NetBSD's version of the resolver library
to see what NetBSD has done to it. There are no abort() calls in the
FreeBSD 7.1 version of res_state.c
I don't see anything that immediately jumps out between changes in 4.0 and
4.0.1, but I'm long since familiar with libc code.


Here's the abort() in the libpthread version of res_state:

libpthread/res_state.c

/*
* This is aliased via a macro to _res; don't allow multi-threaded programs
* to use it.
*/
res_state
__res_state(void)
{
static const char res[] = "_res is not supported for multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}


The libc version uses weak aliases:

#include <sys/cdefs.h>
#if defined(LIBC_SCCS) && !defined(lint)
__RCSID("$NetBSD: res_state.c,v 1.5.10.1 2007/05/17 21:25:19 jdc Exp $");
#endif

#include <sys/types.h>
#include <arpa/inet.h>
#include <arpa/nameser.h>
#include <netdb.h>
#include <resolv.h>

struct __res_state _nres
# if defined(__BIND_RES_TEXT)
= { .retrans = RES_TIMEOUT, } /*%< Motorola, et al. */
# endif
;

res_state __res_get_state_nothread(void);
void __res_put_state_nothread(res_state);

#ifdef __weak_alias
__weak_alias(__res_get_state, __res_get_state_nothread)
__weak_alias(__res_put_state, __res_put_state_nothread)
/* Source compatibility; only for single threaded programs */
__weak_alias(__res_state, __res_get_state_nothread)
#endif

res_state
__res_get_state_nothread(void)
{
if ((_nres.options & RES_INIT) == 0 && res_ninit(&_nres) == -1) {
h_errno = NETDB_INTERNAL;
return NULL;
}
return &_nres;
}

And dccifd is linked against libpthread:

$ ldd /var/dcc/libexec/dccifd
/var/dcc/libexec/dccifd:
-lpthread.0 => /usr/lib/libpthread.so.0
-lm.0 => /usr/lib/libm387.so.0
-lm.0 => /usr/lib/libm.so.0
-lc.12 => /usr/lib/libc.so.12


Let me know if there is something I can do to help.
Have I mentioned that I'm not a fan of the clean target in
the NetBSD bsd.prog.mk because it deletes .gdbinit?
Oh, boy. They really mean *squeaky clean*.

Thanks for your time,
Mike
_______________________________________________
DCC mailing list ***@rhyolite.com
http://www.rhyolite.com/mailman/listinfo/dcc

----- End forwarded message -----
----- Forwarded message from Vernon Schryver <***@calcite.rhyolite.com> -----

Date: Sat, 6 Jun 2009 00:17:27 GMT
static const char res[] = "_res is not supported for
multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}
That is utterly lame, overprotective mommy-ism. Besides, it stupidly
assumes that stderr has not been closed, as competent programmer does
in a daemon.

A competent programmer knows to suspect that a library is not thread
safe unless it explicitly says it is. More than that, you assume that
a library that keeps internal state like _res.options and res_init()
is not thread safe unless it both claims to be thread safe and you can't
find any evidence about problems.

And to omit any hint of the nonsense in the resolver man page!
There is this passing mention in resolve.h:
* Source and Binary compatibility; _res will not work properly
* with multi-threaded programs.
Because dccifd and dccm use threads
Let me know if there is something I can do to help.
Any suggestions on the least nasty kludge to link dccifd and dccm
to the libc resolver intead of the broken-by-design libpthread
resolver? I'd have to force not only the res_state hooks, but the
whole resolver edifice including anything called inside gethostby*().

Should I change the Makefiles to treat NetBSD like Windows and not build
dccifd and dccm under the toy-applications-for-toy-operating-systems rule?

Maybe I can arrange to not tweak the resolver timeouts for the threaded
DCC programs to limit the total DCC delays and so keep SpamAssassin and
MTAs from being unhappy.


Have I mentioned I'm becoming ever less enamored of recent versions of
NetBSD? The Linux experts only gratuitously, incompatibly changed the
names of the resolver hooks.


thanks any way,
Vernon Schryver ***@rhyolite.com
_______________________________________________
DCC mailing list ***@rhyolite.com
http://www.rhyolite.com/mailman/listinfo/dcc

----- End forwarded message -----
----- Forwarded message from Paul Vixie <***@isc.org> -----

Date: Sat, 06 Jun 2009 00:49:55 +0000
From: Paul Vixie <***@isc.org>
To: Vernon Schryver <***@calcite.rhyolite.com>
Cc: lists-***@cappella.us, ***@rhyolite.com

the strange part of all this is, there has been a thread safe superset
of libresolv for about 11 years. just call res_ninit instead of
res_init, and res_nsend instead of res_send. netbsd doesn't know this?
_______________________________________________
DCC mailing list ***@rhyolite.com
http://www.rhyolite.com/mailman/listinfo/dcc

----- End forwarded message -----
----- Forwarded message from Vernon Schryver <***@calcite.rhyolite.com> -----

Date: Sat, 6 Jun 2009 01:02:56 GMT
the strange part of all this is, there has been a thread safe superset
of libresolv for about 11 years. just call res_ninit instead of
res_init, and res_nsend instead of res_send. netbsd doesn't know this?
I didn't know that, but now that you mention it, it's in non-NetBSD
versions of res_state.c, and so probably in NetBSD versions.
I'd not trust it without reading the whole resolver library because
"thread" does appear in res_state.c. But there are plenty of
pthread.h and similar stains in other files in the
FreeBSD version of /usr/src/lib/libc/resolv/*


While I'm ranting about NetBSD, I wish they'd get with the program in
minor ways that wouldn't break anything. I can't think of an excuse
for the spews of compiler warnings like these:
sign.c:77: warning: pointer targets in passing argument 2 of 'MD5Update' differ in signedness
Declaring your MD5Update() to take a const void * instead of a const
unsigned char * should be done even before you tune it. It should be
done when you add "const" to the ancient code from the RFC 1321.


Vernon Schryver ***@rhyolite.com
_______________________________________________
DCC mailing list ***@rhyolite.com
http://www.rhyolite.com/mailman/listinfo/dcc

----- End forwarded message -----

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2009-06-08 00:31:47 UTC
Permalink
In article <***@pintail.smokva.net>,
Petar Bogdanovic <***@smokva.net> wrote:

How many years will it take people to switch from res_foo to res_nfoo
which is re-entrant? Really, these functions have been there for
more than a decade. There is no excuse for a multi-threaded program
not to use them if they are available and instead use their own
API and do their own locking. What is next? Don't use the _r functions
and do your own locking? Add mutexes to protect a possibly non-thread-safe
malloc? Unless of course I am missing something, and if so I apologize.
99% of the programs that use _res in a multi-threaded environment do
so incorrectly, and it is a good thing to make them use the new API's.
Someone made a mistake exposing _res a *long* time ago. Let's not
perpetuate it by having new code try to use it.

christos
Post by Petar Bogdanovic
Hi,
attached is an excerpt from the [1]DCC mailing list. One of the dcc
daemons started to die unexpectedly because it used _res in conjunction
with threads which seems to be a weird combination on NetBSD even if you
/*
* Source and Binary compatibility; _res will not work properly
* with multi-threaded programs.
*/
extern struct __res_state *__res_state(void);
#define _res (*__res_state())
/*
* This is aliased via a macro to _res; don't allow multi-threaded programs
* to use it.
*/
res_state
__res_state(void)
{
static const char res[] = "_res is not supported for multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}
At least that's how I interpreted the story so it would be nice if
somebody here could analyze/comment the issue.
Thank you,
Petar Bogdanovic
[1] http://www.rhyolite.com/dcc/
Date: Fri, 5 Jun 2009 21:52:05 GMT
As expected, this is the result of a kill(2) call.
#0 0xbbaf923f in kill () from /usr/lib/libc.so.12
#1 0xbbb95a64 in abort () from /usr/lib/libc.so.12
#2 0xbbbdc60c in __res_state () from /usr/lib/libpthread.so.0
#3 0x0806b9ca in dcc_res_delays (budget=4) at get_port.c:476
#4 0x080660f2 in dcc_clnt_rdy (emsg=0xb91fff50 "", ctxt=0x80c9000,
clnt_fgs=8 '\b') at clnt_send.c:1740
#5 0x080544dc in clnt_resolve_thread (arg=0x0) at clnt_threaded.c:394
#6 0xbbbe562d in pthread_join () from /usr/lib/libpthread.so.0
#7 0xbbb1aa2c in swapcontext () from /usr/lib/libc.so.12
Now that you mention it, I saw an instance of it a week or two ago,
but hoped it was a fluke. I've been unable to reproduce it then or today.
That's an ugly one, because it's not in my code.
if (!dcc_host_locked)
dcc_logbad(EX_SOFTWARE, "dcc_get_host() not locked");
/* get the current value */
if (!(_res.options & RES_INIT))
res_init();
dcc_logbad() calls abort() after syslog().
Because I assume the resolver is not thread safe and check that it's
locked, it can't be a simple, valid locking problem.
I guess I'll have to look for NetBSD's version of the resolver library
to see what NetBSD has done to it. There are no abort() calls in the
FreeBSD 7.1 version of res_state.c
Have I mentioned that I'm not a fan of the clean target in
the NetBSD bsd.prog.mk because it deletes .gdbinit?
thanks,
_______________________________________________
http://www.rhyolite.com/mailman/listinfo/dcc
----- End forwarded message -----
Date: Fri, 05 Jun 2009 16:16:23 -0700
As expected, this is the result of a kill(2) call.
#0 0xbbaf923f in kill () from /usr/lib/libc.so.12
#1 0xbbb95a64 in abort () from /usr/lib/libc.so.12
#2 0xbbbdc60c in __res_state () from /usr/lib/libpthread.so.0
#3 0x0806b9ca in dcc_res_delays (budget=4) at get_port.c:476
#4 0x080660f2 in dcc_clnt_rdy (emsg=0xb91fff50 "", ctxt=0x80c9000,
clnt_fgs=8 '\b') at clnt_send.c:1740
#5 0x080544dc in clnt_resolve_thread (arg=0x0) at clnt_threaded.c:394
#6 0xbbbe562d in pthread_join () from /usr/lib/libpthread.so.0
#7 0xbbb1aa2c in swapcontext () from /usr/lib/libc.so.12
Now that you mention it, I saw an instance of it a week or two ago,
but hoped it was a fluke. I've been unable to reproduce it then or today.
That's an ugly one, because it's not in my code.
if (!dcc_host_locked)
dcc_logbad(EX_SOFTWARE, "dcc_get_host() not locked");
/* get the current value */
if (!(_res.options& RES_INIT))
res_init();
dcc_logbad() calls abort() after syslog().
Because I assume the resolver is not thread safe and check that it's
locked, it can't be a simple, valid locking problem.
I guess I'll have to look for NetBSD's version of the resolver library
to see what NetBSD has done to it. There are no abort() calls in the
FreeBSD 7.1 version of res_state.c
I don't see anything that immediately jumps out between changes in 4.0 and
4.0.1, but I'm long since familiar with libc code.
libpthread/res_state.c
/*
* This is aliased via a macro to _res; don't allow multi-threaded programs
* to use it.
*/
res_state
__res_state(void)
{
static const char res[] = "_res is not supported for multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}
#include <sys/cdefs.h>
#if defined(LIBC_SCCS) && !defined(lint)
__RCSID("$NetBSD: res_state.c,v 1.5.10.1 2007/05/17 21:25:19 jdc Exp $");
#endif
#include <sys/types.h>
#include <arpa/inet.h>
#include <arpa/nameser.h>
#include <netdb.h>
#include <resolv.h>
struct __res_state _nres
# if defined(__BIND_RES_TEXT)
= { .retrans = RES_TIMEOUT, } /*%< Motorola, et al. */
# endif
;
res_state __res_get_state_nothread(void);
void __res_put_state_nothread(res_state);
#ifdef __weak_alias
__weak_alias(__res_get_state, __res_get_state_nothread)
__weak_alias(__res_put_state, __res_put_state_nothread)
/* Source compatibility; only for single threaded programs */
__weak_alias(__res_state, __res_get_state_nothread)
#endif
res_state
__res_get_state_nothread(void)
{
if ((_nres.options & RES_INIT) == 0 && res_ninit(&_nres) == -1) {
h_errno = NETDB_INTERNAL;
return NULL;
}
return &_nres;
}
$ ldd /var/dcc/libexec/dccifd
-lpthread.0 => /usr/lib/libpthread.so.0
-lm.0 => /usr/lib/libm387.so.0
-lm.0 => /usr/lib/libm.so.0
-lc.12 => /usr/lib/libc.so.12
Let me know if there is something I can do to help.
Have I mentioned that I'm not a fan of the clean target in
the NetBSD bsd.prog.mk because it deletes .gdbinit?
Oh, boy. They really mean *squeaky clean*.
Thanks for your time,
Mike
_______________________________________________
http://www.rhyolite.com/mailman/listinfo/dcc
----- End forwarded message -----
Date: Sat, 6 Jun 2009 00:17:27 GMT
static const char res[] = "_res is not supported for multi-threaded"
" programs.\n";
(void)write(STDERR_FILENO, res, sizeof(res) - 1);
abort();
return NULL;
}
That is utterly lame, overprotective mommy-ism. Besides, it stupidly
assumes that stderr has not been closed, as competent programmer does
in a daemon.
A competent programmer knows to suspect that a library is not thread
safe unless it explicitly says it is. More than that, you assume that
a library that keeps internal state like _res.options and res_init()
is not thread safe unless it both claims to be thread safe and you can't
find any evidence about problems.
And to omit any hint of the nonsense in the resolver man page!
* Source and Binary compatibility; _res will not work properly
* with multi-threaded programs.
Because dccifd and dccm use threads
Let me know if there is something I can do to help.
Any suggestions on the least nasty kludge to link dccifd and dccm
to the libc resolver intead of the broken-by-design libpthread
resolver? I'd have to force not only the res_state hooks, but the
whole resolver edifice including anything called inside gethostby*().
Should I change the Makefiles to treat NetBSD like Windows and not build
dccifd and dccm under the toy-applications-for-toy-operating-systems rule?
Maybe I can arrange to not tweak the resolver timeouts for the threaded
DCC programs to limit the total DCC delays and so keep SpamAssassin and
MTAs from being unhappy.
Have I mentioned I'm becoming ever less enamored of recent versions of
NetBSD? The Linux experts only gratuitously, incompatibly changed the
names of the resolver hooks.
thanks any way,
_______________________________________________
http://www.rhyolite.com/mailman/listinfo/dcc
----- End forwarded message -----
Date: Sat, 06 Jun 2009 00:49:55 +0000
the strange part of all this is, there has been a thread safe superset
of libresolv for about 11 years. just call res_ninit instead of
res_init, and res_nsend instead of res_send. netbsd doesn't know this?
_______________________________________________
http://www.rhyolite.com/mailman/listinfo/dcc
----- End forwarded message -----
Date: Sat, 6 Jun 2009 01:02:56 GMT
the strange part of all this is, there has been a thread safe superset
of libresolv for about 11 years. just call res_ninit instead of
res_init, and res_nsend instead of res_send. netbsd doesn't know this?
I didn't know that, but now that you mention it, it's in non-NetBSD
versions of res_state.c, and so probably in NetBSD versions.
I'd not trust it without reading the whole resolver library because
"thread" does appear in res_state.c. But there are plenty of
pthread.h and similar stains in other files in the
FreeBSD version of /usr/src/lib/libc/resolv/*
While I'm ranting about NetBSD, I wish they'd get with the program in
minor ways that wouldn't break anything. I can't think of an excuse
sign.c:77: warning: pointer targets in passing argument 2 of
'MD5Update' differ in signedness
Declaring your MD5Update() to take a const void * instead of a const
unsigned char * should be done even before you tune it. It should be
done when you add "const" to the ancient code from the RFC 1321.
_______________________________________________
http://www.rhyolite.com/mailman/listinfo/dcc
----- End forwarded message -----
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Vernon Schryver
2009-06-08 14:25:54 UTC
Permalink
Post by Christos Zoulas
How many years will it take people to switch from res_foo to res_nfoo
which is re-entrant? Really, these functions have been there for
more than a decade. There is no excuse for a multi-threaded program
not to use them if they are available and instead use their own
API and do their own locking. What is next? Don't use the _r functions
and do your own locking? Add mutexes to protect a possibly non-thread-safe
malloc? Unless of course I am missing something, and if so I apologize.
99% of the programs that use _res in a multi-threaded environment do
so incorrectly, and it is a good thing to make them use the new API's.
Someone made a mistake exposing _res a *long* time ago. Let's not
perpetuate it by having new code try to use it.
It is crazy to expect people to use undocument facilities.
"res_init", "res_send", and so forth appear in `man 3 resolver` on
"NetBSD 4.0.1 (GENERIC) " but the strings "nres" and "thread" do
not appear even once.

Then there is the issue of how the resolver state structure set by
res_nfoo is supposed to be communicated to the resolver inside
gethostbyname(). Putting values into a per-thread, caller-maintained
structure does nothing unless the library code knows to use the per-thread
structure instead of the common global structure. Have you bothered
to check that changing values with res_nfoo has any effect? Maybe set
the retry limits 1 retransmission and 1 second and then seeing how soon
a failure happens when you try to resolve a domain whose authoritative
server does not answer at all? (I use timeo.rhyolite.com for such tests)

It is incompetent to convert _res into a run-time crash when correct or
less bad things to do are so easy and obvious. You could easily make
threaded programs not link if they use _res. Another tactic would be
to change the threaded version of the resolver library to use per-thread
_res structure, and to make the res_foo() functions in the threaded
resolver library call res_nfoo().

That you call abort() after writing to stderr in is emblematic of
the NetBSD problems. It would be dicey to call syslog(), but simply
assuming that stderr has not been long since closed is at best far
too naive for anyone allowed to touch a libc source tree.

As for malloc(), I don't know what other people do, but I use it only
with sufficient locking. It is incompetent to just assume that any
function that must obviously keep internal state is thread-safe without
explicit words in documentation. (Never mind that I avoid malloc()
in general.)
Besides, the crazy NetBSD philosophy would be to malloc() for
threaded applications into {printf(); abort();} and expect people
to know by mental telepathy to use an undocument function that in
fact does not work.

It is stupid egotism to change well documented names simply because
you can. For decades starting long before NetBSD, existed people
including myself have been doing all kinds of stuff under the covers
in system libraries including libc to preserve APIs. You only change
names as with localtime()/localtime_r() when the old API cannot be
maintained or when you want to offer an improved API.
If for some reason you can't offer a new name, you at least document
the problem. If you don't do that, you at least delete the old
documentation!

Of course on abstract grounds, nres_foo() is better than res_foo() just
as localtime_r() is better than localtime(). But only people who
shouldn't be allowed near a source tree would willfully and knowingly
break res_foo() without either documenting res_nfoo() or implementing
res_foo() as something like res_nfoo(pthread_getspecific())


Finally DO NOT WRITE ME about this stuff. I care only about the
code. I'm not interested in joining a djb style,
we-re-so-wonderful-because-we-tell-each-other-so cult.
The nature of NetBSD in this decade is clear regardless of this
_res silliness.


Vernon Schryver ***@rhyolite.com

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Christos Zoulas
2009-06-08 15:00:28 UTC
Permalink
On Jun 8, 2:25pm, ***@calcite.rhyolite.com (Vernon Schryver) wrote:
-- Subject: Re: [Fwd: Re: dccifd: restart after signal 6]

| > To: tech-***@netbsd.org
| > From: ***@astron.com (Christos Zoulas)
|
| > How many years will it take people to switch from res_foo to res_nfoo
| > which is re-entrant? Really, these functions have been there for
| > more than a decade. There is no excuse for a multi-threaded program
| > not to use them if they are available and instead use their own
| > API and do their own locking. What is next? Don't use the _r functions
| > and do your own locking? Add mutexes to protect a possibly non-thread-safe
| > malloc? Unless of course I am missing something, and if so I apologize.
| > 99% of the programs that use _res in a multi-threaded environment do
| > so incorrectly, and it is a good thing to make them use the new API's.
| > Someone made a mistake exposing _res a *long* time ago. Let's not
| > perpetuate it by having new code try to use it.
|
| It is crazy to expect people to use undocument facilities.
| "res_init", "res_send", and so forth appear in `man 3 resolver` on
| "NetBSD 4.0.1 (GENERIC) " but the strings "nres" and "thread" do
| not appear even once.

Yes, this is NetBSD's fault. The documentation is there, it just has
not been updated. I will fix it right now.

| Then there is the issue of how the resolver state structure set by
| res_nfoo is supposed to be communicated to the resolver inside
| gethostbyname(). Putting values into a per-thread, caller-maintained
| structure does nothing unless the library code knows to use the per-thread
| structure instead of the common global structure. Have you bothered
| to check that changing values with res_nfoo has any effect? Maybe set
| the retry limits 1 retransmission and 1 second and then seeing how soon
| a failure happens when you try to resolve a domain whose authoritative
| server does not answer at all? (I use timeo.rhyolite.com for such tests)

This is another obsolete interface and non-thread-safe interface that has
been replaced by gethostinfo().

| It is incompetent to convert _res into a run-time crash when correct or
| less bad things to do are so easy and obvious. You could easily make
| threaded programs not link if they use _res. Another tactic would be
| to change the threaded version of the resolver library to use per-thread
| _res structure, and to make the res_foo() functions in the threaded
| resolver library call res_nfoo().

That is harder to do, since the threaded program links in with libc
which needs the _res symbol defined.

| That you call abort() after writing to stderr in is emblematic of
| the NetBSD problems. It would be dicey to call syslog(), but simply
| assuming that stderr has not been long since closed is at best far
| too naive for anyone allowed to touch a libc source tree.

Well, I agree with you, but this is not the only function that prints
errors to stderr in libc; and if stderr is closed, you could easily
look at the backtrace in the debugger to find out what went wrong.

| As for malloc(), I don't know what other people do, but I use it only
| with sufficient locking. It is incompetent to just assume that any
| function that must obviously keep internal state is thread-safe without
| explicit words in documentation. (Never mind that I avoid malloc()
| in general.)

Most such functions have been replaced with thread-safe variants. Insisting
on calling the old API's from threaded contexts makes little sense. It
will cause unexpected behavior, and poor performance.

| Besides, the crazy NetBSD philosophy would be to malloc() for
| threaded applications into {printf(); abort();} and expect people
| to know by mental telepathy to use an undocument function that in
| fact does not work.

This is a different case. It is a fine line we've drawn but never-the-less
it is a line. Using an API that we've been trying to make obsolete for
more than 10 years that exposes the guts of libc is one thing, using an
API that is still valid is something different. I challenge you to find
another OS where a binary created on that OS more 10 years ago will work
unmodified in the current version of that OS and uses _res. Yes, we have
kept binary compatibility of _res since that long ago. But we have not
allowed _res to work in multi-threaded programs since we imported the
new multi-threaded aware resolver api's. Using _res from a multi-threaded
program has never worked on NetBSD.
_res directly

| It is stupid egotism to change well documented names simply because
| you can. For decades starting long before NetBSD, existed people
| including myself have been doing all kinds of stuff under the covers
| in system libraries including libc to preserve APIs. You only change
| names as with localtime()/localtime_r() when the old API cannot be
| maintained or when you want to offer an improved API.
| If for some reason you can't offer a new name, you at least document
| the problem. If you don't do that, you at least delete the old
| documentation!

Yes, I will fix the documentation.

| Of course on abstract grounds, nres_foo() is better than res_foo() just
| as localtime_r() is better than localtime(). But only people who
| shouldn't be allowed near a source tree would willfully and knowingly
| break res_foo() without either documenting res_nfoo() or implementing
| res_foo() as something like res_nfoo(pthread_getspecific())

Ask the libbind maintainers to do this.

| Finally DO NOT WRITE ME about this stuff. I care only about the
| code. I'm not interested in joining a djb style,
| we-re-so-wonderful-because-we-tell-each-other-so cult.
| The nature of NetBSD in this decade is clear regardless of this
| _res silliness.

I did not write you. I replied to a question on why _res aborts in
multi-threaded programs in the mailing list.

christos

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Petar Bogdanovic
2009-06-08 15:11:34 UTC
Permalink
Post by Christos Zoulas
| Finally DO NOT WRITE ME about this stuff. I care only about the
| code. I'm not interested in joining a djb style,
| we-re-so-wonderful-because-we-tell-each-other-so cult.
| The nature of NetBSD in this decade is clear regardless of this
| _res silliness.
I did not write you. I replied to a question on why _res aborts in
multi-threaded programs in the mailing list.
I apologize, that was me. I thought you forgot to cc him, and it didn't
come to my mind, that the action could provoke such an outrage. Sorry!



Petar Bogdanovic




--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Laight
2009-06-08 21:48:09 UTC
Permalink
Post by Christos Zoulas
| That you call abort() after writing to stderr in is emblematic of
| the NetBSD problems. It would be dicey to call syslog(), but simply
| assuming that stderr has not been long since closed is at best far
| too naive for anyone allowed to touch a libc source tree.
Well, I agree with you, but this is not the only function that prints
errors to stderr in libc; and if stderr is closed, you could easily
look at the backtrace in the debugger to find out what went wrong.
And they should all be nuked ....

Many years ago we spent ages locating a problem that was actually
a splurious printf() call (it should have been an sprintf() call).
Since the generated data disn't include a '\n', and stdout is line
buffered, nothing untoward happenend until the code had run enough
times to fill the stdio buffer.
Then the data buffer was written to fd 1.
The program had earlier call close(1), and was reusing fd 1 as a
control pipe to another daemon - the text broke the protocol ....

Ok, if the program had done fclose(stderr) instead of close(1)
then it wouldn't have been a problem - but many daemons will
just do close(0); close(1); close(2); and assume that because
they don't access stdin, stdout or stderr that nothing else will.

David
--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...