shutdown(2)'ing a bound UDP socket

Discussion:

(too old to reply)

Peter J. Philipp

2020-07-15 12:00:38 UTC

Hi,

I'm the author of delphinusdnsd, a lightweight dns server. I develop on
OpenBSD but produce ports to Linux, FreeBSD and NetBSD. My latest code I
have ported to Linux and FreeBSD successfully, but NetBSD is not working.

What I do in my code is I bind (with SO_REUSEPORT option) two UDP descriptors
on the same port and shutdown(2) one of those (called dup) in the receive
setting (SHUT_RD). This allows me to read off the non-shut descriptor but
send packets on either, it works out well on OpenBSD. However while NetBSD
does allow shutting down the descriptor (unlike FreeBSD which has other
code to fix that problem it looks like), it does want to deliver incoming
queries to the shut descriptor. I get one answer from my server on NetBSD
and then it blocks. I tried patching this in kernel but it seems to be over
my head, I'm doing something wrong. Basically the socket should get a
SS_CANTRCVMORE state, but checking for this seems to be hard, plus I don't
know what I'M doing in the NetBSD kernel.

So I'm basically left of begging someone to fix this functionality to skip
shutdown(2)'ed bound reading sockets and let the ones that do read receive
the packet.

Otherwise this may be my last year of supporting NetBSD unfortunately. I
would like to give a donation but I'm dirt poor, as gesture I can maybe
afford five euros or something, but can't find more. I have donated five USD
before in 2018, if it's worth any. I'm releasing version 1.5.0 between
september 2020 and november 2020, and I hope to continue NetBSD support, if
only in -current.

If you need to see my code to see what I'm doing you can get it at
https://delphinusdns.org/download/snapshot/delphinusdnsd-snapshot.tgz and the
relevant lines of code are in delphinusdnsd.c (main()) and go further into
forward.c. If you need a config file for the forwarding mode I can produce
you one on request.

Please CC me directly as I'm not on the tech-***@netbsd.org list.

Best Regards,
-peter

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Roy Marples

2020-07-16 09:24:16 UTC

Permalink

Hi Peter

Post by Peter J. Philipp
Hi,
I'm the author of delphinusdnsd, a lightweight dns server. I develop on
OpenBSD but produce ports to Linux, FreeBSD and NetBSD. My latest code I
have ported to Linux and FreeBSD successfully, but NetBSD is not working.
What I do in my code is I bind (with SO_REUSEPORT option) two UDP descriptors
on the same port and shutdown(2) one of those (called dup) in the receive
setting (SHUT_RD). This allows me to read off the non-shut descriptor but
send packets on either, it works out well on OpenBSD. However while NetBSD
does allow shutting down the descriptor (unlike FreeBSD which has other
code to fix that problem it looks like), it does want to deliver incoming
queries to the shut descriptor. I get one answer from my server on NetBSD
and then it blocks. I tried patching this in kernel but it seems to be over
my head, I'm doing something wrong. Basically the socket should get a
SS_CANTRCVMORE state, but checking for this seems to be hard, plus I don't
know what I'M doing in the NetBSD kernel.
So I'm basically left of begging someone to fix this functionality to skip
shutdown(2)'ed bound reading sockets and let the ones that do read receive
the packet.
Otherwise this may be my last year of supporting NetBSD unfortunately. I
would like to give a donation but I'm dirt poor, as gesture I can maybe
afford five euros or something, but can't find more. I have donated five USD
before in 2018, if it's worth any. I'm releasing version 1.5.0 between
september 2020 and november 2020, and I hope to continue NetBSD support, if
only in -current.
If you need to see my code to see what I'm doing you can get it at
https://delphinusdns.org/download/snapshot/delphinusdnsd-snapshot.tgz and the
relevant lines of code are in delphinusdnsd.c (main()) and go further into
forward.c. If you need a config file for the forwarding mode I can produce
you one on request.

Do you have a small test case for this that can reproduce the issue?

Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter J. Philipp

2020-07-16 09:35:50 UTC

Permalink

Post by Roy Marples
Hi Peter

Do you have a small test case for this that can reproduce the issue?
Roy

No I don't. I'll construct something but it may take a few days. I'll
contact you back.

Best Regards,
-peter

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter J. Philipp

2020-07-19 19:39:29 UTC

Permalink

Post by Roy Marples
Hi Peter

Hi Roy and tech-net,

Thank you for the patient wait. I have finally written something and was
able to reproduce the condition tonight. I'm going to paste the program
inline here and then talk a little below it. Just search for "---" to
skip the code to see the commentary...

------->
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/select.h>

#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>

#include <err.h>
#include <errno.h>

int bind_socketv6(int so, u_short port);
int bind_socketv4(int so, u_short port);

int
main(int argc, char *argv[])
{
struct sockaddr_in sin;
struct sockaddr_in6 sin6;
struct timeval tv;

int so, so6, dup, dup6, on = 1;
int max, len, sel;

u_short port = 65053;
fd_set rdset;
pid_t pid;
socklen_t slen;

char buf[512];

if (argc > 1)
port = atoi(argv[1]);

/* make the dup's */

dup = socket(AF_INET, SOCK_DGRAM, 0);
dup6 = socket(AF_INET6, SOCK_DGRAM, 0);

if (dup < 0 || dup6 < 0) {
err(1, "socket");
}

on = 1;
if (setsockopt(dup, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)) < 0) {
err(1, "setsockopt");
}

on = 1;
if (setsockopt(dup6, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)) < 0) {
err(1, "setsockopt6");
}

/* shutdown on the dup's */

if (shutdown(dup, SHUT_RD) < 0) {
err(1, "shutdown dup");
}

if (shutdown(dup6, SHUT_RD) < 0) {
err(1, "shutdown dup6");
}

/* bind the dup's too */

if (bind_socketv4(dup, port) < 0) {
err(1, "dup bind");
}

if (bind_socketv6(dup6, port) < 0) {
err(1, "dup bind6");
}

/* the main sockets */

so = socket(AF_INET, SOCK_DGRAM, 0);
so6 = socket(AF_INET6, SOCK_DGRAM, 0);

if (so < 0 || so6 < 0) {
err(1, "socket");
}

on = 1;
if (setsockopt(so, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)) < 0) {
err(1, "setsockopt");
}

on = 1;
if (setsockopt(so6, SOL_SOCKET, SO_REUSEPORT, &on, sizeof(on)) < 0) {
err(1, "setsockopt6");
}

if (bind_socketv4(so, port) < 0) {
err(1, "bind");
}

if (bind_socketv6(so6, port) < 0) {
err(1, "bind6");
}

/* fork a child */

switch (pid = fork()) {
case -1:
err(1, "fork");
break;
case 0:
close(so); close(so6);
for (;;) {
/* here we can write to the dup's if we wish */
sleep(10);
memset(&buf, 'X', 16);
memset(&sin, 0, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(8888);
sin.sin_addr.s_addr = inet_addr("192.168.177.2");
sendto(dup, buf, 16, 0, (struct sockaddr*)&sin,
sizeof(struct sockaddr_in));
}
/* NOTREACHED */
break;
default:
close(dup); close(dup6);
max = so6;
break;
}

for (;;) {
FD_ZERO(&rdset);
FD_SET(so, &rdset);
FD_SET(so6, &rdset);

tv.tv_sec = 5;
tv.tv_usec = 0;

if ((sel = select(max + 1, &rdset, NULL, NULL, &tv)) < 0) {
fprintf(stderr, "select error: %s\n", strerror(errno));
continue;
}

if (sel == 0) {
continue;
}

if (FD_ISSET(so, &rdset)) {
slen = sizeof(struct sockaddr_in);
if ((len = recvfrom(so, buf, sizeof(buf), 0,
(struct sockaddr*)&sin, &slen)) < 0) {
warn("recvfrom");
}
printf("%lu so read %d bytes\n", time(NULL), len);

/* send something back */
if (sendto(so, buf, len, 0, (struct sockaddr*)&sin, slen) < 0)
warn("sendto");
continue;
} else if (FD_ISSET(so6, &rdset)) {
slen = sizeof(struct sockaddr_in6);
if ((len = recvfrom(so6, buf, sizeof(buf), 0,
(struct sockaddr*)&sin6, &slen)) < 0) {
warn("recvfrom");
}
printf("%lu so6 read %d bytes\n", time(NULL), len);
continue;
}

} /* for(); */
/* NOTREACHED */
}

int
bind_socketv4(int so, u_short port)
{
struct sockaddr_in sin;

memset(&sin, 0, sizeof(sin));
sin.sin_family = AF_INET;
sin.sin_port = htons(port);

return (bind(so, (struct sockaddr *)&sin, sizeof(sin)));
}

int
bind_socketv6(int so, u_short port)
{
struct sockaddr_in6 sin6;

memset(&sin6, 0, sizeof(sin6));
sin6.sin6_family = AF_INET6;
sin6.sin6_port = htons(port);
sin6.sin6_len = sizeof(struct sockaddr_in6);

return (bind(so, (struct sockaddr *)&sin6, sizeof(sin6)));
}
<-------

So as you can see the program binds to 65053, but I used ./test-program 8888
to have it bind on port 8888, but it doesn't really matter which port.

The dup descriptors go into the child and it sends every 10 seconds a packet
to 192.168.177.2 (my OpenBSD workstation, but it can be any address).

What I have found is that with netcat from any source I can get the select
loop in the parent to display UNIX timestamp and the "so" descriptor
read X bytes, UNTIL the child in the for(;;)/sleep(10) loop transmits.
Then it won't receive any more on the parent. And this is exactly what
I'm running into on my program delphinusdnsd.

I hope you can work with this, let me know if I must do anything to accomodate
some sort of fix. The behaviour I'm looking for is that one can write on
the shutdown SHUT_RD socket (dup) and read without it blocking on (so) parent
socket.

Best Regards,
-peter

Post by Roy Marples

Do you have a small test case for this that can reproduce the issue?
Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter J. Philipp

2020-07-20 18:56:30 UTC

Permalink

Hi Erik,

Unless I have misunderstood (which is certainly possible), the question turns into: ???is a shutdown(sock, SHUT_RD) local to that particular descriptor, or global to all descriptors which reference a particular host address+protocol+port number tuple????

The shutdown'ed descriptor is global I think. It is able to bind only because
of the SO_REUSEPORT setsockopt. Usually this is done to give forked childs
an opportunity to receive a packet after some algorithm (I'm guessing it's
a round-robin alg or similar). By shutdowning one of these I hope to tell
the kernel that I want this excluded from this algorithm. And let the other
global "tuple" receive that packet.

Which is to say, you want to declare on one descriptor that you???re never going to read from it again, but read from another descriptor at the same network address+protocol+port number tuple, i.e., that a shutdown(sock, SHUT_RD) should be local to a given specific descriptor, as opposed to global to all descriptors which reference a given IP address, protocol, port number tuple.

Yes. It wasn't a decision that was planned, I felt my way through the
OpenBSD network stack in this regard and it did what I had hoped.
Unfortunately Linux required a bpf filter and a different order of setting up
these sockets in my tests, because this wasn't possible out of the box there.
With FreeBSD I didn't have to shutdown the descriptor, it seemed to detect
what descriptor in the global tuple I was selecting on and directed the packet
in that direction, but this is just a conclusion I made after 3 or 4 hours
max. After adding some ifdef's FreeBSD seemed to not drop a packet.

Why are you doing this dance of multiple descriptors? What behavior are you trying to achieve, or condition you???re trying to avoid, in your server code?

It is unorthodox but made some sense in that I had hoped for an opportunity
to write from a "global tuple" from other processes than the one that receives.
Other choices I considered were writing to a raw socket but it would be a lot
of overhead and hard work when it comes to fragmentation. Lastly I found that
I can set up a shared memory to write the packet back to the process that
received it, it's entirely possibly to do so at overhead of writing the code.
This last choice is what I'll have to do if NetBSD can't help me.

And then why, you probably mean, am I spreading this functionality across
processes. It had to do with OpenBSD and their sandbox'ing mechanism called
pledge(2). The benefit of writing packets via imsg(3) and shared memory into
a process that is "stdio sendfd recvfd" pledged in order to parse a DNS
message is very attractive.

If someone managed to overflow a buffer, somehow they'd be trapped to a very
restricted sandbox. They can't open a file descriptor or open a network socket,
the kernel would kill the process if they tried. I use and develop this
daemon of mine on OpenBSD but I want to also make it available to other
OS's with the sandbox mechanism disabled. It's a trade-off for those people
that absolutely want to use delphinusdnsd but don't have OpenBSD available.

The other process that does the "working" of the forwarding is busy enough
that I would shy away from putting it into the UDP receiving process. I
write all my programs single-threaded and only fork alternatively mainly due
to not understanding threads all that much.

It comes down to "what to do with shutdown(2)" and it's as much political as
technical decision. I think the way OpenBSD allows this makes sense, and it
accomodated my train of planning on this. FreeBSD and Linux are less
accomodating and I had to mess with Linux's natural algorithm with a BPF
filter, and on FreeBSD I got lucky... (but was not able to shutdown at all,
it throws an error).

curious,
Erik Fair

Best Regards,
-peter

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Roy Marples

2020-07-20 19:53:52 UTC

Permalink

Post by Peter J. Philipp
It is unorthodox but made some sense in that I had hoped for an opportunity
to write from a "global tuple" from other processes than the one that receives.
Other choices I considered were writing to a raw socket but it would be a lot
of overhead and hard work when it comes to fragmentation. Lastly I found that
I can set up a shared memory to write the packet back to the process that
received it, it's entirely possibly to do so at overhead of writing the code.
This last choice is what I'll have to do if NetBSD can't help me.
And then why, you probably mean, am I spreading this functionality across
processes. It had to do with OpenBSD and their sandbox'ing mechanism called
pledge(2). The benefit of writing packets via imsg(3) and shared memory into
a process that is "stdio sendfd recvfd" pledged in order to parse a DNS
message is very attractive.

dhcpcd recently gained both pledge and capsicum support.
I faced a similar issue and elected to read from the bound socket in an unpriv
chrooted process but write via a raw socket from a privileged process.
I can shutdown SHUT_RD the raw socket as it's never read from and this works
wonderfully on all OS.

Interestingly, the raw socket approach actually removed a lot more code than it
added because I no longer needed to play "guess the source address" in a
specific way of using dhcpcd. This was also required to finish the capsicum
support on FreeBSD - like OpenBSD's pledge.

Post by Peter J. Philipp
If someone managed to overflow a buffer, somehow they'd be trapped to a very
restricted sandbox. They can't open a file descriptor or open a network socket,
the kernel would kill the process if they tried. I use and develop this
daemon of mine on OpenBSD but I want to also make it available to other
OS's with the sandbox mechanism disabled. It's a trade-off for those people
that absolutely want to use delphinusdnsd but don't have OpenBSD available.

Aside from plege and capsicum, there is also the resource limited sandbox which
works well on NetBSD and DragonFlyBSD - but ironically not at all on OpenBSD due
to a limitation with their ppoll(2) interface.

Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2020-07-20 20:03:03 UTC

Permalink

Post by Peter J. Philipp
The shutdown'ed descriptor is global I think.

No; shutdown() affects the socket, not the descriptor and not the
address/port pair. You are a very unusual case, having multiple
unconnected sockets on the same address/port pair.

Post by Peter J. Philipp
Usually this is done to give forked childs an opportunity to receive
a packet after some algorithm (I'm guessing it's a round-robin alg or
similar). By shutdowning one of these I hope to tell the kernel that
I want this excluded from this algorithm. And let the other global
"tuple" receive that packet.

Why not use a single socket with descriptors open in multiple
processes? I don't know full details of how you're starting up, so
this might be difficult, but it seems to me that you want one read
descriptor and multiple write descriptors for a given address/port. To
me, this sounds perfect for a single socket with multiple fds on it:
just treat most of them as read-only. (Ideally, you should be able to
turn off write access at the descriptors without shutting it off for
the socket, but the obvious way to do that (fcntl F_SETFL turning off
FWRITE) probably won't work - it won't as of 5.2, and I'd be surprised
if anyone had made it work since then.)

To create it, the simplest way is to create the socket before you fork.
That may or may not be feasible in your case.

Post by Peter J. Philipp
[...why...] It had to do with OpenBSD and their sandbox'ing
mechanism called pledge(2).

I know nothing of pledge(2) beyond what I can infer from your mail,
but, if it breaks the "single socket with multiple descriptors"
approach, maybe you could do that only when pledge is unavailable?

Post by Peter J. Philipp
I write all my programs single-threaded and only fork alternatively
mainly due to not understanding threads all that much.

Yeah, UNIX-and-C is a bad system for writing threaded code. Things
like pthreads kinda-sorta make it work anyway, but it's never been a
good fit.

I'm not sure what the best fix is.

Post by Peter J. Philipp
It comes down to "what to do with shutdown(2)" and it's as much
political as technical decision. I think the way OpenBSD allows this
makes sense, [...]

Given that it supports SO_REUSEPORT at all, I suppose that behaviour
makes as much sense as any other. I'm not sure what I think of
SO_REUSEPORT. It breaks the naming paradigm sockets were built around;
I think either it should be eliminated or the whole naming scheme (and
probably parts of the API) should be torn down and rebuilt in a way
that makes sense in the presence of multiple sockets on a given
address/port. (With luck, that'd even permit getting rid of sin_zero!)

If we had a properly designed API, it would just be a question of
checking whether some OSes are out of spec or whether the is code
depending on behaviour not promised by the API. But the socket API
wasn't really designed; it was more like `accreted', and it shows.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Roy Marples

2020-07-20 20:49:04 UTC

Permalink

Post by Peter J. Philipp
Other choices I considered were writing to a raw socket but it would be a lot
of overhead and hard work when it comes to fragmentation.

Why are you concerned about fragmentation?
Provided you don't set INP_HDRINCL then the kernel will include relevant IP
headers and handle fragmentation for you (at least that's my understanding
looking at NetBSD's code).

The only extra overhead you should have is writing the UDP header which is
pretty straight forward.

Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Erik Fair

2020-07-20 17:20:20 UTC

Permalink

Unless I have misunderstood (which is certainly possible), the question turns into: “is a shutdown(sock, SHUT_RD) local to that particular descriptor, or global to all descriptors which reference a particular host address+protocol+port number tuple?”

Which is to say, you want to declare on one descriptor that you’re never going to read from it again, but read from another descriptor at the same network address+protocol+port number tuple, i.e., that a shutdown(sock, SHUT_RD) should be local to a given specific descriptor, as opposed to global to all descriptors which reference a given IP address, protocol, port number tuple.

Why are you doing this dance of multiple descriptors? What behavior are you trying to achieve, or condition you’re trying to avoid, in your server code?

curious,

Erik Fair

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2020-07-20 19:35:26 UTC

Permalink

Unless I have misunderstood (which is certainly possible), the question turn$

(Please don't use paragraph-length lines.)

Neither, I think.

It is local to that particular socket. Like an open plain file, it is
possible to have multiple descriptors referring to a single socket.

In the case at hand, SO_REUSEPORT is allowing the creation of two
distinct sockets on the same address/port, only one of which is subject
to the shutdown().

Why are you doing this dance of multiple descriptors? What behavior are you $

Yeah, that'd be my question too. Why not just use one socket with two
descriptors on it and just never read from one of them?

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter J. Philipp

2020-07-21 05:51:34 UTC

Permalink

Post by Roy Marples

Post by Peter J. Philipp
Other choices I considered were writing to a raw socket but it would be a lot
of overhead and hard work when it comes to fragmentation.

Why are you concerned about fragmentation?
Provided you don't set INP_HDRINCL then the kernel will include relevant IP
headers and handle fragmentation for you (at least that's my understanding
looking at NetBSD's code).
The only extra overhead you should have is writing the UDP header which is
pretty straight forward.
Roy

Hi Roy,

OH! I hadn't considered a raw socket without the HDRINCL then it may indeed
be worth it. I'll tinker with this when I find some time and let you know
how it goes. It may take some days.

Thanks!

-peter

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Peter J. Philipp

2020-07-21 19:02:02 UTC

Permalink

Post by Roy Marples

Post by Peter J. Philipp
Other choices I considered were writing to a raw socket but it would be a lot
of overhead and hard work when it comes to fragmentation.

Why are you concerned about fragmentation?
Provided you don't set INP_HDRINCL then the kernel will include relevant IP
headers and handle fragmentation for you (at least that's my understanding
looking at NetBSD's code).
The only extra overhead you should have is writing the UDP header which is
pretty straight forward.
Roy

OK I've done the changes, and it works awesome! Thank you! I spent most of
the day fine-tuning the udp checksumming, it was developed on OpenBSD, and
then when I ported it to Linux it almost worked out of the box, except Linux
does a weird thing with the protocol set on a sin6.sin6_port. But I figured
it out. Next I tested NetBSD and it worked perfectly. Last I tested FreeBSD.
No regressions. This is awesome! I wasn't sure if you wanted your name on
credits so I credited the NetBSD organization in my CVS. I'm gonna see if I
can get someone to donate for NetBSD as well.

Best Regards,
-peter

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de