Discussion:
AF_UNIX socketpair dgram queue sizes
(too old to reply)
Jan Schaumann
2021-11-09 03:32:43 UTC
Permalink
Hello,

I'm trying to wrap my head around the buffer sizes
relevant to AF_UNIX/PF_LOCAL dgram socketpairs.

On a NetBSD/amd64 9.2 system, creating a socketpair
and simply writing a single byte in a loop to the
write end without reading the data in non-blocking
mode, I can write

net.local.dgram.recvspace / 512 datagrams with a
single byte

e.g.,

16384 recvspace => 32 1-byte writes

Likewise, I can perform 32 writes of up to 400 bytes,
but if I try to write 401 byte-sized chunks, I can
only perform 22 such writes.

This is similarly observed for any variation of
recvspace (32K => 64 writes of 1 - 400 bytes, but 43
writes if the payload is 401 bytes etc.).

This 400-byte cutoff does not appear to be related to
either the recvspace nor SO_SNDBUF -- which, for a
dgram, should be the max datagram size. And indeed I
cannot write any chunks of > 2560 bytes, but can write
(6 * 2560 + 1 * 1010) = 16370 bytes (which, at 2 bytes
overhead per socketpair dgram, apparently, adds up to
exactly 16K recvspace).

Does anybody know where the 400 byte number comes
from, or what I'm getting confused here?

-Jan

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jan Schaumann
2021-11-13 04:53:10 UTC
Permalink
Post by Jan Schaumann
I'm trying to wrap my head around the buffer sizes
relevant to AF_UNIX/PF_LOCAL dgram socketpairs.
Still working on this, and I'm making some progress in
my understanding of mbufs here.

So:

An mbuf is 512 bytes in size, with an m_hdr consuming
56 bytes, and a packet header consuming 56 bytes.

If I write a small amount of data, then I only need
one mbuf of type MT_DATA, so can squeeze in 400 bytes.

If I write more than 856 bytes of data, then I get an
mbuf of type MT_DATA with flags M_EXT set, i.e., an
mbuf cluster.

For any amount of data >400 but <= 856, I get one mbuf
of type MT_DATA with flags M_PKTHDR set, capable of
holding 400 bytes plus a second mbuf of type MT_DATA
with up to 456 bytes.

After I allocate these chains, I then prepend an mbuf
of type MT_SONAME (and size 2 bytes for the socketpair
socket "name") to the beginning of the chain.

So for any dgram of size 1 byte to 400 bytes, I will
need two mbufs (one of type MT_SONAME plus one of type
MT_DATA); for a dgram of size 401 bytes up to 856
bytes, I will need three mbufs (one SONAME, two
DATA), but for a dgram of size >401 bytes, I only need
two mbufs (one SONAME, one DATA with M_EXT).

Given a socketpair with the read end having a
SO_RCVBUF set to 2048 bytes, I now observe the
following:

1) I can write 4 dgrams of size 1 up until 400 bytes
before my next attempt to write another 1 byte dgram
fails with ENOBUFS. This uses 8 512 byte mbufs.

2) I can write 3 dgrams of size 401 to 856 bytes
before my next attempt to write another dgram fails
with ENOBUFS. This uses 9 512 byte mbufs.

(If I write 856 bytes, the third datagram will only
carry 330 bytes for a total of 856 + 856 + 330 = 2042
bytes, which, when we add the two bytes from the
SONAME mbuf for each, adds up to 2048, i.e.,
SNDBUF.)

3) I can write 2 dgrams of size 857 bytes before my
next attempt yields ENOBUFS. This uses 4 mbufs,
writing 1718 bytes in total.


My question at this point is: when does my mbuf
allocation fail? I.e., what is the limit here?

In (1), I allocated 8 mbufs = 4096 bytes = 2 * SO_RCVBUF.

In (2), I allocated 9 mbufs = 4608 bytes = 2 * SO_RCVBUF + 512.
But I ended up writing less data.

In (3), I allocate 4 mbufs = 2048 bytes = SO_RCVBUF.


These correlations don't make much sense to me. What
am I missing?

-Jan

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Michael van Elst
2021-11-13 08:21:15 UTC
Permalink
Post by Jan Schaumann
1) I can write 4 dgrams of size 1 up until 400 bytes
before my next attempt to write another 1 byte dgram
fails with ENOBUFS. This uses 8 512 byte mbufs.
2) I can write 3 dgrams of size 401 to 856 bytes
before my next attempt to write another dgram fails
with ENOBUFS. This uses 9 512 byte mbufs.
(If I write 856 bytes, the third datagram will only
carry 330 bytes for a total of 856 + 856 + 330 = 2042
bytes, which, when we add the two bytes from the
SONAME mbuf for each, adds up to 2048, i.e.,
SNDBUF.)
3) I can write 2 dgrams of size 857 bytes before my
next attempt yields ENOBUFS. This uses 4 mbufs,
writing 1718 bytes in total.
My question at this point is: when does my mbuf
allocation fail? I.e., what is the limit here?
In (1), I allocated 8 mbufs = 4096 bytes = 2 * SO_RCVBUF.
In (2), I allocated 9 mbufs = 4608 bytes = 2 * SO_RCVBUF + 512.
But I ended up writing less data.
In (3), I allocate 4 mbufs = 2048 bytes = SO_RCVBUF.
There are two limits:

sb_cc counts the valid bytes against sb_hiwat = SO_RCVBUF.
sb_mbcnt counts the mbuf sizes against sb_mbmax = 2 * SO_RCVBUF.

Either limit gives you an amount of free space in the buffer
and the minimum is the limit for the write operation. This
is calculated in sbspace().


If I calculated right you get the following cases:

1)
after writing 3 datagrams you have:
1206 bytes written -> 842 bytes free
3072 bytes of mbufs appended -> 1024 bytes free
-> you can write another 842 bytes

next datagram of 400 + 2 bytes fits and you have:
1608 bytes written -> 440 bytes free
4096 bytes of mbufs appended -> 0 bytes free
-> next write will fail

The same happens also when you just write 1 byte datagrams
as sb_mbcnt will be the same.

2)
after writing 2 datagrams you have:
1716 bytes written -> 332 bytes free
3072 bytes of mbufs appended -> 1024 bytes free
-> you can write another 332 bytes

next datagram of 330 + 2 bytes fits and you have
2048 bytes written -> 0 bytes free
4096 bytes of mbufs appended -> 0 bytes free
-> next write will fail

3)
after writing 1 datagram you have:
859 bytes written -> 1189 bytes free
2560 bytes of mbufs appended -> 1536 bytes free
-> you can write another 1189 bytes

next datagram of 857 + 2 bytes fits and you have:
1718 bytes written -> 330 bytes free
5120 bytes of mbufs appended -> 0 bytes free
-> next write will fail


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Michael van Elst
2021-11-13 16:48:06 UTC
Permalink
Post by Jan Schaumann
By the way, it strikes me as odd to treat dgrams with
in between 400 and 857 bytes as special. That's the
only scenario where we use two normal mbufs. Is this
a frequent enough case that makes it worth treating
as special instead of simplifying to e.g., dgrams <
400 => 1 mbuf, dgrams > 400 => 1 mbcluster ?
Small packets (one mbuf) and large packets (one cluster) are probably
most frequent. If you want to simplify for the medium sized packets,
you start wasting half of the memory, but only for a rather infrequent
case.

These heuristics are very old, but probably still valid if you
think that a cluster is good enough for the 1500 byte Ethernet MTU.



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Martin Husemann
2021-11-13 10:49:49 UTC
Permalink
Post by Jan Schaumann
An mbuf is 512 bytes in size, with an m_hdr consuming
56 bytes, and a packet header consuming 56 bytes.
That numbers depend on architecture, see mbuf.h:

/*
* Mbufs are of a single size, MSIZE (machine/param.h), which
* includes overhead. An mbuf may add a single "mbuf cluster" of size
* MCLBYTES (also in machine/param.h), which has no additional overhead
* and is used instead of the internal data area; this is done when
* at least MINCLSIZE of data must be stored.
*/

Martin

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jan Schaumann
2021-11-13 15:27:39 UTC
Permalink
Post by Michael van Elst
sb_cc counts the valid bytes against sb_hiwat = SO_RCVBUF.
sb_mbcnt counts the mbuf sizes against sb_mbmax = 2 * SO_RCVBUF.
That second part was the piece I was missing!
Post by Michael van Elst
Either limit gives you an amount of free space in the buffer
and the minimum is the limit for the write operation. This
is calculated in sbspace().
Right, that makes sense now. Thanks!

By the way, it strikes me as odd to treat dgrams with
in between 400 and 857 bytes as special. That's the
only scenario where we use two normal mbufs. Is this
a frequent enough case that makes it worth treating
as special instead of simplifying to e.g., dgrams <
400 => 1 mbuf, dgrams > 400 => 1 mbcluster ?

-Jan

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...