Discussion:
kernel frameworks for jumbo frame
(too old to reply)
Maxime Villard
2019-01-29 08:32:15 UTC
Permalink
Hi,
I'm working on jumbo frame support for axen(4) and mue(4). However,
unfortunately, I find that we do not have proper kernel frameworks
(1)
We do not have external mbuf cluster capable for receiving jumbo
frames. We need to use m_devget(9) [ale(4), sk(4), ...], or have
per driver pool [bge(4), dge(4), ...]. The former has a performance
problem, whereas the latter makes drivers complicated.
FreeBSD has 4 cluster sizes, MCLBYTES, pagesize, 9KiB, and 16KiB.
https://www.freebsd.org/cgi/man.cgi?query=m_getjcl
Here, the "size" argument should be chosen from 4 values above
(description in man page is misleading).
OpenBSD has 8 cluster sizes, MCLBYTES, MCLBYTES + 2 (ether aligned
2KiB buffer), 4, 8, 9, 12, 16, and 64KiB. They are usable via
http://man.openbsd.org/MCLGETI
Unlike FreeBSD, the "len" argument can be arbitrary; proper cluster
is chosen automatically. The "ifp" argument is not used.
We have several problems when it comes to clusters.

- As you said, we have only one cluster size to begin with.
- MEXTMALLOC is here to allocate variable-sized clusters, but it is
really ugly and should be removed. Same for MEXTADD.
- The error handling of MCLGET() is error-prone.
- The naming of certain functions is confusing, like m_getcl, which
is one typo away from m_clget.

I propose we do this:

- Find another name for m_add, or just remove it (only two users).
- Rename m_clget to m_addjcl, add a 'size' argument to it, and make
it return -1 on error. MCLGET() will be an alias to it with MCLBYTES
as size. The callers are not forced to read the return value and
can still check for M_EXT. We can then smoothly switch them to
handle the return value instead.
- Add m_getjcl, same as FreeBSD, and make it call m_addjcl. This should
solve your problem.
- Make m_getcl an alias to m_getjcl with MCLBYTES as size.

In short, something like:

int
m_addjcl(struct mbuf *m, int how, int size)
{
// transform 'm' into a cluster of size 'size'
// return -1 on error, 0 otherwise
}

#define MCLGET(m, how) m_addjcl((m), (how), MCLBYTES)

struct mbuf *
m_getjcl(int how, int type, int flags, int size)
{
... allocate an mbuf 'm'...

if (m_addjcl(m, how, size) != 0) {
m_free(m);
return NULL;
}

return m;
}

#define m_getcl(how, type, flags) m_getjcl(how, type, flags, MCLBYTES)

This design should simplify a lot of things: we will be able to remove
MEXTMALLOC, improve the error handling of MCLGET, and have a unified
way to allocate jumbo frames.

Changes I would then like to see (but which don't matter right now):

- Rename MCLGET to m_addcl, and switch each caller to handle errors via
the return value. This will be a great improvement.
- If MEXTADD is still needed, rename it to m_addraw. But normally it
shouldn't be needed, and we should be able to remove it.

What do you think about that?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Maxime Villard
2019-01-29 09:03:43 UTC
Permalink
Post by Maxime Villard
- As you said, we have only one cluster size to begin with.
- MEXTMALLOC is here to allocate variable-sized clusters, but it is
really ugly and should be removed. Same for MEXTADD.
- The error handling of MCLGET() is error-prone.
- The naming of certain functions is confusing, like m_getcl, which
is one typo away from m_clget.
MEXTMALLOC() should probably go away, yes. However, MEXTADD() and the whole
set of "arbitrary external storage" has the potential to be quite useful.
Back in the day, I used it to implement a zero-copy data path from
disk-to-network in an iSCSI storage appliance. I know that the sosend_loan
stuff has been disabled for a while, but I think it's worth keeping the
infrastructure that enables it around.
Now, if you're just talking about removing MEXTADD() itself, but keeping the
underlying _m_ext_storage infrastructure, fine.
The main reason why I don't like MEXTADD is that the 'ext_free' field of
_m_ext_storage is in charge of freeing the mbuf itself, and not just its
storage. Because of that we have to make 'mb_cache' public, and it is now
hanging around in several drivers. In terms of abstraction, this is bad.
In terms of diagnostic checks, too, because the drivers don't set MT_FREE
and m_data to NULL.

But anyway, we can keep MEXTADD and just fix that (and also rename it).

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2019-01-30 11:07:30 UTC
Permalink
I think we agree to add API something like
Post by Maxime Villard
int
m_addjcl(struct mbuf *m, int how, int size)
Just a minor nit: avoid "JCL" in any names - it causes great pain for some
readers.
LOL!

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Maxime Villard
2019-03-09 09:56:00 UTC
Permalink
I think we agree to add API something like
Post by Maxime Villard
int
m_addjcl(struct mbuf *m, int how, int size)
Just a minor nit: avoid "JCL" in any names - it causes great pain for some
readers.
Martin
Do you have a better name to suggest? Because Jumbo Cluster is about the best
name I can think of (and so does FreeBSD).

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Valery Ushakov
2019-03-09 11:27:33 UTC
Permalink
I think we agree to add API something like
Post by Maxime Villard
int
m_addjcl(struct mbuf *m, int how, int size)
Just a minor nit: avoid "JCL" in any names - it causes great pain for some
readers.
Do you have a better name to suggest? Because Jumbo Cluster is
about the best name I can think of (and so does FreeBSD).
I'm not sure Martin was entirely serious. I bet those few of us that
are triggered by "JCL" can survive it. :)

FreeBSD already uses "jcl" for jumbo clusters so it's probably a good
idea to stay as consistent as possible even if we don't borrow the api
verbatim.

-uwe


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Martin Husemann
2019-03-09 14:11:03 UTC
Permalink
Post by Valery Ushakov
I'm not sure Martin was entirely serious. I bet those few of us that
are triggered by "JCL" can survive it. :)
Indeed.
Post by Valery Ushakov
FreeBSD already uses "jcl" for jumbo clusters so it's probably a good
idea to stay as consistent as possible even if we don't borrow the api
verbatim.
I would prefer mbuf_jumbo_* or similar, but I am really bad at coming up
with names.

Martin

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2019-03-09 16:08:08 UTC
Permalink
Post by Valery Ushakov
Do you have a better name to suggest? Because Jumbo Cluster is
about the best name I can think of (and so does FreeBSD).
I'm not sure Martin was entirely serious. I bet those few of us that
are triggered by "JCL" can survive it. :)
FreeBSD already uses "jcl" for jumbo clusters so it's probably a good
idea to stay as consistent as possible even if we don't borrow the api
verbatim.
If we don't take the API verbatim (I don't think we should), I don't see much use in keeping annoyingly short names. Besides, "jumbo" might seem like a ridiculous size-judgement in 5 years ("16K are **tiny** packets! Those old guys! Hilarious!")

Personally, I also have a problem with the use of "cluster" in the name, because it refers directly to how these larger buffers were implemented when the API first sprang into existence, but I stuck with it in my previous comments on the subject for lack of a better naming idea.

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Jason Thorpe
2019-01-30 11:07:02 UTC
Permalink
Jason, Maxime, thank you for helpful comments!
I think we agree to add API something like
Post by Maxime Villard
int
m_addjcl(struct mbuf *m, int how, int size)
or
Post by Maxime Villard
int mbuf_get_cluster(struct mbuf *m, size_t desired_size);
I would like to write a draft of patch. Here, how should we treat size
argument; (a) cluster size itself (like FreeBSD), or (b) minimum required
size of cluster (like OpenBSD)? IMO, the latter is better; the overhead
should be negligible for modern processors. Otherwise, in addition to
I agree, I think the latter is better.
Post by Maxime Villard
size_t mbuf_max_cluster_size(void);
function, we may need a function which returns a cluster size
corresponding to a required size. Is it too much?
Yes, I think something like:

size_t mbuf_cluster_size_for_size(size_t desired_size);

...(or similar -- I understand some might find that name too verbose :-) would be a good and reasonable addition.
I don't understand where the value of "ethernet-max-mtu" itself is
stored? I don't think it is not so bad idea to have something like
if_maxmtu member in struct ifnet; it can also be used in the future
for other media that support variable MTU.
The idea would be that instead of adding additional property fields to struct ifnet, you could add either a prop_dictionary_t or nvlist or whatever (I obviously have my own preference, but I'll leave that aside :-) that allows for arbitrary properties to be added. None of these things are generally consulted in critical performance paths, only during configuration events.

-- thorpej


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Gert Doering
2019-03-10 19:33:43 UTC
Permalink
Hi,
hello. I'm not saying anything that anyone here doesn't already know,
but I'll add that Linux seems to have taken the position that all ethernet
interfaces should be called eth0, eth1, etc.
This argument surprises me a bit, as Linux has moved *away* from doing
exactly this a few years ago...

https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/

(if you use udev, but that's the default for all major distributions,
not tied to systemd)


Personally, I'm a bit torn on what is "the best" thing.

On a system that only has one ethernet card, it's highly annoying
that I have to figure out if this is a "em0" or "igb0" - or an
"eno1" vs. "enp0s25" (both "onboard ethernet 0", but BIOS tables
leading udev to believing the second one isn't). So "eth0, always"
would be very convenient. Or possibly "eth0" for "wired ethernet"
and "wifi0" for wireless - as that's a common scenario.

For a system with multiple ethernet cards (server with frontend/backend
networks, or firewals, or routers) naming *stability* is a must - having
"em0" and "em1" swap their numbers at upgrading is as disruptive as
"eth0 and eth1 get swapped because one card had to be changed and the
other driver loads first". So here, something based on bus numbering
might be best - stability trumping convenience...

gert
--
"If was one thing all people took for granted, was conviction that if you
feed honest figures into a computer, honest figures come out. Never doubted
it myself till I met a computer with a sense of humor."
Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany ***@greenie.muc.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Roy Marples
2019-03-11 13:03:04 UTC
Permalink
This is something I thought was solved *years ago* in the Linux world, and I don’t understand why udev/systemd had to break it.
udev has changed nameing scheme at least twice in my memory.
This is on top of any kernel changes in this area.
Splitting eth0 vs wifi0 makes sense to me, as the inherent nature of the devices are fundamentally different.
This actually makes little sense to me - as a developer.
I can tell wifi from ethernet by virtue of the SSID property being
present via ifconfig.

What I struggle to tell, regardles of OS, is if a device is physical or
virtual (such as PPP, TAP or TUN). On BSD, because the name cannot be
changed, I can derive this. But a hard coded list of device names to
special case is pretty bad really.

Roy

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...