if_txtimer API to replace (*if_watchdog)()

Discussion:

if_txtimer API to replace (*if_watchdog)()

(too old to reply)

Jason Thorpe

2020-01-20 22:48:31 UTC

Folks --

The legacy (*if_watchdog)() interface has a couple of problems:

1- It does not have any way to represent multiple hardware-managed transmit queues.

2- It's not easy to make MP-safe because it relies on modifying the ifnet structure periodically outside of the normal locking mechanisms.

The wm(4) driver solved the problem in a reasonable way, and to make it easier for the other drivers in the tree to adopt it's strategy, I re-factored it into a new if_txtimer structure and API.

So save typing, I'll paste the relevant section of <net/if.h>:

/*
* if_txtimer --
*
* Transmission timer (to replace the legacy ifnet watchdog timer).
*
* The driver should allocate one if_txtimer per hardware-managed
* transmission queue and initialize it with if_txtimer_init(). The
* driver should also arrange to have a callout invoked at a regular
* interval (this may be coindidental with a timer used to check
* link status).
*
* When the driver gives packets to the hardware to transmit, it should
* arm the timer by calling if_txtimer_arm(). When it is sweeping up
* completed transmit jobs, it should disarm the timer by calling
* if_txtimer_disarm() if there are no outstanding jobs remaining.
*
* In the periodic callout, the driver should check if the timer has
* expired by calling if_txtimer_is_expired(). It is not necessary to
* check if the timer is armed when checking for expiration. If the
* timer has is expired, then the transmission has timed out and the
* driver should take corrective action to recover from the error.
*
* If the driver needs to check multiple transmission queues, an
* optimization is available that avoids repeated calls to fetch
* the compare time. In this case, the driver can get the compare
* time by calling if_txtimer_now() and can check for timer expiration
* using if_txtimer_is_expired_explicit().
*
* The granularity of the if_txtimer is 1 second.
*
* Locking: All locking of the if_txtimer is the responsibility of
* the driver. The if_txtimer should be protected by the same lock
* that protects the associated transmission queue.
*/

See the diff for complete details. Included is a conversion of wm(4) (which uses the if_txtimer_is_expired_explicit() variant), and pcn(4) (which uses the simpler if_txtimer_is_expired() variant). The diff for pcn(4) looks artificially large because I relocated to pcn_txtimer_check() its sole call site after I renamed it from pcn_watchdog().

My plan is to get the new API in and start updating drivers to use the new mechanism. The changes will be basically mechanical. Before I get started on that, I'd like feedback on the proposed API.

Thx.

Jason Thorpe

2020-01-20 22:59:34 UTC

Post by Jason Thorpe
My plan is to get the new API in and start updating drivers to use the new mechanism. The changes will be basically mechanical. Before I get started on that, I'd like feedback on the proposed API.

It was suggested to me off-list to use _p naming for the predicate functions, i.e.:

if (if_txtimer_expired_p(...)) {
...
}

...and:

if (if_txtimer_expired_explicit_p(...)) {
...
}

I'm fine using that naming convention if that's was folks prefer.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-22 00:21:49 UTC

Post by Jason Thorpe
Folks --
1- It does not have any way to represent multiple hardware-managed transmit queues.
2- It's not easy to make MP-safe because it relies on modifying the ifnet structure periodically outside of the normal locking mechanisms.
The wm(4) driver solved the problem in a reasonable way, and to make it easier for the other drivers in the tree to adopt it's strategy, I re-factored it into a new if_txtimer structure and API.

I spent some time thinking about this a little more, especially around making it easier to convert drivers that don't have a periodic tick already (this is mainly drivers that don't use MII), so I added some extra support for such drivers and as a proof of concept, converted the SGI "sq" driver. You can see that it's a very easy mechanical conversion for legacy drivers, that still gives them a simple NET_MPSAFE migration path (there's a provision for providing a tx queue interlock to the timer expiration callback).

If there is consensus that this is a reasonable direction, then I'll start migrating drivers and, when complete, remove the legacy (*if_watchdog)() and if_timer fields from struct ifnet.

For completeness, here's the big block comment:

/*
* if_txtimer --
*
* Transmission timer (to replace the legacy ifnet watchdog timer).
*
* The driver should allocate one if_txtimer per hardware-managed
* transmission queue. There are two different ways to allocate
* the and use the timer, based on the driver's structure.
*
* DRIVERS THAT PROVIDE THEIR OWN PERIODIC CALLOUT
* ===============================================
*
* ==> Allocate timers using if_txtimer_alloc().
* ==> In the periodic callout, check for txtimer expiration using
* if_txtimer_is_expired() or if_txtimer_is_expired_explicit()
* (see below).
*
* DRIVERS THAT DO NOT PROVIDE THEIR OWN PERIODIC CALLOUT
* ======================================================
*
* ==> Allocate timers using if_txtimer_alloc_with_callback().
* This variant allocates a callout and provides a facility
* for the callout to invoke a driver-provided callack when
* the timer expires, with an optional interlock (typically
* the transmit queue mutex).
*
* If an interlock is provided, the interlock will be acquired
* before checking for timer expiration, and will invoke the
* callback with the interlock held if the timer has expired.
* NOTE: the callback is responsible for releasing the interlock.
* If an interlock is NOT provided, then IPL will be raised to
* splnet() before checking for timer expiration and invoking
* the callback. In this case, the IPL will be restored on
* the callback's behalf when it returns.
* ==> In the driver's (*if_init)() routine, the timer's callout
* should be started with if_txtimer_start(). In the driver's
* (*if_stop)() routine, the timer's callout should be stopped
* with if_txtimer_stop() or if_txtimer_halt() (see callout(9)
* for the difference between stop and halt).
*
* In both cases, all locking of the if_txtimer is the responsibility
* of the driver. The if_txtimer should be protected by the same lock
* that protects the associated transmission queue. The queue
* associated with the timer should be locked when arming and disarming
* the timer and when checking the timer for expiration.
*
* When the driver gives packets to the hardware to transmit, it should
* arm the timer by calling if_txtimer_arm(). When it is sweeping up
* completed transmit jobs, it should disarm the timer by calling
* if_txtimer_disarm() if there are no outstanding jobs remaining.
*
* If a driver needs to check multiple transmission queues, an
* optimization is available that avoids repeated calls to fetch
* the compare time. In this case, the driver can get the compare
* time by calling if_txtimer_now() and can check for timer expiration
* using if_txtimer_is_expired_explicit().
*
* The granularity of the if_txtimer is 1 second.
*/

Kengo NAKAHARA

2020-01-23 04:02:08 UTC

Hi,

Sorry for my late reply.

Post by Jason Thorpe

Post by Jason Thorpe
Folks --
1- It does not have any way to represent multiple hardware-managed transmit queues.
2- It's not easy to make MP-safe because it relies on modifying the ifnet structure periodically outside of the normal locking mechanisms.
The wm(4) driver solved the problem in a reasonable way, and to make it easier for the other drivers in the tree to adopt it's strategy, I re-factored it into a new if_txtimer structure and API.

I spent some time thinking about this a little more, especially around making it easier to convert drivers that don't have a periodic tick already (this is mainly drivers that don't use MII), so I added some extra support for such drivers and as a proof of concept, converted the SGI "sq" driver. You can see that it's a very easy mechanical conversion for legacy drivers, that still gives them a simple NET_MPSAFE migration path (there's a provision for providing a tx queue interlock to the timer expiration callback).
If there is consensus that this is a reasonable direction, then I'll start migrating drivers and, when complete, remove the legacy (*if_watchdog)() and if_timer fields from struct ifnet.

Looks good for me. Thank you for detailed comments.

Thanks,

Post by Jason Thorpe
/*
* if_txtimer --
*
* Transmission timer (to replace the legacy ifnet watchdog timer).
*
* The driver should allocate one if_txtimer per hardware-managed
* transmission queue. There are two different ways to allocate
* the and use the timer, based on the driver's structure.
*
* DRIVERS THAT PROVIDE THEIR OWN PERIODIC CALLOUT
* ===============================================
*
* ==> Allocate timers using if_txtimer_alloc().
* ==> In the periodic callout, check for txtimer expiration using
* if_txtimer_is_expired() or if_txtimer_is_expired_explicit()
* (see below).
*
* DRIVERS THAT DO NOT PROVIDE THEIR OWN PERIODIC CALLOUT
* ======================================================
*
* ==> Allocate timers using if_txtimer_alloc_with_callback().
* This variant allocates a callout and provides a facility
* for the callout to invoke a driver-provided callack when
* the timer expires, with an optional interlock (typically
* the transmit queue mutex).
*
* If an interlock is provided, the interlock will be acquired
* before checking for timer expiration, and will invoke the
* callback with the interlock held if the timer has expired.
* NOTE: the callback is responsible for releasing the interlock.
* If an interlock is NOT provided, then IPL will be raised to
* splnet() before checking for timer expiration and invoking
* the callback. In this case, the IPL will be restored on
* the callback's behalf when it returns.
* ==> In the driver's (*if_init)() routine, the timer's callout
* should be started with if_txtimer_start(). In the driver's
* (*if_stop)() routine, the timer's callout should be stopped
* with if_txtimer_stop() or if_txtimer_halt() (see callout(9)
* for the difference between stop and halt).
*
* In both cases, all locking of the if_txtimer is the responsibility
* of the driver. The if_txtimer should be protected by the same lock
* that protects the associated transmission queue. The queue
* associated with the timer should be locked when arming and disarming
* the timer and when checking the timer for expiration.
*
* When the driver gives packets to the hardware to transmit, it should
* arm the timer by calling if_txtimer_arm(). When it is sweeping up
* completed transmit jobs, it should disarm the timer by calling
* if_txtimer_disarm() if there are no outstanding jobs remaining.
*
* If a driver needs to check multiple transmission queues, an
* optimization is available that avoids repeated calls to fetch
* the compare time. In this case, the driver can get the compare
* time by calling if_txtimer_now() and can check for timer expiration
* using if_txtimer_is_expired_explicit().
*
* The granularity of the if_txtimer is 1 second.
*/
-- thorpej

--
//////////////////////////////////////////////////////////////////////
Internet Initiative Japan Inc.

Device Engineering Section,
Product Development Department,
Product Division,
Technology Unit

Kengo NAKAHARA <k-***@iij.ad.jp>

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Ryota Ozaki

2020-01-23 06:56:42 UTC

Post by Jason Thorpe

Post by Jason Thorpe
Folks --
1- It does not have any way to represent multiple hardware-managed transmit queues.
2- It's not easy to make MP-safe because it relies on modifying the ifnet structure periodically outside of the normal locking mechanisms.
The wm(4) driver solved the problem in a reasonable way, and to make it easier for the other drivers in the tree to adopt it's strategy, I re-factored it into a new if_txtimer structure and API.

I spent some time thinking about this a little more, especially around making it easier to convert drivers that don't have a periodic tick already (this is mainly drivers that don't use MII), so I added some extra support for such drivers and as a proof of concept, converted the SGI "sq" driver. You can see that it's a very easy mechanical conversion for legacy drivers, that still gives them a simple NET_MPSAFE migration path (there's a provision for providing a tx queue interlock to the timer expiration callback).
If there is consensus that this is a reasonable direction, then I'll start migrating drivers and, when complete, remove the legacy (*if_watchdog)() and if_timer fields from struct ifnet.

No objection to the direction. Thank you for the effort!

I have two comments of the API. See below.

Post by Jason Thorpe
/*
* if_txtimer --
*
* Transmission timer (to replace the legacy ifnet watchdog timer).
*
* The driver should allocate one if_txtimer per hardware-managed
* transmission queue. There are two different ways to allocate
* the and use the timer, based on the driver's structure.
*
* DRIVERS THAT PROVIDE THEIR OWN PERIODIC CALLOUT
* ===============================================
*
* ==> Allocate timers using if_txtimer_alloc().
* ==> In the periodic callout, check for txtimer expiration using
* if_txtimer_is_expired() or if_txtimer_is_expired_explicit()
* (see below).
*
* DRIVERS THAT DO NOT PROVIDE THEIR OWN PERIODIC CALLOUT
* ======================================================
*
* ==> Allocate timers using if_txtimer_alloc_with_callback().
* This variant allocates a callout and provides a facility
* for the callout to invoke a driver-provided callack when
* the timer expires, with an optional interlock (typically
* the transmit queue mutex).
*
* If an interlock is provided, the interlock will be acquired
* before checking for timer expiration, and will invoke the
* callback with the interlock held if the timer has expired.
* NOTE: the callback is responsible for releasing the interlock.

Why? I think we should keep lock/unlock in the same place,
at least in the same layer, as much as possible unless there is
a critical reason to do so.

Post by Jason Thorpe
* If an interlock is NOT provided, then IPL will be raised to
* splnet() before checking for timer expiration and invoking
* the callback. In this case, the IPL will be restored on
* the callback's behalf when it returns.
* ==> In the driver's (*if_init)() routine, the timer's callout
* should be started with if_txtimer_start(). In the driver's
* (*if_stop)() routine, the timer's callout should be stopped
* with if_txtimer_stop() or if_txtimer_halt() (see callout(9)
* for the difference between stop and halt).

We need to teach if_txtimer_tick not to call callout_schedule
somehow on destruction, otherwise if_txtimer_tick can continue
to run even after calling callout_halt. wm avoids the flaw by
checking sc_core_stopping at the beginning of wm_tick.

ozaki-r

Post by Jason Thorpe
*
* In both cases, all locking of the if_txtimer is the responsibility
* of the driver. The if_txtimer should be protected by the same lock
* that protects the associated transmission queue. The queue
* associated with the timer should be locked when arming and disarming
* the timer and when checking the timer for expiration.
*
* When the driver gives packets to the hardware to transmit, it should
* arm the timer by calling if_txtimer_arm(). When it is sweeping up
* completed transmit jobs, it should disarm the timer by calling
* if_txtimer_disarm() if there are no outstanding jobs remaining.
*
* If a driver needs to check multiple transmission queues, an
* optimization is available that avoids repeated calls to fetch
* the compare time. In this case, the driver can get the compare
* time by calling if_txtimer_now() and can check for timer expiration
* using if_txtimer_is_expired_explicit().
*
* The granularity of the if_txtimer is 1 second.
*/
-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-23 14:28:39 UTC

Post by Ryota Ozaki

Post by Jason Thorpe
*
* If an interlock is provided, the interlock will be acquired
* before checking for timer expiration, and will invoke the
* callback with the interlock held if the timer has expired.
* NOTE: the callback is responsible for releasing the interlock.

Why? I think we should keep lock/unlock in the same place,
at least in the same layer, as much as possible unless there is
a critical reason to do so.

Andrew / Taylor, would appreciate you weighing in on this, as well.

I went back-and-forth on this point. My thinking there is:

a) The interlock must be taken in order to test for timer expiration.

b) The error handling in the callback will need to drop the lock if it decides to reset the interface.

c) If I drop the lock in if_txtimer_tick() before calling the callback, then the state of the timer will be indeterminate when the callback is called; the callback would have to reacquire the lock and re-check for timer expiration.

This is why I settled on this asymmetry. I would add a KASSERT() to check the lock state on return from the callback, but I don't think it's possible to safely check mutex_owned() on a SPIN mutex. Andrew, can you comment on this?

Post by Ryota Ozaki
We need to teach if_txtimer_tick not to call callout_schedule
somehow on destruction, otherwise if_txtimer_tick can continue
to run even after calling callout_halt. wm avoids the flaw by
checking sc_core_stopping at the beginning of wm_tick.

This is a general problem in all drivers that use callouts, and if_txtimer_tick() doesn't really make it any worse than existing drivers that use tick callouts themselves. I think even wm(4)'s approach has a problem in the detach case. I have some thoughts on how to solve this problem generally, and will attack this problem at a future time.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-23 15:23:46 UTC

Post by Jason Thorpe

Post by Ryota Ozaki
We need to teach if_txtimer_tick not to call callout_schedule
somehow on destruction, otherwise if_txtimer_tick can continue
to run even after calling callout_halt. wm avoids the flaw by
checking sc_core_stopping at the beginning of wm_tick.

This is a general problem in all drivers that use callouts, and if_txtimer_tick() doesn't really make it any worse than existing drivers that use tick callouts themselves. I think even wm(4)'s approach has a problem in the detach case. I have some thoughts on how to solve this problem generally, and will attack this problem at a future time.

Oh! I just realized that I missed the point you were making here. I see what you mean now, and I will address it. But my other statements about callouts is also true :-)

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Andrew Doran

2020-01-23 16:01:29 UTC

Post by Jason Thorpe

Post by Jason Thorpe

Post by Ryota Ozaki
We need to teach if_txtimer_tick not to call callout_schedule
somehow on destruction, otherwise if_txtimer_tick can continue
to run even after calling callout_halt. wm avoids the flaw by
checking sc_core_stopping at the beginning of wm_tick.

This is a general problem in all drivers that use callouts, and if_txtimer_tick() doesn't really make it any worse than existing drivers that use tick callouts themselves. I think even wm(4)'s approach has a problem in the detach case. I have some thoughts on how to solve this problem generally, and will attack this problem at a future time.

Oh! I just realized that I missed the point you were making here. I see what you mean now, and I will address it. But my other statements about callouts is also true :-)

This might do it, I think it's definitely needed for the destroy/detach use
case:

http://www.netbsd.org/~ad/2020/kern_timeout.c.diff

The callout_stop/halt thing is an endless source of work.

Andrew

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Andrew Doran

2020-01-23 16:10:27 UTC

Post by Jason Thorpe

Post by Ryota Ozaki

Post by Jason Thorpe
*
* If an interlock is provided, the interlock will be acquired
* before checking for timer expiration, and will invoke the
* callback with the interlock held if the timer has expired.
* NOTE: the callback is responsible for releasing the interlock.

Why? I think we should keep lock/unlock in the same place,
at least in the same layer, as much as possible unless there is
a critical reason to do so.

Andrew / Taylor, would appreciate you weighing in on this, as well.
a) The interlock must be taken in order to test for timer expiration.
b) The error handling in the callback will need to drop the lock if it
decides to reset the interface.

How about permitting that but require that the lock is taken again before
return back to if_txtimer_tick(). Would it be a big pessimisation? If
you're resetting the interface I guess not.

Post by Jason Thorpe
c) If I drop the lock in if_txtimer_tick() before calling the callback,
then the state of the timer will be indeterminate when the callback is
called; the callback would have to reacquire the lock and re-check for
timer expiration.
This is why I settled on this asymmetry. I would add a KASSERT() to check
the lock state on return from the callback, but I don't think it's
possible to safely check mutex_owned() on a SPIN mutex. Andrew, can you
comment on this?

With spin mutexes you can only do a positive check, i.e. "I know I should
own this right now and want to check that" because for the negative case
some other CPU could own it, or mutex_owned() could be lying to you because
the kernel was compiled without options MULTIPROCESSOR.

Post by Jason Thorpe

Post by Ryota Ozaki
We need to teach if_txtimer_tick not to call callout_schedule
somehow on destruction, otherwise if_txtimer_tick can continue
to run even after calling callout_halt. wm avoids the flaw by
checking sc_core_stopping at the beginning of wm_tick.

This is a general problem in all drivers that use callouts, and
if_txtimer_tick() doesn't really make it any worse than existing drivers
that use tick callouts themselves. I think even wm(4)'s approach has a
problem in the detach case. I have some thoughts on how to solve this
problem generally, and will attack this problem at a future time.
-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-23 17:03:57 UTC

Post by Andrew Doran
How about permitting that but require that the lock is taken again before
return back to if_txtimer_tick(). Would it be a big pessimisation? If
you're resetting the interface I guess not.

Well, it's not the only asymmetry... in the "timer expired" case, if_txtimer_tick() also does not reschedule the callout (it relies on the callback's error handler to decide if that's appropriate). Basically, "timer expired" == "driver handles the situation". It just seems weird and redundant to require the callback to reacquire the lock just to have if_txtimer_tick() do nothing with it other than release it.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-23 17:19:42 UTC

Post by Andrew Doran
With spin mutexes you can only do a positive check, i.e. "I know I should
own this right now and want to check that" because for the negative case
some other CPU could own it, or mutex_owned() could be lying to you because
the kernel was compiled without options MULTIPROCESSOR.

Hm, it'd be nice if we could record this somehow to assert that "it's this CPU that has it".

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Andrew Doran

2020-01-24 00:32:54 UTC

Post by Andrew Doran
How about permitting that but require that the lock is taken again before
return back to if_txtimer_tick(). Would it be a big pessimisation? If
you're resetting the interface I guess not.

Well, it's not the only asymmetry... in the "timer expired" case,
if_txtimer_tick() also does not reschedule the callout (it relies on the
callback's error handler to decide if that's appropriate). Basically,
"timer expired" == "driver handles the situation". It just seems weird
and redundant to require the callback to reacquire the lock just to have
if_txtimer_tick() do nothing with it other than release it.

I'm not good with the network stack and have been avoiding it for years!
All I can say is I don't have an objection to the pattern of handing off the
release if there's a decent justification for it. We do the similar in the
scheduler in a few places if it's a dead end situation, or there's a cost in
holding the lock too long (e.g. wake another CPU up which will immediately
slam into a lock curcpu holds).

Andrew

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Andrew Doran

2020-01-24 00:51:33 UTC

Post by Jason Thorpe

Post by Andrew Doran
With spin mutexes you can only do a positive check, i.e. "I know I should
own this right now and want to check that" because for the negative case
some other CPU could own it, or mutex_owned() could be lying to you because
the kernel was compiled without options MULTIPROCESSOR.

Hm, it'd be nice if we could record this somehow to assert that "it's this CPU that has it".

It's a big compromise, for older computers that don't have CAS, or where
they'll never have more than one CPU. I'd rather it didn't work that way
but it's the price to pay. If you have the CPU cycles to burn you can run
LOCKDEBUG and it'll keep you honest with stuff like that, it does a kind of
crude parallel simulation with its own data structures and keeps track of
what locks are held, and remembers where they were taken, and by whom and
readily panics if it doesn't like how things turn out.

Andrew

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Taylor R Campbell

2020-01-24 05:38:25 UTC

I don't really understand why you're proposing a new API. But let me
make sure I understand the context first.

When we submit a packet to a hardware queue for tx, we want to wait up
to n seconds before giving up -- meaning that we report the packet
dropped (if_oerrors++), reset the device, free the packet, recycle
queue entries, whatever.

Right now, we have a mechanism if_watchdog (a.k.a. if_slowtimo) which
works as follows, for drivers that use it, like wpi(4): Every second,
if ifp->if_timer is nonzero, we decrement it, and if it becomes zero,
we call ifp->if_watchdog -- all under the kernel lock.

- Some drivers like sk(4) set ifp->if_timer = 5 and use if_watchdog to
give up after 5sec.

- Some drivers like wpi(4) set ifp->if_timer = 1, sc->sc_tx_timer = 5,
and use if_watchdog to count down sc->sc_tx_timer every second.
That way they can give up after 5sec but also do other housekeeping
like ieee80211_watchdog every second.

- Some drivers like wm(4) use a callout instead of if_watchdog.

- Some drivers like wm(4) maintain one timer for each tx queue.

The _main_ problem you're trying to solve, as I understand it, is that
management of ifp->if_timer -- in net/if.c but also scattered
throughout drivers all over the tree -- is open-coded and relies on
the kernel lock, which you would like to avoid (but please correct me

Post by Jason Thorpe
2- It's not easy to make MP-safe because it relies on modifying
the ifnet structure periodically outside of the normal locking
mechanisms.

Presumably we could add a new lock for access to if_timer, and that
would solve this problem. But it seems to me it would be even simpler
to just use a callout to solve this:

- Where a driver currently sets ifp->if_timer = 5, it can instead
callout_schedule(sc->sc_ch, 5).

- Where a driver currently sets ifp->if_timer = 1 and also uses
sc->sc_tx_timer, it can instead callout_schedule(sc->sc_ch, 1) and
continue to use sc->sc_tx_timer under its own lock.

- Where a driver currently sets ifp->if_timer = 0, it can
callout_stop(sc->sc_ch), provided, of course, that MP-safe drivers
also callout_halt(sc->sc_ch) before callout_destroy(sc->sc_ch) in
detach and before anything else relies on the callout not to be
running.

- Where a driver currently queries ifp->if_timer == 0 to decide
whether to reschedule it, it can do (under the driver lock):

if (!callout_pending(sc->sc_ch) && !callout_invoking(sc->sc_ch))
callout_schedule(sc->sc_ch, ...);

Why do we need anything more than a callout?

Post by Jason Thorpe
1- It does not have any way to represent multiple hardware-managed transmit queues.

But I don't really see why adding a new txtimer API helps any more
than a callout. Does it actually reduce any duplicated logic that
could meaningfully be shared between, e.g., wm(4) and another
interface with multiple tx queues?

I'm especially skeptical of an API that, by dynamic conditionals,
may combine

(a) a simple countdown timer (if_txtimer),
and
(b) a replica of the callout API using dynamic allocation but with an
interlock determined up front and other slight changes and that has an
embedded simple countdown timer with a boolean indicating that it's
actually the more complex thing (if_txtimer_callout),

without common logic that is meaningfully agnostic to which case it's
working with. For example, will any driver call if_txtimer_start
without actually knowing up front whether it needs to use the callout?
Will any driver call if_txtimer_stop in a code path that would not be
adequately served either by if_txtimer_disarm, or by an unconditional
if_txtimer_disarm & callout_stop?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-24 06:37:55 UTC

Post by Taylor R Campbell
I don't really understand why you're proposing a new API. But let me
make sure I understand the context first.
When we submit a packet to a hardware queue for tx, we want to wait up
to n seconds before giving up -- meaning that we report the packet
dropped (if_oerrors++), reset the device, free the packet, recycle
queue entries, whatever.

That's the basic idea, yes.

Post by Taylor R Campbell
Right now, we have a mechanism if_watchdog (a.k.a. if_slowtimo) which
works as follows, for drivers that use it, like wpi(4): Every second,
if ifp->if_timer is nonzero, we decrement it, and if it becomes zero,
we call ifp->if_watchdog -- all under the kernel lock.

Note: The WiFi drivers are a little weird in that net80211 kind of abuses what if_watchdog was originally meant for.

Post by Taylor R Campbell
- Some drivers like sk(4) set ifp->if_timer = 5 and use if_watchdog to
give up after 5sec.
- Some drivers like wpi(4) set ifp->if_timer = 1, sc->sc_tx_timer = 5,
and use if_watchdog to count down sc->sc_tx_timer every second.
That way they can give up after 5sec but also do other housekeeping
like ieee80211_watchdog every second.

Again, the WiFi drivers are abusing the historical use of if_watchdog.

Post by Taylor R Campbell
- Some drivers like wm(4) use a callout instead of if_watchdog.

Originally, wm(4) used if_watchdog / if_timer like ~everyone else. But many drivers also grew their own independent 1-second timer used for link maintenance (I'm fairly familiar with the original behavior and intent of both of these things, since I am the original author of wm(4) and also wrote our MII layer, which is the origins of the 1-second link maintenance timer that many drivers employ). The wm(4) driver was one such driver that used the "tick" callout model for link maintenance (to call mii_tick()).

Post by Taylor R Campbell
- Some drivers like wm(4) maintain one timer for each tx queue.
The _main_ problem you're trying to solve, as I understand it, is that
management of ifp->if_timer -- in net/if.c but also scattered
throughout drivers all over the tree -- is open-coded and relies on
the kernel lock, which you would like to avoid (but please correct me

There are a 3 basic problems:

1- if_timer relies on the KERNEL_LOCK.

2- if_timer does not have a way to model multiple queues.

3- There should be one basic way to modeling the transmit watchdog timer, rather than N. (In the examples you cited above, N is definitely > 1.)

Post by Taylor R Campbell

Post by Jason Thorpe
2- It's not easy to make MP-safe because it relies on modifying
the ifnet structure periodically outside of the normal locking
mechanisms.

Presumably we could add a new lock for access to if_timer, and that
would solve this problem. But it seems to me it would be even simpler

So, yes, one could just use a callout to solve this universally; I initially considered it. That will be a bit more work to adopt inside drivers, but that's not really a showstopper. But this proposal isn't borne simply out of ease-of-conversion.

Callouts are a bit more expensive... to arm them or introspect them requires taking an additional lock **shared by other callouts** that could be a source of lock contention; my proposal has the txtimer protected by the txq lock that would already be held in a NET_MPSAFE driver, and arming is just a 64-bit relaxed atomic load (yes, I know this would require a slightly more expensive operation on 32-bit machines), an unprotected 64-bit store, and an unprotected store of a bool (these unprotected stores are actually protected by the txq lock; I call them "unprotected" because they themselves don't requite atomicity or ordering). This could make a difference when you're trying to blast out packets at 40Gb/s.

Callouts also have a larger memory footprint ... not hugely so, mind you, but each one is 10 pointers large; my proposal uses a bit less memory than callouts... nearly a wash if you consider the fields that can be removed from struct ifnet when the conversion of every driver is complete. (Though, I do concede that for drivers that do not already use a periodic timer, it would end up using about the same amount of memory as a plain callout, once you account for the fields that can be removed from struct ifnet). This memory footprint isn't going to shatter the earth... but it does potentially mean more cache lines occupied.

Tx transmit timers have no need for precision ... 1 second granularity is totally sufficient, and many drivers are already using a 1-second tick timer for link maintenance / stats updating; my proposal piggy-backs on that. The cost is that you do some work once a second.

Post by Taylor R Campbell
- Where a driver currently sets ifp->if_timer = 5, it can instead
callout_schedule(sc->sc_ch, 5).
- Where a driver currently sets ifp->if_timer = 1 and also uses
sc->sc_tx_timer, it can instead callout_schedule(sc->sc_ch, 1) and
continue to use sc->sc_tx_timer under its own lock.
- Where a driver currently sets ifp->if_timer = 0, it can
callout_stop(sc->sc_ch), provided, of course, that MP-safe drivers
also callout_halt(sc->sc_ch) before callout_destroy(sc->sc_ch) in
detach and before anything else relies on the callout not to be
running.
- Where a driver currently queries ifp->if_timer == 0 to decide
if (!callout_pending(sc->sc_ch) && !callout_invoking(sc->sc_ch))
callout_schedule(sc->sc_ch, ...);
Why do we need anything more than a callout?

Post by Jason Thorpe
1- It does not have any way to represent multiple hardware-managed transmit queues.

But I don't really see why adding a new txtimer API helps any more
than a callout. Does it actually reduce any duplicated logic that
could meaningfully be shared between, e.g., wm(4) and another
interface with multiple tx queues?

You are not wrong that the entire problem could be addressed by using callouts everywhere.

Post by Taylor R Campbell
I'm especially skeptical of an API that, by dynamic conditionals,
may combine
(a) a simple countdown timer (if_txtimer),
and

if_txtimer is not a countdown timer. When the timer is armed, it take a timestamp, and takes another timestamp and compares them when the "is this expired??" question is asked. It's specifically designed to let someone else do the asking of that question.

Post by Taylor R Campbell
(b) a replica of the callout API using dynamic allocation but with an
interlock determined up front and other slight changes and that has an
embedded simple countdown timer with a boolean indicating that it's
actually the more complex thing (if_txtimer_callout),

The intent of if_txtimer_callout is to essentially provide a the callout-backed "tick" facility to drive if_txtimer to drivers that wouldn't otherwise need that facility for their own purposes. It's a set of convenience functions to reduce code duplication.

Post by Taylor R Campbell
without common logic that is meaningfully agnostic to which case it's
working with.

if_txtimer_arm(), if_txtimer_disarm(), if_txtimer_is_armed(), if_txtimer_is_expired(), and if_txtimer_is_expired_explicit() work exactly the same for both the if_txtimer and if_txtimer_callout cases. The users of if_txtimer_callout don't generally need to call if_txtimer_is_expired(), however, because that's done by a convenience function. Every driver could use if_txtimer_callout, but I chose to let drivers that already have their own tick callout piggy-back on that.

Post by Taylor R Campbell
For example, will any driver call if_txtimer_start
without actually knowing up front whether it needs to use the callout?
Will any driver call if_txtimer_stop in a code path that would not be
adequately served either by if_txtimer_disarm, or by an unconditional
if_txtimer_disarm & callout_stop?

The rules are pretty simple:

-- If the driver already provides its own 1-second "tick" callout, it uses that, and doesn't bother with if_txtimer_start() or if_txtimer_{stop,halt}().

-- If the driver does not already provide its own 1-second "tick" callout, it uses if_txtimer_start() and if_txtimer_{stop,halt}().

if_txtimer_{stop,halt}() internally also does a if_txtimer_disarm(), but that's to avoid having to call two different function in the if_stop routine (if_txtimer_disarm() followed by if_txtimer_stop()).

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Taylor R Campbell

2020-01-24 08:27:55 UTC

Date: Thu, 23 Jan 2020 22:37:55 -0800
1- if_timer relies on the KERNEL_LOCK.

Agreed.

2- if_timer does not have a way to model multiple queues.

Does if_txtimer have a way to model multiple queues? Is it not just
driver-specific logic to handle sc->sc_txq[i].txq_lastsent for each
queue independently like wm(4)?

3- There should be one basic way to modeling the transmit watchdog
timer, rather than N. (In the examples you cited above, N is
definitely > 1.)

If we just replaced if_slowtimo by an explicit callout in each driver
just like wm_tick, would you still consider that to be >1 model?

Callouts are a bit more expensive... to arm them or introspect them
requires taking an additional lock **shared by other callouts** that
could be a source of lock contention;

The lock is normally the per-CPU lock of the CPU where it was last
scheduled, so that should be pretty cheap unless something is going
seriously wrong with callouts and/or the driver.

This could make a difference when you're trying to blast out packets
at 40Gb/s.

Which drivers (a) are for hardware devices that can handle >>1GB/sec,
and (b) use if_watchdog -- i.e., which high-throughput drivers would
take advantage of the callout-based mechanism inside your proposed
if_txtimer?

Callouts also have a larger memory footprint ... not hugely so, mind
you, but each one is 10 pointers large; my proposal uses a bit less
memory than callouts... nearly a wash if you consider the fields
that can be removed from struct ifnet when the conversion of every
driver is complete.

If we break the drivers into two classes:

1. Those that currently use if_watchdog.
- Status quo: There already is exactly 1 callout per device.
- With callouts: There would still be exactly 1 callout per device,
but in struct foo_softc instead of struct ifnet.
- With if_txtimer: There would still be exactly 1 callout per
device, but in struct if_txtimer instead of struct ifnet.

2. Those that currently do not use if_watchdog and do their own thing.
- Status quo: There already are (1 + n) callouts per device, where
n is however many the driver allocates to do its own thing.
- With callouts: There would be n callouts per device because we
get rid of if_slowtimo.
- With if_txtimer: There would be n callouts per device because we
get rid of if_slowtimo.

I don't see a difference. Am I missing something?

Tx transmit timers have no need for precision ... 1 second
granularity is totally sufficient, and many drivers are already
using a 1-second tick timer for link maintenance / stats updating;
my proposal piggy-backs on that. The cost is that you do some work
once a second.

I don't understand how this makes a difference between just using
callouts directly, and hiding callouts in an API that uses them on
request. Can you elaborate?

The intent of if_txtimer_callout is to essentially provide a the
callout-backed "tick" facility to drive if_txtimer to drivers that
wouldn't otherwise need that facility for their own purposes. It's
a set of convenience functions to reduce code duplication.

Can you show what code is deduplicated this way?

The diff has +102 -82, even if I ignore the additions in net/if.h and
net/if.c. Some + is to be expected by pushing stuff from struct ifnet
optionally into struct foo_softc, but I read through the patch and I
don't see anything that looks plausibly like deduplication, other than
perhaps the expression in if_txtimer_expired_explicit in wm(4). The
comment and whitespace density seems to be about the same before and
after, so that seems like it's increasing the amount of code _and_ the
API complexity.

Post by Taylor R Campbell
For example, will any driver call if_txtimer_start
without actually knowing up front whether it needs to use the callout?
Will any driver call if_txtimer_stop in a code path that would not be
adequately served either by if_txtimer_disarm, or by an unconditional
if_txtimer_disarm & callout_stop?

My questions weren't about the rules for use. My questions were:

1. Will any driver call if_txtimer_start without actually knowing up
front whether it needs to use the callout?

2. Will any driver call if_txtimer_stop in a code path that would not
be adequately served either by if_txtimer_disarm, or by an
unconditional if_txtimer_disarm & callout_stop?

That is, will there be any code paths which meaningfully take
advantage of having a run-time conditional? If yes, can you give an
example? If no, why do we need a replica of the callout API spelled
differently in if_txtimer?

In the diff you showed, the answers to both questions are no: wm(4)
and pcn(4) _know_ they don't use the if_txtimer callout, so they don't
need if_txtimer_start/stop at all; sq(4) _knows_ it uses the
if_txtimer callout, so it could just as well use callout_schedule and
callout_stop instead of if_txtimer_start/stop.

This suggests to me that the if_txtimer API (at least, the version
with optional callouts) is confounding two different purposes and
would make it harder, not easier, to understand the drivers -- it uses
one name, but there are really two disparate usage models.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-24 14:57:03 UTC

Post by Taylor R Campbell

Post by Jason Thorpe
2- if_timer does not have a way to model multiple queues.

Does if_txtimer have a way to model multiple queues? Is it not just
driver-specific logic to handle sc->sc_txq[i].txq_lastsent for each
queue independently like wm(4)?

if_txtimer is logically associated with a transmit queue, so you would have one if_txtimer per transmit queue. This is what wm(4) does. (I think I mentioned originally that if_txtimer is essentially the wm(4) driver's transmit timer logic wrapped up in a small API that makes it easy to adopt in other drivers with minimal changes.)

Post by Taylor R Campbell

Post by Jason Thorpe
3- There should be one basic way to modeling the transmit watchdog
timer, rather than N. (In the examples you cited above, N is
definitely > 1.)

If we just replaced if_slowtimo by an explicit callout in each driver
just like wm_tick, would you still consider that to be >1 model?

If each driver used a callout per transmit queue, then it would be a single model. I noted in my last message that callout would address all of the same issues as if_txtimer, but that I had other reasons for not using it.

Post by Taylor R Campbell

Post by Jason Thorpe
Callouts are a bit more expensive... to arm them or introspect them
requires taking an additional lock **shared by other callouts** that
could be a source of lock contention;

The lock is normally the per-CPU lock of the CPU where it was last
scheduled, so that should be pretty cheap unless something is going
seriously wrong with callouts and/or the driver.

With these timers, in callout's terminology, it will be extremely common (like maybe 99.999%-100% of the time) for them to be "pending", and extremely rare for them to be "invoking" or "expired". In a high-traffic scenario, this means that they will be almost always assigned to a CPU. My concern is not necessarily other callouts within the driver causing contention, but other callouts in the system that might happen to also be assigned to that CPU.

Perhaps I'm engaged in premature optimization, but I wanted to avoid having to use *any* additional locks other than what the driver would already be using internally.

Post by Taylor R Campbell

Post by Jason Thorpe
This could make a difference when you're trying to blast out packets
at 40Gb/s.

Which drivers (a) are for hardware devices that can handle >>1GB/sec,
and (b) use if_watchdog -- i.e., which high-throughput drivers would
take advantage of the callout-based mechanism inside your proposed
if_txtimer?

dge(4).

Post by Taylor R Campbell

Post by Jason Thorpe
Callouts also have a larger memory footprint ... not hugely so, mind
you, but each one is 10 pointers large; my proposal uses a bit less
memory than callouts... nearly a wash if you consider the fields
that can be removed from struct ifnet when the conversion of every
driver is complete.

1. Those that currently use if_watchdog.
- Status quo: There already is exactly 1 callout per device.
- With callouts: There would still be exactly 1 callout per device,
but in struct foo_softc instead of struct ifnet.
- With if_txtimer: There would still be exactly 1 callout per
device, but in struct if_txtimer instead of struct ifnet.
2. Those that currently do not use if_watchdog and do their own thing.
- Status quo: There already are (1 + n) callouts per device, where
n is however many the driver allocates to do its own thing.
- With callouts: There would be n callouts per device because we
get rid of if_slowtimo.
- With if_txtimer: There would be n callouts per device because we
get rid of if_slowtimo.
I don't see a difference. Am I missing something?

This is not actually the real breakdown. The current situation is:

- If a driver has a non-NULL (*if_slowtimo)() (i.e. (*if_watchdog)()) function pointer, that ifnet structure will have a callout that fires 1x/sec to implement if_slowtimo (note the use of a callout for this purpose is different from the original BSD implementation). The driver does not have access to this 1x/sec callout; it is only called back via (*if_slowtimo)() iff when if_timer transitions from 1 -> 0.

- Many network interface drivers have their own independent 1x/sec "tick" callout to implement link maintenance (this was introduced when mii(4) was added), stats gathering, etc.

So, the truth is that several drivers today actually have 2 -- if_slowtimo and "tick". The wm(4) driver has 1 (it has a NULL (*if_slowtimo)(), and provides a "tick" callout for its own use).

My proposal keeps wm(4) at 1 (its own "tick" callout) (because it simply encapsulates the wm(4) approach into an API). Using bare callouts to implement the tx watchdog timer would mean wm(4) could have up to 17 ("tick" + one callout per hardware tx queue).

The reason that wm(4) only has a single callout despite having up to 16 hardware tx queues is because in its "tick" callout, it checks all tx queue timers for expiration in a batch. My intent with if_txtimer is that drivers that have multiple hardware tx queues would use the wm(4) driver's pattern.

My addition of if_txtimer_callout is meant to simplify the "hardware only has 1 tx queue, and driver does not already have a tick callout" case. I.e. it allows this class of simpler driver (perhaps those used for legacy chips, e.g. if_sq.c) to use the new NET_MPSAFE mechanism without having to add their own "tick" callout directly.

Post by Taylor R Campbell

Post by Jason Thorpe
Tx transmit timers have no need for precision ... 1 second
granularity is totally sufficient, and many drivers are already
using a 1-second tick timer for link maintenance / stats updating;
my proposal piggy-backs on that. The cost is that you do some work
once a second.

I don't understand how this makes a difference between just using
callouts directly, and hiding callouts in an API that uses them on
request. Can you elaborate?

The trade-off I am making here is "arming the timer in the driver transmit hot path is super cheap at the cost of checking it once a second" vs. "arming the timer in the driver transmit hot path a little more expensive, but you don't have to check it periodically once a second".

Maybe you're missing that if_txtimer_callout fires once a second, not when the transmit timer actually expires.

Post by Taylor R Campbell

Post by Jason Thorpe
The intent of if_txtimer_callout is to essentially provide a the
callout-backed "tick" facility to drive if_txtimer to drivers that
wouldn't otherwise need that facility for their own purposes. It's
a set of convenience functions to reduce code duplication.

Can you show what code is deduplicated this way?
The diff has +102 -82, even if I ignore the additions in net/if.h and
net/if.c. Some + is to be expected by pushing stuff from struct ifnet
optionally into struct foo_softc, but I read through the patch and I
don't see anything that looks plausibly like deduplication, other than
perhaps the expression in if_txtimer_expired_explicit in wm(4). The
comment and whitespace density seems to be about the same before and
after, so that seems like it's increasing the amount of code _and_ the
API complexity.

My comment there was directed at if_txtimer_callout ... again, that is a convenience flavor of the API meant to prevent a driver that does not already provide a "tick" callout from having to add one. Right now, there is a shared "slowtimo" facility that all drivers can leverage (it's just that it happens to be a bad mechanism). Eventually, that will be removed.

Increasing API complexity is inevitable because MP-safe APIs are more complex than naive ones. I am, however, attempting to minimize the complexity and still make the MP-safe mechanism easy to adopt universally (even for platforms that don't actually need it).

Post by Taylor R Campbell

Post by Jason Thorpe

Post by Taylor R Campbell
For example, will any driver call if_txtimer_start
without actually knowing up front whether it needs to use the callout?
Will any driver call if_txtimer_stop in a code path that would not be
adequately served either by if_txtimer_disarm, or by an unconditional
if_txtimer_disarm & callout_stop?

1. Will any driver call if_txtimer_start without actually knowing up
front whether it needs to use the callout?

No. if_txtimer_start() is meant only for drivers that don't provide their own "tick" ... i.e. it starts the "tick" that's hidden inside the API.

Post by Taylor R Campbell
2. Will any driver call if_txtimer_stop in a code path that would not
be adequately served either by if_txtimer_disarm, or by an
unconditional if_txtimer_disarm & callout_stop?

if_txtimer_stop() stops the "tick" that's hidden inside the API. Drivers that provide their own "tick" won't use it. It's meant to be used inside the driver routine that brings the interface down. It is a convenience wrapper around callout_stop() for drivers that don't already have their own callout_stop() call in that location.

In if_txtimer_disarm() is a lightweight function that can be used inside there driver's tx interrupt routine (perhaps a hot path) that means the equivalent of "ifp->if_timer = 0;".

Post by Taylor R Campbell
That is, will there be any code paths which meaningfully take
advantage of having a run-time conditional? If yes, can you give an
example? If no, why do we need a replica of the callout API spelled
differently in if_txtimer?

It's not a replica. It's a wrapper, and it's meant to be used by drivers that themselves don't already use a periodic callout directly.

Post by Taylor R Campbell
In the diff you showed, the answers to both questions are no: wm(4)
and pcn(4) _know_ they don't use the if_txtimer callout, so they don't
need if_txtimer_start/stop at all; sq(4) _knows_ it uses the
if_txtimer callout, so it could just as well use callout_schedule and
callout_stop instead of if_txtimer_start/stop.
This suggests to me that the if_txtimer API (at least, the version
with optional callouts) is confounding two different purposes and
would make it harder, not easier, to understand the drivers -- it uses
one name, but there are really two disparate usage models.

No, they're not disparate usage models at all, because you'll notice that the way those drivers use if_txtimer_arm() and if_txtimer_disarm() are equivalent.

The difference is really one just of setup.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Taylor R Campbell

2020-01-28 16:29:14 UTC

Date: Fri, 24 Jan 2020 06:57:03 -0800

Post by Taylor R Campbell
Does if_txtimer have a way to model multiple queues? Is it not just
driver-specific logic to handle sc->sc_txq[i].txq_lastsent for each
queue independently like wm(4)?

if_txtimer is logically associated with a transmit queue, so you
would have one if_txtimer per transmit queue. This is what wm(4)
does. (I think I mentioned originally that if_txtimer is
essentially the wm(4) driver's transmit timer logic wrapped up in a
small API that makes it easy to adopt in other drivers with minimal
changes.)

You say if_txtimer is logically associated with a tx queue, but _part_
of the if_txtimer API is never associated with tx queue -- the callout
is, as far as I can see, only ever one-per-driver. This model strikes
me as confusing: half of what if_txtimer does is per-txqueue, and half
of what it does is per-driver, but the API presents it as one thing --
if_txtimer.

Perhaps I'm engaged in premature optimization, but I wanted to avoid
having to use *any* additional locks other than what the driver
would already be using internally.

I think this is premature optimization because the drivers we're
talking about changing already rely on the kernel lock!

Post by Taylor R Campbell
Which drivers (a) are for hardware devices that can handle >>1GB/sec,
and (b) use if_watchdog -- i.e., which high-throughput drivers would
take advantage of the callout-based mechanism inside your proposed
if_txtimer?

dge(4).

Does dge(4) actually attain 10GB/sec using only a single tx queue with
the kernel lock as is?

My proposal keeps wm(4) at 1 (its own "tick" callout) (because it
simply encapsulates the wm(4) approach into an API). Using bare
callouts to implement the tx watchdog timer would mean wm(4) could
have up to 17 ("tick" + one callout per hardware tx queue).

Perhaps I wasn't clear -- I didn't mean using one callout per tx
queue; I meant using one `tick' callout per driver. In other words,
why don't we just convert the drivers that use if_watchdog now to set
up their own tick callout?

No change needed to wm(4), except perhaps the time_uptime expiry
expressions (which as you noted should really use 32-bit atomic
load/store instead of the racy 64-bit time_uptime).

For single-queue drivers currently relying on the kernel lock I
imagine it's perfectly fine to replace ifp->if_timer = 5 by
callout_schedule(sc->sc_ch, 5*hz). The only other differences from
the status quo are that you would say callout_init(...) instead of
ifp->if_watchdog = ... and you have to call callout_stop/halt in
detach. Drivers needing to avoid the callout lock, or multi-queue
drivers, can just make the callout tick once per second like wm(4)
does.

What I see in the if_txtimer proposal is that it's supposed to be used
per-queue, EXCEPT for single-queue drivers, in which case it also
serves as another way to spell a tick callout.

But it also has a slightly different variant of the callout API -- and
I see some bugs in the patch you sent (e.g., a disarmed txtimer that
has a callout races with halt), which is part of why I'm suspicious of
adding something that is almost just another way to spell callouts but
with slightly different API requirements.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-28 17:04:11 UTC

Post by Taylor R Campbell
You say if_txtimer is logically associated with a tx queue, but _part_
of the if_txtimer API is never associated with tx queue -- the callout
is, as far as I can see, only ever one-per-driver. This model strikes
me as confusing: half of what if_txtimer does is per-txqueue, and half
of what it does is per-driver, but the API presents it as one thing --
if_txtimer.

This is because the simplified variant of the API is there specifically to support simple drivers / devices that have only a single hardware-managed transmit queue and that don't already use a periodic timer. (There are a bunch of these.)

Post by Taylor R Campbell

Post by Jason Thorpe
Perhaps I'm engaged in premature optimization, but I wanted to avoid
having to use *any* additional locks other than what the driver
would already be using internally.

I think this is premature optimization because the drivers we're
talking about changing already rely on the kernel lock!

No, we're not. There are other drivers in the tree that DO NOT use the kernel lock that WOULD STILL USE the core functionality of if_txtimer. I am trying to make a single API that works well for both the old-and-slow and the new-and-fast use cases. It seems stupid to just make everything ad hoc.

Post by Taylor R Campbell

Post by Jason Thorpe

Post by Taylor R Campbell
Which drivers (a) are for hardware devices that can handle >>1GB/sec,
and (b) use if_watchdog -- i.e., which high-throughput drivers would
take advantage of the callout-based mechanism inside your proposed
if_txtimer?

dge(4).

Does dge(4) actually attain 10GB/sec using only a single tx queue with
the kernel lock as is?

Post by Jason Thorpe
My proposal keeps wm(4) at 1 (its own "tick" callout) (because it
simply encapsulates the wm(4) approach into an API). Using bare
callouts to implement the tx watchdog timer would mean wm(4) could
have up to 17 ("tick" + one callout per hardware tx queue).

Perhaps I wasn't clear -- I didn't mean using one callout per tx
queue; I meant using one `tick' callout per driver. In other words,
why don't we just convert the drivers that use if_watchdog now to set
up their own tick callout?

Because there are a bunch of drivers for which that would mean adding a bunch of boilerplate code that serves no other purpose. I was originally going to do that, but then figured I'd get pushback for a bunch of duplicate boilerplate code that serves no other purpose, so I decided to provide helper routines that all of those cases could use. I never imagined I'd have to defend that decision to this extent.

Post by Taylor R Campbell
No change needed to wm(4), except perhaps the time_uptime expiry
expressions (which as you noted should really use 32-bit atomic
load/store instead of the racy 64-bit time_uptime).
For single-queue drivers currently relying on the kernel lock I
imagine it's perfectly fine to replace ifp->if_timer = 5 by
callout_schedule(sc->sc_ch, 5*hz). The only other differences from
the status quo are that you would say callout_init(...) instead of
ifp->if_watchdog = ... and you have to call callout_stop/halt in
detach. Drivers needing to avoid the callout lock, or multi-queue
drivers, can just make the callout tick once per second like wm(4)
does.
What I see in the if_txtimer proposal is that it's supposed to be used
per-queue, EXCEPT for single-queue drivers, in which case it also
serves as another way to spell a tick callout.

IF you want to add a bunch of duplicate boilerplate code that could be easily refactored to a bunch of drivers, be my guest.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-24 17:08:16 UTC

Post by Jason Thorpe
No, they're not disparate usage models at all, because you'll notice that the way those drivers use if_txtimer_arm() and if_txtimer_disarm() are equivalent.
The difference is really one just of setup.

While driving into work, I thought of something else I wanted to say about this...

You're saying "two disparate usage models" ... but I see it very differently.

I see this as "an API that a simple way of using it that satisfies a lot of usage scenarios, and a more complicated way of using it for applications that want a little more control and/or flexibility, or who wish to optimize".

The simpler model is to use if_txtimer_alloc_with_callback() (i.e. what we've been referring to as "if_txtimer_callout" in this thread). This model is the "Hey, we set things up to handle a single transmit queue and do a bunch of the work for you, all you need to do is tell us what function to call when we detect that the timer has expired, and when you bring the interface up and down."

The more complicated model is to use if_txtimer_alloc(). This model is the "Hey, I understand you have a more complex driver there. Here's a way to use this facility cheaply within the structure you already have."

There is nothing preventing the simpler variant from being used everywhere, but the more complicated variant (which isn't really that much more complicated) allows for a bit less overhead when the driver is already doing some of necessary work.

That's really the gist of it.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jason Thorpe

2020-01-24 17:15:20 UTC

Post by Jason Thorpe
Callouts are a bit more expensive... to arm them or introspect them requires taking an additional lock **shared by other callouts** that could be a source of lock contention; my proposal has the txtimer protected by the txq lock that would already be held in a NET_MPSAFE driver, and arming is just a 64-bit relaxed atomic load (yes, I know this would require a slightly more expensive operation on 32-bit machines), an unprotected 64-bit store, and an unprotected store of a bool (these unprotected stores are actually protected by the txq lock; I call them "unprotected" because they themselves don't requite atomicity or ordering). This could make a difference when you're trying to blast out packets at 40Gb/s.

Another point that occurred to me as I was walking out to my car this morning... it doesn't have to be a 64-bit load... Under the covers, it's using time_uptime ... but I see very little reason why time_uptime has to be a time_t. A 32-bit number is sufficient to count ~136 years of uptime.

-- thorpej

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

20 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Jason Thorpe 2020-01-20 22:48:31 UTC

Jason Thorpe 2020-01-20 22:59:34 UTC

Jason Thorpe 2020-01-22 00:21:49 UTC

Kengo NAKAHARA 2020-01-23 04:02:08 UTC

Ryota Ozaki 2020-01-23 06:56:42 UTC

Jason Thorpe 2020-01-23 14:28:39 UTC

Jason Thorpe 2020-01-23 15:23:46 UTC

Andrew Doran 2020-01-23 16:01:29 UTC

Andrew Doran 2020-01-23 16:10:27 UTC

Jason Thorpe 2020-01-23 17:03:57 UTC

Jason Thorpe 2020-01-23 17:19:42 UTC

Andrew Doran 2020-01-24 00:32:54 UTC

Andrew Doran 2020-01-24 00:51:33 UTC

Taylor R Campbell 2020-01-24 05:38:25 UTC

Jason Thorpe 2020-01-24 06:37:55 UTC

Taylor R Campbell 2020-01-24 08:27:55 UTC

Jason Thorpe 2020-01-24 14:57:03 UTC

Taylor R Campbell 2020-01-28 16:29:14 UTC

Jason Thorpe 2020-01-28 17:04:11 UTC

Jason Thorpe 2020-01-24 17:08:16 UTC

Jason Thorpe 2020-01-24 17:15:20 UTC

about - legalese

Loading...