workqueue in if_bnx: static struct work * seems wrong

Discussion:

(too old to reply)

Greg Troxel

2012-02-21 17:30:56 UTC

My colleague Bev Schwartz found that in src/sys/dev/pci/if_bnx.c,
workqueue_enqueue is called with a struct work * that points to a static
variable. But, this struct is used as a linked list, and reuse of the
struct risks creating a circular list.

bnx's workqueue callback does not free the struct work *. Other
workqueue uses seem to alloc before enqueue and then free after the work
is done.

It turns out that under our workloads we have yet to see a case where
wk->wk_entry == wk.

Can anyone explain why the current code is ok, or how it should change?

Manuel Bouyer

2012-02-22 10:42:22 UTC

Permalink

Post by Greg Troxel
My colleague Bev Schwartz found that in src/sys/dev/pci/if_bnx.c,
workqueue_enqueue is called with a struct work * that points to a static
variable. But, this struct is used as a linked list, and reuse of the
struct risks creating a circular list.
bnx's workqueue callback does not free the struct work *. Other
workqueue uses seem to alloc before enqueue and then free after the work
is done.
It turns out that under our workloads we have yet to see a case where
wk->wk_entry == wk.
Can anyone explain why the current code is ok, or how it should change?

I agree it's not clean, and it should probably be changed.
It works because each workqueue (which are per-bnx instance) can have
only one work enqueued at a time, so wk->wk_entry is always NULL.
If the workqueue implementation is ever changed to use something else
but a SIMPLE_QUEUE, this will fail.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2012-02-22 13:58:15 UTC

Permalink

Thanks; perhaps I will add a KASSERT and a comment (that having more
than one is prevented by the code).

jean-Yves Migeon

2012-02-26 14:32:23 UTC

Permalink

I made this change to defer allocation of bnx packets outside the
interrupt handler while staying close to what was done in OpenBSD's
bnx(4). For obvious reasons, I could not defer allocation to a thread
context while having to allocate a work struct from handler (chicken/egg
problem).

The work struct is not used per-see, the workqueue_enqueue() step is
solely used to wake the wq thread. IIRC workqueue(9) does not support
passing NULL arguments, so I had to restrict the variable scope to its
smallest block, hence the static declaration in the function. Ugly, but
the allocation deferal fixed the LOCKDEBUG kernel with this driver.

A much cleaner approach would be an API where we can arbitrarily pass
continuations to a thread with /optional/ arguments, and a way to wakeup
a thread without having to implement the locking in the caller every
time (workqueue(9) is close to that but allocating wk structs prevent
API use from interrupt code). Matt's API proposal was interesting in
this regard, but I did not have time to look into it.

--
Jean-Yves Migeon
***@NetBSD.org

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2012-02-26 14:47:16 UTC

Permalink

Post by jean-Yves Migeon

Post by Greg Troxel
My colleague Bev Schwartz found that in src/sys/dev/pci/if_bnx.c,
workqueue_enqueue is called with a struct work * that points to a
static variable. But, this struct is used as a linked list, and
reuse of the struct risks creating a circular list.
bnx's workqueue callback does not free the struct work *. Other
workqueue uses seem to alloc before enqueue and then free after the
work is done.
It turns out that under our workloads we have yet to see a case where
wk->wk_entry == wk.
Can anyone explain why the current code is ok, or how it should change?

I made this change to defer allocation of bnx packets outside the
interrupt handler while staying close to what was done in OpenBSD's
bnx(4).

Understood, and that seems fine.

Post by jean-Yves Migeon
For obvious reasons, I could not defer allocation to a thread
context while having to allocate a work struct from handler
(chicken/egg problem).

I don't follow that. One only needs to be able to allocate a small
struct, but I guess you are saying that there might not be memory
available? Or are you saying that all means of allocating memory
require a thread context.

Post by jean-Yves Migeon
The work struct is not used per-see, the workqueue_enqueue() step is
solely used to wake the wq thread. IIRC workqueue(9) does not support
passing NULL arguments, so I had to restrict the variable scope to its
smallest block, hence the static declaration in the function. Ugly,
but the allocation deferal fixed the LOCKDEBUG kernel with this
driver.

I saw that it wasn't used, but subr_workqueue will, if workqueue_enqueue
were called twice, write the forward pointers on the work struct.
The code is missing an explanation of why this apparently unsafe code
is ok, and the critical point is the flag that prevents a second request
From being made while the first one is still on the workqueue proper
(rather than the workqueue's runlist). I have a commit in progress that
adds comments and a KASSERT.

Post by jean-Yves Migeon
A much cleaner approach would be an API where we can arbitrarily pass
continuations to a thread with /optional/ arguments, and a way to
wakeup a thread without having to implement the locking in the caller
every time (workqueue(9) is close to that but allocating wk structs
prevent API use from interrupt code). Matt's API proposal was
interesting in this regard, but I did not have time to look into it.

I don't think the problem isn't the optional argument; it's that
workqueue uses struct work to chain the pending requests.

So we are moving to a world where allocation of memory in interrupt
context is basically not allowed? I think then we need a primitive
which is like workqueue but only allows it to be signalled, and has no
argument, basically a way to request the allocate_stuff() function to be
run.

So this would be workqueue, minus the queue, and hence minus struct
work. Perhaps that's where Matt ended up.

Jean-Yves Migeon

2012-02-26 16:22:50 UTC

Permalink

Post by Greg Troxel

Post by jean-Yves Migeon
For obvious reasons, I could not defer allocation to a thread
context while having to allocate a work struct from handler
(chicken/egg problem).

The former. I could allocate from intterupt handler, but had to handle
the case where memory is scarced and allocation fails (pool_cache(9)
allows that, for example). I regularly read that we should avoid
allocating from interrupt and defer that step to thread context, so I
took down this path.

Note that this choice was influenced because I wanted my patch to
remain close to the one in OpenBSD. Besides, the wk struct is unused in
the current code so allocating it to free it a few seconds later is just
extra overhead for no real purpose.

Post by Greg Troxel

Post by jean-Yves Migeon
The work struct is not used per-see, the workqueue_enqueue() step is
solely used to wake the wq thread. IIRC workqueue(9) does not
support
passing NULL arguments, so I had to restrict the variable scope to its
smallest block, hence the static declaration in the function. Ugly,
but the allocation deferal fixed the LOCKDEBUG kernel with this
driver.

I saw that it wasn't used, but subr_workqueue will, if
workqueue_enqueue
were called twice, write the forward pointers on the work struct.
The code is missing an explanation of why this apparently unsafe code
is ok, and the critical point is the flag that prevents a second request
From being made while the first one is still on the workqueue proper
(rather than the workqueue's runlist). I have a commit in progress that
adds comments and a KASSERT.

Indeed. Go ahead.

Post by Greg Troxel

I don't think the problem isn't the optional argument; it's that
workqueue uses struct work to chain the pending requests.
So we are moving to a world where allocation of memory in interrupt
context is basically not allowed? I think then we need a primitive
which is like workqueue but only allows it to be signalled, and has no
argument, basically a way to request the allocate_stuff() function to be
run.

It is allowed, but for that we have to use allocators like
pool_cache(9). And like all "high priority" allocations they can fail,
so the handler has to cope with this. Last time I checked OpenBSD's
bnx(4) driver this code path was not needed, because of the way their
workqueue implementation works.

Post by Greg Troxel
So this would be workqueue, minus the queue, and hence minus struct
work. Perhaps that's where Matt ended up.

Very likely. IMHO workqueues, callouts and softint scheduling have the
same idea behind: pass down work to be processed in another context. So
they follow the same logic behind, perhaps a general implementation
could be factored out.

--
Jean-Yves Migeon
***@free.fr

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2012-02-26 21:05:59 UTC

Permalink

Post by Jean-Yves Migeon
[...]
Note that this choice was influenced because I wanted my patch to
remain close to the one in OpenBSD. Besides, the wk struct is unused
in the current code so allocating it to free it a few seconds later
is just extra overhead for no real purpose.

wk is unused by the caller, but workqueue_enqueue() will still write
to it (it's used to maintain the queue, precisely). That's
why you have to wait for a wk to complete before enqueuing it again.

In the way it's used here, there could be a problem when multiple bnx(4)
instances are present (all instances uses the same wk). Because of
the worqueue(9) implementation (and that's not in the specifications) it
works, because it internally uses a simple list and as there's only one
work per workqueue, the next pointer is always NULL (so it's not a problem
to share the same wk by different workqueues). But it's a side effect
of implementation; it's it's changed to use e.g. a circular list,
things will break badly.
I think the correct thing to do would be to have the wk allocated in the
softc.

Manuel Bouyer

2012-03-01 16:29:32 UTC

Permalink

Post by Jean-Yves Migeon

[...]

Yep, you are right. I just had a look at workqueue(9) implementation
this morning, and multiple bnx(4) instances would badly break with
anything besides SIMPLEQ.
Having a wk allocated in sc is the way to go.

Instead of every driver implementing its own memory management, would it not
make more sense to use an common mechanism like, e.g., pool_cache?

A pool for a single per-instance structure which is completely static
deosn't make sense. Just declare it static in the softc ...

Sverre Froyen

2012-03-01 16:23:34 UTC

Permalink

Post by Jean-Yves Migeon

[...]

Instead of every driver implementing its own memory management, would it not
make more sense to use an common mechanism like, e.g., pool_cache?

Regards,
Sverre

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2012-03-01 18:41:57 UTC

Permalink

Post by Jean-Yves Migeon
Having a wk allocated in sc is the way to go.

Instead of every driver implementing its own memory management, would it not
make more sense to use an common mechanism like, e.g., pool_cache?

What "memory management"? This is a single static structure in the device's
private configuration information.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jean-Yves Migeon

2012-02-27 09:22:20 UTC

Permalink

Post by Manuel Bouyer

wk is unused by the caller, but workqueue_enqueue() will still write
to it (it's used to maintain the queue, precisely). That's
why you have to wait for a wk to complete before enqueuing it again.
In the way it's used here, there could be a problem when multiple bnx(4)
instances are present (all instances uses the same wk). Because of
the worqueue(9) implementation (and that's not in the specifications) it
works, because it internally uses a simple list and as there's only one
work per workqueue, the next pointer is always NULL (so it's not a problem
to share the same wk by different workqueues). But it's a side effect
of implementation; it's it's changed to use e.g. a circular list,
things will break badly.
I think the correct thing to do would be to have the wk allocated in the
softc.

--
Jean-Yves Migeon
***@free.fr

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Sverre Froyen

2012-03-02 16:57:00 UTC

Permalink

Post by Manuel Bouyer

Post by Jean-Yves Migeon

[...]

Instead of every driver implementing its own memory management, would it
not make more sense to use an common mechanism like, e.g., pool_cache?

A pool for a single per-instance structure which is completely static
deosn't make sense. Just declare it static in the softc ...

Sorry, I was following up to the latest message on the thread and doing a poor
job of quoting.

I was really thinking about the change that prompted this discussion (if_bnx.c
rev 1.44). In particular the replacement of

bnx_alloc_pkts(sc)

with

workqueue_enqueue(sc->bnx_wq, &bnx_wk, NULL);
SET(sc->bnx_flags, BNX_ALLOC_PKTS_FLAG);

which was made in order to avoid memory allocation in interrupt context.

It seems to me that this could be handled by the pool code instead. Perhaps as
follows:

1) Create pool_caches of header structures (one per buffer size).

2) Use a custom allocator that does the bus_dma dance and attaches the
resulting memory pointers to the header structure.

3) Require a minimum number of free items (pool_cache_setlowat).

4) Call pool_cache_get in the driver to obtain memory buffers.

I attempted to get this working in the iwn driver( for the rx buffers) but ran
into some issues in the pool code that prevented a simple minded version from
working. I believe the problems can be fixed, however.

Regards,
Sverre

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2012-03-02 17:00:44 UTC

Permalink

Post by Sverre Froyen
Sorry, I was following up to the latest message on the thread and doing a poor
job of quoting.
I was really thinking about the change that prompted this discussion (if_bnx.c
rev 1.44). In particular the replacement of
bnx_alloc_pkts(sc)
with
workqueue_enqueue(sc->bnx_wq, &bnx_wk, NULL);
SET(sc->bnx_flags, BNX_ALLOC_PKTS_FLAG);
which was made in order to avoid memory allocation in interrupt context.
It seems to me that this could be handled by the pool code instead. Perhaps as
1) Create pool_caches of header structures (one per buffer size).
2) Use a custom allocator that does the bus_dma dance and attaches the
resulting memory pointers to the header structure.

The problem is actually in bus_dma, so using pool_cache here won't help.
The problematic bus_dma routine (I don't remmeber which one, sorry) could
still be called from interrupt context.

Sverre Froyen

2012-03-02 17:29:10 UTC

Permalink

Post by Manuel Bouyer

Post by Sverre Froyen
Sorry, I was following up to the latest message on the thread and doing a
poor job of quoting.
I was really thinking about the change that prompted this discussion
(if_bnx.c rev 1.44). In particular the replacement of
bnx_alloc_pkts(sc)
with
workqueue_enqueue(sc->bnx_wq, &bnx_wk, NULL);
SET(sc->bnx_flags, BNX_ALLOC_PKTS_FLAG);
which was made in order to avoid memory allocation in interrupt context.
It seems to me that this could be handled by the pool code instead.
1) Create pool_caches of header structures (one per buffer size).
2) Use a custom allocator that does the bus_dma dance and attaches the
resulting memory pointers to the header structure.

The problem is actually in bus_dma, so using pool_cache here won't help.
The problematic bus_dma routine (I don't remmeber which one, sorry) could
still be called from interrupt context.

It's bus_dmamem_map which is using a pool with IPL_NONE. This is one of the
issues I was referring to. I think the trick might be to

1) allow pool_cache_get to return existing items for an IPL_NONE cache even
when called in interrupt context (this will require changes to the pool /
pool_cache locking code).

2) fix the code that manages the low water mark so that it does not run in
interrupt context (perhaps use a separate thread to grow the pools).

The idea is that the low water mark would guarantee available buffers for the
drivers, most of the time.

Regards,
Sverre

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2012-03-02 17:48:28 UTC

Permalink

Post by Sverre Froyen
It's bus_dmamem_map which is using a pool with IPL_NONE. This is one of the
issues I was referring to. I think the trick might be to
1) allow pool_cache_get to return existing items for an IPL_NONE cache even
when called in interrupt context (this will require changes to the pool /
pool_cache locking code).

I don't think that's desireable.

Post by Sverre Froyen
2) fix the code that manages the low water mark so that it does not run in
interrupt context (perhaps use a separate thread to grow the pools).

thers's more problems than this one in this code path. There are adaptative
mutexes, for example. pmap_enter() is just not designed to be called
from interrupt context. It's not just pool_cache.