BNX driver problem when mbuf clusters run out

Discussion:

BNX driver problem when mbuf clusters run out

(too old to reply)

Beverly Schwartz

2012-04-19 23:39:33 UTC

In bnx_rx_intr, there is a while loop:

while (sw_cons != hw_cons)

Inside this loop, we grab the next mbuf that's available.

m = sc->rx_mbuf_ptr[sw_chain_cons];
sc->rx_mbuf_ptr[sw_chain_cons] = NULL;

It then goes on and tries to get an mbuf cluster to replace the one we just took off the ring.

if (bnx_get_buf(sc, &sw_prod, &sw_chain_prod, &sw_prod_bseq))

If bnx_get_buf fails, the code then calls bnx_add_buf to put the mbuf we just received back on the ring.

bnx_add_buf(sc, m, &sw_prod, &sw_chain_prod, &sw_prod_bseq)

Inside bnx_add_buf, first we put the mbuf (the recycled m) at sw_chain_prod. Because this is the same as sw_chain_cons, m gets placed back at the point we just nulled out.

sc->rx_buf_ptr[*chain_prod] = m_new;

Then sw_chain_prod gets bumped up.

*prod = NEXT_RX_BD(*prod);
*chain_prod = RX_CHAIN_IDX(*prod);

When the code returns from bnx_add_buf, a "continue" is executed, thus going around the loop again.

sw_chain_cons has NOT been incremented, since that call to NEXT_RX_BD is further down in processing in the loop.

However, sw_chain_prod has been advanced in bnx_add_buf.

Next time around the loop, we do all of the above, but now sw_chain_prod is one greater than sw_chain_cons. Because we know we're out of mbuf clusters, bnx_get_buf will fail again, and we will recycle the mbuf once again. However, this time, it will be placed one place ahead of sw_chain_cons. Now we have lost an mbuf cluster forever (because there was one already at sc->rx_buf_ptr[*chain_prod] which will be overwritten), and things go downhill from there. Eventually we lose all mbuf clusters, and our interfaces no longer function at all.

Note, there is another condition in which we recycle mbufs, which suffers from the same problem.

I can think of four things we can do, but I'm not sure which is the "right" answer.

1. When we return from bnx_add_buf, restore sw_chain_prod to whatever it was before we called bnx_add_buf. This will probably cause an infinite loop, so probably isn't a great solution.

2. When we return from bnx_add_buf, push sw_cons along. This will cause all of the packets that the bnx driver has sucked in to be dropped.

3. Instead of calling continue, call break. This will leave the receive chain intact. It breaks out of the loop, but bnx_rx_intr will probably be called trying to process the same packet over and over.

4. Instead of calling continue, increment sw_cons and break. This will cause one packet to be dropped, will at least change conditions for the driver.

I shall try all of these options and see what happens...

-Bev
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Joerg Sonnenberger

2012-04-19 23:43:26 UTC

Post by Beverly Schwartz
Note, there is another condition in which we recycle mbufs,
which suffers from the same problem.

The easiest approach for this is generally to invert the order. *First*
try to obtain an mbuf for the list. If that fails, just drop the newly
received packet. That avoids most of the complications.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-19 23:45:35 UTC

That won't work in this case. If bnx_get_buf succeeds in getting an mbuf cluster, it then puts it in the rx ring. If we haven't removed the mbuf first, there will be no place to put the new mbuf.

-Bev

Post by Joerg Sonnenberger

Post by Beverly Schwartz
Note, there is another condition in which we recycle mbufs,
which suffers from the same problem.

The easiest approach for this is generally to invert the order. *First*
try to obtain an mbuf for the list. If that fails, just drop the newly
received packet. That avoids most of the complications.
Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matt Thomas

2012-04-20 00:27:09 UTC

Post by Beverly Schwartz

Post by Matt Thomas
Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

For now, can you try the patch at

http://www.netbsd.org/~matt/if_bnx-diff.txt

It's kind of ugly but it is a minimal change.
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-20 01:48:38 UTC

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas
Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

For now, can you try the patch at
http://www.netbsd.org/~matt/if_bnx-diff.txt
It's kind of ugly but it is a minimal change.

Doesn't quite do it. With this change, the sw_cons index doesn't get incremented. Next time bnx_rx_intr is called, we'll have the same overwrite problem. I'm now trying incremeing sw_cons before calling break. When I left work, it was still running. Tomorrow morning, I'll see if everything is still intact.

-Bev
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matt Thomas

2012-04-20 05:27:29 UTC

Post by Beverly Schwartz

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas
Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

For now, can you try the patch at
http://www.netbsd.org/~matt/if_bnx-diff.txt
It's kind of ugly but it is a minimal change.

Doesn't quite do it. With this change, the sw_cons index doesn't get incremented. Next time bnx_rx_intr is called, we'll have the same overwrite problem. I'm now trying incremeing sw_cons before calling break. When I left work, it was still running. Tomorrow morning, I'll see if everything is still intact.

It should get incremented since

sw_cons = NEXT_RX_BD(sw_cons);

is after the loop.
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-20 10:29:24 UTC

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas
Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

For now, can you try the patch at
http://www.netbsd.org/~matt/if_bnx-diff.txt
It's kind of ugly but it is a minimal change.

Doesn't quite do it. With this change, the sw_cons index doesn't get incremented. Next time bnx_rx_intr is called, we'll have the same overwrite problem. I'm now trying incremeing sw_cons before calling break. When I left work, it was still running. Tomorrow morning, I'll see if everything is still intact.

It should get incremented since
sw_cons = NEXT_RX_BD(sw_cons);
is after the loop.

It's after the if conditional. It's still in the loop.

-Bev

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matt Thomas

2012-04-20 13:45:22 UTC

Post by Beverly Schwartz

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas

Post by Beverly Schwartz

Post by Matt Thomas
Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

For now, can you try the patch at
http://www.netbsd.org/~matt/if_bnx-diff.txt
It's kind of ugly but it is a minimal change.

Doesn't quite do it. With this change, the sw_cons index doesn't get incremented. Next time bnx_rx_intr is called, we'll have the same overwrite problem. I'm now trying incremeing sw_cons before calling break. When I left work, it was still running. Tomorrow morning, I'll see if everything is still intact.

It should get incremented since
sw_cons = NEXT_RX_BD(sw_cons);
is after the loop.

It's after the if conditional. It's still in the loop.

I don't think you applied the patch properly.

The "if" for testing the mbuf was changed to a while loop so the breaks
just terminate that loop.

The patch has been updated a bit to include a missing bus_dmamap_sync
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-20 15:06:02 UTC

Post by Matt Thomas
I don't think you applied the patch properly.
The "if" for testing the mbuf was changed to a while loop so the breaks
just terminate that loop.
The patch has been updated a bit to include a missing bus_dmamap_sync

You are correct, I missed the added loop.

My interfaces still freeze. Here's what I think the problem is.

If bnx_get_buf fails, we still call bnx_add_buf. The mbuf gets put back in the ring. We break out of the inner loop, and then end up calling ifp->ip_input, which eventually frees the mbuf.

I think there are some other problems with this change:

If there are receive frame problems, the mbuf is put back into the ring, and the mbuf is also processed. This will have the same problem as bnx_get_buf failure. However, in this case, we add processing a packet that probably shouldn't be processed.

Further down in the inner loop, we check for a VLAN tag. This has a continue statement. I believe we will now be in an infinite loop. (There are some other weird outcomes I could posit, but in any case, I think we will have undesired behavior.)

I also think there's another bug in the original code. sc->free_rx_bd is incremented before we check if the the mbuf pointer is NULL. Shouldn't we only increment free_rx_bd if we're actually processing a packet?

-Bev
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-21 21:07:01 UTC

Post by Matt Thomas
The patch has been updated a bit to include a missing bus_dmamap_sync

I do believe I have tracked down the problem. I have not added the bus_dmamap_sync to my patch since Matt already got that one.

-Bev

Here's my description:

Problem 1:
In bnx_rx_intr, there are two places where the producer can get pushed
ahead of the consumer. In the while (sw_cons != hw_cons) loop,
there is a check for a packet whose layer 2 data has been corrupted,
and a check for being able to allocate a new mbuf cluster to be
placed in the receive queue. In both cases, if the test fails,
the mbuf cluster that has just been removed from the receive ring
buffer is put back, and the producer index is pushed forward. However,
the consumer index is not pushed forward, so the next time around,
the mbuf cluster is removed from where the consumer points, but a
new (or recycled) mbuf cluster is put where the producer points. This
over writes the pointer to another mbuf cluster. The mbuf cluster is
lost forever, and the receive buffer ends up filled with NULLs because
the consumer and produce indices stay out of sync. Eventually all
mbuf clusters are lost, and no more communication can happen over any
interface.

To fix this, I changed the action taken in each test in case of failure.
For the botched layer 2 data, I advance the consumer pointer, and then
break out of the loop. For the failure to get a new mbuf cluster,
I removed the call to recycle the mbuf and allow the packet to be
processed rather than end up dropped on the floor. This now prevents
live lock on the interface because incoming packets can still be
processed even if there are no mbuf clusters to be had. And because
the packet is processed, the consumer pointer is pushed forward.
The producer pointer stays put, because no mbuf has been put into
place in the receiver ring buffer.

In addition, I put a KASSERT in bnx_add_buf to check that no mbuf
cluster are being overwritten in the receive ring.

These fixes stopped the lossage of mbuf clusters.

Problem 2:
bnx_get_buf relies on the counter free_rx_bd to determine whether it
should obtain a new mbuf cluster for the receiver ring. bnx_rx_intr
increments this counter before checking if there was an mbuf available
to be removed. Thus, this counter could end up too high, causing
bnx_get_buf to obtain too many mbuf clusters. Once again, we can
leak mbuf clusters. I moved the increment to the place where the
mbuf cluster is actually removed from the receive ring.

Problem 3:
It is possible for the receive ring to completely lose all of its
mbuf clusters in the receive ring. If this happens, the driver is
dependent on bnx_tick to call bnx_get_buf to refill the ring.
bnx_tick did that, but then forgot to tell the chip about it.
So the chip thought it didn't have any buffers to fill, so it waited.
bnx_intr never got an indication that there were new packets to
read because the chip wasn't loading them, and thus, bnx_rx_intr
would never be called. This caused an individual interface to
freeze, but taking the interface down and up would fix the problem
since it would resync the cpu and the chip. Adding calls to
resync the chip after bnx_tick succeeds in getting new mbuf clusters
fixes the problem.

Here's my patch:

diff --git a/netbsd/src/sys/dev/pci/if_bnx.c b/netbsd/src/sys/dev/pci/if_bnx.c
index 711ec34..6fc815d 100644
--- a/netbsd/src/sys/dev/pci/if_bnx.c
+++ b/netbsd/src/sys/dev/pci/if_bnx.c
@@ -3730,6 +3730,7 @@ bnx_add_buf(struct bnx_softc *sc, struct mbuf *m_new, u_in
* last rx_bd entry to that rx_mbuf_ptr and rx_mbuf_map matches)
* and update counter.
*/
+ KASSERT(sc->rx_mbuf_ptr[*chain_prod] == NULL);
sc->rx_mbuf_ptr[*chain_prod] = m_new;
sc->rx_mbuf_map[first_chain_prod] = sc->rx_mbuf_map[*chain_prod];
sc->rx_mbuf_map[*chain_prod] = map;
@@ -4346,7 +4347,6 @@ bnx_rx_intr(struct bnx_softc *sc)

/* Get the used rx_bd. */
rxbd = &sc->rx_bd_chain[RX_PAGE(sw_chain_cons)][RX_IDX(sw_chain_
- sc->free_rx_bd++;

DBRUN(BNX_VERBOSE_RECV, aprint_error("%s(): ", __func__);
bnx_dump_rxbd(sc, sw_chain_cons, rxbd));
@@ -4393,6 +4393,7 @@ bnx_rx_intr(struct bnx_softc *sc)
/* Remove the mbuf from the driver's chain. */
m = sc->rx_mbuf_ptr[sw_chain_cons];
sc->rx_mbuf_ptr[sw_chain_cons] = NULL;
+ sc->free_rx_bd++;

/*
* Frames received on the NetXteme II are prepended
@@ -4445,7 +4446,8 @@ bnx_rx_intr(struct bnx_softc *sc)
panic("%s: Can't reuse RX mbuf!\n",
device_xname(sc->bnx_dev));
}
- continue;
+ sw_cons = NEXT_RX_BD(sw_cons);
+ break;
}

/*
@@ -4459,18 +4461,6 @@ bnx_rx_intr(struct bnx_softc *sc)
DBRUN(BNX_WARN, aprint_debug_dev(sc->bnx_dev,
"Failed to allocate "
"new mbuf, incoming frame dropped!\n"));
-
- ifp->if_ierrors++;
-
- /* Try and reuse the exisitng mbuf. */
- if (bnx_add_buf(sc, m, &sw_prod,
- &sw_chain_prod, &sw_prod_bseq)) {
- DBRUNIF(1, bnx_breakpoint(sc));
- panic("%s: Double mbuf allocation "
- "failure!",
- device_xname(sc->bnx_dev));
- }
- continue;
}

/* Skip over the l2_fhdr when passing the data up
@@ -5615,8 +5605,12 @@ bnx_tick(void *xsc)
prod_bseq = sc->rx_prod_bseq;
chain_prod = RX_CHAIN_IDX(prod);
bnx_get_buf(sc, &prod, &chain_prod, &prod_bseq);
- sc->rx_prod = prod;
- sc->rx_prod_bseq = prod_bseq;
+ if (sc->rx_prod != prod || sc->rx_prod_bseq != prod_bseq) {
+ sc->rx_prod = prod;
+ sc->rx_prod_bseq = prod_bseq;
+ REG_WR16(sc, MB_RX_CID_ADDR + BNX_L2CTX_HOST_BDIDX, prod);
+ REG_WR(sc, MB_RX_CID_ADDR + BNX_L2CTX_HOST_BSEQ, prod_bseq);
+ }
splx(s);
return;
}

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2012-04-21 23:47:14 UTC

If there are no objections I'll commit Bev's patch to current on Monday
or Tuesday. It's getting pretty serious stress testing and has
survived.

<pointy-haired>
Also, should anyone feel in a rush to commit it before I do, please
include in the commit message that this patch is:

Approved for Public Release, Distribution Unlimited

This material is based upon work supported by the Defense Advanced Research
Projects Agency and Space and Naval Warfare Systems Center, Pacific, under
Contract No. N66001-09-C-2073.
</>

Greg Troxel

2012-04-19 23:48:44 UTC

Post by Joerg Sonnenberger

Post by Beverly Schwartz
Note, there is another condition in which we recycle mbufs,
which suffers from the same problem.

The easiest approach for this is generally to invert the order. *First*
try to obtain an mbuf for the list. If that fails, just drop the newly
received packet. That avoids most of the complications.

A fair point, but the driver doesn't actually free an mbuf with the
packet. The 'drop' mechanism is to leave the cluster in place (to be
overwritten later), and the error is in the ring pointer accounting.

My approach would be to try to have the code path in the error case be
as similar to the normal case as possible (in terms of updating ring
pointers), and just not call *if_input with the received mbuf.

Joerg Sonnenberger

2012-04-19 23:54:56 UTC

Post by Beverly Schwartz
That won't work in this case. If bnx_get_buf succeeds in getting
an mbuf cluster, it then puts it in the rx ring. If we haven't
removed the mbuf first, there will be no place to put the new mbuf.

That's circular reasoning. There are three places currently using
bnx_get_buf:

bnx_init_rx_chain
bnx_rx_intr
bnx_tick

The last one can be dropped with the changed semantic, since the RX ring
will never be depleted. The rest is solved by properly splitting up the
actions into allocating a mbuf cluster for the RX ring, loading a
cluster into the RX ring and unloading an already mapped entry.
bnx_init_rx_chain wants to the first two in order, bnx_rx_intr wants to
alloc, if that works: unload and load new cluster.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Matt Thomas

2012-04-20 00:02:53 UTC

Post by Joerg Sonnenberger

Post by Beverly Schwartz
That won't work in this case. If bnx_get_buf succeeds in getting
an mbuf cluster, it then puts it in the rx ring. If we haven't
removed the mbuf first, there will be no place to put the new mbuf.

That's circular reasoning. There are three places currently using
bnx_init_rx_chain
bnx_rx_intr
bnx_tick
The last one can be dropped with the changed semantic, since the RX ring
will never be depleted. The rest is solved by properly splitting up the
actions into allocating a mbuf cluster for the RX ring, loading a
cluster into the RX ring and unloading an already mapped entry.
bnx_init_rx_chain wants to the first two in order, bnx_rx_intr wants to
alloc, if that works: unload and load new cluster.

Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-04-20 00:08:29 UTC

Post by Matt Thomas

Post by Joerg Sonnenberger

Post by Beverly Schwartz
That won't work in this case. If bnx_get_buf succeeds in getting
an mbuf cluster, it then puts it in the rx ring. If we haven't
removed the mbuf first, there will be no place to put the new mbuf.

That's circular reasoning. There are three places currently using
bnx_init_rx_chain
bnx_rx_intr
bnx_tick
The last one can be dropped with the changed semantic, since the RX ring
will never be depleted. The rest is solved by properly splitting up the
actions into allocating a mbuf cluster for the RX ring, loading a
cluster into the RX ring and unloading an already mapped entry.
bnx_init_rx_chain wants to the first two in order, bnx_rx_intr wants to
alloc, if that works: unload and load new cluster.

Unfortunately, that can lead to receiver livelock (spend all the time servicing interrupts without making progress). The best thing to do is pass the packet up the stack and just another packet to the hardware. Instead use bnx_tick to add mbufs if needed. This gives the stack a chance to run and hopefully free some mbufs.

I tried that. And at first, things move along. But eventually, bnx_rx_intr never gets called, because
if (sblk->status_rx_quick_consumer_index0 != sc->nw_rx_cons)
in bnx_intr always fails.

-Bev

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Manuel Bouyer

2012-04-20 11:05:54 UTC

Post by Greg Troxel
A fair point, but the driver doesn't actually free an mbuf with the
packet. The 'drop' mechanism is to leave the cluster in place (to be
overwritten later), and the error is in the ring pointer accounting.
My approach would be to try to have the code path in the error case be
as similar to the normal case as possible (in terms of updating ring
pointers), and just not call *if_input with the received mbuf.

My feeling is that this is the right approach. When we're out of
mbuf cluster we should drop the received packet(s) and reuse the
mbuf(s) from the ring. If we fail to ack the received packets,
we can either create an interrupt storm or stop the receive process
completely.

--
Manuel Bouyer <***@antioche.eu.org>
NetBSD: 26 ans d'experience feront toujours la difference
--

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

15 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Beverly Schwartz 2012-04-19 23:39:33 UTC

Joerg Sonnenberger 2012-04-19 23:43:26 UTC

Beverly Schwartz 2012-04-19 23:45:35 UTC

Matt Thomas 2012-04-20 00:27:09 UTC

Beverly Schwartz 2012-04-20 01:48:38 UTC

Matt Thomas 2012-04-20 05:27:29 UTC

Beverly Schwartz 2012-04-20 10:29:24 UTC

Matt Thomas 2012-04-20 13:45:22 UTC

Beverly Schwartz 2012-04-20 15:06:02 UTC

Beverly Schwartz 2012-04-21 21:07:01 UTC

Greg Troxel 2012-04-21 23:47:14 UTC

Greg Troxel 2012-04-19 23:48:44 UTC

Joerg Sonnenberger 2012-04-19 23:54:56 UTC

Matt Thomas 2012-04-20 00:02:53 UTC

Beverly Schwartz 2012-04-20 00:08:29 UTC

Manuel Bouyer 2012-04-20 11:05:54 UTC

about - legalese

Loading...