Discussion:
bnx(4) lockups?
(too old to reply)
Stephen Borrill
2019-03-19 13:27:58 UTC
Permalink
I have a reliable installation on one Lenovo server which has wm(4) NICs.
It has connections to the LAN and directly to the open Internet. I tried
to migrate it to a different server which has bnx(4) NICs. If not
connected to the Internet, it runs reliably. If connected to the Internet
and traffic is flowing, it will lock solid after a while. I have
encountered this previously and tracked it down to IPFilter. My workaround
on that occasion was to use "pass in all/pass out all" rules (I could not
disable IPFilter as I needed NAT) as anything more complex caused the
lock. However, on this occasion I switched to NPF and the lockups
continue. This is on netbsd-7.

The hardware has been running XenServer solidly under
heavy load for a number of years. Doing internal data copies (to stress
the HDDs and RAID controller) is also reliable.

I really can't think of much else besides the network hardware (and I
don't have any spare NICs to hand). Is bnx(4) known bad?

bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
--
Stephen

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-20 07:44:23 UTC
Permalink
Hi, Stephen.
I have a reliable installation on one Lenovo server which has wm(4) NICs. It has connections to the LAN and directly to the open Internet. I tried to migrate it to a different server which has bnx(4) NICs. If not connected to the Internet, it runs reliably. If connected to the Internet and traffic is flowing, it will lock solid after a while. I have encountered this previously and tracked it down to IPFilter. My workaround on that occasion was to use "pass in all/pass out all" rules (I could not disable IPFilter as I needed NAT) as anything more complex caused the lock. However, on this occasion I switched to NPF and the lockups continue. This is on netbsd-7.
The hardware has been running XenServer solidly under heavy load for a number of years. Doing internal data copies (to stress the HDDs and RAID controller) is also reliable.
I really can't think of much else besides the network hardware (and I don't have any spare NICs to hand). Is bnx(4) known bad?
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
Could you try the following patch? This is taken from OpenBSD rev. 1.93.
Just guess. Not tested yet.


Index: if_bnx.c
===================================================================
RCS file: /cvsroot/src/sys/dev/pci/if_bnx.c,v
retrieving revision 1.68
diff -u -p -r1.68 if_bnx.c
--- if_bnx.c 22 Jan 2019 03:42:27 -0000 1.68
+++ if_bnx.c 20 Mar 2019 07:37:55 -0000
@@ -2213,6 +2213,8 @@ bnx_dma_free(struct bnx_softc *sc)

/* Destroy the status block. */
if (sc->status_block != NULL && sc->status_map != NULL) {
+ bus_dmamap_sync(sc->bnx_dmatag, sc->status_map, 0,
+ sc->status_map->dm_mapsize, BUS_DMASYNC_POSTREAD);
bus_dmamap_unload(sc->bnx_dmatag, sc->status_map);
bus_dmamem_unmap(sc->bnx_dmatag, (void *)sc->status_block,
BNX_STATUS_BLK_SZ);
@@ -2355,6 +2357,9 @@ bnx_dma_alloc(struct bnx_softc *sc)
goto bnx_dma_alloc_exit;
}

+ bus_dmamap_sync(sc->bnx_dmatag, sc->status_map, 0,
+ sc->status_map->dm_mapsize, BUS_DMASYNC_PREREAD);
+
sc->status_block_paddr = sc->status_map->dm_segs[0].ds_addr;
memset(sc->status_block, 0, BNX_STATUS_BLK_SZ);

@@ -5275,7 +5280,7 @@ bnx_intr(void *xsc)
DBRUNIF(1, sc->interrupts_generated++);

bus_dmamap_sync(sc->bnx_dmatag, sc->status_map, 0,
- sc->status_map->dm_mapsize, BUS_DMASYNC_POSTWRITE);
+ sc->status_map->dm_mapsize, BUS_DMASYNC_POSTREAD);

/*
* If the hardware status block index
@@ -5354,7 +5359,7 @@ bnx_intr(void *xsc)
}

bus_dmamap_sync(sc->bnx_dmatag, sc->status_map, 0,
- sc->status_map->dm_mapsize, BUS_DMASYNC_PREWRITE);
+ sc->status_map->dm_mapsize, BUS_DMASYNC_PREREAD);

/* Re-enable interrupts. */
REG_WR(sc, BNX_PCICFG_INT_ACK_CMD,



The same diff is at:

http://www.netbsd.org/~msaitoh/bnx-20190320-0.dif
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Stephen Borrill
2019-03-21 17:14:11 UTC
Permalink
Post by Masanobu SAITOH
Hi, Stephen.
Post by Stephen Borrill
I have a reliable installation on one Lenovo server which has wm(4) NICs.
It has connections to the LAN and directly to the open Internet. I tried to
migrate it to a different server which has bnx(4) NICs. If not connected to
the Internet, it runs reliably. If connected to the Internet and traffic is
flowing, it will lock solid after a while. I have encountered this
previously and tracked it down to IPFilter. My workaround on that occasion
was to use "pass in all/pass out all" rules (I could not disable IPFilter
as I needed NAT) as anything more complex caused the lock. However, on this
occasion I switched to NPF and the lockups continue. This is on netbsd-7.
The hardware has been running XenServer solidly under heavy load for a
number of years. Doing internal data copies (to stress the HDDs and RAID
controller) is also reliable.
I really can't think of much else besides the network hardware (and I don't
have any spare NICs to hand). Is bnx(4) known bad?
[snip]
Post by Masanobu SAITOH
Could you try the following patch? This is taken from OpenBSD rev. 1.93.
Just guess. Not tested yet.
Unfortunately, it still locked up with this patch.
--
Stephen


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-23 02:32:47 UTC
Permalink
Post by Stephen Borrill
Post by Masanobu SAITOH
Hi, Stephen.
I have a reliable installation on one Lenovo server which has wm(4) NICs. It has connections to the LAN and directly to the open Internet. I tried to migrate it to a different server which has bnx(4) NICs. If not connected to the Internet, it runs reliably. If connected to the Internet and traffic is flowing, it will lock solid after a while. I have encountered this previously and tracked it down to IPFilter. My workaround on that occasion was to use "pass in all/pass out all" rules (I could not disable IPFilter as I needed NAT) as anything more complex caused the lock. However, on this occasion I switched to NPF and the lockups continue. This is on netbsd-7.
The hardware has been running XenServer solidly under heavy load for a number of years. Doing internal data copies (to stress the HDDs and RAID controller) is also reliable.
I really can't think of much else besides the network hardware (and I don't have any spare NICs to hand). Is bnx(4) known bad?
[snip]
Post by Masanobu SAITOH
Could you try the following patch? This is taken from OpenBSD rev. 1.93.
Just guess. Not tested yet.
Unfortunately, it still locked up with this patch.
:-(

I will try to reproduce the hard hang next week.

Our bnx(4) driver is based on OpenBSD and have not pulled changes
from it for many years. And also we can see the difference from FreeBSD's
bce(4) by doing "s/bce/bnx/". I'm going to take some changes from other
BSD's and it might fix the problem.

From my experience in bge(4), I suspect the bnx(4)'s hard hang might
come from access conflict between the driver(CPU) and the embedded
controller.
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-23 02:36:18 UTC
Permalink
Post by Stephen Borrill
Post by Masanobu SAITOH
Hi, Stephen.
I have a reliable installation on one Lenovo server which has wm(4) NICs. It has connections to the LAN and directly to the open Internet. I tried to migrate it to a different server which has bnx(4) NICs. If not connected to the Internet, it runs reliably. If connected to the Internet and traffic is flowing, it will lock solid after a while. I have encountered this previously and tracked it down to IPFilter. My workaround on that occasion was to use "pass in all/pass out all" rules (I could not disable IPFilter as I needed NAT) as anything more complex caused the lock. However, on this occasion I switched to NPF and the lockups continue. This is on netbsd-7.
The hardware has been running XenServer solidly under heavy load for a number of years. Doing internal data copies (to stress the HDDs and RAID controller) is also reliable.
I really can't think of much else besides the network hardware (and I don't have any spare NICs to hand). Is bnx(4) known bad?
[snip]
Post by Masanobu SAITOH
Could you try the following patch? This is taken from OpenBSD rev. 1.93.
Just guess. Not tested yet.
Unfortunately, it still locked up with this patch.
:-(
 I will try to reproduce the hard hang next week.
 Our bnx(4) driver is based on OpenBSD and have not pulled changes
from it for many years. And also we can see the difference from FreeBSD's
bce(4) by doing "s/bce/bnx/". I'm going to take some changes from other
BSD's and it might fix the problem.
 From my experience in bge(4), I suspect the bnx(4)'s hard hang might
come from access conflict between the driver(CPU) and the embedded
controller.
One question.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
No MII PHYs?

Could you show me:

0) the dmesg output of the PHYs if available.

1) ifconfig -m

2) pcictl pci0 dump -b [12] -d 0 -f [01]

Thanks.
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Stephen Borrill
2019-03-25 11:48:31 UTC
Permalink
Post by Masanobu SAITOH
 From my experience in bge(4), I suspect the bnx(4)'s hard hang might
come from access conflict between the driver(CPU) and the embedded
controller.
One question.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
No MII PHYs?
Sorry, missed them out.
Post by Masanobu SAITOH
0) the dmesg output of the PHYs if available.
All 4 are like this:

bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
brgphy0 at bnx0 phy 1: BCM5709 10/100/1000baseT PHY, rev. 8
Post by Masanobu SAITOH
1) ifconfig -m
bnx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
capabilities=3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
capabilities=3f00<UDP4CSUM_Rx,UDP4CSUM_Tx>
enabled=0
ec_capabilities=7<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU>
ec_enabled=0
address: 5c:f3:fc:e4:e6:78
media: Ethernet autoselect (1000baseT full-duplex)
status: active
supported Ethernet media:
media none
media 10baseT
media 10baseT mediaopt full-duplex
media 100baseTX
media 100baseTX mediaopt full-duplex
media 1000baseT
media 1000baseT mediaopt full-duplex
media autoselect
inet 10.4.0.11 netmask 0xffff0000 broadcast 10.4.255.255
inet6 fe80::5ef3:fcff:fee4:e678%bnx0 prefixlen 64 scopeid 0x1
Post by Masanobu SAITOH
2) pcictl pci0 dump -b [12] -d 0 -f [01]
I only get output for -b 1 -d 0 -f 0:

PCI configuration registers:
Common header:
0x00: 0x00791000 0x00100047 0x01040005 0x00000010

Vendor Name: Symbios Logic (0x1000)
Device Name: MegaRAID SAS2108 GEN2 (0x0079)
Command register: 0x0047
I/O space accesses: on
Memory space accesses: on
Bus mastering: on
Special cycles: off
MWI transactions: off
Palette snooping: off
Parity error checking: on
Address/data stepping: off
System error (SERR): off
Fast back-to-back transactions: off
Interrupt disable: off
Status register: 0x0010
Interrupt status: inactive
Capability List support: on
66 MHz capable: off
User Definable Features (UDF) support: off
Fast back-to-back capable: off
Data parity error detected: off
DEVSEL timing: fast (0x0)
Slave signaled Target Abort: off
Master received Target Abort: off
Master received Master Abort: off
Asserted System Error (SERR): off
Parity error detected: off
Class Name: mass storage (0x01)
Subclass Name: RAID (0x04)
Interface: 0x00
Revision ID: 0x05
BIST: 0x00
Header Type: 0x00 (0x00)
Latency Timer: 0x00
Cache Line Size: 64bytes (0x10)

Type 0 ("normal" device) header:
0x10: 0x00001001 0x9b940004 0x00000000 0x9b900004
0x20: 0x00000000 0x00000000 0x00000000 0x03b21014
0x30: 0xfffe0000 0x00000050 0x00000000 0x0000010b

Base address register at 0x10
type: i/o
base: 0x00001000, not sized
Base address register at 0x14
type: 64-bit nonprefetchable memory
base: 0x000000009b940000, not sized
Base address register at 0x1c
type: 64-bit nonprefetchable memory
base: 0x000000009b900000, not sized
Base address register at 0x24
not implemented(?)
Cardbus CIS Pointer: 0x00000000
Subsystem vendor ID: 0x1014
Subsystem ID: 0x03b2
Expansion ROM Base Address: 0xfffe0000
Capability list pointer: 0x50
Reserved @ 0x38: 0x00000000
Maximum Latency: 0x00
Minimum Grant: 0x00
Interrupt pin: 0x01 (pin A)
Interrupt line: 0x0b

Capability register at 0x50
type: 0x01 (Power Management)
Capability register at 0x68
type: 0x10 (PCI Express)
Capability register at 0xd0
type: 0x03 (VPD)
Capability register at 0xa8
type: 0x05 (MSI)
Capability register at 0xc0
type: 0x11 (MSI-X)

PCI Power Management Capabilities Register
Capabilities register: 0x0603
Version: 1.2
PME# clock: off
Device specific initialization: off
3.3V auxiliary current: self-powered
D1 power management state support: on
D2 power management state support: on
PME# support D0: off
PME# support D1: off
PME# support D2: off
PME# support D3 hot: off
PME# support D3 cold: off
Control/status register: 0x0008
Power state: D0
PCI Express reserved: off
No soft reset: on
PME# assertion: disabled
PME# status: off
Bridge Support Extensions register: 0x00
B2/B3 support: off
Bus Power/Clock Control Enable: off
Data register: 0x00

PCI Message Signaled Interrupt
Message Control register: 0x0080
MSI Enabled: off
Multiple Message Capable: no (1 vector)
Multiple Message Enabled: off (1 vector)
64 Bit Address Capable: on
Per-Vector Masking Capable: off
Message Address (lower) register: 0x00000000
Message Address (upper) register: 0x00000000
Message Data register: 0x00000000

PCI Express Capabilities Register
Capability register: 0002
Capability version: 2
Device type: PCI Express Endpoint device
Slot implemented: off
Interrupt Message Number: 0
Device Capabilities Register: 0x10008025
Max Payload Size Supported: 4096 bytes max
Phantom Functions Supported: not available
Extended Tag Field Supported: 8bit
Endpoint L0 Acceptable Latency: Less than 64ns
Endpoint L1 Acceptable Latency: Less than 1us
Attention Button Present: off
Attention Indicator Present: off
Power Indicator Present: off
Role-Based Error Report: on
Captured Slot Power Limit Value: 0
Captured Slot Power Limit Scale: 0
Function-Level Reset Capability: on
Device Control Register: 0x5916
Correctable Error Reporting Enable: off
Non Fatal Error Reporting Enable: on
Fatal Error Reporting Enable: on
Unsupported Request Reporting Enable: off
Enable Relaxed Ordering: on
Max Payload Size: 128 byte
Extended Tag Field Enable: on
Phantom Functions Enable: off
Aux Power PM Enable: off
Enable No Snoop: on
Max Read Request Size: 4096 byte
Device Status Register: 0x0009
Correctable Error Detected: on
Non Fatal Error Detected: off
Fatal Error Detected: off
Unsupported Request Detected: on
Aux Power Detected: off
Transaction Pending: off
Link Capabilities Register: 0x00000482
Maximum Link Speed: 5.0GT/s
Maximum Link Width: x8 lanes
Active State PM Support: L0s Entry supported
L0 Exit Latency: Less than 64ns
L1 Exit Latency: Less than 1us
Port Number: 0
Clock Power Management: off
Surprise Down Error Report: off
Data Link Layer Link Active: off
Link BW Notification Capable: off
ASPM Optionally Compliance: off
Link Control Register: 0x0040
Active State PM Control: disabled
Read Completion Boundary Control: 64bytes
Link Disable: off
Retrain Link: off
Common Clock Configuration: on
Extended Synch: off
Enable Clock Power Management: off
Hardware Autonomous Width Disable: off
Link Bandwidth Management Interrupt Enable: off
Link Autonomous Bandwidth Interrupt Enable: off
Link Status Register: 0x1041
Negotiated Link Speed: 2.5GT/s
Negotiated Link Width: x4 lanes
Training Error: off
Link Training: off
Slot Clock Configuration: on
Data Link Layer Link Active: off
Link Bandwidth Management Status: off
Link Autonomous Bandwidth Status: off
Device Capabilities 2: 0x00000016
Completion Timeout Ranges Supported: 6
Completion Timeout Disable Supported: on
ARI Forwarding Supported: off
AtomicOp Routing Supported: off
32bit AtomicOp Completer Supported: off
64bit AtomicOp Completer Supported: off
128-bit CAS Completer Supported: off
No RO-enabled PR-PR passing: off
LTR Mechanism Supported: off
TPH Completer Supported: 0
OBFF Supported: Not supported
Extended Fmt Field Supported: off
End-End TLP Prefix Supported: off
Max End-End TLP Prefixes: 0
Device Control 2: 0x0009
Completion Timeout Value: 260ms to 900ms
Completion Timeout Disabled: off
ARI Forwarding Enabled: off
AtomicOp Rquester Enabled: off
AtomicOp Egress Blocking: off
IDO Request Enabled: off
IDO Completion Enabled: off
LTR Mechanism Enabled: off
OBFF: Disabled
End-End TLP Prefix Blocking on: off
Link Capabilities 2: 0x00000000
Supported Link Speed Vector:
Crosslink Supported: off
Link Control 2: 0x0002
Target Link Speed: 5.0GT/s
Enter Compliance Enabled: off
HW Autonomous Speed Disabled: off
Selectable De-emphasis: off
Transmit Margin: 0
Enter Modified Compliance: off
Compliance SOS: off
Compliance Present/De-emphasis: 0
Link Status 2: 0x0000
Current De-emphasis Level: off
Equalization Complete: off
Equalization Phase 1 Successful: off
Equalization Phase 2 Successful: off
Equalization Phase 3 Successful: off
Link Equalization Request: off

MSI-X Capability Register
Message Control register: 0x000e
Table Size: 15
Function Mask: off
MSI-X Enable: off
Table offset register: 0x00002001
Table offset: 00002000
BIR: 0x1
Pending bit array register: 0x00003801
Pending bit array offset: 00003800
BIR: 0x1

Device-dependent header:
0x40: 0x00000000 0x00000000 0x00000000 0x00000000
0x50: 0x06036801 0x00000008 0x00000000 0x00000000
0x60: 0x00000000 0x00000100 0x0002d010 0x10008025
0x70: 0x00095916 0x00000482 0x10410040 0x00000000
0x80: 0x00000000 0x00000000 0x00000000 0x00000016
0x90: 0x00000009 0x00000000 0x00000002 0x00000000
0xa0: 0x00000000 0x00000000 0x0080c005 0x00000000
0xb0: 0x00000000 0x00000000 0x00000000 0x00000000
0xc0: 0x000e0011 0x00002001 0x00003801 0x00000000
0xd0: 0x0000a803 0x00000000 0x00000000 0x00000000
0xe0: 0x00000000 0x00000000 0x00000000 0x00000000
0xf0: 0x00000000 0x00000000 0x00000000 0x00000000
--
Stephen
Masanobu SAITOH
2019-03-26 06:34:30 UTC
Permalink
Post by Stephen Borrill
Post by Masanobu SAITOH
  From my experience in bge(4), I suspect the bnx(4)'s hard hang might
come from access conflict between the driver(CPU) and the embedded
controller.
One question.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
No MII PHYs?
Sorry, missed them out.
Post by Masanobu SAITOH
    0) the dmesg output of the PHYs if available.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
brgphy0 at bnx0 phy 1: BCM5709 10/100/1000baseT PHY, rev. 8
Post by Masanobu SAITOH
    1) ifconfig -m
bnx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
        capabilities=3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
        capabilities=3f00<UDP4CSUM_Rx,UDP4CSUM_Tx>
        enabled=0
        ec_capabilities=7<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU>
        ec_enabled=0
        address: 5c:f3:fc:e4:e6:78
        media: Ethernet autoselect (1000baseT full-duplex)
        status: active
                media none
                media 10baseT
                media 10baseT mediaopt full-duplex
                media 100baseTX
                media 100baseTX mediaopt full-duplex
                media 1000baseT
                media 1000baseT mediaopt full-duplex
                media autoselect
        inet 10.4.0.11 netmask 0xffff0000 broadcast 10.4.255.255
        inet6 fe80::5ef3:fcff:fee4:e678%bnx0 prefixlen 64 scopeid 0x1
Thanks. There is no problem in the output.
Post by Stephen Borrill
Post by Masanobu SAITOH
    2) pcictl pci0 dump -b [12] -d 0 -f [01]
    0x00: 0x00791000 0x00100047 0x01040005 0x00000010
    Vendor Name: Symbios Logic (0x1000)
    Device Name: MegaRAID SAS2108 GEN2 (0x0079)
It's not bnx(4) :)
And, OK, it's not required because the device is BCM5709 C0 stepping
(P21(The chip marking's formula is different between bge and bnx)).
I have a card which has the same chip.
Post by Stephen Borrill
    Command register: 0x0047
      I/O space accesses: on
      Memory space accesses: on
      Bus mastering: on
(snip)
Post by Stephen Borrill
    0xe0: 0x00000000 0x00000000 0x00000000 0x00000000
    0xf0: 0x00000000 0x00000000 0x00000000 0x00000000
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-26 09:37:12 UTC
Permalink
Post by Masanobu SAITOH
Post by Stephen Borrill
Post by Masanobu SAITOH
  From my experience in bge(4), I suspect the bnx(4)'s hard hang might
come from access conflict between the driver(CPU) and the embedded
controller.
One question.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx1 at pci1 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx1: Ethernet address 5c:f3:fc:e4:e6:7a
bnx1: interrupting at ioapic1 pin 16
bnx1: ASIC BCM5709 C0 (0x57092003)
bnx1: PCIe x2 5Gbps
bnx1: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx2 at pci2 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx2: Ethernet address 5c:f3:fc:6b:c6:b4
bnx2: interrupting at ioapic1 pin 5
bnx2: ASIC BCM5709 C0 (0x57092003)
bnx2: PCIe x2 5Gbps
bnx2: Coal (RX:6,6,18,18; TX:20,20,80,80)
bnx3 at pci2 dev 0 function 1: Broadcom NetXtreme II BCM5709 1000Base-T
bnx3: Ethernet address 5c:f3:fc:6b:c6:b6
bnx3: interrupting at ioapic1 pin 17
bnx3: ASIC BCM5709 C0 (0x57092003)
bnx3: PCIe x2 5Gbps
bnx3: Coal (RX:6,6,18,18; TX:20,20,80,80)
No MII PHYs?
Sorry, missed them out.
Post by Masanobu SAITOH
    0) the dmesg output of the PHYs if available.
bnx0 at pci1 dev 0 function 0: Broadcom NetXtreme II BCM5709 1000Base-T
bnx0: Ethernet address 5c:f3:fc:e4:e6:78
bnx0: interrupting at ioapic1 pin 4
bnx0: ASIC BCM5709 C0 (0x57092003)
bnx0: PCIe x2 5Gbps
bnx0: Coal (RX:6,6,18,18; TX:20,20,80,80)
brgphy0 at bnx0 phy 1: BCM5709 10/100/1000baseT PHY, rev. 8
Post by Masanobu SAITOH
    1) ifconfig -m
bnx0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500
         capabilities=3f00<IP4CSUM_Rx,IP4CSUM_Tx,TCP4CSUM_Rx,TCP4CSUM_Tx>
         capabilities=3f00<UDP4CSUM_Rx,UDP4CSUM_Tx>
         enabled=0
         ec_capabilities=7<VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU>
         ec_enabled=0
         address: 5c:f3:fc:e4:e6:78
         media: Ethernet autoselect (1000baseT full-duplex)
         status: active
                 media none
                 media 10baseT
                 media 10baseT mediaopt full-duplex
                 media 100baseTX
                 media 100baseTX mediaopt full-duplex
                 media 1000baseT
                 media 1000baseT mediaopt full-duplex
                 media autoselect
         inet 10.4.0.11 netmask 0xffff0000 broadcast 10.4.255.255
         inet6 fe80::5ef3:fcff:fee4:e678%bnx0 prefixlen 64 scopeid 0x1
Thanks. There is no problem in the output.
Post by Stephen Borrill
Post by Masanobu SAITOH
    2) pcictl pci0 dump -b [12] -d 0 -f [01]
     0x00: 0x00791000 0x00100047 0x01040005 0x00000010
     Vendor Name: Symbios Logic (0x1000)
     Device Name: MegaRAID SAS2108 GEN2 (0x0079)
 It's not bnx(4) :)
 And, OK, it's not required because the device is BCM5709 C0 stepping
(P21(The chip marking's formula is different between bge and bnx)).
I have a card which has the same chip.
Post by Stephen Borrill
     Command register: 0x0047
       I/O space accesses: on
       Memory space accesses: on
       Bus mastering: on
(snip)
Post by Stephen Borrill
     0xe0: 0x00000000 0x00000000 0x00000000 0x00000000
     0xf0: 0x00000000 0x00000000 0x00000000 0x00000000
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
New patch:

http://www.netbsd.org/~msaitoh/bnx-n7-20190326-0.dif
http://www.netbsd.org/~msaitoh/bnx-n8-20190326-0.dif
http://www.netbsd.org/~msaitoh/bnx-cur-20190326-0.dif

This diff might improve stability on heavy interrupt.
It seems that bnx(4) also doesn't support the flow control.
I'll add it in a few days.
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-28 03:56:07 UTC
Permalink
Post by Masanobu SAITOH
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
    http://www.netbsd.org/~msaitoh/bnx-n7-20190326-0.dif
    http://www.netbsd.org/~msaitoh/bnx-n8-20190326-0.dif
    http://www.netbsd.org/~msaitoh/bnx-cur-20190326-0.dif
This diff might improve stability on heavy interrupt.
It seems that bnx(4) also doesn't support the flow control.
I'll add it in a few days.
New patches:

http://www.netbsd.org/~msaitoh/bnx-n7-20190328-0.dif
http://www.netbsd.org/~msaitoh/bnx-n8-20190328-0.dif

And copy the latest bnxfw.h (rev. 1.5)
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/microcode/bnx/bnxfw.h
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Masanobu SAITOH
2019-03-29 11:28:28 UTC
Permalink
Post by Masanobu SAITOH
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
     http://www.netbsd.org/~msaitoh/bnx-n7-20190326-0.dif
     http://www.netbsd.org/~msaitoh/bnx-n8-20190326-0.dif
     http://www.netbsd.org/~msaitoh/bnx-cur-20190326-0.dif
This diff might improve stability on heavy interrupt.
It seems that bnx(4) also doesn't support the flow control.
I'll add it in a few days.
    http://www.netbsd.org/~msaitoh/bnx-n7-20190328-0.dif
    http://www.netbsd.org/~msaitoh/bnx-n8-20190328-0.dif
    And copy the latest bnxfw.h (rev. 1.5)
    http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/microcode/bnx/bnxfw.h
Updated patches:

http://www.netbsd.org/~msaitoh/bnx-n7-20190329-0.dif
http://www.netbsd.org/~msaitoh/bnx-n8-20190329-0.dif
And don't forget to copy the latest bnxfw.h (rev. 1.5)

I can't reproduce the hangup problem...
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
If the machine doesn't recover from the hang even if no traffic, I think
the above diff won't fix the problem (flow control might decrease the
possibility of the hang).

The descriptor ring and/or DMA map stuff might have bugs. I'm not
familiar with the code in NetBSD's if_bnx.c, so I can't modify them.
Sorry. (If I could reproduce the problem, I could ...)
Does anyone take a look? NetBSD's if_bnx.c has NetBSD specific modification
in the buffer management. It would be worth to try replace it with OpenBSD's
or FreeBSD's.

Regards.
--
-----------------------------------------------
SAITOH Masanobu (***@execsw.org
***@netbsd.org)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Stephen Borrill
2019-03-29 11:39:15 UTC
Permalink
Post by Masanobu SAITOH
Post by Masanobu SAITOH
If connected to the Internet and traffic is flowing, it will lock solid after a while
Does the machine recover from the hard hang after stopping the traffic?
e.g. removing cable.
Removing the cable makes no difference.
Post by Masanobu SAITOH
    http://www.netbsd.org/~msaitoh/bnx-n7-20190326-0.dif
    http://www.netbsd.org/~msaitoh/bnx-n8-20190326-0.dif
    http://www.netbsd.org/~msaitoh/bnx-cur-20190326-0.dif
This diff might improve stability on heavy interrupt.
It seems that bnx(4) also doesn't support the flow control.
I'll add it in a few days.
http://www.netbsd.org/~msaitoh/bnx-n7-20190328-0.dif
http://www.netbsd.org/~msaitoh/bnx-n8-20190328-0.dif
And copy the latest bnxfw.h (rev. 1.5)
http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/dev/microcode/bnx/bnxfw.h
Further testing shows that this is not bnx(4)-specific. I can get the hang
with wm(4) I350 (and I disabled all the on-board bnx(4) devices). I'm
certain it's network-related though. I've brought the box up in
single-user mode with filesystems mounted ready only. If I bring up bnx0
and run netio to it constantly, it will run for an hour without problems.
If I bring up bnx1 too and run the same netio test (to bnx0), it hangs
after a couple of minutes. This rules out storage drivers, firewalling,
etc.

During some trials I've done an ifconfig bnx1 up after it was running OK
for an hour and the machine locked immediately (even without assigning an
address to bnx1).

Note this is an identical kernel to that running on a different model of
machine adjacent to it faultlessly under continual heavy load. The machine
itself is OK as it was running XenServer (Linux) without problems
immediately prior to the NetBSD install.
--
Stephen
Loading...