CRC errors with gem(4)

Discussion:

(too old to reply)

Julian Coleman

2008-01-01 16:56:07 UTC

Hi,

I'm trying to track down a bug with copper gem cards, where they will
generate invalid frames when sending lots of back-to-back UDP frames.
A simple way to reproduce this is to run:

/tmp/ttcp -u -s -t -b 32768 -n 10 -l 16384 <somehost>

using a gem card. It consistently generates the invalid frames, e.g. at
100Mb/s, my cisco switch always see 35 CRC errors for this command.

I noticed that it's possible to program the gem chip to pass up packets
with invalid CRC, so I added this to the driver and looped back gem1 to
gem0 with a cross-over cable. Now, when I run the command from gem1, and
capture with:

tcpdump -e -x -vv -i gem0 > /tmp/tcpdump.out 2>&1 &

I see lots of good packets:

16:03:21.173534 00:03:ba:68:35:4a > 08:00:20:f7:8e:80, ethertype IPv4 (0x0800), length 1514: IP (tos 0x0, ttl 64, id 34, offset 13320, flags [+], length: 1500) anor > sirion: udp
0x0000: 4500 05dc 0022 2681 4011 d010 5102 6e2a E...."&***@...Q.n*
0x0010: 5102 6e2f 2c2d 2e2f 3031 3233 3435 3637 Q.n/,-./01234567
0x0020: 3839 3a3b 3c3d 3e3f 4041 4243 4445 4647 89:;<=>?@ABCDEFG
0x0030: 4849 4a4b 4c4d 4e4f 5051 5253 5455 5657 HIJKLMNOPQRSTUVW
0x0040: 5859 5a5b 5c5d 5e5f 6061 6263 6465 6667 XYZ[\]^_`abcdefg
0x0050: 6869 hi

and occasional packets like:

16:03:21.206802 20:f7:8e:80:00:03 > 37:38:39:3a:08:00, ethertype Unknown (0xba68), length 150:
0x0000: 354a 0800 4500 0084 0022 07f3 4011 f3f6 5J..E...."***@...
0x0010: 5102 6e2a 5102 6e2f 3b3c 3d3e 3f40 4142 Q.n*Q.n/;<=>?@AB
0x0020: 4344 4546 4748 494a 4b4c 4d4e 4f50 5152 CDEFGHIJKLMNOPQR
0x0030: 5354 5556 5758 595a 5b5c 5d5e 5f60 6162 STUVWXYZ[\]^_`ab
0x0040: 6364 6566 6768 696a 6b6c 6d6e 6f70 7172 cdefghijklmnopqr
0x0050: 7374 st

or:

16:03:21.472989 08:00:20:f7:8e:80 > 46:47:48:49:4a:4b, 802.3, length 66: LLC, dsap Unknown (0xba), ssap Unknown (0x68), cmd 0x35, sap 68 > sap ba rnr (r=37,C) len=48
0x0000: ba68 354a 0800 4500 0020 0000 0000 4011 ***@.
0x0010: fc6f 5102 6e2a 5102 6e2f fffa 1389 000c .oQ.n*Q.n/......
0x0020: 2bb0 2021 2223 0000 0000 0000 0000 0000 +..!"#..........
0x0030: 0000 0000 ....

Some expected packets don't appear in the capture (they could be dropped
by the receiving hardware though).

A hack to get round this is to add a delay(70) before transmitting each
full size UDP packet. Any smaller delay doesn't help. I've also tried
increasing the inter-packet gap (which had no effect) and making the card
generate an interrupt for each UDP packet sent (which helped a little -
CRC errors dropped to 7).

I don't see the problem with TCP. I haven't tested IPv6. Hardware
checksums are off. This happens with 4.0 and -current on both sparc64
and macppc.

It looks like the hardware generates the correct TX complete interrupts
even for the invalid and the missing packets.

If anyone has any ideas as to why this might be happening (bugs in the gem
DMA code or hardware errors), that would be great.

Thanks,

J

PS. Thanks to dyoung@ for pointers (and gem fixes) and to riz@ for testing.

The complete tcpdump is at:

http://www.coris.org.uk/misc/tcpdump-gem-broken.out

--
My other computer also runs NetBSD / Sailing at Newbiggin
http://www.netbsd.org/ / http://www.newbigginsailingclub.org/

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Julian Coleman

2008-01-01 17:31:42 UTC

Permalink

Hi,

I should have pointed out that the "ethertype Unknown" packets always start:

20f7 8e80 0003 wwxx yyzz 0800 ba68 354a
0800 4500

instead of:

0800 20f7 8e80 0003 ba68 354a 0800 4500

The destination MAC address (0800 20f7 8e80) is in bytes 10, 11, 0, 1, 2, 3.

The source MAC address is (0003 ba68 354a) in bytes 4, 5, 12, 13, 14, 15

Bytes 6-9 appear to be either parts of the data (3738 393a in this case)
or sometimes 0000 0000.

The IP and TCP parts of the mangled packets are sometimes intact, sometimes
part zeros.

Thanks,

J

Darren Reed

2008-01-02 12:40:23 UTC

Permalink

Post by Julian Coleman
Hi,
20f7 8e80 0003 wwxx yyzz 0800 ba68 354a
0800 4500
0800 20f7 8e80 0003 ba68 354a 0800 4500
The destination MAC address (0800 20f7 8e80) is in bytes 10, 11, 0, 1, 2, 3.
The source MAC address is (0003 ba68 354a) in bytes 4, 5, 12, 13, 14, 15
Bytes 6-9 appear to be either parts of the data (3738 393a in this case)
or sometimes 0000 0000.
The IP and TCP parts of the mangled packets are sometimes intact, sometimes
part zeros.

If I understand you correctly, the implication here is that the
bytes are being transmitted corrupt.

At what packet rate (pps) do you start to see problems?
Is hardware checksum enabled?
If so, does disabling it improve matters?
Are there any comments/workarounds in opensolaris code
for a problem that resembles this?

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Julian Coleman

2008-01-05 20:47:22 UTC

Permalink

Hi,

Post by Darren Reed
If I understand you correctly, the implication here is that the
bytes are being transmitted corrupt.

Yes.

Post by Darren Reed
At what packet rate (pps) do you start to see problems?

Sending 10 16k UDP frames with ttcp shows up the problem, so this at most
110pps.

Post by Darren Reed
Is hardware checksum enabled?
If so, does disabling it improve matters?

No. No. The card doesn't really support UDP checksums, so I've disabled it.

Post by Darren Reed
Are there any comments/workarounds in opensolaris code
for a problem that resembles this?

Unfortunately not.

One thing I tried was to increase the size of the TX descriptor ring. This
made the problem disappear for 10 frames, but it's still there at 100 frames.

Thanks,

J