Discussion:
CVS commit: [yamt-lazymbuf] src/sys/arch/amd64/include
(too old to reply)
YAMAMOTO Takashi
2008-01-23 14:09:23 UTC
Permalink
[ moved from source-changes ]

hi,
remove __HAVE_LAZY_MBUF for now as it's incompatible with in_cksum.S.
Having read the necessary changes, I am not sure I like the current
approach. Can we use two flags please, one to annotate that the current
mbuf is external, but not mapped and one to annotate that this mbuf
chain might have such a mapping? That way we can keep the complexity out
of a few routines and have a simple
if (mbuf_chain_is_lazy(m)) map_mbuf_chain(m)
at the beginning of the glue code e.g. for in_cksum. I would expect this
to apply to a few other cases where we just have to deal with the whole
mbuf anyway.
Joerg
in_cksum is often used for the header portion of a chain.
ie. not the whole chain.

otoh in{4,6}_cksum can benefit from mbuf_chain_is_lazy.
however in_cksum and in{4,6}_cksum shares the body of the code on i386.

i think there are pros and cons.
assuming "lazy" mbufs are rather rare, it probably saves
some code and cycles.

btw, i'm not quite happy with the current approach (the check in mtod) either
because, while it makes changes in the network stack minimum, it increases
code size much. any suggestions are welcome.

YAMAMOTO Takashi

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2008-01-23 16:55:16 UTC
Permalink
Post by YAMAMOTO Takashi
in_cksum is often used for the header portion of a chain.
ie. not the whole chain.
Current usage:
bridge(4): IP header checksum
ip_fastforward: IP header checksum
igmp_input: full packet, should be switched to the generalised version later.
igmp_sendpkt: partial packet, should be switched as well
tcp4_segment: header only
carp: full packet (one exception, but that should be fixed)
icmp_input: header only
icmp_send: full
ip_input: IP header checksum
ip_mroute: IP header checksum (can be optimised, I think), full in some
other cases
ip_output: IP header sum
tcp_respond: full packet
udp6_output: full packet
ipsec4_encapsulate: header

in4_cksum and in6_cksum seem to be used almost exclusively with the full
header. I haven't checked which of the above are in the input path and
would benefit from lazy mbuf chains. I don't think those in the output
path normally do. So what again was the justification for the added
complexity?

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
YAMAMOTO Takashi
2008-01-24 03:11:42 UTC
Permalink
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
in_cksum is often used for the header portion of a chain.
ie. not the whole chain.
bridge(4): IP header checksum
ip_fastforward: IP header checksum
igmp_input: full packet, should be switched to the generalised version later.
igmp_sendpkt: partial packet, should be switched as well
tcp4_segment: header only
carp: full packet (one exception, but that should be fixed)
icmp_input: header only
icmp_send: full
ip_input: IP header checksum
ip_mroute: IP header checksum (can be optimised, I think), full in some
other cases
ip_output: IP header sum
tcp_respond: full packet
udp6_output: full packet
ipsec4_encapsulate: header
in4_cksum and in6_cksum seem to be used almost exclusively with the full
header. I haven't checked which of the above are in the input path and
would benefit from lazy mbuf chains. I don't think those in the output
path normally do. So what again was the justification for the added
complexity?
Joerg
consider the case where tcp checksumming is h/w offloaded but
ip header checksumming is not.
in that case, ip_output/in_cksum needs to map the ipv4 header portion of
the header (which is unlikely being lazy anyway) but it isn't desirable to
map the entire chain.

are you suggesting to sprinkle manual map calls instead of
"automatic" one in mtod (and its equivalent in asm code)?

YAMAMOTO Takashi

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2008-01-24 13:47:49 UTC
Permalink
Post by YAMAMOTO Takashi
are you suggesting to sprinkle manual map calls instead of
"automatic" one in mtod (and its equivalent in asm code)?
Yes, I would strongly prefer to do this in the high-level C code instead
of touching the assembly. If mtod does that, that's ok, but I would
prefer if the MD in_cksum backend does not have to deal with it.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
YAMAMOTO Takashi
2008-02-04 10:39:23 UTC
Permalink
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
are you suggesting to sprinkle manual map calls instead of
"automatic" one in mtod (and its equivalent in asm code)?
Yes, I would strongly prefer to do this in the high-level C code instead
of touching the assembly. If mtod does that, that's ok, but I would
prefer if the MD in_cksum backend does not have to deal with it.
Joerg
if we go that route, isn't it better to remove mbuf knowledge from
the MD backend completely? ie. like linux.

YAMAMOTO Takashi

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2008-02-04 12:46:50 UTC
Permalink
Post by YAMAMOTO Takashi
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
are you suggesting to sprinkle manual map calls instead of
"automatic" one in mtod (and its equivalent in asm code)?
Yes, I would strongly prefer to do this in the high-level C code instead
of touching the assembly. If mtod does that, that's ok, but I would
prefer if the MD in_cksum backend does not have to deal with it.
Joerg
if we go that route, isn't it better to remove mbuf knowledge from
the MD backend completely? ie. like linux.
The MD backend only needs a very limited understanding of mbufs:
- m_len
- m_data
- m_next

Removing that would mean one function call per mbuf, possible register
savings and complications for handling misaligned / odd length mbufs.
That in turn would likely have a measurable performance effect.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
YAMAMOTO Takashi
2008-02-09 12:48:56 UTC
Permalink
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
are you suggesting to sprinkle manual map calls instead of
"automatic" one in mtod (and its equivalent in asm code)?
Yes, I would strongly prefer to do this in the high-level C code instead
of touching the assembly. If mtod does that, that's ok, but I would
prefer if the MD in_cksum backend does not have to deal with it.
Joerg
if we go that route, isn't it better to remove mbuf knowledge from
the MD backend completely? ie. like linux.
- m_len
- m_data
- m_next
yes.
Post by Joerg Sonnenberger
Removing that would mean one function call per mbuf, possible register
savings and complications for handling misaligned / odd length mbufs.
That in turn would likely have a measurable performance effect.
Joerg
does it really matter?
i think the fetch-and-add part is dominant for performance.

btw, according to regress/sys/net/in_cksum, i386 asm version
(cpu_in_cksum.S 1.2) seems slower than portable version
on my cpu.

YAMAMOTO Takashi


nfskuro% dmesg|grep cpu0
cpu0 at mainbus0 apid 0: (boot processor)
cpu0: Intel (686-class), 2793.06 MHz, id 0xf41
cpu0: features bfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR>
cpu0: features bfebfbff<PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX>
cpu0: features bfebfbff<FXSR,SSE,SSE2,SS,HTT,TM,SBF>
cpu0: features2 641d<SSE3,DTES64,MONITOR,DS-CPL,CID,CX16,xTPR>
cpu0: features3 20100000<XD,EM64T>
cpu0: "Intel(R) Xeon(TM) CPU 2.80GHz"
cpu0: I-cache 12K uOp cache 8-way
cpu0: L2 cache 1 MB 64B/line 8-way
cpu0: ITLB 4K/4M: 64 entries
cpu0: DTLB 4K/4M: 64 entries
cpu0: using thermal monitor 1
cpu0: Initial APIC ID 0
cpu0: Cluster/Package ID 0
cpu0: SMT ID 0
cpu0: calibrating local timer
cpu0: apic clock running at 199 MHz
cpu0: 32 page colors
nfskuro% ./in_cksum 1 1 10000 1 1 1
portable version: 0.000484
test version: 0.000326
relative time: 67%
nfskuro% ./in_cksum 1 1 10000 40
portable version: 0.000147
test version: 0.000270
relative time: 182%
nfskuro% ./in_cksum 1 1 10000 50
portable version: 0.000000
test version: 0.001749
relative time: 174900%
nfskuro% ./in_cksum 1 1 10000 60
portable version: 0.000162
test version: 0.000360
relative time: 221%
nfskuro% ./in_cksum 1 1 10000 1600
portable version: 0.004768
test version: 0.014617
relative time: 307%
nfskuro% ./in_cksum 1 1 10000 9000
portable version: 0.020984
test version: 0.082724
relative time: 394%
nfskuro% ./in_cksum 1 1 10000 1500 1500 40
portable version: 0.003678
test version: 0.032719
relative time: 889%
nfskuro%

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Joerg Sonnenberger
2008-02-09 13:10:20 UTC
Permalink
Post by YAMAMOTO Takashi
btw, according to regress/sys/net/in_cksum, i386 asm version
(cpu_in_cksum.S 1.2) seems slower than portable version
on my cpu.
Actually, I'm not surprised by that. Given that the inner loop is using
adc all the time, the adds themselve can't run in parallel. I haven't
had time to carefully analyse the code if the amd64 could be ported.

Joerg

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Laight
2008-02-09 13:54:23 UTC
Permalink
Post by Joerg Sonnenberger
Post by YAMAMOTO Takashi
btw, according to regress/sys/net/in_cksum, i386 asm version
(cpu_in_cksum.S 1.2) seems slower than portable version
on my cpu.
Actually, I'm not surprised by that. Given that the inner loop is using
adc all the time, the adds themselve can't run in parallel. I haven't
had time to carefully analyse the code if the amd64 could be ported.
I also suspect it is excessively unrolled - but that only affects
the cold cache time.

Possibly you could run 2 (maybe 3) add sequences in parallel, using 'adc $0'
to break the carry chain - but I don't know if that actually works!
ISTR than an instruction that writes to ALL the condition flags doesn't
have a dependency against the previous copy of the flags.

Note that an 'adc' chain can't end up with a value of ~0 and carry set
(unless that is the initial condition and you've added ~0).

David
--
David Laight: ***@l8s.co.uk

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...