Discussion:
[patch] bug fix & TCP networking performance improvements
(too old to reply)
David Young
2011-04-08 17:16:20 UTC
Permalink
I am preparing to commit some improvements to TCP networking and a bug
fix both contributed by CoyotePoint Systems, Inc. The patch is large,
so I posted it at
<ftp://elmendorf.ojctech.com/users/netbsd-bdffaa79/devel-vtw-diff1>.

Please review the patches and send any questions/comments you may have.

Here's the proposed commit message:

Sometimes the reclamation of protocol buffers by tcp_drain(),
arp_drain(), frag6_drain(), or ip_drain() will sleep. When a
protocol-drain routines called in an interrupt context sleeps, the
system deadlocks. Avoid a deadlock when memory is low by deferring to a
thread context the reclamation of protocol buffers. The protocol's fast
time-out timer, protosw.pr_fasttimo, runs in thread context, so defer
reclamation to the next fast time-out.

Even after a TCP session enters the TIME_WAIT state, its corresponding
socket and protocol control blocks (PCBs) stick around until the TCP
Maximum Segment Lifetime (MSL) expires. On a host whose workload
necessarily creates and closes down many TCP sockets, the sockets & PCBs
for TCP sessions in TIME_WAIT state amount to many megabytes of dead
weight in RAM.

This patch reduces the resources demanded by TIME_WAIT-state sessions
using methods called Vestigial Time-Wait and Maximum Segment Lifetime
Truncation.

Maximum Segment Lifetimes Truncation (MSLT) assigns each TCP session to
a class based on the nearness of the peer. Corresponding to each class
is an MSL, and a session uses the MSL of its class. The classes are
loopback (local host equals remote host), local (local host and remote
host are on the same link/subnet), and remote (local host and remote
host communicate via one or more gateways). Classes corresponding to
nearer peers have lower MSLs by default: 2 seconds for loopback, 10
seconds for local, 60 seconds for remote. Loopback and local sessions
expire more quickly when MSLT is used.

Vestigial Time-Wait (VTW) replaces a TIME_WAIT session's PCB/socket
dead weight with a compact representation of the session, called a
"vestigial PCB". VTW data structures are designed to be very fast and
memory-efficient: for fast insertion and lookup of vestigial PCBs,
the PCBs are stored in a hash table that is designed to minimize the
number of cacheline visits per lookup/insertion. The memory both
for vestigial PCBs and for elements of the PCB hashtable come from
fixed-size pools, and linked data structures exploit this to conserve
memory by representing references with a narrow index/offset from the
start of a pool instead of a pointer. When space for new vestigial PCBs
runs out, VTW makes room by discarding old vestigial PCBs, oldest first.
VTW cooperates with MSLT.

It may help to think of VTW as a "FIN cache" by analogy to the SYN
cache.

A 2.8-GHz Pentium 4 running a test workload that creates TIME_WAIT
sessions as fast as it can is approximately 17% idle when VTW is active
versus 0% idle when VTW is inactive. It has 103 megabytes more free RAM
when VTW is active (approximately 64k vestigial PCBs are created) than
when it is inactive.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-14 20:44:36 UTC
Permalink
When a protocol-drain routines called in an interrupt context sleeps,
the system deadlocks.
What do you mean exactly here? Sleeping in an interrupt context is
a hard error, not just a deadlock. Or is this about a deadlock
at a more functional level, where some memory might need to be
allocated short-term to be able to free a bigger chunk?
I don't mind the move from the "drain" entry to "fasttimo"
(ahthough it could have performance implications in cases
where the 1000 lingering connections are not needed) but
want to understand the issues. I think to get to better
MP concurrency it makes sense to get the drain/fast/slowtimo
entries out of KERNEL_LOCK, protocol by protocol which would
need some flag(s) in the protosw.

I can't comment on the VTW implementation itself, just some
lower MSLs by default: 2 seconds for loopback
This looks short. In particular with filesystem pressure my boxes
are often unresponsive for significantly longer. There is no
justification for the big difference to
10 seconds for local
Some "#ifdef INET6" and so seem to be missing to reduce bloat.

There is some change about permissions for raw sockets which
is not explained.

The netstat change is not documentsd.

And tlas as "vtw" are not in common use so I'd avoid that for
visible things like filenames.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-14 21:43:02 UTC
Permalink
Post by Matthias Drochner
When a protocol-drain routines called in an interrupt context sleeps,
the system deadlocks.
What do you mean exactly here? Sleeping in an interrupt context is
a hard error, not just a deadlock. Or is this about a deadlock
at a more functional level, where some memory might need to be
allocated short-term to be able to free a bigger chunk?
I think David meant to say "would sleep". We independently fixed
the "pool allocator can cause map reclaim which can sleep from interrupt
context" in our tree -- I believe that does not appear in this patch
because it was almost simultaneously fixed in NetBSD-current -- but
the result was that the system could lock up, unable to get mbufs
while holding the softint lock.

Something's got to drain the protocols in low-memory conditions, and,
before, it was those drain routines called in a way that potentially
had locks held. Fixing that left them effectively never called at
all. Stealing the fasttimo was a quick and reasonably effective
way of ensuring that the drain routines were ever called at all.

I think at least one other developer has a change in his tree that
uses a separate callout or kthread to do the drains. That's better
if and when it shows up in NetBSD-current but I think this interim
fix is good enough to go in.
Post by Matthias Drochner
I can't comment on the VTW implementation itself, just some
lower MSLs by default: 2 seconds for loopback
This looks short. In particular with filesystem pressure my boxes
are often unresponsive for significantly longer. There is no
justification for the big difference to
I strongly disagree. Applications that make large numbers of local
connections benefit dramatically from this change, and if the kernel
cannot provide a fin-ack for a matter of *seconds* there is a bug
elsewhere in the kernel.

The 2MSL rule is meant to give packets time to be in-flight in the
network, not to accomodate buggy stacks -- not even local ones.

At least, that is how I see it.
Post by Matthias Drochner
Some "#ifdef INET6" and so seem to be missing to reduce bloat.
There is some change about permissions for raw sockets which
is not explained.
Eep. This is likely a part of a completely different change which
leaked in by mistake. It won't work without security model changes
we cannot currently contribute back. Thank you for pointing it out!

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-18 19:48:22 UTC
Permalink
Post by Matthias Drochner
There is some change about permissions for raw sockets which
is not explained.
So, I just looked at this harder.

I believe this change is correct, and general, and I do not believe it
requires explanation in the code. What is going on is that the INET6
stack is missing all the kauth work that was done on the inet stack. This
change adds the appropriate and analogous kauth check on raw socket open.

It does so because the tree we did this work in happens to have a security
model that allows ping, etc. to not be setuid root. But it will function
correctly in the standard tree.

I think the other large kauth related hunk of the patch, the one relating
to setting IPv6 interface addresses, is also correct -- or, at least,
better than the situation is right now (again, it is because in our tree
we can run ifconfig not-as-root).

I am not sure what the engineer who added this hunk meant when he
added the comment "I am not going to try and really fix this" above
that hunk of code but I can try to find out.

Obviously a lot of work needs to be done to kauth-ify the v6 stack as the
v4 stack has been but I think these are (very small) steps in the right
direction.


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-19 16:48:06 UTC
Permalink
[kauth in inet6]
Post by Thor Lancelot Simon
I am not sure what the engineer who added this hunk meant when he
added the comment "I am not going to try and really fix this" above
that hunk of code
This code is clearly - hmm - unfinished. The new check
is just inserted after the old suser check, so it can't do
anything for you if the suser check already failed. Also,
half of the commands for which the check is done are no-ops
(or EINVALs) anyway.
I'd suggest to keep the kauth stuff seperate for now; it
is unrelated anyway and it needs some care.

So I've tried the patch. When I enabled vtw on the running
system, it locked up hard immediately. Enabled it early
before going multiuser and the boot succeeded. But when
I closed an outgoing TCP connection I got a diagnostic
panic because kmem_alloc() was called in softint context.
(from tcp_input() iirc -- had to find that savecore(8)
doesn't work anymore if the running kernel is not that
which produced the crashdump)
The (e)gid changes in the netstat(1) patch don't work.
The kmem egid is given up too early and an open
of /dev/mem fails later.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2011-04-19 17:31:09 UTC
Permalink
Post by Matthias Drochner
So I've tried the patch. When I enabled vtw on the running
system, it locked up hard immediately. Enabled it early
before going multiuser and the boot succeeded. But when
I closed an outgoing TCP connection I got a diagnostic
panic because kmem_alloc() was called in softint context.
(from tcp_input() iirc -- had to find that savecore(8)
doesn't work anymore if the running kernel is not that
which produced the crashdump)
Will you send your kernel configuration and the precise steps that you
took to cause this panic?
Post by Matthias Drochner
The (e)gid changes in the netstat(1) patch don't work.
The kmem egid is given up too early and an open
of /dev/mem fails later.
Hmm, I have probably only run netstat as root.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-19 18:12:12 UTC
Permalink
Post by David Young
Will you send your kernel configuration and the precise steps that you
took to cause this panic?
I've just managed to get a stacktrace. The problem is in
fatp_init() (and looks easy to solve by doing these
things earlier).
If you still need the kernel config, let me know.

best regards
Matthias

#10 0xc05c28f9 in kern_assert (t=0xc063264a "diagnostic ",
f=0xc067b6e9 "../../../../kern/subr_kmem.c", l=196, e=0xc067a181
"!cpu_softintr_p()")
at ../../../../../../lib/libkern/kern_assert.c:50
#11 0xc0459728 in kmem_alloc (size=4096, kmflags=2) at
../../../../kern/subr_kmem.c:196
#12 0xc049b801 in vtw_control (af=<value optimized out>, msl=<value optimized
out>)
at ../../../../netinet/tcp_vtw.c:178
#13 0xc049ddd5 in vtw_add (af=2, tp=0xc1c767e4) at
../../../../netinet/tcp_vtw.c:1832
#14 0xc048dd97 in tcp_input (m=0xc1d18d00) at ../../../../netinet/tcp_input.c:2
998
#15 0xc02c363b in ip_input (m=0xc1d18d00) at ../../../../netinet/ip_input.c:891
#16 0xc02c3b6d in ipintr () at ../../../../netinet/ip_input.c:393
#17 0xc02ffddf in softint_dispatch (pinned=0xcca85aa0, s=4) at
../../../../kern/kern_softint.c:541
#18 0xc010110b in Xsoftintr ()



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2011-04-19 18:20:56 UTC
Permalink
Post by Matthias Drochner
Post by David Young
Will you send your kernel configuration and the precise steps that you
took to cause this panic?
I've just managed to get a stacktrace. The problem is in
fatp_init() (and looks easy to solve by doing these
things earlier).
If you still need the kernel config, let me know.
Thanks. Looks easy to fix.

Please send your kernel config.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-20 18:39:08 UTC
Permalink
[initialization crash]
Post by David Young
Looks easy to fix.
With the appended hack the box seems to survive.
Didn't do anything benchmark-like yet.
Only netstat(1) isn't helpful -- started as root and it
got into an endless loop printing
tcp 0 0 *.* *.* TIME_WAIT -1303323789977.711ms
until I killed it.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de
David Young
2011-04-21 17:05:22 UTC
Permalink
Post by Matthias Drochner
[initialization crash]
Post by David Young
Looks easy to fix.
With the appended hack the box seems to survive.
Didn't do anything benchmark-like yet.
Only netstat(1) isn't helpful -- started as root and it
got into an endless loop printing
tcp 0 0 *.* *.* TIME_WAIT -1303323789977.711ms
until I killed it.
I don't understand why I did not see that myself until today, but I
fixed the obvious netstat bug. I have attached a patch that should fix
all of the issues that you've mentioned, too. It needs to be applied
after my previous patch.

Thanks for the feedback.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24
Matthias Drochner
2011-04-21 18:58:45 UTC
Permalink
I fixed the obvious netstat bug
Thanks -- netstat works now, but this reveals another bug:
The expiration check seems to be somewhat fragile. I ended up with:

$ netstat -finet -V
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address State
Expires
tcp 0 48 zelz27.ssh zel459.epmd ESTABLISHED
tcp 0 0 zelz27.ssh zel459.wxbrief ESTABLISHED
tcp 0 0 zelz27.65534 zel459.ssh TIME_WAIT
-122504.647ms
tcp 0 0 localhost.65532 localhost.exp2 TIME_WAIT
-112736.010ms
tcp 0 0 localhost.65533 localhost.sunrpc TIME_WAIT
-112733.070ms
tcp 0 0 localhost.65531 localhost.sunrpc TIME_WAIT
-109477.407ms
tcp 0 0 localhost.65530 localhost.exp2 TIME_WAIT
-109477.268ms
tcp 0 0 localhost.65525 localhost.sunrpc TIME_WAIT
-105651.943ms
tcp 0 0 localhost.65524 localhost.exp2 TIME_WAIT
-105650.831ms
tcp 0 0 localhost.65522 localhost.exp2 TIME_WAIT
-102166.960ms
tcp 0 0 localhost.65523 localhost.sunrpc TIME_WAIT
-102164.806ms

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2011-04-21 19:10:32 UTC
Permalink
Post by Matthias Drochner
I fixed the obvious netstat bug
$ netstat -finet -V
Active Internet connections
Proto Recv-Q Send-Q Local Address Foreign Address State
Expires
tcp 0 48 zelz27.ssh zel459.epmd ESTABLISHED
tcp 0 0 zelz27.ssh zel459.wxbrief ESTABLISHED
tcp 0 0 zelz27.65534 zel459.ssh TIME_WAIT
-122504.647ms
Doh. The test that's in there is for "reclaimed" not for "expired",
and reclamation can lag expiration if there is not pressure on the
VTW table. I will fix that so that it filters out both reclaimed and
expired entries.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-22 11:40:35 UTC
Permalink
The test that's in there is for "reclaimed" not for "expired", and
reclamation can lag expiration if there is not pressure on the VTW
table
Perhaps I'm not understanding it correctly, but as I see the
entries are marked as expired/reclaimed by special values (0/-1)
in tv_sec/tv_usec. Then, the netstat display should show either
the same integral or the same fractional part for all affected
time differences since the base time is the same for all.
It doesn't...

Btw, I think that it would be better to use microuptime(9)
rather than microtime(9) for these interval things, to make
it robust against time changes. Or getmicrouptime(9) to save
some cycles - accuracy shouldn't be an issue.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-22 15:55:05 UTC
Permalink
Post by Matthias Drochner
Btw, I think that it would be better to use microuptime(9)
rather than microtime(9) for these interval things, to make
it robust against time changes. Or getmicrouptime(9) to save
some cycles - accuracy shouldn't be an issue.
This makes sense to me too.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
David Young
2011-04-22 18:29:42 UTC
Permalink
Post by Matthias Drochner
The test that's in there is for "reclaimed" not for "expired", and
reclamation can lag expiration if there is not pressure on the VTW
table
Perhaps I'm not understanding it correctly, but as I see the
entries are marked as expired/reclaimed by special values (0/-1)
in tv_sec/tv_usec. Then, the netstat display should show either
the same integral or the same fractional part for all affected
time differences since the base time is the same for all.
It doesn't...
Ok, I see what you're saying.

Expired and reclaimed are different states. The entries that you showed
to me had expired (hence the negative remaining lifetime in netstat's
output), but they had not been reclaimed. I don't think the operator
will usually care to see either expired or reclaimed entries, so I plan
to hide them all from view.
Post by Matthias Drochner
Btw, I think that it would be better to use microuptime(9)
rather than microtime(9) for these interval things, to make
it robust against time changes. Or getmicrouptime(9) to save
some cycles - accuracy shouldn't be an issue.
Definitely.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-05-10 16:39:15 UTC
Permalink
Post by David Young
Post by Matthias Drochner
Btw, I think that it would be better to use microuptime(9)
rather than microtime(9) for these interval things, to make
it robust against time changes. Or getmicrouptime(9) to save
some cycles - accuracy shouldn't be an issue.
Definitely.
Here is something which works.
OK, the use of kvm in netstat sucks. But it sucks already that
kvm is needed just to see whether a socket is still in TIME_WAIT,
so it doesn't make it much worse.

best regards
Matthias


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de
David Young
2011-05-10 18:17:17 UTC
Permalink
Post by Matthias Drochner
Post by David Young
Post by Matthias Drochner
Btw, I think that it would be better to use microuptime(9)
rather than microtime(9) for these interval things, to make
it robust against time changes. Or getmicrouptime(9) to save
some cycles - accuracy shouldn't be an issue.
Definitely.
Here is something which works.
OK, the use of kvm in netstat sucks. But it sucks already that
kvm is needed just to see whether a socket is still in TIME_WAIT,
so it doesn't make it much worse.
Looks fine to me.

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-19 20:58:18 UTC
Permalink
Post by Matthias Drochner
[kauth in inet6]
Post by Thor Lancelot Simon
I am not sure what the engineer who added this hunk meant when he
added the comment "I am not going to try and really fix this" above
that hunk of code
This code is clearly - hmm - unfinished. The new check
is just inserted after the old suser check, so it can't do
anything for you if the suser check already failed.
Yes. It turns out the "I am not going to try and really fix this"
comment probably refers to the suser check having been put back
into effect. I'm still digging into it a bit.
Post by Matthias Drochner
Also, half of the commands for which the check is done are no-ops
(or EINVALs) anyway.
Yes, but as interface ioctls they should still authorize via the
appropriate kauth operation as the analogous v4 ioctls do. It is
most consistent and safe that way.
Post by Matthias Drochner
I'd suggest to keep the kauth stuff seperate for now; it
is unrelated anyway and it needs some care.
You are probably correct about this.

I believe David's going to fix the initialization issue you also
noted, which is much more serious. It's actually my opinion that
VTW should be on by default, and that we should plan to remove the
non-VTW mode of operation once users report VTW is stable for them.
After all, we do not have operation both with and without the SYN
cache for TCP, and it seems likely to me the code will end up less
reliable and less well maintained if we leave two modes for FIN
handling.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-20 11:01:26 UTC
Permalink
Post by Thor Lancelot Simon
Post by Matthias Drochner
keep the kauth stuff seperate for now
You are probably correct about this.
While we are here - there is some TCP RTT calculation change
in tcp_input.c which looks related to the recent thread
on tech-net -- was there some conclusion finally?

And there is the addition of "kinfo_pcb2" to socket.h
which I don't see a use for.

Why can't arp_drain() not be called in interrupt context?
It just does m_freem() which should be safe.
Post by Thor Lancelot Simon
VTW should be on by default
I agree that maintaining 2 code paths for the same thing
is not desirable. VTW adds ~22kB kernel code on i386
which is not _that_ much.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-20 11:52:38 UTC
Permalink
Post by Matthias Drochner
Post by Thor Lancelot Simon
Post by Matthias Drochner
keep the kauth stuff seperate for now
You are probably correct about this.
While we are here - there is some TCP RTT calculation change
in tcp_input.c which looks related to the recent thread
on tech-net -- was there some conclusion finally?
I don't see any such change in our patch. Could you point it out?

I do hope Greg will get BBN's fixes into the NetBSD tree soon, however!

Thanks,

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-20 11:57:32 UTC
Permalink
Post by Matthias Drochner
And there is the addition of "kinfo_pcb2" to socket.h
which I don't see a use for.
This should go. It's an unrelated local change.
Post by Matthias Drochner
Why can't arp_drain() not be called in interrupt context?
It just does m_freem() which should be safe.
There is a long and tortured explanation of this. It will take me
a couple of days to try to distill it from our internal discussion
and summarize.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Matthias Drochner
2011-04-20 11:57:09 UTC
Permalink
Post by Thor Lancelot Simon
I don't see any such change in our patch. Could you point it out?
Here (in your devel-vtw-diff1 file):

@@ -3231,9 +3492,9 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
* point). Adjust rtt to origin 0.
*/
- delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
+ delta = (rtt << 0) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
- tp->t_srtt = 1 << 2;
+ tp->t_srtt = 1; // 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
* smoothed mean difference), then set the retransmit

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Thor Lancelot Simon
2011-04-20 12:21:38 UTC
Permalink
Post by Thor Lancelot Simon
I don't see any such change in our patch. Could you point it out?
Argh! I thought we'd already contributed that back.

I believe this does in fact fix the bug we brought up in our original
discussion -- that shifting srtt left by 2 as the first step of the
computation means it is adjusted by 1/2 RTT rather than 1/8.

I do not know whether or how it interacts with Greg's later patches,
but I don't think those have been checked in yet.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-04-20 12:35:05 UTC
Permalink
I do not know whether or how it interacts with Greg's later patches,
but I don't think those have been checked in yet.

I am trying to get things in today, at least comments that make this a
lot easier to understand.

Right now, I would not want to change the storage representation; things
are *almost* ok.
Matthias Drochner
2011-04-20 13:09:35 UTC
Permalink
Post by Thor Lancelot Simon
I believe this does in fact fix the bug we brought up in our original
discussion -- that shifting srtt left by 2 as the first step of the
computation means it is adjusted by 1/2 RTT rather than 1/8.
Looking at the patch and the CVS history I'm getting the impression
the the change which introduced it (tcp_input.crev.1.16) was an attempt
to change the representation of srtt so that it keeps 5 bits after
the point, thus increasing precision. It was just not communicated
correctly, and likely incomplete.
But if done correctly, it shouldn't have bad effects.

best regards
Matthias



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Besuchen Sie uns auf unserem neuen Webauftritt unter www.fz-juelich.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-04-20 13:28:39 UTC
Permalink
Post by Matthias Drochner
Post by Thor Lancelot Simon
I believe this does in fact fix the bug we brought up in our original
discussion -- that shifting srtt left by 2 as the first step of the
computation means it is adjusted by 1/2 RTT rather than 1/8.
Looking at the patch and the CVS history I'm getting the impression
the the change which introduced it (tcp_input.crev.1.16) was an attempt
to change the representation of srtt so that it keeps 5 bits after
the point, thus increasing precision. It was just not communicated
correctly, and likely incomplete.
But if done correctly, it shouldn't have bad effects.
I think that's what happened.

It's almost correct; I'm about to check in a large patch that changes
only comments and explains things - it's quite hard to figure out.

I believe there are only three things wrong; I'll post another note
about that post-commit.
Greg Troxel
2011-04-20 13:45:37 UTC
Permalink
Post by Thor Lancelot Simon
Post by Thor Lancelot Simon
I don't see any such change in our patch. Could you point it out?
Argh! I thought we'd already contributed that back.
I believe this does in fact fix the bug we brought up in our original
discussion -- that shifting srtt left by 2 as the first step of the
computation means it is adjusted by 1/2 RTT rather than 1/8.
I do not know whether or how it interacts with Greg's later patches,
but I don't think those have been checked in yet.
I believe the rttvar calculation is correct as in, modulo the RTO
calculation being wrong. It would be good for people to review the
comments I just checked in - I chased down all the storage
representations and attempted to demystify the calculations.
David Young
2011-05-03 18:30:26 UTC
Permalink
Post by Matthias Drochner
Post by Thor Lancelot Simon
I don't see any such change in our patch. Could you point it out?
@@ -3231,9 +3492,9 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
* point). Adjust rtt to origin 0.
*/
- delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
+ delta = (rtt << 0) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
- tp->t_srtt = 1 << 2;
+ tp->t_srtt = 1; // 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
* smoothed mean difference), then set the retransmit
I have committed vestigial time-wait and maximum segment lifetime
truncation. I left the RTT changes out. Do people agree that the
changes are correct?

Dave
--
David Young OJC Technologies
***@ojctech.com Urbana, IL * (217) 344-0444 x24

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Greg Troxel
2011-05-03 22:56:45 UTC
Permalink
Post by David Young
Post by Matthias Drochner
Post by Thor Lancelot Simon
I don't see any such change in our patch. Could you point it out?
@@ -3231,9 +3492,9 @@ tcp_xmit_timer(struct tcpcb *tp, uint32_
* an alpha of .875 (srtt = rtt/8 + srtt*7/8 in fixed
* point). Adjust rtt to origin 0.
*/
- delta = (rtt << 2) - (tp->t_srtt >> TCP_RTT_SHIFT);
+ delta = (rtt << 0) - (tp->t_srtt >> TCP_RTT_SHIFT);
if ((tp->t_srtt += delta) <= 0)
- tp->t_srtt = 1 << 2;
+ tp->t_srtt = 1; // 1 << 2;
/*
* We accumulate a smoothed rtt variance (actually, a
* smoothed mean difference), then set the retransmit
I have committed vestigial time-wait and maximum segment lifetime
truncation. I left the RTT changes out. Do people agree that the
changes are correct?
They are not obviously correct, and I think it's more likely than not
that they aren't correct. Have you looked at them relative to the
comment enhancements I recently committed?

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...