ICMP_UNREACH_NEEDFRAG returns iface MTU instead of route?

Discussion:

ICMP_UNREACH_NEEDFRAG returns iface MTU instead of route?

(too old to reply)

Dave Huang

2013-12-19 23:34:41 UTC

It seems that the ICMP fragmentation needed packet contains the
interface MTU rather than the route MTU if the route MTU is lower
than the iface's: see sys/netinet/ip_input.c, ip_forward(), case
EMSGSIZE--around line 1337: destmtu = rt->rt_ifp->if_mtu;

Is that what it should be doing? It seems wrong and isn't what I
expected... if I artificially lower the MTU of a route, e.g.,
route add www.netbsd.org $my_gateway_ip -mtu 1200

Then ping -Ds 1300 www.netbsd.org from another machine that routes
through the above router, I get:
PING www.netbsd.org (149.20.53.86): 1300 data bytes
36 bytes from foxy.azeotrope.org (10.1.1.67): frag needed and DF set. Next MTU=1500 for icmp_seq=0

Shouldn't Next MTU=1200, rather than 1500?

This is on a -current i386 kernel from October 2012, but the latest
ip_input.c (1.308) looks like it does the same thing, though I haven't
actually tried.

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2013-12-20 16:01:03 UTC

Post by Dave Huang
It seems that the ICMP fragmentation needed packet contains the
interface MTU rather than the route MTU if the route MTU is lower
than the iface's: see sys/netinet/ip_input.c, ip_forward(), case
EMSGSIZE--around line 1337: destmtu = rt->rt_ifp->if_mtu;
Is that what it should be doing? It seems wrong and isn't what I
expected... if I artificially lower the MTU of a route, e.g.,
route add www.netbsd.org $my_gateway_ip -mtu 1200
Then ping -Ds 1300 www.netbsd.org from another machine that routes
PING www.netbsd.org (149.20.53.86): 1300 data bytes
36 bytes from foxy.azeotrope.org (10.1.1.67): frag needed and DF set.
Next MTU=1500 for icmp_seq=0
Shouldn't Next MTU=1200, rather than 1500?

That's a good question. I would suggest reading the standards, and
checking what other systems do. My quick reaction is that MTU on a
route is a local matter used to choose MTU on outgoing packets that are
originated, and I don't see why it would affect forwarding. But I can
also see that a philosophy that it's better to send ICMP fragmentation
required as soon as possible.

Regardless, it seems clear that a packet rejected because of a route MTU
should provoke an ICMP fragmentation required with a value no great
than the route MTU. But the first question is about route MTUs in the
first place.

Dave Huang

2013-12-20 17:13:57 UTC

Post by Greg Troxel
That's a good question. I would suggest reading the standards, and
checking what other systems do. My quick reaction is that MTU on a
route is a local matter used to choose MTU on outgoing packets that are
originated, and I don't see why it would affect forwarding. But I can
also see that a philosophy that it's better to send ICMP fragmentation
required as soon as possible.

RFC 1191 says, "When a router is unable to forward a datagram because
it exceeds the MTU of the next-hop network and its Don't Fragment bit
is set, the router is required to return an ICMP Destination
Unreachable message to the source of the datagram, with the Code
indicating "fragmentation needed and DF set". To support the Path MTU
Discovery technique specified in this memo, the router MUST include
the MTU of that next-hop network in the low-order 16 bits of the ICMP
header field that is labelled "unused" in the ICMP specification [7]."

It seems reasonable to me to interpret the route's MTU as specifying
"MTU of the next-hop network".

I checked a Debian Linux system (running kernel 2.6.32-5-686), and it
returns the route MTU. E.g., repeating the same type of test as I did
earlier: ip route add 149.20.53.86 dev eth0 mtu 1200

Then from another machine, pinged 149.20.53.86 with a 1300-byte packet
and DF set. The MTU returned in the ICMP fragmentation needed packet
was 1200.

NetBSD's current behavior would seem to break PMTU discovery... it
won't forward DF packets larger than the route MTU, but then it tells
the sender that larger packets are OK.

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2013-12-21 20:03:48 UTC

Post by Dave Huang

Post by Greg Troxel
That's a good question. I would suggest reading the standards, and
checking what other systems do. My quick reaction is that MTU on a
route is a local matter used to choose MTU on outgoing packets that are
originated, and I don't see why it would affect forwarding. But I can
also see that a philosophy that it's better to send ICMP fragmentation
required as soon as possible.

RFC 1191 says, "When a router is unable to forward a datagram because
it exceeds the MTU of the next-hop network and its Don't Fragment bit
is set, the router is required to return an ICMP Destination
Unreachable message to the source of the datagram, with the Code
indicating "fragmentation needed and DF set". To support the Path MTU
Discovery technique specified in this memo, the router MUST include
the MTU of that next-hop network in the low-order 16 bits of the ICMP
header field that is labelled "unused" in the ICMP specification [7]."
It seems reasonable to me to interpret the route's MTU as specifying
"MTU of the next-hop network".

You can interpret it that way if you want, but that isn't what the text
really says. The "MTU of the next-hop network" it is referring to is
the MTU of the network the outgoing interface is attached to; it is a
synonym for the next hop interface MTU. This also means, of course, that
the text above doesn't really apply to your situation since "exceeds the MTU
of the next-hop network" is not the reason the datagram is being discarded
in your case.

I don't think you'll find much help with route MTUs in the standard documents
since I'm pretty sure standard IP doesn't think that routes have associated
MTUs. This is a non-standard feature someone made up, so someone also gets
to make up what it does and how it does that.

Post by Dave Huang
I checked a Debian Linux system (running kernel 2.6.32-5-686), and it
returns the route MTU. E.g., repeating the same type of test as I did
earlier: ip route add 149.20.53.86 dev eth0 mtu 1200
Then from another machine, pinged 149.20.53.86 with a 1300-byte packet
and DF set. The MTU returned in the ICMP fragmentation needed packet
was 1200.
NetBSD's current behavior would seem to break PMTU discovery... it
won't forward DF packets larger than the route MTU, but then it tells
the sender that larger packets are OK.

NetBSD's current behavior is indeed broken. If the router is going to
drop the packet because of its size then the ICMP unreachable error it
returns to inform the sender of this must include a packet size that
the router would not drop.

I think Greg's point is a bit different. Since the bit of the standard
you quote says the reason for returning that error is that "a router was
unable to forward a datagram because it exceeds the MTU of the next-hop
network", then because the packet being forwarded in your case did not
"exceed the MTU of the next-hop network" one could argue that the router
shouldn't be discarding the packet in the first place. That is, the ICMP
message is correct but the packet discard which prompted its sending
was incorrect. Since there is no standard for what route MTUs do the
question of what they should do is going to depend on what function they
serve and what is best for that function.

I, like Greg, had the early reaction that the router shouldn't be using
the route MTU when forwarding packets (as opposed to processing packets
originated by the local host-in-the-router). The reason is that the only
use I've seen made of the route MTU field is to implement a cache of PMTUs
discovered by applications running on the local host, and it is normally
considered quite inappropriate for a router to enforce the PMTUs learned
by its local applications on packets originated by other hosts. It is
possible, however, that the field is no longer used for PMTU discovery
caching and that the remaining uses actually do require route MTUs to be
applied to forwarded packets for some reason.

In any case it is clear that something is broken. What isn't clear, however,
is whether it is the MTU in the ICMP message or the packet drop which caused
the message to be sent which is the broken bit.

Dennis Ferguson

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jarle Greipsland

2013-12-22 10:01:43 UTC

Dennis Ferguson <***@gmail.com> writes:
[ ... ]

Post by Dennis Ferguson
I, like Greg, had the early reaction that the router shouldn't be using
the route MTU when forwarding packets (as opposed to processing packets
originated by the local host-in-the-router). The reason is that the only
use I've seen made of the route MTU field is to implement a cache of PMTUs
discovered by applications running on the local host, and it is normally
considered quite inappropriate for a router to enforce the PMTUs learned
by its local applications on packets originated by other hosts.

[ ... ]
Indeed. Should a downstream router route packets based on source
addresses or some other form of packet classification, the
packets may be sent along paths with different MTUs. Thus, the
calculated route MTU for the NetBSD forwarder will not
necessarily apply to the traffic from an upstream host.

-jarle

--
"Once policy-based routing is solved, some new problem with routing
will arise." -- Marshall T Rose

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2013-12-21 22:55:37 UTC

Post by Dennis Ferguson
You can interpret it that way if you want, but that isn't what the text
really says. The "MTU of the next-hop network" it is referring to is
the MTU of the network the outgoing interface is attached to; it is a
synonym for the next hop interface MTU. This also means, of course, that
the text above doesn't really apply to your situation since "exceeds the MTU
of the next-hop network" is not the reason the datagram is being discarded
in your case.

How does the OS know what the MTU of the next-hop network is? By
adding a static route, I'm declaring what I want the next hop's
outgoing interface to be. It seems reasonable that I should also be
able to use it to declare what the MTU is, if the default MTU assumed
by the system based on the interface type is incorrect for whatever
reason.

Post by Dennis Ferguson
The reason is that the only
use I've seen made of the route MTU field is to implement a cache of PMTUs
discovered by applications running on the local host, and it is normally
considered quite inappropriate for a router to enforce the PMTUs learned
by its local applications on packets originated by other hosts. It is
possible, however, that the field is no longer used for PMTU discovery
caching and that the remaining uses actually do require route MTUs to be
applied to forwarded packets for some reason.

It doesn't seem like NetBSD uses it as a PMTU cache--at least I've
never seen "netstat -r" show any routes added due to PMTU discovery.

Post by Dennis Ferguson
In any case it is clear that something is broken. What isn't clear, however,
is whether it is the MTU in the ICMP message or the packet drop which caused
the message to be sent which is the broken bit.

I was just experimenting with route MTUs when I came across this
problem; I don't have any real need or use for them, so I wouldn't
personally be affected either way. That said, I do think it makes
sense for the route MTU to be a way to administratively set the "MTU
of the next-hop network" for a certain route. Which is what NetBSD
already does-- packets being forwarded through the router that are
larger than the route MTU do get fragmented if DF is not set. This
would also match the Linux behavior, and I think the FreeBSD behavior
too. I haven't actually tried on FreeBSD, but there's a comment in
their ip_input.c:ip_forward() that says, "Try to cache the route MTU
from ip_output so we can consider it for the ICMP_UNREACH_NEEDFRAG
"Next-Hop MTU" field described in RFC1191." followed by
mtu = ro.ro_rt->rt_rmx.rmx_mtu;

Later, when the ICMP needs frag is sent, the MTU sent back is the
minimum of the interface MTU and the route MTU.

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dennis Ferguson

2013-12-22 18:05:18 UTC

Post by Dave Huang

Post by Dennis Ferguson
You can interpret it that way if you want, but that isn't what the text
really says. The "MTU of the next-hop network" it is referring to is
the MTU of the network the outgoing interface is attached to; it is a
synonym for the next hop interface MTU. This also means, of course, that
the text above doesn't really apply to your situation since "exceeds the MTU
of the next-hop network" is not the reason the datagram is being discarded
in your case.

How does the OS know what the MTU of the next-hop network is? By
adding a static route, I'm declaring what I want the next hop's
outgoing interface to be. It seems reasonable that I should also be
able to use it to declare what the MTU is, if the default MTU assumed
by the system based on the interface type is incorrect for whatever
reason.

That's a good question. Since a mismatch between the interface MTU and
another host's MRU causes painful problems, the standards try hard to make
sure this doesn't happen. Often the MTU for a link type is a constant
defined in the standard which specifies how IP is mapped to the link. When
it isn't a constant (e.g. point to point encapsulations, often) there is
often a protocol specified to ensure that everyone involved agrees
what the MTU should be (e.g. PPP). In any case, if there's any possibility
that the choice of MTU the OS is making might not be the same as other
correctly operating hosts on the same link then the OS should also be
providing you with a way to correct its choice of interface MTU directly
by interface configuration.

You are right that you could effectively do the same thing with route
MTUs, if that's what a route MTU was for, but just being able to set an
MTU on a route yourself with `route add' is insufficient by itself. Everything
which adds routes would need to know to do this, including interface
configuration (i.e. interface configuration still needs a correct MTU),
ARP and Neighbor Discover, DHCP, PPP and ICMP redirects. A much better and
easier workaround for a bad interface MTU would be the ability to reconfigure
the interface MTU.

Post by Dave Huang

Post by Dennis Ferguson
The reason is that the only
use I've seen made of the route MTU field is to implement a cache of PMTUs
discovered by applications running on the local host, and it is normally
considered quite inappropriate for a router to enforce the PMTUs learned
by its local applications on packets originated by other hosts. It is
possible, however, that the field is no longer used for PMTU discovery
caching and that the remaining uses actually do require route MTUs to be
applied to forwarded packets for some reason.

It doesn't seem like NetBSD uses it as a PMTU cache--at least I've
never seen "netstat -r" show any routes added due to PMTU discovery.

Here is a snippet from sys/netinet/tcp_output.c:

size = tcp_mssdflt;
if (tp->t_mtudisc && rt->rt_rmx.rmx_mtu != 0) {
size = rt->rt_rmx.rmx_mtu - hdrlen;

TCP hence seems to think that when it is doing PMTU Discovery the learned
MTU will be stored in a route MTU.

Unfortunately there are more places where routes are stored than the structure
shown by "netstat -r". The routes TCP looks at are apparently stored in the
"route cache" version of the table. I don't know the command to display the
contents of that, but I do know that forwarding lookups also look in the
"route cache" version of the table so it isn't clear to me that the results
of PMTU Discovery aren't being applied to forwarded packets.

I really dislike the route cache thing, though that's a different topic.

Post by Dave Huang
Later, when the ICMP needs frag is sent, the MTU sent back is the
minimum of the interface MTU and the route MTU.

This is the problem with mechanisms which have ill-defined (or undefined)
purposes: everyone gets to have an opinion about how to use it, and there's
no reason for everyone to have the same opinion.

What I do know is that it is inappropriate for a router to enforce
PMTUs discovered by its local applications on packets originated by other
hosts, and that the route MTU on some routes seems to be coming from
PMTU Discovery. I admit the argument that being bug-for-bug compatible
with Linux and/or FreeBSD is useful might have some merit, however.

Dennis Ferguson
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2013-12-23 05:23:06 UTC

Post by Dave Huang

Post by Dennis Ferguson
The reason is that the only
use I've seen made of the route MTU field is to implement a cache of PMTUs
discovered by applications running on the local host

It doesn't seem like NetBSD uses it as a PMTU cache--at least I've
never seen "netstat -r" show any routes added due to PMTU discovery.

Huh, I guess I just haven't been looking at the right time (and also
probably because 1500-byte MTUs are so common?) After running with the
patch in
<http://mail-index.netbsd.org/tech-net/2013/12/20/msg004421.html> that
makes IPSEC send ICMP fragmentation needed, I do in fact see "netstat
-r" show entries for hosts on the other side of the tunnel, with
appropriately-reduced MTUs. E.g., after uploading a file to 10.2.1.20,
this showed up:

Destination Gateway Flags Refs Use Mtu Interface
[ ... ]
10.2.1.20 10.1.1.67 UGHD - - 1420 fxp0

I still think the patch in kern/48472 improves the situation though...
if it's decided that discovered PMTUs shouldn't affect forwarded
packets (which I agree with), that's a separate topic from accurately
reporting the MTU that's actually being used.

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2013-12-28 16:03:37 UTC

Dave Huang <***@azeotrope.org> writes:

(Separately from the points addressed hear, I should point out that I
agree with everything Dennis has said.)

Post by Dave Huang
How does the OS know what the MTU of the next-hop network is? By
adding a static route, I'm declaring what I want the next hop's
outgoing interface to be. It seems reasonable that I should also be
able to use it to declare what the MTU is, if the default MTU assumed
by the system based on the interface type is incorrect for whatever
reason.

That's layering on a kludge to fix a bug. An interface has a configured
MTU, which can be changed if it's wrong. But an MTU and the implicit
MRU are really a property of link (v6 term).

Post by Dave Huang
It doesn't seem like NetBSD uses it as a PMTU cache--at least I've
never seen "netstat -r" show any routes added due to PMTU discovery.

I have seen such entries. They are rare now, as DSL has faded, and most
links handle 1500. And almost all v6 links handle 1280.

You could provoke them by hooking up two machines with a configured
(interface) MTU of 576 across an ethernet serving as a gateway to
another link, and then operate normally from the far side.

Post by Dave Huang
I was just experimenting with route MTUs when I came across this
problem; I don't have any real need or use for them, so I wouldn't
personally be affected either way. That said, I do think it makes
sense for the route MTU to be a way to administratively set the "MTU
of the next-hop network" for a certain route. Which is what NetBSD

The MTU is fundamentally a property of the link (next hop), not of paths
(where an MTU is dynamic, and the min of the links). There is a way to
configure MTU for the network, although it is rarely appropriate to
change it.

Post by Dave Huang
already does-- packets being forwarded through the router that are
larger than the route MTU do get fragmented if DF is not set. This
would also match the Linux behavior, and I think the FreeBSD behavior
too. I haven't actually tried on FreeBSD, but there's a comment in
their ip_input.c:ip_forward() that says, "Try to cache the route MTU
from ip_output so we can consider it for the ICMP_UNREACH_NEEDFRAG
"Next-Hop MTU" field described in RFC1191." followed by
mtu = ro.ro_rt->rt_rmx.rmx_mtu;
Later, when the ICMP needs frag is sent, the MTU sent back is the
minimum of the interface MTU and the route MTU.

It still seems overly inventive to be using "route MTUs" which are
really PMTU-D cache entries.

Darren Reed

2013-12-22 22:18:26 UTC

Post by Dave Huang
It seems that the ICMP fragmentation needed packet contains the
interface MTU rather than the route MTU if the route MTU is lower
than the iface's: see sys/netinet/ip_input.c, ip_forward(), case
EMSGSIZE--around line 1337: destmtu = rt->rt_ifp->if_mtu;
Is that what it should be doing? It seems wrong and isn't what I
expected... if I artificially lower the MTU of a route, e.g.,
route add www.netbsd.org $my_gateway_ip -mtu 1200
Then ping -Ds 1300 www.netbsd.org from another machine that routes
PING www.netbsd.org (149.20.53.86): 1300 data bytes
36 bytes from foxy.azeotrope.org (10.1.1.67): frag needed and DF set. Next MTU=1500 for icmp_seq=0
Shouldn't Next MTU=1200, rather than 1500?

Yes.

Otherwise PMTU Discovery does not work.

We can argue about RFC semantics forever and a day but if the

behaviour stays the same then PMTU discovery involving NetBSD

will remain broken in situations like this.

Darren

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Dave Huang

2013-12-23 04:11:05 UTC

Post by Dennis Ferguson
What I do know is that it is inappropriate for a router to enforce
PMTUs discovered by its local applications on packets originated by other
hosts, and that the route MTU on some routes seems to be coming from
PMTU Discovery. I admit the argument that being bug-for-bug compatible
with Linux and/or FreeBSD is useful might have some merit, however.

That I agree with... Is there a way to view the PMTU route cache? In
Linux, it's "ip route show cache", and Windows has "netsh interface
ipv4 show destinationcache", but I didn't spot anything relevant in
"apropos mtu" on NetBSD. If I knew what was in the PMTU cache, I could
do a test to see whether the cached MTU affected forwarded packets.

In any case, I think there's agreement that there's a bug, if not
exactly what the bug is. I'll file a PR about it :)

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2013-12-27 18:39:04 UTC

Post by Dave Huang
It seems that the ICMP fragmentation needed packet contains the
interface MTU rather than the route MTU if the route MTU is lower

Yes. I agree with Darren on this; see PR 44508, and my thread bringing
it up (right here on tech-net) on 2012-06-14. Since neither of those
apparently did much, it's unlikely this thread will do much either.
But I didn't give patches before; this time I have them, at least for
the versions I run (4.0.1 and 5.2).

4.0.1's is

commit 4e8da79fceebe279dbf150348fb79dd20d10f9e2
Author: Mouse <***@Rodents-Montreal.ORG>
Date: Fri Apr 1 21:44:21 2011 -0400

Get "need to frag but DF set" MTU values righter.

This affects the case where the route's MTU is lower than the
interface's; it would formerly use the interface's MTU, which causes
PMTU-D to loop, since the generated ICMPs contain a nonworking MTU.

diff --git a/sys/netinet/ip_input.c b/sys/netinet/ip_input.c
index d088151..7724373 100644
--- a/sys/netinet/ip_input.c
+++ b/sys/netinet/ip_input.c
@@ -1843,6 +1843,7 @@ ip_forward(struct mbuf *m, int srcrt)
int error, type = 0, code = 0, destmtu = 0;
struct mbuf *mcopy;
n_long dest;
+ int rmtu;

/*
* We are now in the output path.
@@ -1934,9 +1935,10 @@ ip_forward(struct mbuf *m, int srcrt)
}
}

+ rmtu = 0;
error = ip_output(m, (struct mbuf *)0, &ipforward_rt,
- (IP_FORWARDING | (ip_directedbcast ? IP_ALLOWBROADCAST : 0)),
- (struct ip_moptions *)NULL, (struct socket *)NULL);
+ (IP_FORWARDING | IP_RETURNMTU | (ip_directedbcast ? IP_ALLOWBROADCAST : 0)),
+ (struct ip_moptions *)NULL, (struct socket *)NULL, &rmtu);

if (error)
ipstat.ips_cantforward++;
@@ -1997,7 +1999,7 @@ ip_forward(struct mbuf *m, int srcrt)
&ipsecerror);
#endif

- destmtu = ipforward_rt.ro_rt->rt_ifp->if_mtu;
+ destmtu = rmtu ? : ipforward_rt.ro_rt->rt_ifp->if_mtu;
#if defined(IPSEC) || defined(FAST_IPSEC)
if (sp != NULL) {
/* count IPsec header size */

5.2's is

commit d7138d8a1fa119cbc83621176524e81bfcacb5ba
Author: Mouse <***@Rodents-Montreal.ORG>
Date: Fri Feb 15 18:52:37 2013 -0500

Get "need to frag but DF set" MTU values righter.

diff --git a/sys/netinet/ip_input.c b/sys/netinet/ip_input.c
index 8cd094d..f1920cb 100644
--- a/sys/netinet/ip_input.c
+++ b/sys/netinet/ip_input.c
@@ -1847,6 +1847,7 @@ ip_forward(struct mbuf *m, int srcrt)
struct sockaddr dst;
struct sockaddr_in dst4;
} u;
+ int rmtu;

/*
* We are now in the output path.
@@ -1926,8 +1927,8 @@ ip_forward(struct mbuf *m, int srcrt)
}

error = ip_output(m, NULL, &ipforward_rt,
- (IP_FORWARDING | (ip_directedbcast ? IP_ALLOWBROADCAST : 0)),
- (struct ip_moptions *)NULL, (struct socket *)NULL);
+ (IP_FORWARDING | IP_RETURNMTU | (ip_directedbcast ? IP_ALLOWBROADCAST : 0)),
+ (struct ip_moptions *)NULL, (struct socket *)NULL, &rmtu);

if (error)
IP_STATINC(IP_STAT_CANTFORWARD);
@@ -1974,6 +1975,8 @@ ip_forward(struct mbuf *m, int srcrt)
if ((rt = rtcache_validate(&ipforward_rt)) != NULL)
destmtu = rt->rt_ifp->if_mtu;

+ if (rmtu && (rmtu < destmtu)) destmtu = rmtu;
+
#if defined(IPSEC) || defined(FAST_IPSEC)
{
/*

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2013-12-29 00:00:14 UTC

Post by Thor Lancelot Simon

Post by Mouse

Post by Thor Lancelot Simon
I suggest we fix it this way for now and count the dancing angels
once we have stopped being stuck with the pins.

Why the derision?

Because it's been two weeks of arguing about _how_ to fix a fairly
serious bug, while nobody's actually checked anything in to fix it
even in the interim

What's the urgency about fixing it _now_? It's sat there for over 12
years - after my previous mail in this thread I found mail I sent to
tech-net back in, if I'm reading things right, 2001-11-04 reporting
what appears to have been the same problem. And it's been nearly three
years that PR 44508 has sat there. (I don't know why I didn't commit
anything back in 2001 - I speculate I wasn't running then-current - and
I think I no longer had commit access by 2011-02-03. In the former I
am at least partially at fault in that I said I'd try a change and if
it worked I'd send-pr it; I find a later mail reporting that the change
worked, but I see no evidence I created a PR for it until 2011.)

Post by Thor Lancelot Simon

Post by Mouse
[I]f you really think NetBSD would be better off paying attention to
route MTUs only sometimes, go for it.

I think that if this discussion illustrates anything, it's that the
notion of a "route MTU" is incoherent --

I don't see it as incoherent, just not used everywhere it needs to be.

Post by Thor Lancelot Simon
a back-formed conceptual rationale for the expedient hack of storing
path MTUs in the routing table, which we're now paying for.

If so, it's a remarkably useful expedient hack. I've found lots of
cases where route MTUs are useful. And one of the indications that a
thing is a right thing is when it finds uses not anticipated by its
creators.

Post by Thor Lancelot Simon
I *do* object to the bickering over what seems to me to be the
consequent neologism "route MTU" preventing us from quickly applying
an obvious fix to solve these old, very real, problems for users
caused by the original implementation of path MTU.

Well, sure, apply a fix. I'd just rather it be the one that results in
a more useful system. But, of course, I'm not the one doing the work,
I won't be running the result, and even if I were I'm entirely
competent to replace it with the more useful fix locally anyway. I
just prefer to see things done right.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2013-12-28 15:55:53 UTC

Post by Dave Huang
RFC 1191 says, "When a router is unable to forward a datagram because
it exceeds the MTU of the next-hop network and its Don't Fragment bit
is set, the router is required to return an ICMP Destination
Unreachable message to the source of the datagram, with the Code
indicating "fragmentation needed and DF set". To support the Path MTU
Discovery technique specified in this memo, the router MUST include
the MTU of that next-hop network in the low-order 16 bits of the ICMP
header field that is labelled "unused" in the ICMP specification [7]."
It seems reasonable to me to interpret the route's MTU as specifying
"MTU of the next-hop network".

I disagree; there seems to be no notion in the standards that discovered
MTUs for routes are to be propagated. The entire notion of "route MTU"
is just an implementation detail to store PMTU-D information.

A system of nodes that report discovered MTUs seems more complicated and
perhaps more fragile than one in which each host discovers MTUs itself.
But I can't prove the fragile part.

Post by Dave Huang
I checked a Debian Linux system (running kernel 2.6.32-5-686), and it
returns the route MTU. E.g., repeating the same type of test as I did
earlier: ip route add 149.20.53.86 dev eth0 mtu 1200
Then from another machine, pinged 149.20.53.86 with a 1300-byte packet
and DF set. The MTU returned in the ICMP fragmentation needed packet
was 1200.

That's an interesting datapoint about practice, but it strikes me as
non-compliant with standards (see below).

Post by Dave Huang
NetBSD's current behavior would seem to break PMTU discovery... it
won't forward DF packets larger than the route MTU, but then it tells
the sender that larger packets are OK.

I agree that the combination of declining to forward a packet via a
route and returning an interface MTU greater than the route MTU is
broken.

The real question is:

Why is it ok to decline to forward packets because they are bigger
than the route MTU, when the route MTU is about PMTU-D to be used for
locally-sourced packest?

If it is ok to decline, then we need to return route MTU when declining
to forward because of route MTU. If it's not, we need to fix the
forwarding behavior. I just skimmed RFC4821 and found zero discussion
of interaction with forwarding packets. So I believe that route MTUs
(and really the entire routing table entry resulting from PMTU-D) should
be ignored when forwarding packets. The RFC talks about storing PMTU-D
state using flow ids, essentially pointing out that PMTU-D values may
not necessarily be valid for dissimilar packets with the same
destination address.

That leaves routes with explicit MTUs that aren't from PMTU-D as an odd
case. One can view those as buggy or test cases for PMTU-D on the
theory that the route MTU mechanism was added for PMTU-D.

Dave Huang

2013-12-28 19:39:11 UTC

Post by Greg Troxel
I disagree; there seems to be no notion in the standards that discovered
MTUs for routes are to be propagated. The entire notion of "route MTU"
is just an implementation detail to store PMTU-D information.

Is it? That seems like the root of the problem then... the routing
table is used to store routing information, including info on how to
route forwarded packets. PTMU-D info should be stored elsewhere... or
at least it should be marked with some flag tht the kernel can look at
to know whether it's a PMTU cache entry or an actual route. AFAICT,
neither Linux nor Windows uses the routing table for its PMTU cache.

Post by Greg Troxel
I agree that the combination of declining to forward a packet via a
route and returning an interface MTU greater than the route MTU is
broken.
Why is it ok to decline to forward packets because they are bigger
than the route MTU, when the route MTU is about PMTU-D to be used for
locally-sourced packest?

I don't think it's OK, but that's an orthogonal issue. If someone
wants to fix that, I'm all for it. However, I don't think that's a
prerequisite for fixing the issue I'm reporting: if NetBSD is going to
drop a packet because it exceeds the MTU, it needs to properly report
what that MTU is.

--
Name: Dave Huang | Mammal, mammal / their names are called /
INet: ***@azeotrope.org | they raise a paw / the bat, the cat /
FurryMUCK: Dahan | dolphin and dog / koala bear and hog -- TMBG
Dahan: Hani G Y+C 38 Y++ L+++ W- C++ T++ A+ E+ S++ V++ F- Q+++ P+ B+ PA+ PL++

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2013-12-28 20:30:17 UTC

Post by Mouse

Post by Dave Huang
It seems that the ICMP fragmentation needed packet contains the
interface MTU rather than the route MTU if the route MTU is lower

Yes. I agree with Darren on this; see PR 44508, and my thread bringing
it up (right here on tech-net) on 2012-06-14. Since neither of those
apparently did much, it's unlikely this thread will do much either.
But I didn't give patches before; this time I have them, at least for
the versions I run (4.0.1 and 5.2).

Here is a patch to fix it the other way (use interface MTU when forwarding)
for NetBSD-current. It is untested, but it has the advantage, to me at
least, of being a 1-line change.

I suggest we fix it this way for now and count the dancing angels once
we have stopped being stuck with the pins.

Index: ip_output.c
===================================================================
RCS file: /cvsroot/src/sys/netinet/ip_output.c,v
retrieving revision 1.224
diff -c -r1.224 ip_output.c
*** ip_output.c 29 Jun 2013 21:06:58 -0000 1.224
--- ip_output.c 28 Dec 2013 20:26:43 -0000
***************
*** 283,289 ****
}
ia = ifatoia(rt->rt_ifa);
ifp = rt->rt_ifp;
! if ((mtu = rt->rt_rmx.rmx_mtu) == 0)
mtu = ifp->if_mtu;
rt->rt_use++;
if (rt->rt_flags & RTF_GATEWAY)
--- 283,290 ----
}
ia = ifatoia(rt->rt_ifa);
ifp = rt->rt_ifp;
! if ((flags & (IP_FORWARDING|IP_RAWOUTPUT) ||
! ((mtu = rt->rt_rmx.rmx_mtu) == 0))
mtu = ifp->if_mtu;
rt->rt_use++;
if (rt->rt_flags & RTF_GATEWAY)

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Mouse

2013-12-28 21:07:25 UTC

Post by Thor Lancelot Simon
Here is a patch to fix it the other way (use interface MTU when forwarding)

Hmm, I don't like that because it makes it impossible to impose a lower
MTU limit for some but not all of the forwarding traffic through an
interface. (This is something that should never be necessary. But the
net is not that ideal.)

Post by Thor Lancelot Simon
It [...] has the advantage, to me at least, of being a 1-line change.

I'm not sure why one-line versus five-line makes all that much
difference. If this is done, I expect "route MTUs are ignored when
forwarding" bugs to get filed, because I would certainly _expect_
forwarding traffic using a particular route to pay attention to the
route's MTU. Having route MTUs paid attention to only sometimes
strikes me as far worse than having to change a whole four more lines
in the fix.

Post by Thor Lancelot Simon
I suggest we fix it this way for now and count the dancing angels
once we have stopped being stuck with the pins.

Why the derision? As amusing as the turn of phrase is, I don't see the
issue as angel-dancing in any sense and certainly not deserving of
being made out to be ridiculous like this.

Still, it's no skin off my nose, since my systems won't be affected
either way; if you really think NetBSD would be better off paying
attention to route MTUs only sometimes, go for it.

/~\ The ASCII Mouse
\ / Ribbon Campaign
X Against HTML ***@rodents-montreal.org
/ \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Thor Lancelot Simon

2013-12-28 21:17:07 UTC

Post by Mouse

Post by Thor Lancelot Simon
I suggest we fix it this way for now and count the dancing angels
once we have stopped being stuck with the pins.

Why the derision?

Because it's been two weeks of arguing about _how_ to fix a fairly
serious bug, while nobody's actually checked anything in to fix it
even in the interim even though we all clearly know at least two
different, very simple, ways how. Users shouldn't be left in the
lurch like that while we discuss.

Post by Mouse
Still, it's no skin off my nose, since my systems won't be affected
either way; if you really think NetBSD would be better off paying
attention to route MTUs only sometimes, go for it.

I think that if this discussion illustrates anything, it's that the
notion of a "route MTU" is incoherent -- a back-formed conceptual
rationale for the expedient hack of storing path MTUs in the routing
table, which we're now paying for.

I don't object to storing path MTUs in the routing table, because it
is a central datastructure with properties that make them convenient
to store there and efficient to look up from there (since we already
must do the routing lookup at packet output time). I *do* object
to the bickering over what seems to me to be the consequent neologism
"route MTU" preventing us from quickly applying an obvious fix to
solve these old, very real, problems for users caused by the original
implementation of path MTU.

Thor

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Greg Troxel

2013-12-29 00:28:43 UTC

Post by Dave Huang

Post by Greg Troxel
I disagree; there seems to be no notion in the standards that discovered
MTUs for routes are to be propagated. The entire notion of "route MTU"
is just an implementation detail to store PMTU-D information.

Is it? That seems like the root of the problem then... the routing
table is used to store routing information, including info on how to
route forwarded packets. PTMU-D info should be stored elsewhere... or
at least it should be marked with some flag tht the kernel can look at
to know whether it's a PMTU cache entry or an actual route. AFAICT,
neither Linux nor Windows uses the routing table for its PMTU cache.

That's a fair point, but a central design point of the 4.4BSD networking
code is to use (abuse) the (single) routing table for multiple things,
including ARP. It's a reasonable notion that these PMTU-D entries
should be identifiable.

Gert Doering

2013-12-29 10:37:21 UTC

Hi,

Post by Greg Troxel
That's layering on a kludge to fix a bug. An interface has a configured
MTU, which can be changed if it's wrong. But an MTU and the implicit
MRU are really a property of link (v6 term).

True, on layer 3.

Untrue on higher layers, where a MTU is effectively a function of the
whole path between you and the system you are talking to.

I'd argue that for IPSEC, what is *relevant* is the MTU on the path
to the other side of the end host, not the local interface MTU - so
using route MTU is the most likely source of path MTU information that
has useful information to the sender of the to-be-IPSEC-encapsulated
packet how to avoid fragmentation.

gert

--
USENET is *not* the non-clickable part of WWW!
//www.muc.de/~gert/
Gert Doering - Munich, Germany ***@greenie.muc.de
fax: +49-89-35655025 ***@net.informatik.tu-muenchen.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Gert Doering

2013-12-29 10:39:32 UTC

Hi,

Post by Greg Troxel
Why is it ok to decline to forward packets because they are bigger
than the route MTU, when the route MTU is about PMTU-D to be used for
locally-sourced packest?

Because IPSEC-encapsulated packets *are* locally-sourced, on the outer
header?

gert

--
USENET is *not* the non-clickable part of WWW!
//www.muc.de/~gert/
Gert Doering - Munich, Germany ***@greenie.muc.de
fax: +49-89-35655025 ***@net.informatik.tu-muenchen.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Jose Luis Rodriguez Garcia

2013-12-30 19:04:28 UTC

I see the next problem using MTU route instead of iface route. I know
that it isn't a usual case.

If a router in the path, after of our server does routing based in the
source address, the iface MTU isn't a acurate value. I think that
iface MTU will work in this case.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

21 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Dave Huang 2013-12-19 23:34:41 UTC

Greg Troxel 2013-12-20 16:01:03 UTC

Dave Huang 2013-12-20 17:13:57 UTC

Dennis Ferguson 2013-12-21 20:03:48 UTC

Jarle Greipsland 2013-12-22 10:01:43 UTC

Dave Huang 2013-12-21 22:55:37 UTC

Dennis Ferguson 2013-12-22 18:05:18 UTC

Dave Huang 2013-12-23 05:23:06 UTC

Greg Troxel 2013-12-28 16:03:37 UTC

Darren Reed 2013-12-22 22:18:26 UTC

Dave Huang 2013-12-23 04:11:05 UTC

Mouse 2013-12-27 18:39:04 UTC

Mouse 2013-12-29 00:00:14 UTC

Greg Troxel 2013-12-28 15:55:53 UTC

Dave Huang 2013-12-28 19:39:11 UTC

Thor Lancelot Simon 2013-12-28 20:30:17 UTC

Mouse 2013-12-28 21:07:25 UTC

Thor Lancelot Simon 2013-12-28 21:17:07 UTC

Greg Troxel 2013-12-29 00:28:43 UTC

Gert Doering 2013-12-29 10:37:21 UTC

Gert Doering 2013-12-29 10:39:32 UTC

Jose Luis Rodriguez Garcia 2013-12-30 19:04:28 UTC

about - legalese

Loading...