FAST_IPSEC fragmentation problem

Discussion:

(too old to reply)

Beverly Schwartz

2012-10-16 20:05:11 UTC

I have detected a problem in which FAST_IPSEC does not return a fragmentation needed icmp message.

Here's the setup:

A=======B--------C

Between A & B is an IPsec esp tunnel. B & C are in the clear.

A initiates a wget from C. After initial TCP handshaking, C sends a packet
with TCP data to fill the mtu between B & C.

B tries to forward the packet to A after adding a new IP header, ESP header,
and padding, but the resultant packet is too large.

B does *not* send a fragmentation needed packet back to C. Instead, C times
out and black hole detection comes into play, C starts sending minimally sized
packets, and then the data gets through. This delays the (small) transfer by
about 5 seconds. For large transfers, this has a signifcant impact on
throughput because of the small-sized packets.

If you look at netipsec/ipsec_output.c, it looks like fragmentation is being
taken care of, but it is not. This is what happens.

In function ipsec4_process_packet, based on the policy set for setting the DF
bit on the outer packet (ip4_ipsec_dfbit) and the setting of the bit in the
inner packet is whether or not the DF bit gets set in the outer packet. This
code does work as expected.

Continuing further down ipsec4_process_packet, we see a call to do the output
transform:
error = (*sav->tdb_xform->xf_output)(m, isr, NULL, i, off);

In my case, we are doing crypto, and this function maps to esp_output.

Looking in netipsec/xform_esp.c, we can see crypto_dispatch called.
crypto_dispatch will queue the packet to have crypto done, returning
ENOENT.

We can see this is netinet/ip_output.c after the call to
ipsec4_process_packet. In ip_output, error is changed to 0 because no
error has occurred - we're just waiting to be called back when opencrypto
is done. Control jumps to done, and ip_forward sees no error. ip_forward
returns.

When crypto has done it's thing, it puts the packet on another queue, and
calls esp_output_cb. esp_output_cb calls ip_output. On this run through
ip_output, the DF bit is checked, and we do have an MTU error. However,
we're no longer returning to ip_forward which has all the context we need
to call icmp_error. Even if esp_output_cb could detect the failure is an
mtu problem, all it has is an encrypted packet, and has no way to figure
out where to send an icmp error message to.

So no fragmentation needed message is sent.

I have been trying different ways to deal with this. What seems easiest
to me, is in the case where the DF bit is set, add a call before the call
to ipsec4_process_packet in ip_output which determines how many bytes
IPsec will add. Then check current packet length plus the extra bytes
against mtu. If this fails, return EMSGSIZE. Here's the code, placed
right before the call to ipsec4_process_packet:

/*
* Check that MTU is sufficient.
*/
if (ntohs(ip->ip_off) & IP_DF) {
size_t ipsec_hdrlen = ipsec_hdrsiz(sp);
if (ip_len + ipsec_hdrlen > mtu) {
if (flags & IP_RETURNMTU)
*mtu_p = mtu - ipsec_hdrlen;
error = EMSGSIZE;
IP_STATINC(IP_STAT_CANTFRAG);
splx(s);
goto bad;
}
}

ipsec_hdrsiz is declared a static function, but this is easy to change.
I prefer calling ipsec_hdrsiz directly rather than using ipsec4_hdrsiz
because we already have the policy so don't need to get it again.

This does not fully solve our problem, however. In ip_forward, we
call ipsec4_hdrsiz in order to determine the mtu size to send
in the icmp error message. Well the calculation doesn't work, and
we end up with an mtu of 1500. No help there.

Fortunately, this problem is easy to fix. Because we set *mtu_p,
ip_forward already has the new mtu. So in case EMSGSIZE in ip_forward,
after setting type and code, we need to add:
if (destmtu != 0)
break;

The only problem I'm still struggling with is that ipsec_hdrsiz returns
a strange (meaning odd number) header length. I think this should be
decreased until we have a multiple of 4. But that's just my opinion.

BTW, IPv6 doesn't quite run into this because it just applies source
fragmentation to the new packet. IPv6 should not fragment midstream,
so this is probably not desired behavior. However, one could argue
that the encapsulated packet is a new packet, therefore fragmentation
is allowed. In my opinion, this doesn't ring true to the spirit of the
IPv6 spec.

-Bev

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Gert Doering

2012-10-16 20:34:56 UTC

Permalink

Hi,

Post by Beverly Schwartz
BTW, IPv6 doesn't quite run into this because it just applies source
fragmentation to the new packet. IPv6 should not fragment midstream,
so this is probably not desired behavior. However, one could argue
that the encapsulated packet is a new packet, therefore fragmentation
is allowed. In my opinion, this doesn't ring true to the spirit of the
IPv6 spec.

IPv6 says "a router must not fragment someone else's packets". An IPSEC
device is fragmenting its own packets and (the important bit) at the
end of the tunnel, the original packet emerges unfragmented.

Whether you fragment in IPSEC tunnels, segment in ATM cells, etc. - as
long as you do not modify the original IPv6 packet, you're fine.

gert

--
USENET is *not* the non-clickable part of WWW!
//www.muc.de/~gert/
Gert Doering - Munich, Germany ***@greenie.muc.de
fax: +49-89-35655025 ***@net.informatik.tu-muenchen.de

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Beverly Schwartz

2012-10-25 19:10:56 UTC

Permalink

This is a continuation of the email I posted on 10/16/12 about FAST_IPSEC not sending "fragmentation needed" messages. I have included that email at the end of this email.

As part of my fix, I now have ip_forward use the mtu returned from ip_output via the mtu_p pointer. The problem is that breaks KAME_IPSEC. KAME_IPSEC doesn't update the mtu value, so with my fix outlined below, KAME_IPSEC will now send a "fragmentation needed" message with a value of 1500 (or whatever the mtu of the interface is). So I have modified the KAME_IPSEC block in ip_output.c to parallel what I do in the FAST_IPSEC block. (See below.)

This all works, but there are a couple of troubling details.

1. ipsec_hdrsiz returns the maximum possible number of bytes IPsec may add, rather than the actual number of bytes that *will* be added to the packet. With my fix, some packets that could actually fit in the MTU will be bounced because ipsec_hdrsiz reports a number of bytes that is greater than what is actually added.

2. ipsec_hdrsiz returns an odd number, so the "fragmentation needed" packet contains an mtu that is an odd number. (This was true before any of my fixes, so is a different but related problem. See esp_hdrsiz in netipsec/xform_esp.c for FAST_IPSEC or netinet6/esp_output.c for KAME_IPSEC.)

For the short term, I am proposing that we use ipsec_hdrsiz as is, but in ip_forward, reduce destmtu until it is on a 4-byte boundary.

For a longer term fix, I propose having two functions, one that calculates the actual number of bytes added to a packet, and another that calculates the maximum possible number of bytes that can be added based on IPsec policy. The calculation should push the number up to a 4-byte, 8-byte or whatever boundary, based on what boundary the IPsec policy uses for padding.

Thoughts? Comments?

-Bev

Subject: FAST_IPSEC fragmentation problem
Date: October 16, 2012 4:05:11 PM EDT
I have detected a problem in which FAST_IPSEC does not return a fragmentation needed icmp message.
A=======B--------C
Between A & B is an IPsec esp tunnel. B & C are in the clear.
A initiates a wget from C. After initial TCP handshaking, C sends a packet
with TCP data to fill the mtu between B & C.
B tries to forward the packet to A after adding a new IP header, ESP header,
and padding, but the resultant packet is too large.
B does *not* send a fragmentation needed packet back to C. Instead, C times
out and black hole detection comes into play, C starts sending minimally sized
packets, and then the data gets through. This delays the (small) transfer by
about 5 seconds. For large transfers, this has a signifcant impact on
throughput because of the small-sized packets.
If you look at netipsec/ipsec_output.c, it looks like fragmentation is being
taken care of, but it is not. This is what happens.
In function ipsec4_process_packet, based on the policy set for setting the DF
bit on the outer packet (ip4_ipsec_dfbit) and the setting of the bit in the
inner packet is whether or not the DF bit gets set in the outer packet. This
code does work as expected.
Continuing further down ipsec4_process_packet, we see a call to do the output
error = (*sav->tdb_xform->xf_output)(m, isr, NULL, i, off);
In my case, we are doing crypto, and this function maps to esp_output.
Looking in netipsec/xform_esp.c, we can see crypto_dispatch called.
crypto_dispatch will queue the packet to have crypto done, returning
ENOENT.
We can see this is netinet/ip_output.c after the call to
ipsec4_process_packet. In ip_output, error is changed to 0 because no
error has occurred - we're just waiting to be called back when opencrypto
is done. Control jumps to done, and ip_forward sees no error. ip_forward
returns.
When crypto has done it's thing, it puts the packet on another queue, and
calls esp_output_cb. esp_output_cb calls ip_output. On this run through
ip_output, the DF bit is checked, and we do have an MTU error. However,
we're no longer returning to ip_forward which has all the context we need
to call icmp_error. Even if esp_output_cb could detect the failure is an
mtu problem, all it has is an encrypted packet, and has no way to figure
out where to send an icmp error message to.
So no fragmentation needed message is sent.
I have been trying different ways to deal with this. What seems easiest
to me, is in the case where the DF bit is set, add a call before the call
to ipsec4_process_packet in ip_output which determines how many bytes
IPsec will add. Then check current packet length plus the extra bytes
against mtu. If this fails, return EMSGSIZE. Here's the code, placed
/*
* Check that MTU is sufficient.
*/
if (ntohs(ip->ip_off) & IP_DF) {
size_t ipsec_hdrlen = ipsec_hdrsiz(sp);
if (ip_len + ipsec_hdrlen > mtu) {
if (flags & IP_RETURNMTU)
*mtu_p = mtu - ipsec_hdrlen;
error = EMSGSIZE;
IP_STATINC(IP_STAT_CANTFRAG);
splx(s);
goto bad;
}
}
ipsec_hdrsiz is declared a static function, but this is easy to change.
I prefer calling ipsec_hdrsiz directly rather than using ipsec4_hdrsiz
because we already have the policy so don't need to get it again.
This does not fully solve our problem, however. In ip_forward, we
call ipsec4_hdrsiz in order to determine the mtu size to send
in the icmp error message. Well the calculation doesn't work, and
we end up with an mtu of 1500. No help there.
Fortunately, this problem is easy to fix. Because we set *mtu_p,
ip_forward already has the new mtu. So in case EMSGSIZE in ip_forward,
if (destmtu != 0)
break;
The only problem I'm still struggling with is that ipsec_hdrsiz returns
a strange (meaning odd number) header length. I think this should be
decreased until we have a multiple of 4. But that's just my opinion.
BTW, IPv6 doesn't quite run into this because it just applies source
fragmentation to the new packet. IPv6 should not fragment midstream,
so this is probably not desired behavior. However, one could argue
that the encapsulated packet is a new packet, therefore fragmentation
is allowed. In my opinion, this doesn't ring true to the spirit of the
IPv6 spec.
-Bev

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de