MTU discovery (Theodore Y. Ts'o)

Index Home About Blog

Date: 	Thu, 20 Apr 2000 23:29:48 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: Problems with MTU?
Newsgroups: fa.linux.kernel

   From: hans@grumbeer.inka.de (Hans-Joachim Baader)
   Date: 	Thu, 20 Apr 2000 09:41:45 +0200 (CEST)

   for several months I had the problem that access to some web sites
   fails mysteriously. This means I never got a reply although the
   servers were reachable with ping and telnet.

   Today I did

	   echo 1 > /proc/sys/net/ipv4/ip_no_pmtu_disc

   and suddenly all these sites worked again for me. I use kernel 2.2.x
   (2.2.pre15-19 currently). Some of the affected sites are
   dri.sourceforge.net (in fact, the whole sourceforge.net)
   and support.3com.com

   Are these web sites broken? I never had these problems under Windows,
   so this is probably not the case.

Actually, yes, the web sites are broken; but it's a relatively subtle
problem.  Here's what's going on.

The web sites are probably behind a firewall which is filtering all ICMP
packets (since some dumb firewall administrator read somewhere that all
ICMP packets are evil).  Somewhere in the network path between you and
the web site, there's a link which has a maximum MTU which is smaller
than the ethernet default of 1500 bytes.

Normally, dumb hosts would just send bytes using their local max MTU,
and then when the packets hit the link with the restricted MTU, the
packet would get fragmented at that point, and then it would get
reassembled at the receiver.  Fragmentation has the problem, though,
that it only takes one fragment to get lost in order to require the
retranmission of the entire packet, thus wasting network bandwidth ---
and if the reason why the one of the fragments was dropped was due to
link congestion, the fact that all of the fragments have to get
retransmitted can actually worsen the situation.

Hence, good network citizens (of which Linux is one) uses Path MTU
discovery to try to determine the appropriate size to send over any
arbitrary link.  The way it works is very simple; in each direction, the
hosts sends packets with the Don't Fragment bit set.  When a packet
which is too large reaches the router just before the link with the
restricted MTU, since the don't fragment bit is set, the router drops
the packet on the floor and sends back an ICMP Destination unreachable
with a code which means "fragmentation needed and DF set", along with
the maximum MTU of the constricting network link.  This allows the
sender to automatically determine maximum MTU for the hop, and the
sender can then resend the TCP segment using a smaller packetsize, now
that the maximum path MTU is known.

The problem with this comes with, as I mentioned earlier, bone-headed
firewall maintainers who believe that all ICMP packets are bad and
filters all of them.  This includes the ICMP Destination unreachable
packets which are needed to make path MTU function correctly.  As a
result, a site which is behind one of these firewalls will continually
send big packets with the don't fragment bit set, which then get
rejected when they hit the constricting link, but since the firewall
filters out the ICMP "too big" message, the sending sight never knows
that the packets are getting rejected, and so they can never send you
anything. 

This doesn't come up in normal operation for most hosts because most
links support at least the ethernet maximum MTU of 1500, and if there is
a constricting link, it is at the client endpoint (for example, the
client is dialing up with PPP and so has a restricted MTU).  At the
client end point, it's not a problem, since the client knows that it's
sending packets to an interface with a restricted MTU, and so it sends
small packets.  The problem comes when the constricting link is in the
*middle* of the network path, and so Path MTU discovery is required in
order to make things work.  (Either that, or you have to learn to live
with fragmentation with their attendant disadvantage).

So this will come up if you are using a PPP to connect to your gateway,
and then you use NAT to allow machines on the local ethernet to gain
access to the network via your singleton PPP connection.  It also comes
up with those folks using DSL where the providers are using the
abomination also known as PPP over Ethernet (PPPOE).

   Could anyone please explain this? Is there a better solution than
   disabling MTU discovery?

There are a couple of solutions you can use to solve this problem.  One
is to ask nicely to the web site administrators that they fix their
firewall.  Unfortunately, this doesn't always work.  I am told that
Amazon's web programmers were told about this problem, and they
acknowledged it *as* a problem, but said they weren't allowed to fix it.
(Probably because the bone-headed firewall administrator in their case
was clueless, and they didn't have the power to override him.)

The second approach, assuming that the constricting link is close to you
(usually it's the second-to-last hop in most scenarios), you can simply
set a per route max segment size (MSS) parameter, which will force TCP
packets using that particular route to be no larger than the MSS size
--- in both directions.  Given that the problem is usually on the link
to the outside internet, and that connections to other hosts within the
subnet are OK, it's usually a matter of setting the mss option only on
the default route:

route add default gw 216.176.176.160 mss 1400

The final approach, which apparently some of the DSL providers use, is
that on the cable modem box or on the DSLAM in the telco's central
office, they are actively messing with the outgoing packets, by looking
for TCP packets with the SYN bit set (which indicates the beginning of a
TCP stream), and change the max MSS option in the IP header to be
smaller than max MTU caused by the PPP over Ethernet overhead.  This is
incredibly ugly, and violates the IP protocol's end-to-end argument.  It
also breaks in the presence of IPSEC, since the DSL provider won't be
able to muck with the packet without breaking the cryptographic
checksums.  

So it's completely ugly.  Still, I imagine that Rusty will no doubt be
writing a new "packet fucking" module in ipchains to support this kind
of TCP syn option rewriting.  :-)

						- Ted

Date: 	Fri, 21 Apr 2000 15:35:58 -0400
From: "Theodore Y. Ts'o" <tytso@MIT.EDU>
Subject: Re: P-MTU discovery
Newsgroups: fa.linux.kernel

   Date: Fri, 21 Apr 2000 11:55:04 -0400 (EDT)
   From: jamal <hadi@cyberus.ca>

   I dont know whether telcos are already doing this, but we certainly are in
   Linux. I point the finger to Marc Boucher. He did it! 

Ah, I hadn't realized someone had done it already.  Is it in ipchains?

   The reason is very simple: NAT that good old friend of IPSEC. 
   When you have lotsa boxes that you are masquareding for it is hell to go
   around and start changing their MTU values or doing any sort of per-box
   changes.

Actually, the hack is useful even if you're not doing NAT; any time you
have a configuration where you have a gateway box which is doing some
kind of tunnelling (either PPPOE or IP-IP or something else), and you
have lots of client machines behind the tunnel end-pointing, making lots
of per-box changes a pain.   

If you're using dhcp, something you can do to avoid having to change all
of the boxes one at a time is to set the interface-mtu using dhcp to
1400 or 1450.  The disadvantage of doing this is that *all* packets get
sent with the restricted MTU, not just ones going out through the
tunnel/gateway.  (You'd really like to be able to set a per-route MSS,
but dhcp doesn't appear to have a way of doing that right now.)

   Disabling PMTU at the masquerading box also doesnt help because 
   PPPOE adds an extra shim header to the packet. It will break IPSEC in 
   most cases (maybe not in the case where your masquerading box is also your
   IPSEC gateway). 

Right; that that's the problem; PPPOE, because it adds a shim header,
constricts the link MTU, and so you need to do PMTU discovery at the
endpoints.  And in either case, doing PMTU doesn't help if you have
something in the path which is filtering the ICMP messages.

   From a philosophical angle:
   there is no panacea for these kind of problems. I wonder how long you've 
   been chasing them. You will continously chase people to try and fix things
   for IPSEC's sake ;-> I wonder how you plan to deal with all those "content
   switching" startups (since that is the greatest thing since sliced bread
   these days). Is the end2end arguement really a dead horse? (I am ducking
   ahead of time). Maybe what the IETF needs is to take alls chairs into some
   end2end non-breakage indoctrination and give them a qualifying test first.

Here's the problem.  End2end is great design principle, but it
fundamentally assumes that the intelligence is at the endpoints, and the
middle of the network isn't supposed to do anything special/magical.
But as the internet gets bigger and bigger, trying to change all of the
endpoints to add security, or to handle paths with long latencies
efficiently, gets harder and harder.  And so, it gets easier to make
changes in the middle of the network.  And most of the (to use Rusty's
phrase) "packet fucking" techniques come from this dilemma: NAT's
(easier than IPV6), firewalls (easier than doing real end-point
security), tcp ack spoofing (easier than upgrading Windows TCP stacks to
make them work correctly over satellite links), etc.

One could argue that by violating the IP architecture, they're engaging
in hill-climbing optimizations that in the long-run will cause someone a
lot of pain.  Some things simply won't work if you play such games, and
as long as you acknowledge that fact, use them in good health.

So I've used NAT's before, even though I think that fundamentally
they're evil, because it solved the limited problem I needed to solve at
the time.  But I didn't consider them first class objects, but treated
them rather as kludges.  So if things broke because of the NAT, I knew
it was coming to me, and I would deal.  One of the ways I dealt was to
get myself a /27 at home, but I realize not everyone can get that.

The problem is that more and more users are using things like NAT's and
MSS adjusters, etc., and they don't understand that they're kludges.  So
when other protocols start breaking, they blame those other protocols
instead of correctly placing the blame where it belongs.

   Having said that, there could be an alternative solution in Linux. The
   PPPOE code could be made, after dropping the packet, to generate ICMP "too
   big" messages back to the masqueraded boxes instead (when packet-size
   >PMTU-shim_header). Hopefully, the win* boxes know what to do with these 
   messages. And this will work also for UDP. Marc?

That doesn't help.  We're doing this today already; it's required by the
RFC's, after all.  The problem is that the sender of the big packet has
to receive the ICMP, and if there's something filtering the ICMP
message, you're stuck.

							- Ted

Index Home About Blog