Certain network links -- even now -- have issues with (a) low MTU (which by modern convention means MTU less than 1500) and (b) broken PMTU (Path MTU) Discovery (usually caused by some ICMP being gratituously blocked). This means that any TCP connection through the affected link will stall as soon as the TCP window opens up enough to start wanting to send full sized TCP packets.

The hacky work around is to limit the TCP MSS (Max Segment Size), such that TCP never tries to send a packet bigger than the MTU. This has performance issues, but they are not as dramatic as the complete lack of packets flowing as soon as the hidden MTU is exceeded.

On Linux this can be done with:

iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss *SIZE* -d *AFFECTEDHOST*

(and there are some other options too). In that command, the "SYN,RST" sets the flags to check, and the "SYN" is the flag that should be set (with the other not set), so this should -- in theory -- match both SYN and SYN/ACK and thus affect all traffic setup.

The MSS should typically be set to 40 bytes less than the maximum MTU to allow for the TCP and IP overheads (more if IP options are being used too).

To deal with MTU issues in both directions both the SYN and SYN/ACK need to have their MSS clamped, otherwise the two TCP instances will learn asymmetric MSS values, which will result in traffic working in only one direction. The above works if the place doing the clamping is either the originating host, or is a router in between doing something horrible changing the traffic on the way past. However on the destination host it only changes the MSS in one direction which is not sufficient. To apply this MSS clamping in both directions on the destination host it is necessary to do:

iptables -t mangle -A INPUT  -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss *SIZE* -s *AFFECTEDHOST*
iptables -t mangle -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss *SIZE* -d *AFFECTEDHOST*

(which will only affect traffic going in/out of the host.)

For instance this script:

#! /bin/sh
# Clamp the TCP MSS (max segment size) to avoid sending TCP packets bigger
# than hidden Path MTU, where the PMTU cannot be discovered via ICMP due
# to broken network configuration.
#
# Written by Ewen McNeill <ewen@naos.co.nz>, 2009-12-07
#---------------------------------------------------------------------------

PATH=/usr/bin:/bin:/usr/sbin:/sbin

MSS=1452
REMOTEIP="${1}"

if [ -n "${REMOTEIP}" ]; then
   SRC_SPEC="-s ${REMOTEIP}"
   DST_SPEC="-d ${REMOTEIP}"
else
   SRC_SPEC=''
   DST_SPEC=''
fi

iptables -t mangle -A INPUT  -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss "${MSS}" $SRC_SPEC
iptables -t mangle -A OUTPUT -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss "${MSS}" $DST_SPEC

And a note on testing this:

  • On Linux and BSD (including Mac OS X) ping takes a payload size argument (-s)

  • packets emitted at payload size + "ping tracking overhead" (8 bytes) + IP headers (20 bytes)

  • so payload size of 1472 is the largest one would expect to fit in a "normal" (1500 byte) packet

  • payload size of 1468 (eg as in example that prompted these notes) indicates that 4 bytes are being used up somewhere else, which is most likely a 802.1Q label or an MPLS label (typically caused by doing 802.1Q or Q-in-Q on a device that only has a 1514 byte maximum frame size -- MPLS or 802.1Q requires 1518 byte frames, and Q-in-Q requires 1522 byte frames, at least if there is to be no user-visible impact on the Path MTU)

  • MSS typically needs to be 40 bytes less than the maximum

  • The OS X ping has a nice ping size "sweep" argument:

    • -g SWEEPMINSIZE
    • -G SWEEPMAXSIZE
    • -h INCREMENTSIZE
  • For example:

    ping -g 1450 -G 1480 -h 2 DESTIP
    

    which can be used to automatically track down gaps like this, although to be most useful the "Don't Fragment" bit also needs to be set:

    ping -g 1450 -G 1480 -h 2 -D DESTIP
    

    otherwise it will find only particular gratituous examples of problems.

  • In contrast mtr has a packet size argument that includes all the IP headers and ICMP headers and tracking overhead, setting the total size of the packets. So:

    mtr -s 1496 DESTIP    # equivilent to 1468 payload size to ping
    mtr -s 1500 DESTIP    # equivalent to 1472 payload size to ping
    
  • Various routers specify different inclusive/exclusive values in their built-in CLIs