On TTLs, heros and loops


mills@dcn6.arpa
29-Mar-86 20:45:40-UT


Phil, Art and folks,

We happen to use a TTL of 30, so that the maximum time a looper can persist is
less than the reassembly timeout. We arrived at values of this order after
seven years of hacking in these swamps, breaking and being broken in wonderful
ways, but I would not heroically defend our particular choice.

The plea of this note is to resist the urge to crank up the TTL in order to
survive legitimate paths involving many gateway hops. First, consider the
issue of a reasonable lower bound. The largest number of hops reported by the
core system in EGP updates is now five, but values even that high probably
indicates GGP is broken and "counting to infinity." The EGP hop count
represents a lower bound on the core portion of the path plus whatever the EGP
peer stuffs into the hop-count field, usually zero. Our swamp can involve four
additional hops for nets, subnets and the like, which is probably not
unreasonable for places like MIT, CMU and Stanford as well. Assuming swamps
like ours at both ends of the core path suggests a path with something over
twelve hops is unlikely but possible. I conclude that 15 is a defensible lower
bound and that 30 is probably adequate until such time as the Internet goes
intergalactic.

Now consider the costs of setting the TTL too high. If the rate of distruction
of IPgrams due to loops is estimated by the incidence of ICMP Time Exceeded
messages, loops don't occur too often. From our experience that rationale is
faulted, since such loops usually result in the consumption of all buffer
resources, including those necessary to forward the ICMP message. What makes
this acute is the fact that the sender begins to retransmit, which inserts
more stuff in the loop. Obviously, it would be desirable that no IPgram could
survive longer than the minimum retransmission interval, but that is clearly
impractical in the present architecture.

The transmission time of a 576-octet IPgram on a 9600-bps link is in the order
of a half second and at lower speeds you don't wanna think about it. Even with
a TTL of 15 a loop involving such a link, which is reasonably common,
resources can be strained for periods longer than the typical TCP
retransmission timeout. Actually, following formation of the loop, what
usually happens is that during a period of a few minutes intense congestion
sets in until the customers all give up. Then every few minutes somebody,
usually a mail daemon, honks a TCP SYN segment and assumes an unrealistically
low retransmission timeout, which often is enough to trip the system again
into congestive collapse, especially if TTLs much greater than 15-30 are used.
We all know that mailers are exquisitely persistent.

Within the GGP core system we all know that loops can be particularly painful,
since in many scenarios a net bobbing up or down triggers a transient routing
loop, together with a spasm of updates that can last minutes while the
distances "count to infinity." It's a good thing infinity is a small number
(less than ten). Recent observation of our EGP tables indicates an alarming
number of nets, perhaps five at a time, seem to be doing just that. It is not
clear what is causing this; however, EGP gateway designers should realize that
dispersing reachability changes throughout the Internet is painful and slow,
so that internal changes should be filtered before being advertised
externally.

The biggest danger for loop formation may well be subnets and default
gateways. What happens is that somebody glitches a routing table in a gateway
handling the routing for a particular net and set of subnets and then defaults
everything else to the nearest friendly EGP gateway, which doesn't know about
subnets. Then, usually as the result of mismatched host-name/address tables
and routing tables, some innocent honks an IPgram at a host on some subnet not
in the routing table, so that a loop is formed between the subnet gateway and
the EGP gateway. The lesson for subnet gateway designers is acute and clear,
especially in scenarios like the above where two-thirds of the gateways on a
path might well be vulnerable to subnet loops like this.

My conclusion is that the dangers of overstating the TTL far outweigh the
dangers of understating it. I consider values between 15 and 30 to be
appropriate and larger values to be not only ill-advised, but potentially
damaging to the health of the entire community.

Dave
-------



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:36:05 GMT