Yet more on RTTs

Van Jacobson (
Fri, 14 Nov 86 12:51:35 PST

A few weeks ago there was a query about estimating packet round trip
time for an RDP implementation. I replied with some local measurements
that suggested that TCP's problems might be similar to RDP's: most
TCP conversations were so short that they behaved as datagrams. Based
on this, I suggested that a part of RTT maintenance be moved from
the TCP layer to the IP layer. If RTT really is common to RDP and TCP,
the IP layer is to logical place to put it. Based on measurement and
simulation, I have reason to believe that this move would improve
the Internet's present, abysmal performance.

I've been out of touch for two weeks (a problem of inter-personnal
congestion control) and read the past two weeks of TCP-IP messages
last night. The RTT messages were disappointing: they addressed
problems whose solution is known and which are being solved (albeit
slowly). I think we're facing a whole new set of problems. In an
effort to promote some light (or heat), what follows is my simple
minded explanation of what's going on and what we might start to
do about it. (In what follows, "connection" means a conversation
between two processes over a network, not a TCP connection.)

RTT is measured to help deal with unreliable packet delivery. When
packets are delivered reliably, TCP and RDP are self-clocking. Since
we know that delivery is unreliable, we design our protocols to make
an educated guess about whether a particular packet has been lost:
If no "clock" has been received for a "long time" (relative to the
round trip time), the packet probably needs to be retransmitted.

There are two reasons for losing a packet:
  1) It was damaged or misplaced in transit.
  2) It was discarded due to congestion.
The appropriate recovery strategy depends on the reason: For (1),
the packet should be retransmitted as soon as possible. For (2),
the retransmission should happen after a "Long" time (many times
the round trip time) so the congestion has a chance to clear (I'm
making the assumption that there's substantial buffering in the
subnet so the time constants for congestion are long -- this is
true of the nets I deal with and, given current memory prices,
likely to remain true).

Given that the sender doesn't know whether (1) or (2) has occurred,
what stategy should be used? If the stategy for (2) is chosen when (1)
is the cause, the throughput on this connection will go down a bit. If
the strategy for (1) is chosen when (2) is the cause, the problem will
get much worse, both for this host and others on the net. In the
absence of other information, the principle of Least Damage tells us to
use the strategy for (2). [Experience also suggests that damaged
packets are unlikely -- the error rate on our worst net is <0.1%. But,
if you really have to get maximum throughput on a connection, as
opposed to maximum agregate throughput on all your connections, the
Pollaczek-Khinchine equation says that the variance of the RTT
estimates can be used to distinguish (1) & (2). I design networks for
real-time control in hostile environments and occasionally make use of
this. It's not generally useful.]

The strategy for (1) is very "local". It can be detected by the
process running the connection and that process can take corrective
action that should both cure the problem and have negligible effect
on other connections. The strategy for (2) is global. The congestion
detected on a connection is probably not caused by that connection.
In fact, it is likely that no single connection is the cause. Thus no
single action is going to cure the problem, only the combined effect of
several connections reducing their traffic rate. For this to happen,
each of those connections has to discover the problem (which means
sending packets, which agravate the problem) including newly opened
connections. The recovery time is clearly going to be an exponential
with a long time constant.

A way to reduce the recovery time is to introduce more coupling between
the connections. Congestion is a propery of network path(s), not of
connections. When one connection discovers congestion on a path,
that information should be made available to all connections in the
same machine using that path. This isn't hard to implement: A lot
of the congestion happens over paths that look like:

 Host A-| |
        | |
        | |
        | |-Host D

Where A is talking to D, the vertical lines are relatively high-speed,
local nets and the horizontal line is a low-speed, long-haul net(s). The
difference in net speeds means that any congestion will almost certainly
occur somewhere on the path from B to C. This means that, from A's
point of view, the round trip time to D is characteristic of the RTT to
any host served by C (I have data which says that the gateway accounts
for 90% of the variation in RTT). If A contains a routing entry for C
(IP requires a routing entry in A for B or C or both), a slot could be
left in that entry for RTT. If TCP, RDP, etc., used that slot for
the value in all their RTT calculations, information about the path
would automatically be shared (and also wouldn't be lost when a TCP
connection closed). Just this much change would eliminate the "turn-
on transient" of retransmissions that occur while a tcp connection
is learning the RTT.

Once one stops regarding RTT as a property of connections and starts to
regard it as a measured, dynamic property of the topology, some related
ideas start to look interesting. Like B telling A topology and A
telling B transit times (the stability problems of the old Arpanet
routing protocol shouldn't show up if this is only done locally and
RTT(local) << RTT(long haul)). Or treating "source quench" as if it
meant "I'm congested" rather than "You should shut up" (it obviously
means both). Under the first interpretation, it is information about
the state of part of the path. If gateways along the return path
wiretap, the information does service to several hosts rather than a
single TCP conversation (and we start to get distributed congestion
control via "choke packets" which have some nice properties if there's
enough buffering in the subnet to handle the diffusion time.)

[I'm sure I'll be toasted for something in the preceding paragraph,
if not for the rest of this opus. We learn by making mistakes.]

I'll close with a brief reiteration of my context. As the round trip
time of the Internet has gotten worse, the nature of our (locally
generated) traffic has changed. Only someone desperate or mad would
try to telnet. Our ftp lacks an automatic retry and it took only a
few "connection timed out"s to make our users abandon file transfer.
The result is a high proportion of mail traffic. Our usual congestion
is not caused by a few hosts flooding the net with packets (perhaps
because the few hosts we've found doing this were quickly and forcibly
disconnected). Our usual congestion is the result of a large fraction
of the 200 hosts on an ethernet trying to ship mail through a gateway
with a 9.6Kbit output line. Each host sends 3 small packets and one big
one, "HELO", "MAIL FROM", "RCPT TO" and "DATA..." (the small packets
are SMTP's fault -- thanks to John Nagle's accumulate-until-the-ack, all
the packets are as big as they can be). The destinations are usually
different. I don't know of a congestion algorithm that deals with this
situation but I feel we need one.

  - Van Jacobson

This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:36:59 GMT