Re: 4.2/4.3 TCP and long RTTs

Van Jacobson (
Sun, 07 Dec 86 05:44:03 PST

What you observe is probably poor tcp behavior, not antisocial rdp
behavior. If the link is lossy or the mean round trip time is
greater than 15 seconds, the 4.3bsd tcp throughput degrades rapidly.
For long transfers, a link that gives 2.7KB/s throughput with a
1% loss rate, gives 0.07KB/s throughput with a 10% loss rate. (As
appalling as this looks, 4.2bsd, TOPS-20 tcp, and some other major
implementations that I've measured, get worse faster. The 4.3
behavior was the best of everything I looked at.)

I know some of the reasons for the degradation. As one might
expect, the failure seems to be due to the cumulative effect of
a number of small things. Here's a list, in roughly the order
that they might bear on your experiment.

1. There is a kernel bug that causes IP fragments to be generated
   and ip fragments have only a 7.5s TTL.

In the distribution 4.3bsd, there is a bug in the routine
in_localaddr that makes it say all addresses are "local". In
most cases, this makes tcp use a 1k mss which results in a lot of
ip fragmentation. On high loss or long delay circuits, a lot of
the tcp traffic gets timed out and discarded at the destination's
ip level.

The bug fix is to change the line:
        if (net == subnetsarelocal ? ia->ia_net : ia->ia_subnet)
in netinet/in.c to
        if (net == (subnetsarelocal ? ia->ia_net : ia->ia_subnet))

I also changed IPFRAGTTL in ip.h to two minutes (from 15 to 240)
because we have more memory than net bandwidth.

2. The retransmit timer is clamped at 30s.

The 4.3 tcp was put together before the arpanet went to hell and
has some optimistic assumptions about time. Since the retransmit
timer is set to 2 * RTT, an RTT > 15s is treated as 15s. (Last
week, the mean daytime rtt from LBL to UCB was 17s.) On a circuit
with 2min rtt, most packets would be transmitted four times and
the protocol pipelining would be effectively turned off (if 4.3
is retransmitting, it only sends one segment rather than filling
the window). When running in this mode, you're very sensitive to
loss since each dropped packet or ack effectively uses up 4 of
your 12 retries.

I would at least change TCPTV_MAX in netinet/tcp_timer.h to a
more realistic value, say 5 minutes (remembering to adjust
related timers like MSL proportionally). I changed the
TCPT_RANGESET macro to ignore the maximum value because I
couldn't see any justification for a clamp.

3. It takes a long time for tcp to learn the rtt.

I've harped on this before. With the default 4k socket buffers
and a 512 byte mss, 4.3 tcp will only try to measure the rtt of
every 8th packet. It will get a measurement only if that packet
and its 7 predecessors are transmitted and acked without error.
Based on trpt trace data, tcp gets the rtt of only one in every
80 packets on a link with a 5% drop rate. Then, because of the
gross filtering suggested in rfc793, only 10% of the new
measurement is used. For a 15s rtt, this means it takes at least
400 packets to get the estimate from the default 3s to 7.5s
(where you stop doing unnecessary retransmits for segments with
average delay) and 1700 packets to get the estimate to 14s (where
you stop unnecessary retransmits because of variance in the
delay). Also, if the minimum delay is greater than 6s
(2*TCPTV_SRTTDFLT), tcp can never learn the rtt because there
will always be a retransmit canceling with the measurement.

There are several things we want to try to improve this
situation. I won't suggest anything until we've done some
experiments. But, the problem becomes easier to live with if
you pick a larger value for TCPTV_SRTTDFLT, say 6s, and improve
the transient response in the srtt filter (lower TCP_ALPHA to,
say, .7).

4. The retransmit backoff is wimpy.

Given that most of the links are congested and exhibit a lot of
variance in delay, you would like the retransmit timer to back
off pretty aggressively, particularly given the lousy rtt
estimates. 4.3 backs of linearly most of the time. The actual
intervals, in units of 2*rtt, are:
  1 1 2 4 6 8 10 15 30 30 30 ...
While this is only linear up to 10, the 30s clamp on timers
means you never back off as far as 10 if the mean rtt is >1.5s.
The effect of this slow backoff is to use up a lot of your
potential retries early in a service interruption. E.g., a
2 minute outage when you think the rtt is 3s will cost you 9
of your 12 retries. If the outage happens while you were
trying to retransmit, you probably won't survive it.

This is another area where we want to do some experiments. It
seems to me that you want to back off aggressively early on, say
 1 4 8 16 ...
for the first part of the table. It also seems like you want
to go linear or constant at some point, waiting 8192*rtt for the
12th retry has to be pointless. The dynamic range depends to
some extent on how good your rtt estimator is and on how robust
the retransmit part of your tcp code is. Also, based on some
modelling of gateway congestion that I did recently, you don't
want the retransmit time to be deterministic. Our first cut
here will probably look a lot like the backoff on an ethernet.

5. "keepalive" ignores rtt.

If you are setting SO_KEEPALIVE on any of your sockets, the
connection will be aborted if there's no inbound packet for
6 minutes (TCPTV_MAXIDLE). With a 2m rtt, that could happen
in the worst case with one dropped packet followed by one
dropped ack. ("Sendmail" sets keepalive and we were having
a lot of problems with this when we first brought up 4.3.)

A fix is to multiply by t_srtt when setting the keepalive
timer and divide t_idle by t_srtt when comparing against

6. The initial retransmit of a dropped segment happens, at
   best, after 3*rtt rather than 2*rtt.

If the delay is large compared to the window, the steady state
traffic looks like a burst acks interleaved with data, an ~rtt
delay, a burst of acks interleaved with data and repeat. 4.3
doesn't time individual segments. It starts a 2*rtt timer for
the first segment, then, when the first segment is acked,
restarts the timer at 2*rtt to time the next segment. Since the
2nd segment went out at approximately the same time as the first
and since the ack for the first segment took rtt to come back,
the retransmit time for the 2nd segment is 3*rtt. In the usual
internet case of 4k windows and an mss of 512, the probability of
a loss taking 3*rtt to detect is 7/8.

The situation is actually worse than this on lossy circuits.
Because segments are not individually timed, all retransmits
will be timed 2*rtt from the last successful transfer (i.e.,
the last ack that moved snd_una). This tends add the time
taken by previous retransmissions into the retransmission time
of the the current segment, increasing the mean rexmit time
and, thus, lowering the average throughput. On a link with
a 5% loss rate, for long transfers, I've measured the mean time
to retransmit a segment as ~10*rtt.

The preceeding may not be clear without a picture (it sure took
me a long time to figure out what was going on) but I'll try to
give an example. Say that the window is 4 segments, the rtt is
R, you want to ship segments A-G and segments B and D are going
to get dropped. At time zero you spit out A B C D. At time R you
get back the ack for A, set the retransmit timer to go off at 3R
("now" + 2*rtt), and spit out E. At 3R the timer goes off and you
retransmit B. At 4R you get back an ack for C, set the retransmit
timer to go off at 6R and transmit F G. At 6R the timer goes off,
you retransmit D. [D should have been retransmitted at 2R.] Even
if we count the retransmit of B delaying everything by 2R (in
what is essentially a congestion control measure), there is an
extra 2R added to D's retransmit because its retransmit time is
slaved to B's ack. Also note that the average throughput has
gone from 8 packets in 2R (if no loss) to 8 packets in 7R, a
factor of four degradation.

The obvious fix here is to time each segment. Unfortunately,
this would add 14 bytes to a tcpcb which would then no longer fit
in an mbuf. So, we're still trying to decide what to do. It's
(barely) possible to live within the space limitations by, say,
timing the first and last segments and assuming the segments were
generated at a uniform rate.

7. the retransmit policy could be better.

In the preceeding example, you might have wondered why F G were
shipped after the ack for C rather than D. If I'd changed the
example so that C was dropped rather than D, C D E F would have
been shipped when the ack for B came in (unnecessarily resending
D and E). In either case the behavior is "wrong". The reason it
happens is because an ack after a retransmit is treated the same
way as normal ack. I.e., because of data that might be in
transit you ignore what the ack tells you to send next and just
use it to open the window. But, because the ack after a
retransmit comes 3*rtt after the last new data was injected, the
two sides are essentially in sync and the ack usually does tell
you what to send next.

It's pretty clear what the retransmit policy should be. We
haven't even started looking into the details of implementing
that policy in tcp_input.c & tcp_output.c. If a grad student
would like a real interesting project ...

There's more but you're probably as tired of reading as I am of
writing. If none of this helps and if you have any Sun-3s handy,
I can probably send you a copy of my tcp monitor (as long as our
lawyers don't find out). This is something like "etherfind"
except it prints out timestamps and all the tcp protocol info.
You'll have to agree to post anything interesting you find out

Good luck.

  - Van

This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:37:00 GMT