more on tcp congestion control


Van Jacobson (van@lbl-csam.arpa)
Mon, 22 Feb 88 21:39:37 PST


Phil -

Thanks for correcting the CUTE reference (I've got to start
putting JSAC and COM in separate piles) and thanks for your
interesting message on changes to the ka9q package. You
bring up a lot of things that I should have covered so I'm
cc'ing this reply to the tcp/ip list (in the unlikely event
that anyone else is interested).

> I suggest rounding your computed RTO values up to the next
> highest number of ticks when setting the timer. This is
> especially important when you calculate retransmission timeouts
> based on mean deviations, since they can be very small.
> Otherwise it's possible for truncation effects to set the timer
> LESS than the current SRTT.

Good point. I should have edited that timer msg I inserted
because it contained several rounding errors. The line

        rto = (Avg >> 3) + (Mdev >> 1);

should have read

        rto = ((Avg >> 2) + Mdev + 3) >> 1;

which is how things read in the timer rfc & the bsd code -- Avg
could also be rounded but it doesn't make much practical
difference. The mysterious "+ 3" is half a tick for rounding
plus one tick because the send is phased randomly with respect
to the clock (i.e., there's a +-1/2 tick uncertainty in when
the timer goes off). Also, because of the timer uncertainty, the
minimum value for rto should be two ticks. You can save a
"if (rto < 2) rto = 2" step, at the cost of a half-tick bias
in rto, if you make the above

        rto = ((Avg >> 2) + Mdev + 4) >> 1;

> 3. When you refer to "packets" in the context of setting the
> congestion window in TCP, what do you mean? One MSS?

Yes, by "packet" without a qualifier (such as "arbitrary size
packet") I always mean a one MSS sized packet.

> I tried that first, but I ran into problems. When I have an
> entire SLIP link to myself, the mean deviation sits at a very
> low value. Thus the RTO sits just above the SRTT. But when
> slowstart opens up the window to send more, the round trip time
> immediately increases beyond the retransmission timeout and an
> unnecessary retransmission occurs. The congestion window
> oscillates rapidly, with an unnecessary retransmission in each
> cycle.

I think I understand the problem (although I think you're
confusing the slowstart & congestion control algorithms -- more
on that in the next section) but I'm not sure I agree with the
analysis. I run tcp stuff, bind & NFS over a 2400 baud slip
link from home to work and I've spent a bit of time
investigating the link's behavior (waiting for the damn slow
link gave me a lot of time when I couldn't do anything but
investigate the link behavior :-).

Having the rtt variance in addition to the rtt is a big help
when you're trying to start a bandwidth-limited link. Since
errors in the variance decay quickly, you can afford to pump it
up whenever you're uncertain about how long a send is going to
take. In particular, when starting we set the variance to a
`big number' (default is 3 sec but this can be overridden on a
per-route basis) and the rtt to either our best estimate (also
kept in the route) or zero. After the syn packet, we'll know
the rtt for a zero length packet and rttvar will get decayed by
at most 25%. As long as the the rtt for the first data packet
changes by < 2*.75*bigNumber, we won't get a spurious timeout.
We're free to overestimate bigNumber initially because even a
factor of 10 error will be `forgotten' in 8 packet exchanges.

The same scheme applies for the 2nd and future packet exchanges.
If the link is delay limited, increasing the window from one to
two packets will have no effect. If the link is bandwidth
limited, doubling the window size will double the rtt. In
general, given that we know the rtt, R(W), for a window size of
W, the worst case rtt change for a window size of W+dW is
R(W)*dW/W. We don't know if we'll see the worse case but the
above reflects our "uncertainty" in the rtt of the next packet
if we increase the window size. Thus it should get lumped into rttvar:

        rttvar += rtt * dW/W;

During slowstart, dW is the max segment size, mss, and W is
snd.cwnd so the above becomes

        rttvar += rtt * mss/snd.cwnd

before each snd.cwnd change. (The above needs appropriate
scaling and rounding but they're implemenatation dependent.
Also, it overestimates the possible rtt change when snd.cwnd is
small but that turns out to be desirable.)

> So I'm trying something else. I initialize the congestion window
> at one MSS, but then I increase the congestion window by the
> actual amount of data acked each time an ACK comes in. This
> effectively slows the slow-start algorithm so the timer can have
> more time to adapt. Or at least the congestion window
> oscillation occurs at a much lower rate. What do you think?

If you mean open cwnd by MIN(amount.acked, mss), that's a real
interesting idea and I'm going to think about it. The above
timer hacking takes care of spurious retransmits but there's
another problem: Slowstart is designed to prevent hitting a
gateway with bursts of packets. If you send a few small packets
when the connection starts, their acks will open the window so
that if you suddenly decide to send a bunch of data, you'll hit
the gateway with a burst. This small/large switch is exactly
what happens in an SMTP conversation & it looks like your mod
will cure the problem.

But, if you mean what you said, I don't think it's a good idea.
Consider what happens when you send 1,2,3,4 and 1 is dropped.



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:40:42 GMT