more on tcp congestion control


Van Jacobson (van@lbl-csam.arpa)
Mon, 22 Feb 88 21:39:37 PST


BAD MSG:
fill the hole in the sequence space, all four packets will be
cked. If you increment by the amount acked, cwnd will get
fully opened and you'll blast out a window's worth of
back-to-back packets, almost guaranteeing another drop.
(I've shown the IETF lots of pictures of this kind of stable
failure mode.)

> 4. I'm confused by the modified slow-start algorithm you
> described, the one that uses a threshold to change the amount
> the congestion window is increased. Again, what does it mean to
> increase the congestion window by 1/cwind? Is cwind in packets
> (what size?) or bytes (increase the window by fractions of a
> byte?)

We found it was important to distinguish startup behavior from
steady-state behavior. (It sounds as if you're lumping the two
together.) The thing we call "slow start" just has to do with
starting data flowing. The algorithm has a very short lifetime
-- it typically runs for the first 3-5 rtt after you first start
sending on a connection or restart after a drop. When slow
start shuts down, the link is in equilibrium (the pipe is full
and you've exchanged a full window of data so you know the ack
"clock" is working). At this point another algorithm,
congestion avoidance, takes over. This algorithm is responsible
for estimating how much data is needed to fill the pipe under
the current (time-varying) load. The congestion avoidance &
slowstart are separate algorithms with completely different
objectives. They just happen to run consecutively and just
happen to accomplish their individual objectives by twiddling
the same state variable (cwnd).

The congestion avoidance algorithm is described well in DEC
technical report DEC-TR-506, "Congestion Avoidance in Computer
Networks with a Connectionless Network Layer" by Raj Jain, K. K.
Ramakrishnan, Dah-Ming Chiu, August, 1987. The idea is that when
you try to use a window larger than the bandwidth*delay of the
network path, the excess window won't fit in the pipe so it
must show up in a queue somewhere. In the usual case, "somewhere"
will be the outbound queue of the slowest link in the path.
That link will probably be a bottleneck for a lot of other
conversations and it's likely its queue will fill up and overflow.
We treat this as a signal that our window is too large and
reduce it (Jain doesn't wait for an overflow -- his signal is
a bit in packet that the gateway can set to say its congested.)

But the `bandwidth' in the delay-bandwidth product is *your
share* of the fixed link bandwidth and your share can change as
conversations using the same path come and go. The gateway
tells you when you're using too much but says nothing when
you're using too little. To find out if someone has shut down
and your share is now larger, you continuously test by slowly
increasing your window size until the gateway says you're using
too much. (Thus the window will oscillate around some mean
value. The algorithm is designed to do this. But we've tried
to set the adjustable parameters that determine the oscillation
so that the bandwidth lost to testing the path is at most 1%.)

>From a linear system argument, you can show that best probe
policy is to make your bandwidth vs. time obey dB/dt = C for
some constant C. Since B = W / R, where W is the window size
and R is the round trip time, and R is constant, this says you
want dW/dt = c for some other c. In other words, in one rtt we
want the window to go smoothly from size W to size W+c. I want
the algorithm to be "self clocked" so we go looking for how to
do the above change using acks rather than time to drive the dW.
If the max packet size is mss and we have decent silly window
code, a window of W will generate at most W/mss acks and it will
generate them spread out over one rtt. Thus if we increment by
c/(W/mss) = c*mss/W per ack, we will have opened the window by c
after one rtt. An appropriate c for current arpanet conditions
turns out to be one mss so the increment part of the congestion
avoidance turns into

        snd.cwnd += mss*mss / snd.cwnd;

(you might have to worry about the the fixed-point in this if you
were trying to send small packets over a net with a large bandwidth
delay product but it's not a problem in any net I know of.)

> 5. You didn't specifically mention ICMP source quench, but I
> assume you mean that the congestion window should be cut in half
> each time you get one. To make this meaningful I also limit the
> congestion window to be no more than the offered window. One
> question: should the congestion window be allowed to decrease
> below one MSS? I would say yes, but I'd like to hear your
> opinion.

I didn't mention source quench because it's a disaster and I keep
hoping it will go away. There must be at least one rtt of hysteresis
or "dead time" for any control algorithm to be stable. By using
packet drops and timeouts to drive things, we get this automatically.
But it's very, very hard to get the right kind of hysteresis with
source quench. It's also possible to show that source quench leads
to a very unfair bandwidth allocation -- SQ is preferentially
delivered to the hosts using the smallest windows (I have lots of
trace data showing this and can explain why it happens but the
explanation is mathematical and quite involved).

So, since SQ currently happens only when one or more packets has
been dropped, all we do is close snd.cwnd down to one packet (so
no more will get dropped) but we don't touch snd.ssthresh
(because the slowstart time is short, ssthresh is the crucial
window control, not cwnd). Ssthresh and the rest of the
congestion control stuff will get handled correctly, and with
the right "time constants", when we eventually take the timeout
for the lost packet.

We limit cwnd to be at most the offered window (you violate
rfc793 if you ever send more than the offered window). I don't
think it's a good idea to let cwnd get below one mss. Dropping
cwnd below mss forces you to send small packets, increasing the
overhead per data byte xferred. Thus you are using the network
less efficiently at a time it's heavily loaded and you really
want to make the most efficient possible use of it. This is an
unstable positive feedback that can lead to collapse. If mss is
set correctly, it never makes sense to send a smaller packet.

> 6. It's interesting to note that the "set the congestion window
> to one packet after a timeout" rule automatically implies the
> "first-only retransmission" rule, thus simplifying my code
> somewhat. In effect, a timeout is just a very drastic source
> quench.

Yes, there were several sort-of ad-hoc pieces of code, like the
first-only rexmit and all the source quench handling, that were
taken out because their job was handled in a more unified way by
the slowstart/cong. avoidance framework -- it was very
satisfying. We removed something like 15 lines and replaced
them with 6. Of course, we then stuck in your clamped
retransmit backoff algorithm, a fast retransmit algorithm (that
got described at one of the IETF meetings) and a few more
goodies, and ended up +3 lines from the original 4.3 code
instead of the -9 we expected (not including, of course, dozens
of lines of bug fixes).

I prefer to think of source quench as a poorly considered, ill
behaved timeout :-).

 - Van
-------



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:40:43 GMT