VMTP Checksum Discussion


David Cheriton (cheriton@pescadero.stanford.edu)
Mon, 1 Dec 86 21:25:58 pst


Dear TCP-IPers:
  I have been working on a request-response transport protocol (VMTP) for
possible adoption in the Internet. This work is being done as part of my
participation in the IAB End-to-end task force, chaired by Bob Braden.
  Bob has asked me to include this mailing list in on some discussion of
design choices for checksums. Here is my side of the story.

                  VMTP Checksum Controversy

There appears to be three basic issues re VMTP Checksum, or more generally
a checksum for the next generation of transport protocols, namely:
1) what is the size of the checksum field.
2) the algorithm should be used.
3) Where in the packet should the checksum be located.
I spent some time talking with John Gill and Marty Hellman of the Stanford
Information Systems Lab. as well as thinking about it and some experimentation.
(I also got some good feedback from others in my group, including Steve
Deering, Michael Stumm, Ross Finlayson.)

1) Checksum Size

The checksum is a probabilistic device, clearly. There are an enormous
number of packets that map to the same checksum for either 16 or 32 bits of
checksum.
Let R be the packet rate (on average) in packets per second.
Let P be the probability of a packet corrupted that is undetected
by the lower levels, per packet.
As long as P is non-zero, there is some time range T at which
T*R*P becomes quite large, i.e. we have to expect some packets
that are erroneous being delivered by the IP level. I'll refer to these
as nasty packets.
If we view the corruption as randomly assigning the packet a new checksum
from the space of possible checksums, it is clear that, with a 16-bit
checksum, there is a 1 in 65,536 chance that the corrupted packet will have
the same checksum. With a 32-bit checksum, it is one in 4 billion.
The key ingredient we expect to dramatically change over the next 5 years
is the packet rate, for channels, switching nodes and gateways.
It is reasonable to expect a significant increase in traffic, number of
gateways as well as packet rate per gateway. Just to play with some numbers,
let us assume by 1991, 500 gateways each handling 200 per sec. with probability
10**-7 of packets with errors undetected by lower levels (i.e. combination
of all hardware and software from one endpoint to another.) means that
there are 36 packets per hour with undetected errors at lower level.
Assuming random mapping on to checksums, with 16 bit checksum, one might
expect a packet whose error is not detected by the transport-level checksum
every 75 days or so. Maybe this is OK? However, suppose these numbers are
low? For instance, we occasionally gets bursts of packets at 2000/sec on
our Ethernet. The future parameters and use of communication could easily
push the time between undetected errors down by an order of magnitude
(7 days?)

Of course, one could argue that such a packet may still fail the addressing
checks, the sequence number checks, etc. in the protocol. However,
our measurements of VMTP indicate 1/2 as being minimal size packets and 1/4
as maximal size packets (for the Ethernet at least). With small packets,
the probability of random mods to data versus header is .5 whereas for
for large packets it is .95 for the data. Thus, this argument reduces the
risk by far less than a factor of two. Moreover, it is disquieting to fall
back on these random checks to help us when we expect the "real" error
detection mechanisms is going to fail.

One might also argue that my assumed error rate/per packet is too high.
However, the probability of correct operation is the PRODUCT of
the probabilities for the sequence of components involved in a packet
transmission and delivery - so this error rate could get large fast.
Moreover, one requires a more careful analysis to get a tighter bound.
Can anyone offer a significant improvement in bounds without making
some disconcerting assumptions??

In the same breath, one could argue that my analysis is far too optimistic.
For instance, if the checksum algorithm uniformly maps all possible
packets onto all possible checksums, then the fact that normal packets
may be dramatically skewed to particular types, such as ASCII text
packets to and from an important server, may mean that far fewer
corruptions will be detected. In particular, a systematic hardware error
that corrupts a packet into a nasty packet (with correct checksum) is
far more likely to encounter the same packet (or similar enough)
than if random packets were being generated.

This "analysis" deals with pleasant, sunny day expected behavior.
A more frightening situation is dealing with burst errors from failing
components. For instance, suppose a gateway or network interface fails
by curdling random portions of packets. We could see 200 undetected
(below transport level) errors per second. Again, assuming random
mods, with a 16-bit checksum we can expect to get a corrupted packet
that passes the transport-level checksum every 5.5 minutes,
versus every 8.2 days with a 32-bit checksum.
If we consider a military situation where components may get wounded in action,
it is easy to conceive of several things generating garbage at the worst
possible time, given a very high exposure using a 16-bit checksum.
The 32-bit checksum appears to give a needed marginal of safety,
especially a variety of factors could cause one to effectively loss a
factor of 2 or more. (As John Gill raised, are you sure 32-bits is enough?!)

Overall, after studying this issue in brief and thinking thru the above,
I am even more convinced we need to get to at least a 32-bit checksum.
The increased expected packet rate is the key thing that is making 16-bit
checksums unacceptable.

2) Checksum algorithm

There appears to be a deficiency of theory behind checksums that are suitable
for software implementation. Calculating 32-bit CRC's is infeasible - either
too slow or requires an enormous lookup table. The ideal algorithm is:
i) very fast to calculate and check in software.
ii) can be shown to detects common types of packet corruption including,
iii) can be shown to be unupdateable "thru" an encryption scheme,
  I.e. intruder cannot update (encrypted) checksum field to compensate for
   modifications to other encrypted data.

Lots of schemes can be proposed but the more complex, the more difficult it is
to analyze with respect to (ii) and (iii).
The current algorithm, assuming we extend it to a 32-bit version
does not detect appending or inserting 0's in a packet
or the reordering of 16-bit (or 32-bit) units. These seem like not unlikely
failures in a large internetwork with gateways or different byte orders
and flakey hardware, etc.

Another candidate is "add-and-rotate" as used by RDP, plus suggested
(without guarantee) by John Gill. It also has the property that any single



This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:37:00 GMT