VMTP Checksum Discussion

David Cheriton (cheriton@pescadero.stanford.edu)
Mon, 1 Dec 86 21:25:58 pst

the TCP algorithm).
he "rotate" provides some detection of zeros and reordering. However,
clearly there are occasions that it will fail to detect reordering of
32 long word blocks, or insertion of 32 long blocks of zeros.
(It also requires some assembly language in most languages for efficient
implementation.) Other properties are not clear to me.

Another algorithm is the ISO transport algorithm, which is easily extended
to 32-bits or 64-bits. I.e. R = Sum(1..L) Wi and S = Sum(1..L)(L-i+1)Wi
where Wi is the ith (32-bit) word and there are L bits in the packet.
According to Fletcher's analysis (See ref. below.), this algorithm
detects all single bit errors and detects all but .001526 percent of all
possible errors, same as CRC. It also detects all burst errors of length
C bits, where C is the number of bits in the checksum. Fletcher claims:
"it represents a very reasonable tradeoff between effectiveness and efficiency".

I did some basic performance tests on these algorithms.
The following gives the performance for the three algorithms for checksumming
a VMTP packet with 1024 bytes of data (assumed to be long word aligned).

Algorithm MC 68010 (10 MHz) MC 68020 (16 MHz)
add-and-rotate 916 usec 320 usec.
ISO Transport (32-bit) 914 usec 336 usec.
ISO (16-bit) 1245 usec 446 usec.
TCP 890 usec 323 usec.

The TCP figure is really not too accurate - it lacks a test for carry on
each 32-bit add. The first ISO version uses 32-bit adds to develop the two sums
using 2's complement arithmetic (and then I just store the sum of the two
sums - a proper scheme here would require 64-bits for the checksum field.)
(Fletcher notes that the 1's complement version has slightly better properties
but is more expensive to compute.)
The second one uses 16-bit adds, which means a register swap is necessary as
well as doubling the number of adds required. The results here appear to
contradict Jon Crowcroft's comments on performance. Perhaps his factor of
4 comes from a poor implementation??
The ISO 32-bit version and the TCP version are in (portable) non-asm C,
whereas the others are done in assembly language. An assembly version
of ISO 32-bit appears about 80 usecs faster on a 68020.

All these algorithms have the property of requiring only one
memory reference (to get the 32-bit byte out of the packet) per word
in the packet, which is the minimal memory reference cost, given 32-bit
data paths. The numbers seem to indicate that a small number of instructions
per long word (that do not increase # memory references) make no significant
difference in performance of the algorithm. This is particularly true on
machines like the 68020, where the inner loop fits into the on-chip
instruction cache, and this is the future. For instance, the ISO and TCP
versions differ by essentially one addition, and thus by 13 usecs
(for 1108 register-register adds!)

Thus, I argue that there is no significant performance penalty to adding a
few non-memory reference instructions for a rotate or extra addition
to get better detection capabilities.

3) Location of the checksum

I propose to place the VMTP checksum into a trailer portion of the packet.
This allows a high-performance network interface, such as the one we are
developing here at Stanford, to add the checksum as the packet is transmitted
to the network, avoiding 2 extra memory references. It does not appear to
add any additional software overhead without such an interface, unless one
is trying to use a relatively simple DMA net interface and DMA packets
on transmission and reception without copy. Since this is in general not
feasible without scatter-gather, and with, the trailer presents no problem,
I dont think this is a real disadvantage.

My vision of the future is of network interfaces that implement much or
all of the transport level, providing a service similar to the disk interface.
I.e. one does not generally worry in software about errors between main memory
and the disk. Note that the error checking is still end-to-end at least
between network interfaces in the end machines. Also, with request-response
protocols, the response is an application-level end-to-end acknowledgement of
the request.

For these interfaces, performance is determined by the number of memory
references. The next generation of network interfaces must minimize
the number of memory references to transmit and receive packets or they
will be slow. In fact, one of the big challenges (the only one?) is to
implement "intelligence" on the network interface without being killed
by the delay of extra copies (i.e. memory references.)
  Clearly, the value of the checksum is not known until the entire packet
has been scanned. Putting the checksum at the end allows the checksum
to be calculated as part of transmitting the packet from the network
interface to the network. The only other copy into which it might be
incorporated, between the host and the network interface, has several
problems. First, the data is not necessarily packetized at that point,
and definitely not in our network interface design. Second, the data rate
on that path can be so high (and must be in multiprocessors) that fitting
a checksum calculation into the copy is difficult. Moreover, one may
want to do encryption at the same time. For instance, we move data
at 320 megabits/sec. in our design for the VMEbus. Much higher speeds
are planned.

Currently, we are dealing with very "1st" generation network interfaces
that are dealing with multi-megabit data rates. I think the next generation
can be much more reliable, efficient, and powerful. We need to design
the transport protocol such that it does not introduce unnecessary
difficulties for such interfaces.
Thus, I feel this minor modification to the protocol supports the right way
of doing network interfaces, and that it aids these interfaces in dealing
with an intrinsic problem, not one that will go away with a few more bells
and whistles on the interfaces.

One issue a trailer checksum raises is alignment relative to encryption.
I have tried to keep the header as a multiple of 64 bits.
With a null data segment and 32-bit checksum, the VMTP packet is still a
multiple of 64 bits. If we force the datasegment to also be a multiple
of 64 bits, this all remains true still. This is what I propose to do.

In summary, I recommend we go for a 32-bit checksum and use the ISO
checksum algorithm, modified to use 32-bit units.
The ISO checksum is a standard and has an article analyzing its
properties (to some extent). (If we also start the checksum
using a non-zero initial value, it should do well at detecting a packet
of all zeroes.) Finally, we place the checksum in a single 32-bit trailer
portion of the packet.


David Cheriton


J.G. Fletcher, An Arithmetic Checksum for Serial Transmission,
IEEE Trans. on Com. COM-30 (1) 247-251, Jan, 1982.

This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:37:00 GMT