Re: Checksums (again)

Charley Kline (
Fri, 1 Apr 88 15:53:44 CST

When I first started the CTSS TCP/IP project, Dave Clark suggested that
I publish the checksum algorithm. His contention was that most checksummers
are dreadfully machine-dependent, incomprehensible, and artful things. Well,
I didn't put much effort into art, but I thought I'd oblige him.

What I do to compute checksums on a Cray is a vector operation
in assembler of all things. I can do up to 512 bytes of packet at a
time. The core of the algorithm looks as follows. A1 holds the address
of the 512-byte block of memory to checksum (I'm leaving out many
details having to do with short blocks, because they are ugly).

        A0 A1
        VL 64 use full vectors
        S1 <32 form 32-bit mask from the right.
        A2 32
        V1 ,A0,1 load packet into V1
        V2 S1&V1 Form right-hand 32-bits in V2.
        V3 V1>A2 Form left-hand 32-bits in V3.
        V1 V2+V3 Add the two together.
        A2 63 Prepare to collapse into a scalar.
        S1 0
        S4 <16 Form 16-bit mask from the right.
        A4 16
        A2 A2-1
        A0 A2
        S1 S1+S2
        JAN CK$LOOP
        S2 S1&S4 Form right-hand 16-bits in S2
        S1 S1>A4 Form left-hand 16-bits in S1
        S1 S1+S2
        S2 S1&S4 Form right-hand 16-bits in S2
        S1 S1>A4 Form left-hand 16-bits in S1
        S1 S1+S2
        S1 #S1 Take one's complement
        CMR At this point, S1 contains the checksum.

Note that this takes full advantage of the fact that the one's complement
sum is an Abelian Group--I can add 32-bit pieces instead of 16-bit pieces,
as has been mentioned. I can also take advantage of the fact that the sum of
the checksums of several blocks is the same as the checksum of the sum
by carrying out this checksum on several blocks and adding up the answers.

First I load two copies of the packet into two vector registers. One
gets vector-shifted right 32 bits; the other gets vector-ANDED with a
32 bit mask. Then the two are added together. Since all these operations
chain, I can produce one result per clock cycle. Then I collapse
the results in the vector by adding each element to a scalar register.
Finally, the end-around carry and conversion to 16-bits. It's all
terribly fast. Now if only context-switching and IPC under CTSS were
as fast...

Charley Kline, University of Illinois Computing Services

"Birth. School. Work. Death."

This archive was generated by hypermail 2.0b3 on Thu Mar 09 2000 - 14:41:54 GMT