Previous: Purpose of this document
Up: Appendix 1: Packetization of H.261 video streams
Next: Usage of RTP
Previous Page: Purpose of this document
Next Page: Usage of RTP

Structure of the packet stream

H.261 codecs produce a bit stream. In fact, H.261 and companion recommendations specify several levels of encoding:

  1. Images are first separated in blocks of 8x8 pixels. Blocks which have moved are encoded by computing the discrete cosine transform (DCT) of their coefficients, which are then quantized and Huffman encoded.

  2. The bits resulting of the Huffman encoding are then arranged in 512 bits frames, containing 2 bits of synchronization, 492 bits of data and 18 bits of error correcting code.

  3. The 512 bits frames are then interlaced with an audio stream and transmitted over px64 kbps circuits according to specification H.221.

When transmitting over the Internet, we will directly consider the output of the Huffman encoding. We will not carry the 512 bits frames, as protection against errors can be obtained by other means. Similarly, we will not attempt to multiplex audio and video signals in the same packets, as UDP and RTP provide a much more efficient way to achieve multiplexing.

Directly transmitting the result of the Huffman encoding over an unreliable stream of UDP datagrams would however have very poor error resistance characteristics. The H.261 coding is in fact organized as a sequence of images, or frames, which are themselves organized as a set of Groups of Blocks (GOB). Each GOB holds a set of 3 lines of 11 macro blocks (MB). Each MB carries information on a group of 16x16 pixels: luminance information is specified for 4 blocks of 8x8 pixels, while chrominance information is only given by two color difference components 8x8 "red" and "blue" blocks. These components and the codes representing their sampled values are as defined in the CCIR Recommendation 601.

This grouping is used to specify information at each level of the hierarchy:

The result of this structure is that one needs to receive the information present in the frame header to decode the GOBs, as well as the information present in the GOB header to decode the MBs. Without precautions, this would mean that one has to receive all the packets that carry an image in order to properly decode its components. In fact, the experience has shown that:

  1. It would be unrealistic to carry an image in a single packet: video images can sometimes be very large.

  2. GOB information typically fits in a packet. In fact, several GOBs can often be grouped in a packet.

Once we have take the decision to correlate GOB synchronization and packetization, a number of decisions remain to be taken, due to the following conditions:

  1. The algorithm should be easy to implement when packetizing the output stream of a hardware codec.

  2. The algorithm should not induce rendition delays -- we should not have to wait for a following packet to display an image.

  3. The algorithm should allow for efficient resynchronization in case of packet losses.

  4. It should be easy to depacketize the data stream and direct it to an hardware codec's input.

  5. When the hardware decoder operates at a fixed bit rate, one should be able to maintain synchronization, e.g. by adding padding bits when the packet arrival rate is slower than the bit rate.

The H.261 Huffman encoding includes a special "GOB start" pattern, composed of 15 zeroes followed by a single 1, that cannot be imitated by any other code words. That pattern marks the separation between two GOBs, and is in fact used as an indicator that the current GOB is terminated. The encoding also includes a stuffing pattern, composed of seven zeroes followed by four ones; that stuffing pattern can only be entered between the encoding of MBs, or just before the GOB separator.

The first conclusion of the analysis is that the packets should contain all the GOB data, including the "GOB start" pattern that separate the current block from its follower. Actually, as this pattern is well known, we could as well use a single bit in the data header to indicate that a GOB-start pattern must be added at the decoder side.

Not encoding the GOB-start pattern has two advantages:

Another problem posed by the specificities of the H.261 compression is that the GOB data have no particular reason to fit in an integer number of octets. The data header will thus contain two three-bits integers, EBIT and SBIT:

Although only the EBIT counter would really be needed for software coders, the SBIT counter was inserted to ease the packetization of hardware coders output. A sample packetization procedure is found in annex A.

At the receiving sites, the GOB synchronization can be used in conjunction with the synchronization service of the RTP protocol. In case of losses, the decoders could become desynchronized. The "S" bit of the H.261 option field will be set to indicate that the packet includes the beginning of the encoding of a GOB, i.e. the quantifier common to all macro blocks. The receiver will detect losses by looking at the RTP sequence numbers. The receiver is recommended to resequence out of order packets in order to limit the packet loss effect. Some misordering of packets in the network seems likely, even when there is no loss, and one would not want to drop a frame because of that. In case of losses, it will ignore all packets whose "S" bit is null. Once an S bit packet has been received, it will prepend the GOB start code to that packet, and resume decoding.

An example packetization program is given in the original of this intenet draft.

Previous: Purpose of this document
Up: Appendix 1: Packetization of H.261 video streams
Next: Usage of RTP
Previous Page: Purpose of this document
Next Page: Usage of RTP
Mon Jan 10 18:40:57 GMT 1994