1. Introduction

Mbone audio and video tools provide relatively low levels of compression due to the requirement for low cost software-based solutions compatible with a wide range of target platforms. This strategy has enabled multimedia conferencing technology to attain a critical mass of users over the Mbone (multicast backbone over the Internet). Higher quality audio and video can however be provided at reasonable bandwidths if state-of-the-art compression algorithms are used. These combined with the emergence of low cost, high  performance PCs with usable multi-tasking operating systems mean that high quality audio and video over SuperJANET could soon be accessible to a very large range of users and applications.

The Mbone presents a unique environment for communication, because it supports large multi-way communication in a dynamic heterogeneous manner; multicast enables multi-way communication that scales, and communication management that is flexible (users can join and leave the multicast group at will). High quality coding algorithms that can cater for the current heterogeneous environment by being scalable and robust to problems such as packet loss, will translate directly into a flexible solution for a variety of user/application requirements and cost-choices.
By adapting state of the art coding algorithms to the requirements of the internet and Mbone, this project will evaluate the extent to which high quality video and audio can be successfully transferred over SuperJANET for a range of applications. It is important to note that the project will jointly address video and high quality audio coding, in particular wideband music coding, which is expected to have a significant impact on future services. In this context, the aims and objectives of this project are:

2. Application Areas

There are many applications where the provision of high quality audio and video (often in addition to lower quality communication facilities) is essential. This project will consider the following application areas.
Music-on-demand uses multimedia servers to store high quality audio and video. The service provides both quality music and supporting images or video. After the selection of a virtual CD, the remote user has direct control over the material. As well as track changing, fast forwarding etc. a user will want to make cost-choices about the relative quality of the individual media. For example, 48 kHz music might be used in conjunction with 10 fps CIF quality video, or high quality stills of musicians. Lower quality music and video might prove attractive to other users, and revenue generation would benefit from users' ability to vary quality choices during play. For example, low quality audio and video play-back of excerpts from selected tracks might prove attractive as a browsing facility before purchase.

In music recording, often a session musician is not able or prepared to travel to the studio. In such cases it has become common to use ISDN services to link with a remote musician. The link is not both real-time and high quality: typically the studio sends a current mix of moderate quality to the musician, he/she plays along to this, recording the contribution locally for subsequent off-line transmission back to the studio. Delays are allowable for the return of the high quality individual instrument. Another current application using ISDN is when mix approval is sought from a producer remote to the studio. Premium quality is not essential and some latency can be tolerated. In both cases, early discussions with an ISDN music service provider, H2O Enterprises indicate that the market would like an Internet alternative to ISDN and that a separate video feed would be an undoubted bonus. A typical ISDN bit-rate is 364 kbs (stereo), though with greater compression, lower bit rates are achievable for similar quality.

ISDN is also used is for remote voice overs for commercial radio. These are non-real-time and use lossless compression of 15 kHz over 64 or 128 kbs channels. With high quality lossy coding (e.g. MPEG-1 Layer 3) the quality could be maintained, but with the bonuses of lower cost over Internet and real-time performance. Again a video service would be a distinct bonus.

Project Coven (collaborators UCL, Division Ltd etc.) trails a virtual travel agency, where users watch a slide show with audio commentary before investigating the remote location in a virtual world. The current means of producing audio and video is from local disk, and the application has performance problems during play-back. Coven already uses RAT for audio communication, and is very interested in pre-recorded networked high quality video/audio, with synchronised play-back using RAT and vic. MANICOREL is another virtual reality project that uses RAT and vic, who are willing to trial the results of this programme.
As a distance learning application we will interact with Project ReLaTe (Remote Language Teaching - Exeter managed, with collaborators at UCL) which pilots distance language learning activities. ReLaTe uses RAT and vic to provide audio and video for interactive communication, and as play-back vehicles for pre-recorded material. Material storage is accomplished by UCL's multimedia server, but currently is toll quality speech with low frame rate video. ReLaTe is interested and willing to provide feed-back on the use of high quality audio and video in distance learning.
 

3. Background and State of the Art

3.1 IP Multicast and Typical Mbone Performance

The Mbone is a backbone over some of the high speed parts of the Internet, where datagrams of varying length are individually routed at network nodes. A multicast address causes Mbone routers to set-up a source distribution tree to interested receivers - this provides scalable multi-way communication. Routers suffer from temporary and persistent congestion, both of which result in packet loss. Loss generally exhibits a random pattern, although router anomalies also lead to regular loss bursts.

The current FIFO routing strategy will be replaced in the near term by mechanisms such as RED (random early drop), which provides early indication of congestion to individual flows and WFQ (weighted fair queuing) or CBQ (Class-based Queuing), which attempt to prevent badly behaved traffic from hogging output bandwidth, and minimise queuing delay for delay sensitive traffic.

Within the duration of a 2 year project, RSVP (reservation protocol) will be deployed in the Mbone, to guarantee QoS levels to multicast receivers. Rather than using a single QoS for traffic such as audio or video, a range of QoS levels will be used by receivers. This means that a range of delay and packet loss conditions in a single multicast group will persist; the acceptability of this model is demonstrated by considering the explosion in Internet telephony - users are prepared to accept different conditions depending upon cost.  Different users in a multicast group will make different cost/performance decisions.

3.2 Existing Tools

RAT and vic provide audio and video coding tools for a wide range of SuperJANET applications. They represent stable research platforms on which to build high quality video and audio demonstrations for a range of different applications. UCL produces RAT, and has undertaken development work within vic.

RAT has been developed to provide packet loss robustness for large multicast multimedia conferences, and is currently successfully used by a variety of SuperJANET pilots. RAT currently provides toll quality speech. It provides robustness to both packet loss (through redundancy), and the scheduling problems which occur on general purpose operating systems due to lack of real-time support. An adaptive bandwidth management strategy (Performance Optimised Multicast - POM) is currently being developed, which uses multiple levels of redundancy (transient loss protection), and receiver-driven layered multicast (a stream is split across a number of multicast groups, and receivers can join a subset). A hierarchical speech codec is also being developed under RAT to provide speech communication in the range (0-7kHz).

vic has been developed by the Lawrence Berkeley Labs to provide Mbone video communications. vic uses a variety of coding algorithms to provide CIF or QCIF images at frame rates from 0 to 25 frames per second using (amongst other schemes) H.261 (inter-frames are not sent, as this results in significantly degraded video quality during packet loss). vic does not currently employ motion compensation and bit rates of approximately 200kbps (up to 20fps) are the limit of operation on a Sparc Ultra (bus to video card). A simple scalable wavelet codec has already been researched by the authors of vic, but this is not part of the current release. vic also provides voice activity switching of video at the receiver in order to save real estate and to improve received quality.
Music via file transfer is currently provided over the Internet by RealAudio. It is believed that RealAudio uses AC-3, although little has been published which confirms this.  RealAudio also employs some loss repair techniques, and TCP/IP provides automatic back-off facilities in the presence of loss.

3.3 Requirements for Internet Audio and Video Coding

To be considered high quality, a music audio signal requires a bare minimum bandwidth of 12 kHz (with some high frequency loss). More realistically this should be 15 kHz, the bandwidth used for FM radio transmission, rising to a more conventional bandwidth of >20 kHz. This gives a sampling frequency between 25 kHz and the professional standard of 48 kHz.  The perceived signal-to-noise ratio (SNR) must be equivalent to that obtained from CD technology: ~ 100 dB. There are however some scenarios of interest to Internet audio where versions of a signal might be transmitted with less than this quality threshold

Good music coders now exist which provide quality of the levels sought (MPEG 1 and 2, Dolby AC-2 and AC-3, ASPEC, etc) all based on time-frequency modeling of the signal. Depending on the complexity of the encoding process, tranparent (the SNR criterion is met with no preceptible high frequency loss) or near transparent quality is obtained with bit-rates from 64 kbs up to 192 kbs per channel. Current work, especially in MPEG-2 AAC and MPEG-4 intends to drive this rate down to 32 kbs or less. A figure of 8-16 kbs has been quoted as a target, but is unlikely for the next few years.  A codec, recently developed at King’s College is based on MPEG models, but uses a Wavelet Packet Transform in place of polyphase filtering or MDCT. Experiments have already shown that, at the low bit-rates needed for Internet, Wavelets demonstrate good subjective and objective performance advantages over MPEG equivalents. Furthermore, the complexity of the coder can be tailored by changing the Wavelet basis, and can be considerably lower than MPEG. This is an essential feature if codecs are to be software based and platform independent.

Perceived video quality is highly dependent on application and sequence content.  For example, a conferencing facility would normally require only head and shoulders with fixed background whereas a sports scene might include zoom, pan and and high level complex motions which are more difficult to code.  Head and shoulders material can be coded at sub QCIF at 10kb/s for mobile applications with moderate (some might say poor) quality, whereas broadcast applications using MPEG2 require 3-5b/s to achieve a quality comparable to VHS.  Typical ISDN based video conferencing services will usually employ H.261 at 128kb/s or higher.  Improved codecs for lower bit-rates are now emerging from research laboratories and through standardisation processes such as H.263L and MPEG-4.

This project will address a target video bit-rates between 64 and 512kb/s comparing standard DCT based codecs with more recent innovations developed at Bristol (including a morphological segmentation codec and an enhanced Embedded Zero Tree Wavelet coder).  As with audio, it is important to demonstrate the performance limits capable with a software codec while also evaluating the potential of more complex, hardware-assisted solutions.

3.4 Scalability and Robustness

Scalable coders are well suited to the changing bandwidth allocation needed by Internet audio and video transmission. A scalable coder is one which provides both a coarsely coded bitstream and at least one finer resolution stream, used to add detail (typically signal SNR or bandwidth). As network traffic increases, only the coarse code continues to be transmitted. Scalable approaches will integrate well with the use of the RSVP (Reservation Protocols) over Mbone and with different charging models. They are also cost-effective when using separate multicast groups for communication within a heterogeneous group of participants. Suitable scalable novel coders offering enhanced compression performance already exist at King’s and Bristol, though these will need modification for datagram network use and integration into RAT and vic.

Of primary importance for high quality, is the need for error resilience (in the presence of packet loss) in the bitstream. Current approaches favour the incorporation of signal-redundancy through packet duplication at full or reduced resolution. This project will examine restructuring the bitstream (based on knowledge of packet loss statistics) such that packet loss (effectively a large burst error) appears at the decoder as a more uniform distribution of bit errors. These will be more amenable to error correction and concealment using techniques already developed for multimedia transmission over radio channels.

Interleaving across packets (packet striping) will be employed to shuffle the bitstream prior to transmission such that consecutive samples in the stream are not consecutive in the original signal.  If this is done over multiple packets, the effects of any lost packets will be spread and more easily corrected. This impacts latency, and the technique will be examined both for low latency applications (needing interleaving over limited timescales) such as teleconferencing and high latency applications (less restricted interleaving) such as audio/video on demand.  Interleaving will be combined with coding methods, already proven in wireless applications such as the error resilient entropy code (EREC) and pyramid vector quantisation (PVQ). This approach releases some of the bandwidth occupied using redundancy which is then available for the enhancement of signal quality. The incorporation of layered channel coding or reduced amounts of redundancy to compliment these approaches will also be investigated. Such techniques are seen as complimentary to RSVP which guarantees a known level of quality of service.
 

4. Work Programme

The work in this project will adress both low latency and high latency applications of high quality audio and video. We propose to collaborate with existing user groups who employ RAT and vic users to obtain feed-back about the use of scalable high quality audio and video (see attached letters of support). We will also address the new high quality music applications outlined in section 2.  The programme will be of 2 years duration with each site contributing 18 person months of effort over this period. The project is split into the following work packages.

WP1 Baseline Engineering

Objectives: To modify RAT and vic in readiness for integration of the new codecs developed in WP2,3,5,6.
Description: Prior to the integration of high quality codecs within RAT and vic, a number of architectural issues need to be resolved. Vic will be provided with a receive buffer to eliminate the effects of jitter from the rendering of video frames. RAT will be rebuilt to manipulate variable sized audio packets and to provide stereo operation. Inter-media Synchronisation Facilities must be provided using an intermediary agent to negotiate play-out delays. Audio Activity Control of Video Bandwidth will be incorporated to enable higher quality video reception in large multi-way conferences.
Deliverables: Release of RAT to all applications piloting projects, Integration of video improvements into vic release schedule.
Effort per site: 5 person months at UCL, 1 person month at Bristol, 1 person month at KCL

WP2 Loss Resilient Video Coding

Objective: to demonstrate the performance improvements possible with loss resilient coding methods.
Description: Existing wavelet and DCT codecs will be adapted to use interleaving and EREC.  A range of interleaving orders will be evaluated together with an evaluation of the best packet and slot- sizes for optimising EREC performance. The higher the order of interleave, the higher the order of interpolation that may be used in the decoder and thus the higher the accuracy of reconstructed samples. However, high order interleaving implies higher latency, and may be constrained by the application. Most current internet codecs do not employ motion compensation due the effects of error propagation due to lost packets.  Since the objective of EREC is to reduce this problem, a full evaluation of a motion compensated codec will be performed and compared to existing solutions.  Testing will be done with standard video sequences and simulated loss conditions.
Deliverables: software modules for integration into vic; internal report on capabilities and performance under various simulated loss conditions and with different interleaving parameters.
Effort per site: 6 person-months at UoB

WP3 Loss Resilient Audio Coding

Objective: to produce a Wavelet based codec (mono and stereo) incorporating loss resilience.
Description: an existing Wavelet-based codec will be adapted to use interleaving and EREC; a range of interleaving orders will be implemented; the codec will be tested in isolation on a range of workstations. The interleaving of signals that have undergone transformation  using a wavelet domain must be addressed. For example, it is not only lost wavelet coefficients that must be interpolated, but also scale factors etc.  Where possible within timescales, codecs of different complexity will be constructed to run real-time on a range of workstations. Although MPEG-2 and MPEG-4 and Dolby AC-3 allow for more than 2 channels of audio (typically stereo), this project proposes only to examine mono and stereo coding.
Deliverables: software modules for integration into RAT; internal report on capabilities and performance under various simulated loss conditions and different interleaving parameters.
Effort per site: 6 person-months at KCL

WP4 Integrated Loss Resilient Audio and Video Coding

Objective: to integrate the audio and video codecs from WP2 and 3 into RAT and vic respectively.
Description:Processing power profiling of the high quality codecs will be performed in isolation from RAT and vic. The new codecs will be integrated into RAT and vic, and their performances assessed within the tools using artificially congested mini-routers.  Subjective performance assessment will be achieved using MOS audio and video quality rating based on controlled experiments. The operation of high quality audio and video codecs will also be evaluated in the context of UCL's multimedia server.
Deliverables: Internal Report on the performance of loss resilient audio and video coding, Release of new RAT version, Integration of video changes into vic releases.
Effort per site: 5 person months at UCL, 1 person month at KCL 2 person months at Bristol.

WP5 Scalable Video

Objective: to modify and subsequently evaluate an existing wavelet codec compatible with internet scalability requirements
Description: the loss resilient methods produced under WP2 will be added to Bristol’ existing zerotree wavelet codec.  The codec will be enhanced to incorporate a sub-pixel, overlapping block motion estimation scheme.  Particular attention will be paid to mapping the scaling procedure to the requirements of Performance Optimised Multicast (POM) and to RSVP and RED. The system will be evaluated using standard video sequences and simulated congestion conditions.
Deliverables: software modules for integration into vic; internal report on capabilities and performance under various simulated loss conditions and different interleaving parameters.
Effort per site: 6 person-months at Bristol

WP6 Scalable Audio

Objective: to produce a wavelet based scalable stereo audio codec incorporating loss resilience using both EREC and interleaving.
Description: the codec produced under WP3 will be modified to incorporate SNR and bandwidth scalability.  Following the approach of WP3 an investigation of different amounts of protection and interleaving for different scale layers will be made.
Deliverables: software modules for integration into RAT; internal report on capabilities and performance under various simulated loss conditions and different interleaving parameters.
Effort per site: 6 person-months at KCL

WP7 Integrated Scalable Audio and Video Coding

Objective: to integrate the robust scalable codecs from WP5 and 6 into RAT and vic and evaluate their performance.
Description: the codecs produced under WP5 and 6 will be integrated into RAT and vic. Receiver-driven layered multicast control for both RAT and vic will be via the congestion control mechanism developed for RAT.  Statistics reporting mechanisms in RAT and vic will be enhanced to produce detailed information to support quality assessment.  Other issues to be addressed include i) the application dependant perceptual tradeoffs in quality between audio and video and how this can be used to influence dynamic bit allocation and ii) improved synchronization of audio and video iii) activity control of bandwidth, such as voice switching of video content to keep within changing cost envelopes and reduce network traffic.  Trials will be undertaken for both latency insensitive (eg recorded sources) and latency sensitive (interactive) sessions and the results used to produce recommendations for interleaving and scaling strategies. Although real-time trials will be based on software codecs in RAT and vic, more complex algorithms (for example motion compensated encoding) will be evaluated in non real time trials.
Deliverables: Demonstration of quality vs cost/ bandwidth trade-offs for typical audio visual information.
Effort per site: 5 person months at UCL, 1 person month at KCL 2 person months at Bristol.

WP8 Application Trials

Objective: to trial and assess the integrated tools from WP7 on a range of applications with varying audio and video quality requirements, and varying encoder and decoder complexities. Some applications require real-time performance only for decoding.
Description: the tools produced under WP7 will be trialled on the applications outlined in section 2. Broadcasts of music / video over SuperJANET will be made from the UCL multimedia server, using RAT and vic to obtain remote network performance statistics, together with solicited feed-back from users.  Projects coven and ReLaTe will be exploited to provide an assessment of performance. In addition a performance assessment of high quality video and audio clips using RSVP could be provided (if supported) by project proposal HIPMUCID. Virtual CD trials will be developed primarily between Bristol and King's and the ISDN-replacement audio application trials will be undertaken with H2O Enterprises.
Deliverables: Report on subjective quality assessment and user feedback.
Effort per site: 3 person-months at UCL , 3 person months at KCL, 1 person month at Bristol
 

APPENDIX: Track Record of Applicants

The three partners in this consortium each bring specialised expertise to the team forming a complete mix of audio and music coding (KCL), video coding and error resilience (UoB) and audio and video transmission over packet networks, particularly IP and multicast (UCL). Each partner has significant funding from EPSRC and industry.

King’s has two EPSRC grants on audio coding, one in wavelets primarily for scalable music coding and one in speech coding looking at high compression for wideband speech. There is also industrial funding for Wavelet coding research and long-standing expertise in audio and music technology with a variety of funding.

The Image Communications Group at Bristol has numerous industry-collaborative projects relating to image and video coding from sub 20kb/s to broadcast rates.  It currently has 4 EPSRC grants in these areas including two relating to  scalable robust video and image coding for integrated fixed and wireless networks.

The multicast multimedia group at UCL has two EPSRC grants specifically on audio (and its interaction with video) over IP networks (Projects RAT and MEDAL), and numerous other relevant projects such as MERCI, ReLaTe etc. Vicky Hardman is the prime system architect of RAT, and leads the multicast audio research group at UCL.

Already King’s and Bristol collaborate on scalable Wavelet coders for joint audio and video coding, and King’s and UCL collaborate on Wavelet audio coders for MBone. Bristol and UCL are also members of the Virtual Centre of Excellence in Digital Broadcasting and Multimedia Technology.