Multicast Audio: The Next Generation
Colin Perkins (email@example.com)
Vicky Hardman (firstname.lastname@example.org)
Isidor Kouvelas (email@example.com)
Martina Angela Sasse (firstname.lastname@example.org)
Department of Computer Science
University College London
London WC1E 6BT
The paper summarizes recent technical developments in second-generation
multicast audio tools (packet loss protection and repair, improved
scheduling and cross media synchronisation) and the resulting improvements
in terms of performance and usability. More sophisticated applications of
Internet multimedia conferencing will, however, require further
improvements in terms of both reliability and quality. We outline a number
of technical solutions (improved redundancy and repair, higher-quality
audio, heterogeneous distribution and improved acoustic effects) required
to achieve these improvements which should therefore be incorporated into the
next generation of multicast audio tools.
In the past few years, multimedia conferencing systems have become
increasingly popular: from expensive video-conferencing suites used by
executives for special occasions, the technology has moved onto user's
desktops for regular use. With the growth in bandwidth, and improvement in
compression techniques, conferencing over the Internet is emerging as an
inexpensive alternative to ISDN based solutions. With the introduction of
IP multicast technology it is possible to conduct ad-hoc conferences
between large numbers of participants, at comparatively low cost. As a
result of this increased experience, a number of problems which were
apparent in older versions of multicast tools have been solved. At the same
time, a number of limitations have emerged, which currently reduce the
applicability of multicast tools.
The State of the Art
Limitations of Current Audio Tools
In this paper, we provide a brief overview of the state of the art in
multicast audio conferencing, together with a discussion of the recent work
undertaken at UCL leading to the current release of our second-generation
audio tool, RAT. This is followed by a discussion of the limitations of
these tools, as applied to a number of demanding application areas.
Finally, we highlight those areas where we believe further development work
In the past few years, many multicast audio conferencing tools have become
available. These tools typically provide an 8kHz audio sampling rate, with
the option of using several compression algorithms, to transmit data at bit
rates of between 66.8 kbits/second (PCM u-law) and 9.8 kbits/second (LPC),
including IP/UDP/RTP headers, although silence suppression may reduce this
further. The choice of audio compression algorithm affects the sound
quality, but many algorithms are available which provide quality comparable
to that of the telephone system, provided the underlying network is not
Audio quality is impaired by packet loss on the network, caused by
congestion, and by the lack of real-time support in general-purpose
operating systems. Acoustic aspects of packet network audio systems also
produce problems, since one channel of restricted bandwidth is not the
natural means of human audio communication. A number of enhancements have
been made to current audio tools, when compared to first generation tools,
which attempt to solve these limitations. These enhancements are now
discussed in turn:
The techniques described above allow second generation audio conferencing
tools, such as RAT, to perform significantly better than first generation
multicast audio tools. Experience within the MICE  and ReLaTe 
projects has shown that with the application of these techniques, it is
possible to conduct conferences over low-quality network links (it has been
demonstrated that audio is intelligible at up to 30% packet loss), and on
heavily loaded workstations (for example, when decoding multiple video
streams). In addition, the use of cross-media synchronisation provides a
first step towards more sophisticated, integrated, real-time conferencing
applications on the Internet. It is, however, clear that the quality of
even these second generation tools is insufficient for many applications,
and there are a number of areas where these tools may be further enhanced:
this is discussed further in the remainder of this paper.
- Packet Loss Protection
- The most noticeable problem with multicast audio conferencing tools is
that of packet-loss, caused by congestion in the networks.
Hardman et al.  suggested that low-bit-rate redundancy could
be used to improve the quality of multicast audio in the face of
packet loss. This proposal was first implemented in a multicast audio
tool called RAT (Robust-Audio Tool), developed at UCL , and has
been shown to significantly improve perceived audio quality. It has
since been incorporated into both the UCL/Exeter ReLaTe and INRIA
FreePhone systems, and is currently the subject of an IETF
standardization effort .
- Packet Repetition
- In addition to redundancy, the use of packet repetition and waveform
substitution can also improve perceived audio quality in the presence
of packet loss. Packet repetition is implemented in RAT to patch
single-packet loss (if redundancy is not used, or if the redundant
copy of a packet is also lost), and relies on the short-term
self-similarity of speech waveforms to provide improved audio quality
compared to silence substitution.
- Adaption to Workstation Scheduling Jitter
- The problem of network timing jitter, and subsequent requirement for
an adaptive playout buffer to be included in conferencing tools has
been known for many years . Experience has shown that most modern
operating systems provide poor support for real-time processes, and
this manifests itself as disrupted audio when a workstation is under
heavy load. Recent work  has illustrated a technique whereby a
workstation's audio device driver buffers may be used to adapt to
scheduling jitter, in a manner analogous to that required for adapting
to network jitter, resulting in improved performance under conditions
of heavy load.
- Cross-Media Synchronisation
- The use of the IETF standard Real-Time Transport Protocol (RTP) 
allows for the association of multiple media streams, and for the
mapping of media clocks between streams. This allows for lip
synchronisation between audio and video streams, and has been
demonstrated at UCL  as part of the ReLaTe project, using RAT and a
specially modified version of the vic video tool .
The enhancements described above are indicative of the progress which can
be made in the development of multimedia conferencing applications which
operate over `best-effort' packet networks, such as the Internet Mbone.
The study of network characteristics, exploitation of knowledge about human
perception, and application of improved computational techniques can lead
to considerable improvement in existing tools and the perceived quality of
Whilst these second generation audio tools are well suited to many
application areas, it is clear that a number of applications exist which
demand still higher performance, and additional features. Current work at
UCL is focusing on two such applications: distance learning, and the
integration of multicast audio with networked virtual reality.
One of the more demanding environments for the use of multicast audio is
distance learning, and in particular remote language teaching. Since 1995
the ReLaTe project  has studied the problems inherent in this, fueling
the development of features such as packet-loss robustness (through the
application of redundancy), and cross-media synchronisation. Extensive user
trials have been conducted as part of this project [18,19,20] and these
illustrate a number of points which, despite progress made to date, still
cause considerable annoyance to users:
- Manual Control
- In current audio tools, features such as redundant audio and packet
repetition are manually controlled: a user can configure the audio
tool to match the characteristics of a particular network. Often
however, the user does not have the necessary knowledge to perform
this task well, resulting in these features being under-used.
- Narrow-Band and Poor Quality Audio
- Whilst audio sampled at 8kHz is sufficient for communication, such
narrow-band speech does not sound natural, leading to user frustration
after prolonged exposure, and making language learning difficult.
- Lack of Distance Cues and Acoustic Feedback
- Current multicast audio tools provide little feedback to the user
regarding distance, loudness, etc., and provide only manual gain
control. Experience has shown that most speakers have difficulty in
judging how loud they are speaking, because the channel sounds
acoustically `dead'. This results in some speakers shouting, whilst
others are too quiet.
- Monaural Sound
- In a natural environment, listeners use sound localization as an aid
to speaker identification, and to focus on one sound in the presence
of other sound sources. When combined with narrow bandwidth, monaural
sound results in reduced speech intelligibility and restricts users
ability to identify and focus on individual speakers .
The ability of multicast conferencing tools to scale to large numbers of
participants, and to cope easily with groups comprising many senders, has
resulted in interest from the virtual-reality and distributed interactive
simulation communities. Not only is multicast data transfer well suited to
distribution of world-data for the use in these simulations, but audio
conferencing tools have immediate applicability in this environment. The
major limitation of current multicast audio tools when applied to this
environment is the use of narrow band, monaural sound. When used in an
immersive virtual environment, the use of sound localization is of
fundamental importance, and current multicast audio tools do not support
To summarize, it is clear that future development must focus primarily on
high-quality audio: wide-band, stereo, and the use of sound localization,
and acoustic feedback techniques are all important. Further, automatic
control of features such as packet repetition, waveform substitution and
redundancy is a must. The next section of this paper will describe
several impending developments in the field which attempt to solve these
In the following, we highlight a number of areas where current Internet
multicast audio conferencing applications can profitably be enhanced. In
some cases these enhancements will involve extensive research (for example,
efficient, scalable, multi-party control of redundancy), whilst in other
cases integration of existing work in other fields is sufficient.
Improved Control of Redundancy
The use of low-bit-rate redundancy as a means of overcoming the problem of
packet-loss is now well understood, and has been implemented in a number of
audio conferencing tools. However, the majority of those tools require the
user to manually enable redundancy and select the audio compression scheme
to be used. If the use of multiple redundant copies of a packet, each
compressed using one of a range of possible compression algorithms, is
allowed, it becomes difficult to manually select the appropriate combination
of redundancy and compression schemes to give optimal quality for a given
network condition. A clear improvement which may be made in this area is
the provision of automatic control mechanisms, which sense and adapt to
network conditions. Whilst some limited amount work has been conducted in
this area , it is clear that further work is required before a general
purpose, scalable, solution is found.
In addition, layered coding algorithms provide for a greater range of
transmission rates, allowing such a scheme to function more effectively.
Some work has been conducted with the combined use of layered encoding and
layered compression schemes, applied to video conferencing , and it
seems clear that some variation on this technique may also be profitably
applied to audio conferencing.
Improved Waveform Substitution/Packet Repetition
The use of waveform substitution and packet repetition schemes provides a
receiver-based solution to the problem of packet loss, which complements
the use of packet redundancy. The aim of such techniques is to generate, at
the receiver, a replacement for a lost packet, based on the contents of the
surrounding packets. This process is based on the observation that speech
waveforms often exhibit a degree of short-term self-similarity, and hence
generation of a replacement packet with similar spectral qualities to the
lost packet is possible, provided that packets are relatively short (at
above 30ms packet duration this technique breaks down).
Previous work [1,17] has shown that the use of simple packet-repetition
schemes can result in a significant perceived improvement in audio quality,
at relatively low loss rates, and it is clear that simple techniques such
as these have considerable potential for improving the perceived quality of
audio in the presence of packet loss. In addition, a number of waveform
substitution algorithms have been presented in the literature (see  for
example), which have been shown to perform better than simple packet
repetition: the incorporation of these techniques into multicast audio
conferencing tools would be a worthwhile future development.
Assuming the existence of a good quality network, existing multicast audio
tools provide toll-quality audio, and are typically optimized for speech
compression, performing poorly with other classes of sound (music, for
example). It is clear that such quality is not sufficient for certain
classes of application: for example, remote language teaching applications
would benefit from high-quality audio, as would music transmissions.
The provision of high-quality audio is a two-stage process: wide-band
sampling (16kHz, 32kHz) is required, and improved audio compression schemes
must be integrated into the multicast audio tools. The first of these is a
simple matter, and the real-time transport protocol (RTP) audio/video
profile , provides the facilities necessary for the transport of such
wide band audio.
The provision of improved audio compression schemes is more complex. A
number of compression schemes for high-quality audio have been defined, for
example MPEG-audio  but these are typically optimized for broadcast
media, and operate on large data packets. This results in increased coder
delay, when compared to toll-quality compression algorithms, and this delay
limits interactivity. Further, in a lossy packet network, loss of a single
large packet will have a more noticeable effect on audio quality than the
loss of a smaller packet (since a larger gap will occur). The development
of high-quality audio compression schemes which have low-delay and are
insensitive to packet loss is an area of active research.
Improved Acoustic Feedback
Improved acoustic feed-back can be provided by reverberation, which
simulates normal echos which follow closely after the original sound, and
which are a result of reflections off walls, etc. More simply, however,
improved acoustic feedback can be provided by a single echo after the
original sound, which is similar to side-tone  provided in normal
telephone hand-sets. Reverberation or side-tone should also be combined
with automatic gain control , to cope with impedance mismatch caused by
different head-sets, in order to maximize use of the input dynamic range.
Sound localization  is the artificial manipulation of a single input channel
to produce 3D spatial sound. The sound can be presented over headphones or
via loudspeakers, but the effect is better than stereo, since the sound
appears to emanate from a particular location (outside the listener's head
if using headphones). A low-cost approach to sound localization is being
developed for RAT , and the mechanism can be used to separate conference
participants out in space. This feature substantially eases speaker
identification, and improves speech intelligibility.
To summarize, it is clear that whilst existing, first generation, audio
tools are sufficient for many needs, there are a number of application
areas which require more advanced tools. For example, in the realm of
distance education and language teaching, it is clear that wide-band audio
and automatic, hands-free, control of features such as redundancy and
waveform substitution provide a much enhanced learning environment. When
coupled with 3D audio to distinguish sound sources, these features provide
a better environment for large audio-conferences by facilitating enhanced
participant identification. The use of 3D audio also fits naturally into
shared visualization systems, where users manipulate objects in 3D space.
The developments we envisage have applicability in three broad areas:
It is, therefore, seen that the improvements we propose will provide a
valuable addition to multicast-audio conferencing tools, and extend the
realm in which such tools are applicable into new areas.
- With the current use of manual configuration, users tend to select
tool control parameters at the start of a session, and continue using
these throughout, even if network conditions are such that these are
inappropriate. This may result in unnecessary network traffic, for no
gain. The use of automatic control of redundancy, packet repetition
and waveform substitution will allow applications to adapt faster to
network conditions, and reduce the amount of resource wastage. When
coupled with enhancements from other fields, such as for example, RTP
header compression , it will be possible to conduct conferences
over more heterogeneous networks, even including low bandwidth modem
- The ReLaTe project has demonstrated that many users find integrated
user interfaces easier to use than separate systems for audio, video
and shared workspace applications. In addition, virtual-reality and
distributed interactive simulation applications require a greater
degree of integration between audio and other media tools. The use of
wide-band, stereo, sound localization, and acoustic feedback techniques
will also improve the user experience.
- Throughout the above, the role of usability assessment should not be
underestimated. Extensive user trials have been conducted as part of
the ReLaTe remote language teaching project and these have had a
major influence on the development of the UCL Robust-Audio Tool (RAT).
The improvements suggested in this paper should go a long way to
improving the usability of multicast audio conferencing tools, but
future assessment will be required to evaluate the scale of these
The authors wish to thank Orion Hodson for his comments on an early draft
of this paper. In addition, this work has benefited from many discussions
with our colleagues at UCL: Mark Handley (now at ISI), Jon Crowcroft, Peter
Kirstein, and the numerous other members of the multimedia research group.
Work on the RAT and ReLaTe systems has been funded by the BT/JISC
SuperJANET applications project ReLaTe, EPSRC projects RAT (GR/K72780) and
MEDAL (GR/L06614), and the Commission of the European Communities
Telematics for Research project MERCI. Audio redundancy was developed in
consultation with partners of project MICE (ESPRIT 7602/6), in particular
Jean Bolot and colleagues at INRIA Sophia Antipolis, France.
 V. Hardman, M. A. Sasse, M. Handley & A. Watson,
Reliable Audio for Use over the Internet,
Proceedings INET'95, Honalulu, Hawaii, June 1995.
 V. Hardman, M. A. Sasse & I. Kouvelas,
Successful Multi-party Audio Communication over the Internet,
To appear in Communications of the ACM.
 J. Buckett, I. Campbell, T. J. Watson, M. A. Sasse, V. Hardman & A. Watson,
ReLaTe: Remote Language Teaching over SuperJANET,
Proceedings of UKERNA Networkshop 1995.
 I. Kouvelas, V. Hardman & A. Watson,
Lip Synchronisation for use over the Internet: Analysis and Implementation,
Proceedings IEEE GLOBECOM'96, November 1996.
 I. Kouvelas & V. Hardman,
Overcoming Workstation Scheduling Problems in a Real-Time Audio Tool,
USENIX Annual Technical Conference, January 1997.
 V. Hardman & M. Iken,
Enhanced Reality Audio in Interactive Networked Environments,
Proceedings of the Framework for Interactive Virtual Environments
conference, Pisa, Italy, December 1996.
 C. Perkins, I. Kouvelas, V. Hardman, M. Handley, J.-C. Bolot, A. Vega-Garcia, S. Fosse-Parisis,
RTP Payload for Redundant Audio Data,
IETF Audio/Video Transport Working Group, Work in progress, January 1997.
 H. Schulzrinne, S. Casner, R. Frederick & V. Jacobson,
RTP: A Transport Protocol for Real-Time Applications,
IETF Audio/Video Transport Working Group, RFC 1889, February 1996.
 S. McCanne & V. Jacobson,
vic: A flexible framework for packet video,
Proceedings ACM Multimedia'95, November 1995.
 R. Ramjee, J. Kurose, D. Towsley & H. Schulzrinne,
Adaptive Playout Mechanisms for Packetized Audio Applications in Wide-Area Networks,
 J.-C. Bolot & A. Vega-Garcia,
Control Mechanisms for Packet Audio in the Internet,
Proceedings IEEE INFOCOM'96, March 1996.
 S. McCanne, V. Jacobson & M. Vetterli,
Receiver-driven Layered Multicast,
Proceedings ACM SIGCOMM'96, August 1996.
 H. Schulzrinne,
RTP Profile for Audio and Video Conferences with Minimal Control,
IETF Audio/Video Transport Working Group, RFC 1890, February 1996.
 ISO MPEG 1991,
Coding of Moving Pictures and Associated Audio,
ISO/MPEG-90/176 Committee Draft Standard 1172, ISO Geneva, 1991.
 O. J. Wasem, D. J. Goodman, C. A. Dvorak & H. G. Page,
The Effect of Waveform Substitution on the Quality of PCM Packet Communications,
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 26,
No. 3, March 1988.
 J. D. Johnston,
Transform Coding of Audio Signals Using Perceptual Noise Criteria,
IEEE Journal on Selected Areas in Communications, Vol. 6, No. 2,
 A. Watson,
Loss of audio information in multimedia video-conferencing: an
investigation into methods of assessing different means of
compensating for this loss,
MSc Thesis, Department of Ergonomics, University of London, 1994.
 J. Hughes & M. A. Sasse,
Internet Multimedia Conferencing - Results from the ReLaTe Project,
To be presented at ICDE World Conference, June 2-6, 1997, Pennsylvania State
 A. Watson & M. A. Sasse,
Assessing the Usability and Effectiveness of a Remote Language Teaching System,
Proceedings of ED-MEDIA 96 - World Conference on Educational Multimedia and Hypermedia,
June 17-22 1996, Boston, Mass. pp 685-690.
 A. Watson & M. A. Sasse,
Evaluating Audio and Video Quality in Low-Cost Multimedia Conferencing Systems,
Interacting with Computers, Vol. 8 (3), pp. 255-275, 1996.
 P. T. Kirstein, M. A. Sasse & M. J. Handley,
Recent Activities in the MICE Conferencing Project,
Proceedings INET'95, Honalulu, Hawaii, June 1995.
 S. Casner & V. Jacobson,
Compressing IP/UDP/RTP Headers for Low-Speed Serial Links,
IETF Audio/Video Transport Working Group, Work in progress, December
 E. C. Cherry,
Some experiments on the recognition of speech with one and two ears,
JASA, 25, 975-9, 1953.
 H. Sharifi,
An Investigation of Acoustic Enhancing Features for Use in Audio Tools,
MSc Thesis, Department of Computer Science, University of College London, 1996.
 F. L. Wightman & D. J. Kistler,
Headphone Simulation of Free-Field Listening. 1: Stimulus Synthesis,
Journal of the Acoustical Society of America, Vol. 85, No. 2, Pages
 D. L. Richards,
Telecommunication by Speech,
Butterworth & Co., UK, 1973.