Digital Amphitheatre
20 June 2000
/ da
The Digital Amphitheatre provides a shared virtual environment, where one
hundred people can conduct an online meeting. The user interface mimics a
lecture room, with background substitution replacing much of the clutter
inherent in typical conferencing environments, to give a greater sense of
presence. The system is agent based, with aggregation points within the
network combining video streams to lower the end system load.
Concept
The aim of our user interface is to create a digital meeting place, an
environment where participants in the meeting can feel that they are
interacting with each-other; rather than using a complex teleconferencing
system. We envisage an auditorium, with seating for the audience and a
panel of speakers, much as one might find in a typical meeting or
seminar. To implement this on a flat display, we reflect the audience,
so that the participant sees a view from the stage showing their presence
with the other audience members, but we show the speaker and panelists as
if viewed from the audience. The figure illustrates the concept, with a
mock-up we used in our early design.
In this mock-up, the images of the participants have been processed to
remove their background. Each participant is seated in an amphitheatre
seat. The seating follows the rules of perspective, such that seats and
participants become smaller as they move towards the back. The use of
background substitution and natural seating provides the illusion of
presence, and allows a large number of video images to be composited
whilst maintaining a visually pleasing aspect.
While the participants are scattered throughout the amphitheatre, the
speaker appears in the middle of the front row amongst other panel
members. The speaker occupies a relatively large video frame (possibly
with high frame rate) as do other panelists. Both speaker and panelists
have their names written in front of them, as they would in an actual
panel session.
There is no moderator in place, therefore it is possible for everyone
to talk at the same time, although of course the result would be a
difficult to understand jumble of sound. Again, our model is based on
real-life conferences, where floor time is dictated by social norms. In
a more futuristic version of the Digital Amphitheatre we foresee the
software detecting a raised hand, as a queue for requesting floor time.
We envision a system capable of supporting several hundred, perhaps one
thousand, simultaneous interactive users. The benefits of such a system
are obvious: large organizations can have regular meetings with all
levels of management involved without incurring high travel cost, long
distance educational programs can meet as if within a lecture hall
while students and lecturers join from geographically disparate
locations, or it could be used for political and other debates.
System Architecture
Video teleconferencing among small groups of people is now
common, and is supported by a number of commercial and open-source
tools. However large structured meetings, on the scale that we are
envisioning, have not yet been tried. There are a number of reasons for
this: processing such a large number of video streams presents a
formidable challenge, both in the network and for the end-user
application, and display technology is often a limiting factor.
Processing hundreds of video streams can easily overwhelm most
workstations, in terms of bus access, interrupt processing, context
switching, packet handling and demultiplexing, decoding, display
processing and rendering.
Many of the current teleconferencing tools, especially the research
oriented ones such as the popular “Mbone conferencing”
toolset have been designed with scaling properties in mind. However,
their focus has been mainly on attaining scaling via multicast, and
thereby reducing network load. This approach does not address the
problem of the end-system bottleneck, and in fact it aggravates it.
End-users can generate video content in parallel, this content moves
through the network, but once received at its destination, must be
processed by an inherently serial system. All the video flows must
be instantaneously reconstructed, decompressed and rendered, thereby
creating a performance bottleneck in the end-system.
Given that the processing limitations of end-systems are the main
bottleneck and deterrent to very large scale video conferencing, what
are the possible solutions? Our experience shows that the simple brute
force technique of using faster end-systems is not a viable
solution, as even the fastest available workstations cannot keep up
with hundreds of video streams.
The implication is that we must distribute the processing, leveraging
the increased communication ability rather than drinking from the
firehose of the full set of input streams. Parts of processing must be
pushed into the network infrastructure, offloading functions from the
end-system to agents within the network. The question remains as to how
much and which parts of the process can be off-loaded from the end-system,
and exactly what are the tradeoffs involved.
Implementation Outline
To support a large number of video streams in the digital amphitheatre
we adopted an agent-based approach, distributing the processing required
to build the user interface throughout the network. There are several
parts to the system: background substitution
at the transmitter, spatial tiling agents within
the network, and user interface composition at the
receiver.
Each transmitter performs the background substitution algorithm on
their own video stream, replacing the actual background with a synthetic image
supplied during session initiation. Each audience member participates
by unicasting video to the closest tiling agent. The agent, in turn,
tiles together all the video streams it receives, and sends the
resulting stream to a multicast group. All participants join this
group, receiving and displaying the combined audience video. The
panelists and speaker send directly to the multicast group, thus
circumventing the tiling agents.
The receivers compose the tiled audience segments, speaker and
panelists into a single display. Audio is received directly via a
single multicast group, since it is expected that the audio rate will
be low (silence is suppressed, so there are typically only a small
number of active audio senders).
In addition to distributed processing based on media agents, control
protocols are needed to announce and setup the session, enabling
the participants to find the tiling agents and each other. The
session can be announced using SAP, SIP, a web page or even email.
The announced session has a single piece of information within it:
an anycast address, which should be contacted via SIP to obtain the
details needed to join the session.
On sending a SIP request to that anycast address, the routing system
will ensure the response comes from the closest member of the anycast
group. This will be a SIP server, co-located with a tiling agent,
which will respond to this request and return both the multicast group
used for the audience, and unicast address of the closest tiling agent.
A user can then participate by sending video to either the unicast
address or the multicast group, respectively.
This architecture spreads the processing load throughout the network,
while maintaining a simple method of joining the session.
Publications
-
Ladan Gharai, Colin Perkins, Ron Riley, and Allison Mankin,
Large Scale Video Conferencing: A Digital Amphitheatre,
Proceedings of the 8th International Conference on Distributed Multimedia Systems,
San Francisco, CA, USA,
September 2002.
-
Ladan Gharai, Colin Perkins, and Allison Mankin,
Large Group Teleconferencing: Techniques and Considerations,
Proceedings of the 3rd International Conference on Internet Computing,
Las Vegas, NV, USA,
June 2002.
-
Ladan Gharai, Colin Perkins, and Allison Mankin,
Scaling Video Conferencing through Spatial Tiling,
Proceedings of the 11th International Workshop on Network and Operating Systems Support for Digital Audio and Video (NOSSDAV 2001),
Port Jefferson, NY, USA,
June 2001.
DOI:10.1145/378344.378364
- Allison Mankin, Ladan Gharai, Ron Riley, Maryann Perez Maher and Jaroslav Flidr,
The Design of a Digital Amphitheatre,
Proceedings of 10th International Workshop on Network and Operating System Support
for Digital Audio and Video (NOSSDAV 2000), Chapel Hill, June 2000.
Acknowledgements
This work was supported by the DARPA Information Processing
Technology Office. Any opinions, findings, conclusions or recommendations
expressed in this material are those of the authors and do not necessarily
reflect the views of the Defense Advanced Research Projects Agency, or the
United States Government.