csperkins.org

Digital Amphitheatre

The Digital Amphitheatre provides a shared virtual environment, where one hundred people can conduct an online meeting. The user interface mimics a lecture room, with background substitution replacing much of the clutter inherent in typical conferencing environments, to give a greater sense of presence. The system is agent based, with aggregation points within the network combining video streams to lower the end system load.

Concept

The aim of our user interface is to create a digital meeting place, an environment where participants in the meeting can feel that they are interacting with each-other; rather than using a complex teleconferencing system. We envisage an auditorium, with seating for the audience and a panel of speakers, much as one might find in a typical meeting or seminar. To implement this on a flat display, we reflect the audience, so that the participant sees a view from the stage showing their presence with the other audience members, but we show the speaker and panelists as if viewed from the audience. The figure illustrates the concept, with a mock-up we used in our early design.

Mockup UI

In this mock-up, the images of the participants have been processed to remove their background. Each participant is seated in an amphitheatre seat. The seating follows the rules of perspective, such that seats and participants become smaller as they move towards the back. The use of background substitution and natural seating provides the illusion of presence, and allows a large number of video images to be composited whilst maintaining a visually pleasing aspect.

While the participants are scattered throughout the amphitheatre, the speaker appears in the middle of the front row amongst other panel members. The speaker occupies a relatively large video frame (possibly with high frame rate) as do other panelists. Both speaker and panelists have their names written in front of them, as they would in an actual panel session.

There is no moderator in place, therefore it is possible for everyone to talk at the same time, although of course the result would be a difficult to understand jumble of sound. Again, our model is based on real-life conferences, where floor time is dictated by social norms. In a more futuristic version of the Digital Amphitheatre we foresee the software detecting a raised hand, as a queue for requesting floor time.

We envision a system capable of supporting several hundred, perhaps one thousand, simultaneous interactive users. The benefits of such a system are obvious: large organizations can have regular meetings with all levels of management involved without incurring high travel cost, long distance educational programs can meet as if within a lecture hall while students and lecturers join from geographically disparate locations, or it could be used for political and other debates.

System Architecture

Video teleconferencing among small groups of people is now common, and is supported by a number of commercial and open-source tools. However large structured meetings, on the scale that we are envisioning, have not yet been tried. There are a number of reasons for this: processing such a large number of video streams presents a formidable challenge, both in the network and for the end-user application, and display technology is often a limiting factor. Processing hundreds of video streams can easily overwhelm most workstations, in terms of bus access, interrupt processing, context switching, packet handling and demultiplexing, decoding, display processing and rendering.

Parallel content can overwhelm a host

Many of the current teleconferencing tools, especially the research oriented ones such as the popular “Mbone conferencing” toolset have been designed with scaling properties in mind. However, their focus has been mainly on attaining scaling via multicast, and thereby reducing network load. This approach does not address the problem of the end-system bottleneck, and in fact it aggravates it. End-users can generate video content in parallel, this content moves through the network, but once received at its destination, must be processed by an inherently serial system. All the video flows must be instantaneously reconstructed, decompressed and rendered, thereby creating a performance bottleneck in the end-system.

Given that the processing limitations of end-systems are the main bottleneck and deterrent to very large scale video conferencing, what are the possible solutions? Our experience shows that the simple brute force technique of using faster end-systems is not a viable solution, as even the fastest available workstations cannot keep up with hundreds of video streams.

The implication is that we must distribute the processing, leveraging the increased communication ability rather than drinking from the firehose of the full set of input streams. Parts of processing must be pushed into the network infrastructure, offloading functions from the end-system to agents within the network. The question remains as to how much and which parts of the process can be off-loaded from the end-system, and exactly what are the tradeoffs involved.

Implementation Outline

Spatial tiling produces happiness

To support a large number of video streams in the digital amphitheatre we adopted an agent-based approach, distributing the processing required to build the user interface throughout the network. There are several parts to the system: background substitution at the transmitter, spatial tiling agents within the network, and user interface composition at the receiver.

Each transmitter performs the background substitution algorithm on their own video stream, replacing the actual background with a synthetic image supplied during session initiation. Each audience member participates by unicasting video to the closest tiling agent. The agent, in turn, tiles together all the video streams it receives, and sends the resulting stream to a multicast group. All participants join this group, receiving and displaying the combined audience video. The panelists and speaker send directly to the multicast group, thus circumventing the tiling agents.

The receivers compose the tiled audience segments, speaker and panelists into a single display. Audio is received directly via a single multicast group, since it is expected that the audio rate will be low (silence is suppressed, so there are typically only a small number of active audio senders).

In addition to distributed processing based on media agents, control protocols are needed to announce and setup the session, enabling the participants to find the tiling agents and each other. The session can be announced using SAP, SIP, a web page or even email. The announced session has a single piece of information within it: an anycast address, which should be contacted via SIP to obtain the details needed to join the session.

On sending a SIP request to that anycast address, the routing system will ensure the response comes from the closest member of the anycast group. This will be a SIP server, co-located with a tiling agent, which will respond to this request and return both the multicast group used for the audience, and unicast address of the closest tiling agent. A user can then participate by sending video to either the unicast address or the multicast group, respectively.

This architecture spreads the processing load throughout the network, while maintaining a simple method of joining the session.

Publications

Acknowledgements

This work was supported by the DARPA Information Processing Technology Office. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency, or the United States Government.