CLUE, 81th IETF, Quebec City, Canada

Date:                   Thursday, July 28, 2011

Location:            Quebec City, Canada

Chairs:               Mary Barnes, Paul Kyzivat

Note Takers:       Magnus Westerlund, Stephen Botzko, Marshal Eubanks

Minutes Editor:  Paul Kyzivat

Jabber Scribes: Peter Saint-Andre

Recorded playback:

http://www.ietf.org/meeting/81/remote-participation.html#Meetecho

Agenda bash, Status and items of interest

Presenter:    Mary Barnes

Slides:          http://www.ietf.org/proceedings/81/slides/clue-2.pdf

Agenda bash: No discussion, ok

 

Definitions

Presenter:    Stephan Wenger

Slides:          http://www.ietf.org/proceedings/81/slides/clue-3.pptx

Summary of action items:

¥   WG/Stephan: start conversation on the list on a new definition for layout (r.e., Issue #2).

Conclusion:

Document will continue to be updated.  WeÕll decide later what one document should include the definitions (e.g., Requirements, Framework or wherever).

Detailed Discussions:

Stephan?: Can we work without the crude tool this is.

Roni Even: Good tool to start. Still some definitions that need ironing out. Eventually this should go into the framework document.

Stephan?: Will do a revision and discuss the open issues.

Christer Holmberg: Just want to make sure we go through the open terminology issues

Charles Eckel: We don't need this doc, just roll it into the documents. Hopefully this should be rolled into only one.

James Polk: Are we talking about copying it out into all the documents?

Mary Barnes: Try to ensure that the terminology only goes into one document. Not be copied into all of them.

Stephan: Open issues #1 Left/Right

No one objected to interpreting them in context. Authors have to ensure that it is understandable when using the terms.

Action: accepted

Stephan: Open Issue #2 Layout

Layout is a render side only:

Christer Holmberg: Is this the receiver of the media. MCUs will render a layout in binaural.

Stephan Wenger: Render, produces sound waves and photons, which MCUs donÕt do.

Roni Even: There are two aspects: Both within a screen and the physical relation between devices.

Brian Rosen: Thinks we need two terms, one for physical arrangement, one for models of ?

Christer: I agree with two terms.

Agreement to create two terms. 

Stephan: I will post something to the list.

Mary: Suggests WG (folks in support of two terms)  start conversation on the list on a new definition

Stephan: Open Issue #3 MCU

Magnus: Suggest that one starts with a central node that doesn't imply specific media processing

Brian Rosen: ?

Eric Burger: Mixes the media under the control of the focus.

?Devices the mixes media to devices that don

Magnus: What about relays? Are we ensuring that we don't get media plane

?: RFC4353 requires media from the mixer be sent to each participant. Is this an issue?

-Some say yes, other's no.

Eric: Very loose on the MCU definition. No reason for tightening it now.

Chairs: Unless proposed text, keep definition as it is.

Stephan: Open Issue #4 Media

?: Does definition need to exclude FECC or other non-rendered RTP streams.

Action: leave as is, but will need more discussion on the list.

John Elwell: DTMF is also a media

Brian Rosen: Suggest that "timed" in "timed text" is removed.

Christer: If we talk about SDP, then this becomes a bit more complete.

Stephan: Should we include camera control

Christer: No, maybe, we likely need a wider term, like media plane or data plane.

Eric: Not to tight we need something that allows for smell-vision

Roni Even: ?

?: Are we excluding MSRP?

Brian Rosen: I want text chatting. What about adding "typically"

Stephan: Wait for more input. Likely want to keep the definition with small modifications to allow for other media protocols, like MSRP.

 

Summary: Rendering Negotiation (Christer Holmberg)

Presenter:    Christer Holmberg

Slides:          http://www.ietf.org/proceedings/81/slides/clue-4.ppt

Conclusion:

Needs more discussion on the mailing list.

Detailed Discussions:

[This was a summary/report on what was discussed at the ad hoc meeting on Rendering Type Negotiation that took place Tuesday.]

Keith Drage: Careful with using Wikipedia. What is meant with signaling in the definition?

Charles Eckel: What are the difference between rendering and composing.

Christer: Fine if we can use only one word.

Brian Rosen: I would prefer Rendering, as you can render a single stream, but not compose a single source.

Roni: Composing is not a good term.

Mark Duckworth: I thought Stephan had a good idea with sound waves. While composition and layout models better describe what is happening prior.

Christer: Just want to find terms that separate the cases.

John Elwell: Better, how is this related to the framework?

Mary: What ever we do needs to fit in.

Roni: There can be different composition algorithms.

Mark: I disagree - binaural is a format. The composition algorithm is how the sources actually are placed within the sound field.

Roni: Disagree with that. A composition algorithm is like 2-by-2 video composition.

Mark: Example 2 - most active speaker slide: This is a good example of what I mean that there is different levels of concepts.

Roni: Agrees with Mark, there are input selection algorithms, not composition choices.

Stephen Botzko: We need to agree what the things really mean before determining if the requirements are agreeable.

Mary: We need more discussion on the mailing list.

Details from the Tuesday ad hoc meeting:

Presenter:       Christer Holmberg

Note Taker:      Roni Even

Definition:

What is rendering – definition – no comments

Usage – offer and answer.

Use case – binaural audio as example

Stephan: Everyone knows that there are different algorithms for audio rendering most are clear but not for video (three screens). Registry is not enough without definition of the syntax.  There is no intuitive understanding of the algorithm.

In video the number of options is big.

Jonathan says that not a registry but a syntax like XML.

John: what is offer, is it receive or support.

Christer: may be capability. Not offer answer

Mark: is this for central rendering?

Christer: the endpoint will do the rendering.

Mark: MCU will need to negotiate between both sides.

Christer: yes

Steve: need more use cases, how change mid call.

Paul: the receiver wants to ask a separate one.

Christer: does not care if it advertising capabilities or the receivers asks one.

Mark: it is in the framework – to advertise.

Stephan: based on the framework draft we can do all you want but the requirement here is more complicated.  Two modes of audio rendering is example that can go to more complicated usages.

John: this is getting more asymmetric than offer answer. Will need separate description for each direction.

Christer: will update the presentation.

Mark: Framework handles audio format. Need to look at what is layout and what is rendering. What audio streams you want in the rendering so it has more than one dimension

Christer: need to clarify this is what I want to have.

Mark: binaural is not a layout.

Charles: the framework has this but need more information about how to do it. Complexity, asymmetry.

Mary: need detailed use cases for requirements.

Christer: will clear the presentation for tomorrow based on the feedback.

Mary: read the framework.

 

Requirements (Allyn Romanow)

Presenter:    Allyn Romanow

Slides:          http://www.ietf.org/proceedings/81/slides/clue-0.pptx

Summary of action items:

¥   Reqmt-3a: consensus to leave as a MUST.

¥   Reqmt-4/5: merge???

¥   Reqmt-8: leave for now

¥   Reqmt-10: leave in

¥   Reqmt-13: leave in

¥   Reqmt-13a: delete

¥   Reqmt-14: defer. Stephan to draft a definition for segment and site switching.

¥   Resubmit this document as a working group document.

Conclusion: 

Document agreed as a WG document. Document will be updated based on the above action items and submitted as a WG -00 document.

Detailed Discussions:

Reqmt-3a:

Stephan Wenger: IPR concerns

Eric Burger: ?

Roni: Must be able to do it, must not do it, optional to use it.

Stephan Wenger: A must in requiremnt require at least a MAY in the solution. That way one may avoid IPR from companies if we steer into them. Want a SHOULD so that we choice later to not

Roni: Freedom of choice is good.

Eric Burger: +1

John E: We should state our intention of requirements

Cullen Jennings: Req are informational and not binding

Stephan B: Should leave it as it is, we really needs this for telepresence. If we find issue deal with it later

Christian?:  Include multiple mono streams.

Stephan: Some solution include multiple mono streams

Allyn: Rough consensus to leave it as a MUST.

Reqmt-3b:

Still needs to be deferred, as layout discussion hasn't concluded.

Reqmt-4/5:

?: Merge Reqmt-4/5

Reqmt-7:

Marshall Eubanks: What is meant with actual size.

Stephen Botzko: Advertise what the capture sizes really are so a renderer can make intelligent choices.

Jonathan Lennox: If we delete Reqmt-7 then Reqmt-8 isn't covered anymore.

Stephan B: Do not really need Reqmt-8.

Roni: I don't want to hear that it isn't needed later, so please leave it.

Allyn: Consensus to leave Reqmt-8 for now.

Reqmt-10:

Magnus: Want to keep it as bandwidth on different paths in a centralized node media plane is going to be reality.

Wenger: ?

Eckel: Reqmt-10 is too vague. Need it in all conferences.

Burger: A guide on how to build a good telepresence system.

Andrew Allen: What is the scope for the WG: only the work in the CLUE, or a complete system?

Mary: Mostly the later, but not all.

Marshall Eubanks: The goal is to build a interoperable system. Reqmt-10 is interoperable.

Cullen: Agree with that.

John: We need to figure out what is needed, and then we may pick what already exist.

Mary: Hum about requirment:

Action: hum taken - leave it in. No opposing view.

Reqmt-13:

Action: leave in.

Reqmt-13a:

Jonathan Lennox: How much control is there É

Decision to delete 13a was agreed.

Jonathan L has a somewhat different requirement to propose.

Reqmt-14:

Wenger: what are site switching.

Roni: Site switching is selecting all streams from a particular site, rather than a single camera. This goes back to enabling one to select the streams.

Wenger: What are this described? (use cases). Requirement says that you need to support at least one of the methods. I find the requirements unnecessarily complex. Needs to be improved.

Marshall Eubanks: Need to enable segment switching when a single media stream is changed, even within a sub-part of a composite screen.

Stephan B: We know we need to reword it. But no reason to do it before layout discussion is done.

Action: Still deferred, current requirement is inadequate.

Action: Stephan to draft a definition for segment and site switching

Reqmt-15:

Magnus: Unclear if the requirement include a protocol support for transferring indicator downstream from source node.

Stephan Botzko: If the audio stream is common for a room, one might need to indicate which of 3 camera captures contain the active talker.

Lennox: If someone wants the requirement, it should be reworded to the actual requirement that is ?

Allyn: Delete the current requirement and invite new requirements that better cover this.

Adoption of document:

HUM to accept this as a working group document:

Action: Agreed to be a working group document, no opposing view.

 

Framework

Presenter:    Mark Duckworth, Andy Pepperell, Brian Baldino

Slides:          http://www.ietf.org/proceedings/81/slides/clue-1.pptx

Summary of action items:

¥   Chairs: Schedule interim meeting to complete discussion of Framework - i.e., for Brian Baldino to do presentation on examples.

Conclusion: 

WG to continue discussion of the framework on the mailing list.

 

Detailed Discussions:

First/Second Row discussion:  Microphone and video, may need extension to 2D.

Comments that some messages in encoding groups were in fact codec dependent.

Allyn:

What we are doing here - telepresence deals with multiple streams, while our standards deal with single streams. Challenges - we want something:

-       immediately usable (or at least relatively quickly)

-       extensible

-       and simple and practical to implement

The Framework clusters around 2 concepts:

-       media capture information that needs to be passed

-       how the provider figures out what streams to send

process

-       provider provides capabilities

-       consumer choses from these

optimization - before negotiation, the consumer may send info about itself to the provider, so the provider can tailor what it provides.

Properties

-       Media capture

-       encode groups

-       simultaneous transmission sets

I want to take a minute to set context - any proposed framework was going to be difficult to communicate. We thought we should do so in stages and start simple. We would like it if people would focus on the concepts first.

Mark Duckworth:

Media Capture and Attributes

A media capture is a source of audio or video media.

They can be:

-       media from a camera or a microphone (a capture device)

-       media from a combination of media devices

-       or, this could be done remotely.

A capture set is a way to group media captures that have some relations

Some attributed include things like:

-       is the video auto switched or composed ?

-       is the audio mixed ?

-       audio channel format (mono/stereoÉ)

-       what is the spatial scale / image width on video

Attributes include a "purpose" - say, main versus presentation

I want to introduce a capture scene -

Imagine a given scene with people - cameras - camera views.

Types include:

-       one camera per screen

-       merging cameras in some fashion

-       switched based on voice with a composed PiP

-       etc.

Chris: you don't assume that the whole scene is always shown

Mark: of course

Roni: What about other models ? (lists some)

Mark: this is just one example.

A capture set is a representation of a group of video capture. It has N "rows." Each row is a capture set. Ordering within the rows important - it's how the left to right order is imposed.

Stephan: when you have pan/zoom cameras, and they are set differently for different members of capture set, then its hard to understand what they are. Do different captures identify what part of the scene they capture?

Mark: thatÕs a non-goal.

Case where cameras cover the same scene was raised.

Allyn indicated that a "Regions" concept would be added to the next draft.

Cullen: when we charted this WG É

Allyn: we were trying to start with something simple

Stephan Botzko: the goal here is to achieve interoperability. Having some approximate idea of adjacency may be more powerful

Roni: this talks about a simple architecture

Christian: we still have a concept of left/right?

Stephan Wenger: I am willing to spend work, but I am not willing to let you off the hook when there are requirements that are relevant for me.

Mark: Matching audio and video - when they are part of the same capture set - that includes time synchronization and spatial relationships

Spatial relationships - audio direction should roughly match video directions

In the audio, we are calling this audio channel formats - a receiver can map these into its loudspeakers to approximate the spatial relationship, in a way better than just going to mono, but not requiring identical audio formats.

Allyn: the point is whether or not the draft deals with everything we need to capture the framework, not whether it deals with all of the details.

Andy Pepperell:

Choosing streams

Basic Message flow

media stream consumer and media stream provider

(of course, typically side each has both)

msc communicates with msp

msc : consumer capability advertisement

msp : media capture advertisement

Initial message msc : consumer capability advertisement (AKA "the hint")

-       Physical factors

-       User preferences

-       Software limitations

-       etc.

Next (the second message, from msp to msc) is the media capture advertisement from the msp

-       most recently received consumer capability advertisement

-       provider fixed parameters, such as the number of cameras

-       dynamic factors - active speaker, presentation source status,

-       simultaneous transmission sets, etc.

Third message (msc to msp)

Stream configure message from the msc

-       based on media capture advertisement

-       consumer fixed characteristics

-       dynamic factors

This is the trigger for actual media transfer from the provider.

Question - why not use the terms sender and receiver?

Andy: we thought this was a little different case and that might confuse people.

Mark: and, this is not the sending and receiving of media

Andy: simultaneous transmission sets

Suppose that the same camera is a digital zoom of one sub-scene, and also provides the entire scene - that's why you need simultaneous transmission sets.

Encoding groups  - part of the media capture set advertisement by the media stream provider. Each capture has an associated

-       Encoding group structure - within an encoding group, there is a possibility of multiple encode or multiple potential encodes

-       the usual sort of video encode attributed (advertised by the provider to the consumer)  (these are the usual sorts of stuff, bandwidth, max bandwidth, etc.)

Roni: from the consumer side you are talking about screens.

You also have to some way of linking encoding group with a specific codec.

Marshall Eubanks: So, if something changes in the middle of a session, to change things you will have to have the provider send a new media capture advertisement to the consumer, which will have to then send a new Stream configure message, to get the changed stream.

Andy: Yes

Marshall: So the consumer will have to be listening to the provider for MCAs at any time?

Andy: Yes.

Marshall: And, of course, you will need error messages.

Andy: Of course. The msc might get it wrong.

Brian Baldino did not present due to lack of time.