<?xml version='1.0'?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY rfc2119 SYSTEM
      'reference.RFC.2119.xml'>
    <!ENTITY rtp SYSTEM
      'reference.RFC.3550.xml'>
    <!ENTITY clueusecases SYSTEM
      'reference.I-D.ietf-clue-telepresence-use-cases.xml'>
    <!ENTITY cluerequirements SYSTEM
      'reference.I-D.ietf-clue-telepresence-requirements.xml'>
    <!ENTITY clueframework SYSTEM
      'reference.I-D.ietf-clue-framework.xml'>
    <!ENTITY westerlundrtpmux SYSTEM
      'reference.I-D.westerlund-avtcore-multiplex-architecture.xml'>
    <!ENTITY rtcwebmux SYSTEM
      'reference.I-D.lennox-rtcweb-rtp-media-type-mux.xml'>
    <!ENTITY contentattr SYSTEM
      'reference.RFC.4796.xml'>
    <!ENTITY hdrext SYSTEM
      'reference.RFC.5285.xml'>
    <!ENTITY ccm SYSTEM
      'reference.RFC.5104.xml'>
    <!ENTITY rtptopo SYSTEM
      'reference.RFC.5117.xml'>
]>

<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>

<rfc category='std' ipr='trust200902' docName='draft-lennox-clue-rtp-usage-04'>
    <front>
        <title abbrev='RTP Usage for Telepresence'>
		Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions
        </title>

        <author initials='J.' surname='Lennox'
                fullname='Jonathan Lennox'>
            <organization abbrev='Vidyo'>
               Vidyo, Inc.
            </organization>
            <address>
               <postal>
                   <street>433 Hackensack Avenue</street>
                   <street>Seventh Floor</street>
                   <city>Hackensack</city> <region>NJ</region>
                   <code>07601</code>
                   <country>US</country>
               </postal>
               <email>jonathan@vidyo.com</email>
            </address>
        </author>
		<author initials="P." surname="Witty"
                fullname="Paul Witty">
            <organization></organization>
              <address>
               <postal>
                   <street></street>
                   <city></city> <region>England</region>
                   <country>UK</country>
               </postal>
               <email>paul.witty@balliol.oxon.org</email>
            </address>
        </author>
        <author initials="A." surname="Romanow"
                fullname="Allyn Romanow">
            <organization>Cisco Systems</organization>
    
            <address>
               <postal>
				   <street> </street>
                   <city>San Jose</city> <region>CA</region>
                   <code>95134</code>
                   <country>USA</country>
               </postal>
               <email>allyn@cisco.com</email>
            </address>
        </author>

	

        <date />
        <area>RAI</area>
        <workgroup>CLUE</workgroup>

        <keyword>I-D</keyword>
        <keyword>Internet-Draft</keyword>
		<!-- TODO: more keywords -->

        <abstract>
		<t>
		  This document describes mechanisms and recommended practice
		  for transmitting the media streams of telepresence sessions
		  using the Real-Time Transport Protocol (RTP).
		</t>
        </abstract>

    </front>

<middle>

<section title='Introduction' anchor='introduction'>

<t>Telepresence systems, of the architecture described by
<xref target='I-D.ietf-clue-telepresence-use-cases' />
and <xref target='I-D.ietf-clue-telepresence-requirements' />, will send and
receive multiple media streams, where the number of streams in use is
potentially large and asymmetric between endpoints, and streams can
come and go dynamically.  These characteristics lead to a number of architectural
design choices which, while still in the scope of potential
architectures envisioned by the <xref target='RFC3550'>Real-Time
Transport Protocol</xref>, must be fairly different than those
typically implemented by the current generation of voice or video
conferencing systems.</t>

<t>Furthermore, captures, as defined by
  the <xref target='I-D.ietf-clue-framework'>CLUE Framework</xref>,
  are a somewhat different concept than RTP's concept of media
  streams, so there is a need to communicate the associations between
  them.</t>

<t>This document makes recommendations, for this telepresence
  architecture, about how streams should be encoded and transmitted in
  RTP, and how their relation to captures should be communicated.</t>

</section>

<section title='Terminology'>

<t>The key words "MUST", "MUST NOT", "REQUIRED", 
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>

</section>

<section title='RTP requirements for CLUE'>

<t>CLUE will permit a SIP call to include multiple media streams: easily dozens
at a time (given, e.g., a continuous presence screen in a multi-point
conference), potentially out of a possible pool of hundreds. Furthermore,
endpoints will have an asymmetric number of media streams.</t>

<t>Two main backwards compatibility issues exist: firstly, on an initial SIP
offer we can not be sure that the far end will support CLUE, and therefore a
CLUE endpoint must not offer a selection of RTP sessions which would confuse a
CLUEless endpoint.  Secondly, there exist many SIP devices in the network
through which calls may be routed; even if we know that the far end supports
CLUE, re-offering with a larger selection of RTP sessions may fall foul of one
of these middle boxes.</t>

<t>We also desire to simplify NAT and firewall traversal by allowing endpoints
to deal with only a single static address/port mapping per media type rather
than multiple mappings which change dynamically over the duration of the
call.</t>

<t>A SIP call in common usage today will typically offer one or two video RTP
sessions (one for presentation, one for main video), and one audio session. 
Each of these RTP sessions will be used to send either zero or one media streams
in either direction, with the presence of these streams negotiated in the SDP
(offering a particular session as send only, receive only, or send and receive),
and through BFCP (for presentation video).</t>

<t>In a CLUE environment this model -- sending zero or one source (in each
direction) per RTP session -- doesn't scale as discussed above, and mapping
asymmetric numbers of sources to sessions is needlessly complex.</t>

<t>Therefore, telepresence systems SHOULD use a single RTP session per media
type, as shown in <xref target='rtp-multiplex' />, except where there's a need to give sessions different transport
treatment.  All sources of the same media type, although from distinct captures,
are sent over this single RTP session.</t>

<figure anchor="rtp-multiplex" title="Multiplexing multiple media streams
into one RTP session">
<artwork>
<![CDATA[
Camera 1   -.__                                        _,'Screen 1
               `--._   , =-----------...........     ,'
                    `'+.._`\ _________________ _\,'
                     /    '|        RTP          |
Camera 2 ------------+----,''''''''''''''''''''':-------- Screen 2
                     \  _ ----------------------.'.
                  _,.-''-----------------------,/  `-._
             _,.-'                                     `.. Screen 3
Camera 3  ,-'                                             `
]]>
</artwork>
</figure>

<t>During call setup, a single RTP session is negotiated for each media type. 
In SDP, only one media line is negotiated per media and multiple media streams
are sent over the same UDP channel negotiated using the SDP media line.</t>

<t>
   A number of protocol issues involved in multiplexing RTP streams
   into a single session are discussed in
   <xref target='I-D.westerlund-avtcore-multiplex-architecture' /> and
   <xref target='I-D.lennox-rtcweb-rtp-media-type-mux' />. In the rest
   of this
   document we concentrate on examining the mapping of RTP streams to requested
   CLUE captures in the specific context of telepresence systems.
</t>

<t>The CLUE architecture requires more than simply source
  multiplexing, as defined by <xref target='RFC3550' />.
  The key issue is how a receiver interprets the multiplexed
  streams it receives, and correlates them with the captures it has
  requested.  
  In some cases, the
  <xref target='I-D.ietf-clue-framework'>CLUE Framework</xref>'s
  concept of the "capture" maps cleanly to the RTP concept of an SSRC,
  but in many cases it does not.
</t>

<!--
[This paragraph was redundant and less clear than some other stuff elsewhere]
<t>
  Conceptually, each different capture requested could be viewed as an
  independent unidirectional RTP session, with the stream (or
  streams) received for one capture handled entirely independently from those
  streams received for different captures.  
  However, because any stream could be
  sent over any capture, and for the other reasons described above,
  we still have all the restrictions of source
  multiplexing.</t>
-->

<t>
   First we will consider the cases that need to be considered. We will then
   examine the two most obvious approaches to mapping streams for captures, showing their pros
   and cons. We then describe a third possible alternative. 
</t>

</section>

<section title='RTCP requirements for CLUE'>

<t>When sending media streams, we are also required to send corresponding RTCP
information.  However, while a unidirectional RTP stream (as identified by a
single SSRC) will contain a single stream of media, the associated RTCP stream
will include sender information about the stream, but will also include feedback
for streams sent in the opposite direction.  On a simple point-to-point case, it
may be possible to naively forward on RTCP in a similar manner to RTP, but in
more complicated use cases where multipoint devices are switching streams to
multiple receivers, this simple approach is insufficient.</t>

<t>As an example, receiver report messages are sent with the source SSRC of a
single media stream sent in the same direction as the RTCP, but contain within
the message zero or more receiver report blocks for streams sent in the other
direction.  Forwarding on the receiver report packets to the same endpoints
which are receiving the media stream tagged with that SSRC will provide no
useful information to endpoints receiving the messages, and does not guarantee
that the reports will ever reach the origin of the media streams on which they
are reporting.</t>

<t>CLUE therefore requires devices to more intelligently deal with received RTCP
messages, which will require full packet inspection, including SRTCP
decryption.  The low rate of RTCP transmission/reception makes this feasible to
do.</t>

<t>RTCP also carries information to establish clock synchronization between
  multiple RTP streams.  For CLUE, this information will be crucial, not
  only for traditional lip-sync between video and audio, but also for
  synchronized playout of multiple video streams from the same room.
  This information needs to be provided even in the case of switched
  captures, to provide clock synchronization for sources that are
  temporarily being shown for a switched capture.</t>

</section>

<section title='Multiplexing multiple streams or multiple sessions?'>

<t>It may not be immediately obvious whether this problem is best described as
multiplexing multiple RTP sessions onto a single transport layer, or as
multiplexing multiple media streams onto a single RTP session. Certainly, the
different captures represent independent purposes for the media that is sent;
however, as any stream may be switched into any of the multiplexed captures, we
maintain the requirement that all media streams within a CLUE call must have a
unique SSRC -- this is also a requirement for the above use of RTCP.</t>

<t>Because of this, CLUE's use of RTP can best be described as multiplexing
multiple streams onto one RTP session, but with additional data about the
streams to identify their intended destinations.  A solution to perform this
multiplexing may also be sufficient to multiplex multiple RTP sessions onto
one transport session, but this is not a requirement.</t>

</section>

<section title='Use of multiple transport flows'>

<t>Most existing videoconferencing systems use separate RTP sessions
  for main and presentation video sources, distinguished by the
  <xref target='RFC4796'>SDP content attribute</xref>.  The use of
  <xref target='I-D.ietf-clue-framework'>the CLUE telepresence
  framework</xref> to describe multiplexed streams can remove the
  need to establish separate RTP sessions (and transport flows) for
  these sessions, as the relevant information can be provided by CLUE
  messaging instead.</t>

<t>However, it can still be useful in many cases to establish multiple
  RTP sessions (and transport flows) for a single CLUE session.
  Two clear cases would be for disaggregated media (where media is
  being sent to devices with different transport addresses), or
  scenarios where different sources should get different
  quality-of-service treatment.  To support such scenarios, the use of
  multiple RTP sessions, with SDP m lines with 
  different transport addresses, would be necessary.</t>
  
<t>To support this case, CLUE messaging needs to be able to indicate the RTP
session in which a requested capture is intended to be received.</t>

</section>

<section title='Use Cases' anchor='usecases'>

<t>There are three distinct use cases relevant for telepresence systems:
static stream choice, dynamically changing streams chosen from a finite set, and
dynamic changing streams chosen from an unbounded set.</t>

<t>Static stream choice:</t>

<t>In this case, the streams sent over the multiplex are constant over
  the complete session. An example is a triple-camera system to MCU in
  which left, center and right streams are sent for the duration of
  the session.</t>
<t>This describes an endpoint to endpoint, endpoint to multipoint
  device, and equivalently a transcoding multipoint device to
  endpoint.</t>
<t>This is illustrated in <xref target='p2p-static-streams' />.</t>

<figure anchor="p2p-static-streams" title="Point to Point Static Streams">
<artwork>
<![CDATA[
            ,'''''''''''|                           +-----------Y
            |           |                           |           |
            | +--------+|"""""""""""""""""""""""""""|+--------+ |
            | |EndPoint||---------------------------||EndPoint| |
            | +--------+|"""""""""""""""""""""""""""|+--------+ |
            |           |                           |           |
            "-----------'                           "------------
]]>
</artwork>
</figure>

<t>Dynamic streams from a finite set:</t>

<t>In this case, the receiver has requested a smaller number of
  streams than the number of media sources that are available, and
  expects the sender to switch the sources being sent
  based on criteria chosen by the sender.  (This is called
  auto-switched in the <xref target='I-D.ietf-clue-framework'>CLUE
  Framework</xref>.)</t>

<t>An example is a triple-camera system to two-screen system, in which
  the sender needs to switch either LC -> LR, or CR -> LR.  (Note in
  particular, in this example, that the center camera stream could be sent as
  either the left or the right auto-switched capture.)</t>

<t>This describes an endpoint to endpoint, endpoint to multipoint
  device, and a transcoding device to endpoint.</t>

<t>This is illustrated in <xref target='p2p-finite-streams' />.</t>


<figure anchor="p2p-finite-streams" title="Point to Point Finite Source Streams">
<artwork>
<![CDATA[
            ,'''''''''''|                           +-----------Y
            |           |                           |+--------+ |
            | +--------+|"""""""""""""""""""""""""""||EndPoint| |
            | |EndPoint||                           |+--------+_|
            | +--------+''''''''''                   '''''''''''
            |           |........
            "-----------'
]]>
</artwork>
</figure>


<t>Dynamic streams from an unbounded set:</t>

<t>This case describes a switched multipoint device to endpoint, in
  which the multipoint device can choose to send any streams received
  from any other endpoints within the conference 
  to the endpoint.</t>
<t>For example, in an MCU to triple-screen system,  the MCU could send
  e.g. LCR of a triple-camera system -> LCR, or CCC of three
  single-camera endpoints -> LCR. </t>
<t>This is illustrated in <xref target='multipoint-unbounded-streams' />.</t>

<figure anchor="multipoint-unbounded-streams" title="Multipoint Unbounded Streams">
<artwork>
<![CDATA[
           +-+--+--+
           | |EP|  `-.
           | +--+  |`.`-.
           +-------`. `. `.
                     `-.`. `-.
                        `.`-. `-.
                          `-.`.  `-.-------+              +------+
           +--+--+---+       `.`.|  +---+  ---------------| +--+ |
           |  |EP|   +----.....:=.  |MCU|  ...............| |EP| |
           |  +--+   |"""""""""--|  +---+  |______________| +--+ |
           +---------+"""""""""";'.'.'.'---+              +------+
                              .'.'.'.'
                            .'.'.'.'
                           / /.'.'
                         .'.::-'
            +--+--+--+ .'.::'
            |  |EP|  .'.::'
            |  +--+  .::'
            +--------.'
]]>
</artwork>
</figure>
<!-- " (close quote to keep emacs xml mode happy) -->

<t>Within any of these cases, every stream within the multiplexed
  session MUST have a unique SSRC.  The SSRC is chosen at random
  <xref target='RFC3550' /> to ensure uniqueness (within the
  conference), and contains no meaningful information.</t>

<t>Any source may choose to restart a stream at any time, resulting in
  a new SSRC. For example, a transcoding MCU might, for reasons of load
  balancing, transfer an encoder onto a different DSP, and throw away
  all context of the encoding at this state, sending an RTCP BYE
  message for the old SSRC, and picking a new SSRC for the stream when
  started on the new DSP.</t>
<t>Because of this possibility of changing the SSRC at any time, all
  our use cases can be considered as simplifications of the third and most difficult
  case, that of dynamic streams from an unbounded set. Thus, this is
  the primary case we will consider.</t>

</section>

<section title='Other implementation constraints'>

<t>To cope with receivers with limited decoding resources, for example a
  hardware based telepresence endpoint with a fixed number of decoding modules,
  each capable of handling only a single stream, it is particularly important
  to ensure that the number of streams which the transmitter is expecting the
  receiver to decode never exceeds the maximum number the receiver has
  requested. In this case the receiver will be forced to drop some of the
  received streams, causing a poor user experience, and potentially higher
  bandwidth usage, should it be required to retransmit I-frames.</t>
  
<t>On a change of stream, such a receiver can be expected to have a one-out,
  one-in policy, so that the decoder of the stream currently being received on
  a given capture is stopped before starting the decoder for the stream
  replacing it.  The sender MUST therefore indicate to the receiver which stream
  will be replaced upon a stream change.</t>

</section>

<section title='Requirements of a solution' anchor='solution-requirements'>

<t>This section lists, more briefly, the requirements a media
  architecture for Clue telepresence needs to achieve, summarizing the
  discussion of previous sections.  In this section, RFC 2119 language
  refers to requirements on a solution, not an implementation; thus,
  requirements keywords are not written in capital letters.</t>

<t><list style="hanging">

<t hangText='Media-1:'>It must not be necessary for a Clue session to
  use more than a single transport flow for transport of a given media
  type (video or audio).</t>

<t hangText='Media-2:'>It must, however, be possible for a Clue
  session to use multiple transport flows for a given media type where
  it is considered valuable (for example, for distributed media, or
  differential quality-of-service).</t>

<t hangText='Media-3:'>It must be possible for a Clue endpoint or MCU
  to simultaneously send sources corresponding to static, to
  composited, and to switched captures, in the same transport flow.
  (Any given device might not necessarily be able send all of these
  source types; but for those that can, it must be possible for them
  to be sent simultaneously.)</t>

<t hangText='Media-4:'>It must be possible for an original source to
  move among switched captures (i.e. at one time be sent for one
  switched capture, and at a later time be sent for another one).</t>

<t hangText='Media-5:'>It must be possible for a source to be placed
  into a switched capture even if the source is a "late joiner",
  i.e. was added to the conference after the receiver requested the
  switched source.</t>

<t hangText='Media-6:'>Whenever a given source is assigned to a
  switched capture, it must be immediately possible for a receiver to
  determine the switched capture it corresponds to, and thus that any
  previous source is no longer being mapped to that switched
  capture.</t>

<t hangText='Media-7:'>It must be possible for a receiver to identify
  the actual source that is currently being mapped to a switched
  capture, and correlate it with out-of-band (non-Clue) information
  such as rosters.</t>

<t hangText='Media-8:'>It must be possible for a source to move among
  switched captures without requiring a refresh of decoder state
  (e.g., for video, a fresh I-frame), when this is unnecessary.
  However, it must also be possible for a receiver to indicate when a
  refresh of decoder state is in fact necessary.</t>

<t hangText='Media-9:'>If a given source is being sent on the same
  transport flow for more than one reason (e.g. if it corresponds to
  more than one switched capture at once, or to a static capture), it
  should be possible for a sender to send only one copy of the source.</t>

<t hangText='Media-10:'>On the network, media flows should, as much as
  possible, look and behave like currently-defined usages of existing
  protocols; established semantics of existing protocols must not be
  redefined.</t>

<t hangText='Media-11:'>The solution should seek to minimize the
  processing burden for boxes that distribute media to decoding
  hardware.</t>

<t hangText='Media-12:'>If multiple sources from a single
  synchronization context are being sent simultaneously, it must be
  possible for a receiver to associate and synchronize them properly,
  even for sources that are are mapped to switched captures.</t>

</list></t>

</section>

<section title='Mapping streams to requested captures'>

<t>The goal of any scheme is to allow the receiver to match the
received streams to the requested captures.  As discussed in
  <xref target='usecases' />, during
the lifetime of the transmission of one capture, we may see one or multiple
media streams which belong to this capture, and during the lifetime of one media
stream, it may be assigned to one or more captures.</t>

<t>Topologically, the requirements
  in <xref target='solution-requirements' /> are best addressed by
  implementing static and a switched captures with an RTP Media Translator,
  i.e. the topology that <xref target='RFC5117'>RTP Topologies</xref>
  defines as Topo-Media-Translator.  (A composited capture would be
  the topology described by Topo-Mixer; an MCU can easily produce
  either or both as appropriate, simultaneously.).  The MCU selectively forwards
  certain sources, corresponding to those sources which it currently
  assigns to the requested switched captures.</t>

<t>Demultiplexing of streams is done by SSRC; each stream is known to have a
  unique SSRC. However, this SSRC contains no information about capture IDs. 
  There are two obvious choices for providing the mapping from SSRC to
  captures: sending the mapping outside of the media stream, or tagging media
  packets with the capture ID.  (There may be other choices, e.g., payload type
  number, which might be appropriate for multiplexing one audio with
  one video stream on the same RTP session, but this not relevant for
  the cases discussed here.)</t>

<t>(An alternative architecture would be to map all captures directly to SSRCs,
  and then to use a Topo-Mixer topology to represent switched captures
  as a "mixed" source with a single contributing CSRC.
  However, such an architecture would not be able to
  satisfy the requirements Media-8, Media-9, or Media-12 described in
  <xref target='solution-requirements' />, without substantial changes
  to the semantics of RTP.)</t>

<section title='Sending SSRC to capture ID mapping outside the media stream'>

<t>Every RTP packet includes an SSRC, which can be used to demultiplex the
  streams.  However, although the SSRC uniquely identifies a stream, it does not
  indicate which of the requested captures that stream is tied to.  If more than
  one capture is requested, a mapping from SSRC to capture ID is therefore
  required so that the media receiver can treat each received stream
  correctly.</t>

<t>As described above, the receiver may need to know in advance of receiving
the media stream how to allocate its decoding resources. Although
implementations MAY cache incoming media received before knowing which
multiplexed stream it applies to, this is optional, and other
implementations may choose to discard media, potentially requiring an expensive
state refresh, such as an <xref target='RFC5104'>Full Intra Request
(FIR)</xref>.</t>

<t>In addition, a receiver will have to store lookup tables of SSRCs
  to stream IDs/decoders etc.  Because of the large SSRC space (32
  bits), this will have to be in the form of something like a hash
  map, and a lookup will have to be performed for every incoming
  packet, which may prove costly for e.g. MCUs processing large numbers of
incoming streams.</t>

<t>Consider the choices for where to put the mapping from SSRC to capture ID.
  This mapping could be sent in the CLUE messaging.  The use of a reliable
  transport means that it can be sure that the mapping will not be
  lost, but if this reliability is achieved through retransmission,
  the time taken for the mapping to reach all receivers (particularly in a very
  large scale conference, e.g., with thousands of users) could result in
  very poor switching times, providing a bad user experience.</t>

<t>A second option for sending the mapping is in RTCP, for instance
  as a new SDES item.  This is likely to follow the same path as media, and
  therefore if the mapping data is sent slightly in advance of the media, it can
  be expected to be received in advance of the media.  However, because RTCP is
  lossy and, due to its timing rules, cannot always be sent
  immediately, the mapping
  may not be received for some time, resulting in the
  receiver of the media not knowing how to route the received media.  A system
  of acks and retransmissions could mitigate this, but this results in the same
  high switching latency behaviour as discussed for using CLUE as a transport
  for the mapping.</t>
  
<figure anchor="sdes-mux-id" title="SDES item for encoding of the Capture ID">
<artwork>
<![CDATA[
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  CaptureID=9  |   length=4    |           Capture ID          :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   :                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

]]>
</artwork>
</figure>

</section>

<section title='Sending capture IDs in the media stream'>

<t>The second option is to tag each media packet with the capture ID. This
  means that a receiver immediately knows how to interpret received media, even
  when an unknown SSRC is seen.  As long as the media carries a known capture
  ID, it can be assumed that this media stream will replace the stream
currently being received with that capture ID.</t>

<t>This gives significant advantages to switching latency, as a switch
  between sources can be achieved without any form of negotiation with
  the receiver.  There is no chance of receiving media without knowing
  to which switched capture it belongs.</t>

<t>However, the disadvantage in using a capture ID in the stream that it
introduces additional processing costs for every media packet, as capture IDs
are scoped only within one hop (i.e., within a cascaded conference a capture
ID that is used from the source to the first MCU is not meaningful between two
MCUs, or between an MCU and a receiver), and so they may need to be added or
modified at every stage.</t> 

<t>As capture IDs are chosen by the media sender, by offering a particular
capture to multiple recipients with the same ID, this requires the sender to
only produce one version of the stream (assuming outgoing payload type numbers
match). This reduces the cost in the multicast case, although does not
necessarily help in the switching case.</t>

<t>An additional issue with putting capture IDs in the RTP packets comes from
cases where a non-CLUE aware endpoint is being switched by an MCU to a CLUE
endpoint.  In this case, we may require up to an additional 12 bytes in the RTP
header, which may push a media packet over the MTU.  However, as the MTU on
either side of the switch may not match, it is possible that this could happen
even without adding extra data into the RTP packet.  The 12 additional bytes
per packet could also be a significant bandwidth increase in the case of
very low bandwidth audio codecs.</t>

<section title="Multiplex ID shim">

<t> As in draft-westerlund-avtcore-transport-multiplexing</t>

</section>

<section title="RTP header extension">

<t>The capture ID could be carried within the RTP header extension field,
using <xref target='RFC5285' />.  This is negotiated within the SDP i.e.</t>

<t>a=extmap:1 urn:ietf:params:rtp-hdrex:clue-capture-id</t>

<t>Packets tagged by the sender with the capture ID will then contain a header
extension as shown below</t>

<figure anchor="hdrext-cap-id" title="RTP header extension for encoding
of the capture ID">
<artwork>
<![CDATA[
     0                   1                   2                   3
     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |  ID=1 |  L=3  |                 capture id                    |
    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    |  capture id   |
    +-+-+-+-+-+-+-+-+
]]>
</artwork>
</figure>

<t>To add or modify the capture ID can be an expensive operation,
  particularly if SRTP is used to authenticate the packet.
  Modification to the contents of the RTP header requires a
  reauthentication of the complete packet, and this could prove to be
  a limiting factor in the throughput of a multipoint device.  However,
  it may be that reauthentication is required in any case due to the
  nature of SDP. SDP permits the receiver to choose payload types,
  meaning that a similar option to modify the payload type in the
  packet header will cause the need to reauthenticate.</t>

</section>

<section title='Combined approach'>

<t>The two major flaws of the above methods (high latency switching
  of SSRC multiplexing, high computational cost on
  switching nodes) can be mitigated with a combined method.  In this,
  the multiplex ID can be included in packets belonging to the first
  frame of media (typically an IDR/GDR), but following this only the
  SSRC is used to demultiplex.</t>
  
<section title="Behaviour of receivers">

<t>A receiver of a stream should demultiplex on SSRC if it knows the capture ID
for the given SSRC, otherwise it should look within the packet for the presence
of the stream ID.  This has an issue where a stream switches from one capture to
a second - for example, in the second use case described in <xref
target='usecases' />, where the transmitter chooses to switch the center stream
from the receiver's right capture to the left capture, and so the
receiver will already know an incorrect mapping from that stream's SSRC to a
capture ID.</t>

<t>In this case the receiver should, at the RTP level, detect the
presence of the capture ID and update its SSRC to capture ID map.  This
could potentially have issues where the demultiplexer has now sent the packet to
the wrong physical device - this could be solved by checking for the presence of
a capture ID in every packet, but this will have speed implications.  If a
packet is received where the receiver does not already know the mapping between
SSRC and capture ID, and the packet does not contain a capture ID, the
receiver may discard it, and MUST request a transmission of the capture ID
(see below).</t>

</section>

<section title="Choosing when to send capture IDs">

<t>The updated capture ID needs to be known as soon as possible on a switch of
SSRCs, as the receiver may be unable to allocate resources to decode the
incoming stream, and may throw away the received packets.  It can be assumed
that the incoming stream is undecodable until the capture ID is received.</t>

<t>In common video codecs (e.g. H.264), decoder refresh frames (either IDR or
GDR) also have this property, in that it is impossible to decode any video
without first receiving the refresh point.  It therefore seems natural to
include the capture ID within every packet of an IDR or GDR.</t>

<t>For most audio codecs, where every packet can be decoded independently, there
is not such an obvious place to put this information.  Placing the capture ID
within the first n packets of a stream on a switch is the most simple solution,
where n needs to be sufficiently large that it can be expected that at least one
packet will have reached the receiver.  For example, n=50 on 20ms audio
packets will give 1 second of capture IDs, which should give reasonable
confidence of arrival.</t>

<t>In the case where a stream is switched between captures, for reasons of
coding efficiency, it may be desirable to avoid sending a new IDR frame for
this stream, if the receiver's architecture allows the same decoding state to
be used for its various captures.  In this case, the capture ID could be
sent for a small number of frames after the source switches capture,
similarly to audio.</t>

</section>

<section title="Requesting Capture ID retransmits">

<t>There will, unfortunately, always be cases where a receiver misses the
beginning of a stream, and therefore does not have the mapping.  One proposal
could be to send the capture ID in SDES with every SDES packet; this should
ensure that within ~5 seconds of receiving a stream, the capture ID will be
received.  However, a faster method for requesting the transmission of a
capture ID would be preferred.</t>

<t>Again, we look towards the present solution to this problem with video. 
RFC5104 provides an Full Intra Refresh feedback message, which requests that the
encoder provide the stream such that receivers need only the stream after that
point.  A video receiver without the start of the stream will naturally need to
make this request, so by always including the capture ID in refresh frames, we
can be sure that the receiver will have all the information it needs to decode
the stream (both a refresh point, and a capture ID).</t>

<t>For audio, we can reuse this message.  If a receiver receives an audio stream
for which it has no SSRC to capture mapping, it should send a FIR message for
the received SSRC.  Upon receiving this, an audio encoder must then tag outgoing
media packets with the capture ID for a short period of time.</t>

<t>Alternately, a new RTCP feedback message could be defined which
  would explicitly request a refresh of the capture ID mapping.</t>

</section>

</section>

</section>

<section title="Recommendations">

<t>We recommend that endpoints MUST support the RTP header extension method
of sharing capture IDs, with the extension in every media packet.  For low
bandwidth situations, this may be considered excessive overhead; in which
case endpoints MAY support the combined approach.</t>

<t>This will be advertised in the SDP (in a way yet to be determined); if a
receiver advertises support for the combined approach, transmitters which
support sending the combined approach SHOULD use it in preference.</t>

</section>

</section>

<section title='Security Considerations' anchor='security'>

<t>The security considerations for multiplexed RTP do not seem to be
  different than for non-multiplexed RTP.</t>

<t>Capture IDs need to be integrity-protected in secure environments;
  however, they do not appear to need confidentiality.</t>

</section>

<section title='IANA Considerations' anchor='iana'>

<t>Depending on the decisions, the new RTP header extension element,
  the new RTCP SDES item, and/or the new AVPF feedback message will
  need to be registered.</t>

</section>

</middle>

<back>

<references title='Normative References'>

&rfc2119;

&rtp;

</references>

<references title='Informative References'>

&clueusecases;

&cluerequirements;

&clueframework;

&contentattr;

&hdrext;

&rtcwebmux;

&westerlundrtpmux;

&ccm;

&rtptopo;

</references>

<!--
<section title='Open issues'>

<t><list style='symbols'>

<t></t>

</list></t>

</section>
-->

<!-- 
<section title='Changes From Earlier Versions'>

<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>

<section title='Changes From Draft -00'>

<t><list style='symbols'>

<t></t>


</list></t>

</section>

</section>

-->

</back>

</rfc>

