<?xml version='1.0'?>

<!DOCTYPE rfc SYSTEM "rfc2629.dtd" [
    <!ENTITY rfc2119 SYSTEM
      'reference.RFC.2119.xml'>
    <!ENTITY rtp SYSTEM
      'reference.RFC.3550.xml'>
    <!ENTITY ice SYSTEM
      'reference.RFC.5245.xml'>
    <!ENTITY sdp SYSTEM 
      'reference.RFC.4566.xml'>
    <!ENTITY offeranswer SYSTEM 
      'reference.RFC.3264.xml'>
    <!ENTITY prack SYSTEM 
      'reference.RFC.3262.xml'>
    <!ENTITY avp SYSTEM
      'reference.RFC.3551.xml'>
    <!ENTITY avpf SYSTEM
      'reference.RFC.4585.xml'>
    <!ENTITY srtp SYSTEM
      'reference.RFC.3711.xml'>
    <!ENTITY savpf SYSTEM
      'reference.RFC.5124.xml'>
    <!ENTITY sourcedesc SYSTEM
      'reference.RFC.5576.xml'>
    <!ENTITY sdpgroup SYSTEM
      'reference.RFC.5888.xml'>
    <!ENTITY sdpbundle SYSTEM
      'reference.I-D.holmberg-mmusic-sdp-bundle-negotiation.xml'>
    <!ENTITY sdpcapneg SYSTEM
      'reference.RFC.5939.xml'>
    <!ENTITY rtpfec SYSTEM
      'reference.RFC.5109.xml'>
    <!ENTITY layergroup SYSTEM
      'reference.RFC.5583.xml'>
    <!ENTITY rtprtx SYSTEM
      'reference.RFC.4588.xml'>
]>

<?rfc toc="yes" ?>
<?rfc strict="yes" ?>
<?rfc compact="yes" ?>
<?rfc sortrefs="yes" ?>
<?rfc symrefs="yes" ?>

<rfc category='std' ipr='trust200902' docName='draft-lennox-rtcweb-rtp-media-type-mux-00'>
    <front>
        <title abbrev='Multiplexing Media Types in RTP'>
		Multiplexing Multiple Media Types In a Single Real-Time Transport Protocol (RTP) Session
        </title>

        <author initials='J.' surname='Lennox'
                fullname='Jonathan Lennox'>
            <organization abbrev='Vidyo'>
               Vidyo, Inc.
            </organization>
            <address>
               <postal>
                   <street>433 Hackensack Avenue</street>
                   <street>Seventh Floor</street>
                   <city>Hackensack</city> <region>NJ</region>
                   <code>07601</code>
                   <country>US</country>
               </postal>
               <email>jonathan@vidyo.com</email>
            </address>
        </author>

        <author initials="J." surname="Rosenberg"
                fullname="Jonathan Rosenberg">
            <organization>Skype</organization>
    
            <address>
                <email>jdrosen@skype.net</email>
                <uri>http://www.jdrosen.net</uri>
            </address>
        </author>

        <date />
        <area>RAI</area>
        <workgroup>RTCWEB</workgroup>

        <keyword>I-D</keyword>
        <keyword>Internet-Draft</keyword>
		<!-- TODO: more keywords -->

        <abstract>
		<t>
		  This document describes mechanisms and recommended practice
		  for transmitting media streams of multiple media types
		  (e.g., audio and video) over a single Real-Time Transport
		  Protocol (RTP) session, primarily for the use of Real-Time
		  Communication for the Web (rtcweb).
		</t>
        </abstract>

    </front>

<middle>

<section title='Introduction' anchor='introduction'>

<t>Classically, multimedia sessions using
  the <xref target='RFC3550'>Real-Time Transport Protocol (RTP)</xref>
  have transported different media types (most commonly, audio and
  video) in different RTP sessions, each with in own transport
  flow.  At the time RTP was designed, this was a reasonable design
  decision, reducing system variability and adding flexibility
  (<xref target='RFC3550' /> discusses the motivation for this
  design decision in section 5.2).</t>

<t>However, the de facto architecture of the Internet has changed
  substantially since RTP was originally designed, nearly twenty years
  ago.  In particular, Network Address Translators (NATs) and
  firewalls are now ubiquitous, and IPv4 address space scarcity is
  becoming more severe.  As a consequence, the network resources used
  up by an application, and its probability of failure, are directly
  proportional to the number of distinct transport flows it uses.</t>

<t>Furthermore, applications have developed mechanisms
  (notably <xref target='RFC5245'>Interactive Connectivity
  Establishment (ICE)</xref>) to traverse NATs and firewalls.  The time
  such mechanisms need to perform the traversal process is
  proportional to the number of distinct transport flows in use.</t>

<t>As a result, in the modern Internet, it is advisable and useful to
  revisit the transport-layer separation of media in a multimedia
  session.  Fortunately, the architecture of RTP allows this to be
  done in a straightforward and natural way: by placing multiple
  sources of different media types in the same RTP session.</t>

<t>Since this is architecturally somewhat different from existing RTP
  deployments, however, this decision has some consequences that may
  be non-obvious.  Furthermore, it is somewhat complex to negotiate
  such flows in signaling protocols that assumed the older
  architecture, most notably <xref target='RFC4566'>the Session
  Description Protocol (SDP)</xref>.  The rest of this document
  discusses these issues.</t>

</section>

<section title='Terminology'>

<t>The key words "MUST", "MUST NOT", "REQUIRED", 
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", 
and "OPTIONAL" in this document
are to be interpreted as described in <xref
target='RFC2119'>RFC 2119</xref> and indicate requirement levels for
compliant implementations.</t>

</section>

<section title='Transmitting multiple types of media in a single RTP session'>

<t><xref target='RFC3550'>RTP</xref> supports the notion of multiple
  sources within a session. Historically, this was typically used for
  distinct users within a group to send media of the same type. Each
  source has its own synchronization source (SSRC) value and has a
  distinct sequence number and timestamp space. This document
  specifies that this same mechanism is used to allow sources of
  multiple media types in the same RTP session, even if they come from
  the same user. For example, in a call containing audio and video
  between two users, each sending a single audio and a single video
  source, there would be a single RTP session containing two sources
  (one audio, one video) from each user, for a total of four sources
  (and thus four SSRC values) within the RTP session.</t>

<t>Transmitting multiple types of media in a
  single <xref target='RFC3550'>RTP</xref> session is done using the
  same RTP mechanisms as are used to transmit multiple sources of the
  same media type on a session.  Notably:

<list style='symbols'>
<t>Each stream (of every media type) is a distinct source
  (distinct stream of consecutive packets to be sent to a decoder) and
  is given a distinct synchronization source ID (SSRC), and has its
  own distinct timestamp and sequence number space.</t>
<t>Every media type (full media type and subtype, e.g. video/h264 or audio/pcmu)
  has a distinct payload type value.  The same payload type
  value mappings apply across all sources in the session.</t>
<t>RTP SSRCs, initial sequence numbers, and initial timestamps are
  chosen at random, independently for each source (of each media type).</t> 
<t>RTCP bandwidth is five percent of the total RTP session bandwidth.</t>
<t>RTP session bandwidth and RTCP bandwidth are divided among all the
  sources in the session.</t>
<t>RTCP sender report (SR) or receiver report (RR) packets, and source
  description (SDES) packets, are sent periodically for every source in the
  session.</t>
</list>
</t>

<t>In other words, no special RTP mechanisms are specifically needed
for senders of multiplexed media.  The only constraint is that senders
sources MUST NOT change the top-level media type (e.g. audio or video)
of a given source.  (It remains valid to change a source's subtype,
e.g. switching between audio/pcmu and audio/g729.)</t>

<t>For a receiver, the primary complexity of multiplexing is knowing
how to process a received source.  Without multiplexing, all sources
in an RTP session can (in theory) be processed the same manner; e.g.,
all audio sources can be fed to an audio mixer, and all video sources
displayed on a screen.  With multiplexing, however, receivers must
apply additional knowledge.</t>

<t>If the streams being multiplexed are simply audio and video,
this processing can decision can be made based simply on a source's
payload type.  For more complex situations (for example, simultaneous
  live-video and shared-application sources, both sent as video), signaling-level
descriptions of sources would be needed, using a mechanism such as
<xref target='RFC5576'>SDP Source Descriptions</xref>.</t>

<t>Additionally, due to the large difference in typical bitrate between
  different media (video can easily use a bit rate an order of
  magnitude or more larger than audio), some complications arise with
  RTCP timing.  Because RTCP bandwidth is shared evenly among all
sources in a session, the RTCP for an audio source can end up being
sent significantly more frequently than it would in a non-multiplexed
session.  (The RTCP for video will, correspondingly, be sent slightly
less frequently; this is not nearly as serious an issue.)</t>

<t>For RTP sessions that use RTP's recommended minimum fixed timing interval
of 5 seconds, this problem is not likely to arise, as most sessions'
bandwidth is not so low that RTCP timing exceeds this limit.  The <xref
target='RFC3551'>RTP/AVP</xref> or <xref
target='RFC3711'>RTP/SAVP</xref> profiles use this minimum interval by
default, and do not have a mechanism in SDP to negotiate an alternate
interval.</t>

<t>For sessions using the <xref target='RFC4585'>RTP/AVPF</xref> and <xref
target='RFC5124'>RTP/SAVPF</xref> profiles, however,
endpoints SHOULD set the minimum RTCP regular
  reporting interval trr-int to 5000 (5 seconds), unless they explicitly need
  it to be lower.  This minimizes the excessive RTCP bandwidth
  consumption, as well as aiding compatibility with AVP endpoints.
  Since this value only affects regular RTCP reports, not RTCP
  feedback, this does not prevent AVPF feedback messages from being
sent as needed.</t>

<section title='Optimizations'>

<t>For multiple sources in the same session, several optimizations are
possible.  (Most of these optimizations also apply to multiple sources
of the same type in a session.)  In all cases, endpoints MUST be
  prepared for their peers to be using these optimizations.</t>

<t>An endpoint sending multiple sources MAY, as needed, reallocate
media bandwidth among the RTP sources it is sending.  This includes adding
or removing sources as more or less bandwidth becomes available.</t>

<t>An endpoint MAY choose to send multiple sources' RTCP messages in a
single compound RTCP packet (though such compound packets SHOULD NOT
exceed the path MTU, if avoidable and if it is known).  This will
reduce the average
compound RTCP packet size, and thus increase the frequency with which
RTCP messages can be sent.  Regular (non-feedback) RTCP compound
  packets MUST still begin with an SR or RR packet, but otherwise may
  contain RTCP packets in any order.  Receivers MUST be prepared to
  receive such compound packets.</t>

<t>An endpoint SHOULD NOT send reception reports from one of
its own sources about another one ("cross-reports").  Such reports are
useless (they would always indicate zero loss and jitter) and use up
bandwidth that coud more profitably be used to send information about
remote sources.  Endpoints receiving reception reports MUST be
  prepared that their peers might not be sending reception reports about
  their own sources.
(A naive RTCP monitor might think that
there is a network disconnection between these sources; however,
architecturally it is very unclear if such monitors actually exist, or
would care about a disconnection of this sort.)</t>

<t>Similarly, an endpoint sending multiple sources SHOULD NOT send
reception reports about a remote source from more than one of its local
sources.  Instead, it SHOULD pick one of its local sources as the
"reporting" source for each remote source, which sends full report
  blocks; all its other
  sources SHOULD be treated as if they were disconnected, and never saw that
  remote source.  An endpoint MAY choose different local
  sources as the reporting source for different remote sources (for
  example, it could choose to send reports about remote audio sources
  from its local audio source, and reports about remote video sources
  from its local video source), or it MAY choose a single local source
  for all its reports.  If the reporting source leaves the
  session (sends BYE), another reporting source MUST be chosen.  This
  "reporting" source SHOULD also
  be the source for any AVPF feedback messages about its
  remote sources, as well.  Endpoints interpreting reception reports
  MUST be prepared to receive RTCP SR or RR messages where only one
  remote source is reporting about its sources.</t>

</section>

</section>

<section title='Backward compatibility'>

<t>In some circumstances, the offerer in <xref target='RFC3264'>an
offer/answer exchange</xref> will
not know whether the peer which will receive its offer supports media
type multiplexing.</t>

<t>In scenarios where endpoints can rely on their peers
  supporting <xref target='RFC5245'>Interactive Connectivity
  Establishment (ICE)</xref>, even if they might not support multiplexing,
  this should not be a problem.  An
  endpoint could construct a list of ICE candidates for its single
  session, and then offer that list, for backward compatibility,
  toward each of the peers; it would disambiguate the flows based on
  the ufrag fields in the received ICE connectivity checks.  (This
  would result in the chosen ICE candidates participating in multiple
  RTP sessions, in much the same manner as following a forked SIP
  offer.)  For RTCWeb, it is currently anticipated that ICE will be
  required in all cases, for consent verification.</t>

<t>The more difficult case is if an offerer cannot reply on its
potential peers supporting any features beyond baseline RTP (i.e.,
  neither ICE nor multiplexing).
In this case, it would either need to be prepared to use only a
single media type (e.g., audio) with such a peer, or else will need to
do the pre-offer steps to set up all the non-multiplexed sessions.
Notably, this would
include opening local ports, and doing ICE address gathering
(collecting candidate addresses from STUN and/or TURN servers) for
each session, even if it is anticipated that in most cases
backward compatibility is not going to be necessary.</t>

<t>If the signaling protocol in use supports sending additional ICE
candidates for an ongoing ICE exchange, or updating the destination of
  a non-ICE RTP session, it is instead possible
for an offerer to do such gathering lazily, e.g. opening only local host
candidates for the non-default RTP sessions, and gathering and offering
additional candidates or public relay addresses once it becomes clear
  that they are needed.
(With SIP, sending
updated candidates or RTP destinations prior to the call being answered is possible only if
  both peers support
the <xref target='RFC3262'>SIP 100rel feature</xref>, i.e. PRACK and
  UPDATE; otherwise, the initial offer cannot be updated until 
after the 200 OK response to the initial INVITE.)</t>

</section>


<section title='Signaling'>

<t>There is a need to signal multiplexed media in the <xref
target='RFC4566'>Session Description Protocol (SDP)</xref> -- for
inter-domain federation in the case of RTCWeb, as well as for "pure"
SIP endpoints that also want to use media-multiplexed sessions.</t>

<t>To signal multiplexed sessions, two approaches seem to present
themselves: either using the <xref target='RFC5888'>SDP grouping
framework</xref>, as in <xref
target='I-D.holmberg-mmusic-sdp-bundle-negotiation' />, or directly
representing the multiplexed sessions in SDP.</t>

<t>Directly encoded multiplexed sessions would have some grammar issues in
SDP, as the syntax of SDP mixes together top-level media types and
transport information in the m= line, splitting media types to be
partially described in the m= line and partially in the a=rtpmap
attribute.  New SDP attributes would need to be invented to describe
the top-level media types for each source.</t>

<figure anchor="mux_mline_example" title="Hypothetical syntax for
										  describing multiplexed media
										  lines in SDP">
<artwork>
m=multiplex 49170 RTP/AVP 96 97
a=mediamap:96 video
a=rtpmap:96 H264/90000
a=mediamap:97 audio
a=rtpmap:97 pcmu/8000
</artwork>
</figure>


<t>If single-pass backward compatibility is (ever) a goal, directly
  encoding multiplexed sessions in SDP m= lines becomes much more complex,
  as it would require <xref
  target="RFC5939">SDP Capability Negotiation</xref> in
  order to offer both the legacy and the multiplexed streams.</t>

<t>Using SDP grouping seems to rule out the possibility of
  non-backward-compatible multiplexed streams.  Other than that,
  however, it seems that it would be the easier path to signal
  multiplexed sessions.</t>

</section>

<section title='Protocols with SSRC semantics'>

<t>There are some RTP protocols that impose semantics on SSRC values.
  Most notably, there are several protocols (for instance,
  <xref target='RFC5109'>FEC</xref>, <xref target='RFC5583'>layered codecs</xref>, or
  <xref target='RFC4588'>RTP retransmission</xref>)
  have modes that require that sources in multiple RTP sessions have
  the same SSRC value.</t>

<t>When multiplexing, this is impossible.  Fortunately, in each case, there are
  alternative ways to do this, by <xref target='RFC5576'>explicitly signaling RTP SSRC
  values</xref>.  Thus, when multiplexing, these modes need to be used
  instead.</t>

<t>It is unclear how to signal this in a backward-compatible way
  (falling back to session-multiplexed modes) if SDP
  grouping semantics are used to described multiplexed sources in SDP.</t>

</section>

<section title='Security Considerations' anchor='security'>

<t>The security considerations of a muxed stream appear to be similar
  to those of multiple sources of the same media type in an RTP session.</t>

<t>Notably, it is crucial that SSRC values are never used more than
  once with the same SRTP keys.</t> 

</section>

<section title='IANA Considerations' anchor='iana'>

<t>The IANA actions required depend on the decision about how muxed
  streams are signaled.</t>

</section>

</middle>

<back>

<references title='Normative References'>

&rfc2119;

&rtp;

&avpf;

</references>

<references title='Informative References'>

&ice;

&sdp;

&offeranswer;

&avp;

&srtp;

&savpf;

&sdpbundle;

&sourcedesc;

&sdpgroup;

&sdpcapneg;

&rtpfec;

&layergroup;

&rtprtx;

&prack;

</references>

<!--
<section title='Open issues'>

<t><list style='symbols'>

<t></t>

</list></t>

</section>
-->

<!-- 
<section title='Changes From Earlier Versions'>

<t>Note to the RFC-Editor: please remove this section prior to publication
as an RFC.</t>

<section title='Changes From Draft -00'>

<t><list style='symbols'>

<t></t>


</list></t>

</section>

</section>

-->

</back>

</rfc>
