<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc='yes'?>
<?rfc tocdepth='5'?>
<?rfc symrefs='yes'?>

<?rfc compact='yes'?>
<?rfc subcompact='no'?>

<rfc ipr="trust200902" category="info" docName="draft-rosenberg-rtcweb-rtpmux-00">


    <front>
        <title abbrev="RTP Mux">
Multiplexing of Real-Time Transport Protocol (RTP) Traffic for Browser based Real-Time Communications (RTC)</title>
    
        <author initials="J.R." surname="Rosenberg"
                fullname="Jonathan Rosenberg">
            <organization>Skype</organization>
    
            <address>
                <email>jdrosen@skype.net</email>
                <uri>http://www.jdrosen.net</uri>
            </address>
        </author>

        <author initials="C.J." surname="Jennings"
                fullname="Cullen Jennings">
            <organization>Cisco</organization>
    
            <address>
                <email>fluffy@cisco.com</email>
            </address>
        </author>

        <author initials="J.P." surname="Peterson"
                fullname="Jon Peterson">
            <organization>Neustar</organization>
    
            <address>
                <email>jon.peterson@neustar.biz</email>
            </address>
        </author>

        <author initials="M.K." surname="Kaufman"
                fullname="Matthew Kaufman">
            <organization>Skype</organization>
    
            <address>
                <email>matthew.kaufman@skype.net</email>
            </address>
        </author>

        <author initials="E.R." surname="Rescorla"
                fullname="Eric Rescorla">
            <organization>RTFM</organization>
    
            <address>
                <email>ekr@rtfm.com</email>
            </address>
        </author>

        <author initials="T.T" surname="Terriberry"
                fullname="Tim Terriberry">
            <organization>Mozilla</organization>
    
            <address>
                <email>tterriberry@mozilla.com</email>
            </address>
        </author>



        <date month="July" day="4" year="2011" />
    
        <area>RAI</area>
        <workgroup>RTCWEB</workgroup>
        <keyword>SIP</keyword>
        <keyword>RTP</keyword>
        <abstract>
            <t>This document argues that multiplexing of voice and
            video traffic over a single RTP session should be
            specified as the baseline mode of operation for
            multimedia traffic in RTC web. </t>
        </abstract>
    </front>

<middle>
<section title="Introduction">

<t>
The RTCweb working group is chartered to specify a framework and
protocols for enabling real-time communications services within a
browser, without the need for plugins
<xref target="I-D.rosenberg-rtcweb-framework"/>. It is envisioned that this
will enable many use cases
<xref target="I-D.ietf-rtcweb-use-cases-and-requirements"/>, the most basic of which
is a video call between two users on the web.
</t>

<t>
In order to enable this functionality, the specifications produced by
the IETF will mandate a specific set of protocols that must be
implemented within the browser. It is anticipated that these protocols
will include the Real-Time Transport Protocol <xref
target="RFC3550"/>, and either in full or in part, Interactive
Connectivity Establishment (ICE) <xref target="RFC5245"/>.
</t>

<t>
The usage of RTP raises the question of multiplexing - whether or not
RTCP and RTP should run on the same port, and furthermore, whether or
not voice, video, and possibly data, should also run on the same
port. To provide guidance on this, Perkins et. al. produced <xref
target="I-D.perkins-rtcweb-rtp-usage"/>, which recommends that
voice and video utilize different RTP sessions, and thus different UDP
ports. 
</t>

<t>
This document argues against this conclusion, and advocates that a
single transport session (i.e., a single UDP port) is used to carry
voice and video traffic, using the SSRC for demux.
</t>

</section>

<section title="RTP Muxing with SSRC">

<t>
This document recommends that all of the associated media content of
the call - the voice, video, and RTCP traffic for both the voice and
video sessions, utilize a single transport session (i.e., single UDP
port). In cases where there are multiple video streams (for example,
screen sharing), the single transport session would carry all of the
video. Furthemore, that demultiplexing voice and video traffic is done
by assigning a different SSRC to each. This recommendation applies to
the case of a single unicast communications session between a pair of
endpoints (e.g., this document does not consider the case of running a
multi-user service like a gateway).
</t>

<t>
To enable multiplexing, we propose that the 32-bit SSRC value in the
RTP header be broken up into the following sub-fields:
</t>

<figure title="SSRC Field"><artwork>
<![CDATA[
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  |          Magic Cookie         |Type |     StreamID          |x|
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
]]></artwork></figure>

<t>
The Magic Cookie is two bytes, with a value of 0xf7b3. It is meant to
facilitate DPI applications which can use its value to - with high
confidence - determine that this RTP packet uses the encoding format
defined here. The type is a 3 bit value, corresponding to the
top-level MIME type of the media (mapping table TBD). It too is meant
to facilitate DPI applications which want to separate voice and
video. The streamID is a 12 bit field which represents the unique ID
for this stream. It is signaled between participants out of band. The
final bit, 'x' is set to zero and is reserved for future usage.
</t>

</section>

<section title="Arguments in Favor of Multiplexing">

<t>
This section outlines several arguments in favor of multiplexing.
</t>

<section title="NAT Resource Preservation">

<t>
Today's Internet is full of Network Address Translators (NAT), a
situation which is likely to get worse as IPv4 address exhaustion
continues. When NAT is in use, the constraint on the number of
endpoints behind the NAT is based on the number of parallel transport
sessions that need to be supported. If, for example, a NAT has a
single external IP address, it can support 64k UDP sessions while
having an endpoint-independent mapping behavior <xref
target="RFC4787"/>. Thus, in the presence of NAT, parallel transport
sessions becomes the scarce resource.
</t>

<t>
If rtcweb specifies that audio and video run on a separate port, this
will double the number of transport session resources consumed in
intervening NATs. While the usage of port as an application layer
demux point made sense when RTP was designed back in 1992 (the year
the first RTP draft was published), the Internet has changed
substantially since then. Continuing to perpetuate this design
optimizes preseveration of legacy against protection of
resources in the modern Internet. We feel that this optimizes in the
wrong direction.
</t>

<t>
Given that we anticipate widespread usage of rtcweb, this design
choice may create a non-trivial load on the transport session capacity
of the Internet at large. Real-time video communications on the
Internet has seen huge growth in recent years. For Skype, approximately 40%
of its Skype-to-Skype calls are video based. A recent report by
Sandvine reports that Skype alone is the third largest source of
upload traffic on the Internet as a whole, largely attributed to Skype
video calling. <eref
target="http://www.sandvine.com/downloads/documents/05-17-2011_phenomena/Sandvine%20Global%20Internet%20Phenomena%20Spotlight%20-%20Netflix%20Rising.pdf"/>. The
conclusion from this is that the costs of a separate voice and video
port cannot be ignored. 
</t>

<t>
Simply put, the usage of transport ports for application
demultiplexing should be considered harmful for the Internet.
</t>

</section>

<section title="Improved Failure Modes">

<t>
The usage of separate transport sessions for the audio, video or other
content of the call introduces a variety of partial failure modes. The
transport session for one type of media might get established; but a
NAT capacity problem might cause the transport session for another
type of media to fail. Usage of a single transport session means that
the conversation succeeds or fails atomically. We consider this a
feature. 
</t>

</section>

<section title="Setup Time">

<t>
The rtcweb group is considering the usage of ICE to create p2p
sessions. ICE provides firewall and NAT traversal in addition to
providing a handshake necessary to assure mutual consent for
communications. 
</t>

<t>
Unfortunately, ICE requires time to perform its setup operations. This
time grows in proportion to the number of transport sessions which
must be opened in order to support the call. By using a different port
for video traffic, call setup times will increase. The precise amount
of this increase depends on the type of NAT and varies depending on
packet loss. However, in a simple, ideal case of no packet loss and
direct connectivity between endpoints, this value is XXX [[fill in]].
</t>

</section>

<section title="Complexity">

<t>
ICE is not a simple protocol. One of its significant complexities is
its requirement to support calls for multiple media streams, each of
which runs on a separate port, and multiple components for each stream
(e.g., RTCP). If the concept of streams and components were
eliminated, ICE would be a simpler protocol. 
</t>

<t>
If, within rtcweb, a single transport connection was utilized,
browsers could implement a simplified version of the ICE protocol.
</t>

</section>

</section>

<section title="Responding to draft-perkins-rtcweb-rtp-usage">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> outlines several arguments
for continuing to use a separate port for audio and video. In this
section, we respond to those arguments.
</t>

<section title="Requires Additional Signaling">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session would require a demux point to
be specified (for example, the SSRC), and require additional signaling
to be specified to accomplish this.
</t>

<t>
Firstly, this conclusion is only partly true. For communications
sessions between rtcweb users within the same domain, no signaling
specifications are required. This is true in general with rtcweb; one
of its benefits is that it does not require standardized signaling.
</t>

<t>
Secondly, it is not yet clear that rtcweb will be able to interoperate
with existing VoIP endpoitns without a media intermediary to terminate
ICE traffic. It is our position that interoperability without media
intermediary only be provided for basic voice services, and even then,
only when RTCP is supported. In the case of basic voice endpoints,
where there is no video, RTP multiplexing of voice and video is
irrelevant, and thus no signaling complexity is introduced. 
</t>

<t>
Thirdly, the primary place where there will be a need for signaling
enhancements is for inter-domain calling between rtcweb endpoints in
different domains. In such a case, an SDP extension is required, and
one can be specified. It is trivial to do so.
</t>

<t>
Finally, this document does recommend that it be possible to utilize a
separate transport session for voice and for video, and that, in the
worst case, this mode can be used for calls between an rtcweb endpoint
and a legacy endpoint.
</t>


</section>

<section title="QoS and Traffic Engineering">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session would mean that it would not
be possible to apply QoS techniques separately for voice and video
which rely on the 5-tuple.
</t>

<t>
Firstly, the public Internet lacks any QoS mechanism, so this argument
is moot on the public Internet.
</t>

<t>
Secondly, private enterprise networks which do provide QoS most often
use diffserv. Diffserv is compatible with utilization of a common port
for voice and video traffic. Typically, different DSCPs are used for
voice and video (Cisco recommends EF for audio and AF41 for video in
enterprise telephony deployments), and this practice is compatible
with usage of the same port - each packet would be marked
appropriately. It is also possible to use the same DSCP for voice and
video.
</t>

<t>
Carrier networks, such as mobile operator networks, typically provide
QoS through traffic engineering, using a combination of MPLS tunnels
and diffserv markings. MPLS tunnels do use 5-tuples as classifiers to
determine which traffic to put in what kind of tunnel. If there is a
need for using separate MPLS tunnels for voice and video, the DSCP
codepoint itself can be used as a differentiator.
</t>

<t>
It is true that it would not be possible to utilize RSVP to separately
establish QoS treatment for the voice and the video traffic. However,
there is very little real deployment of RSVP. None within the public
Internet and relatively little within corporate networks. As such,
this argument is mostly theoretical.
</t>

<t>
Finally, DPI is used within some operator networks to perform traffic
classification. It would always be possible to use DPI to assign
different treatment to voice and video traffic.
</t>


</section>

<section title="Scalability">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session would mean that layered coding
using multicast for each layer would not be possible.
</t>

<t>
Firstly, most layered coding today uses unicast and a switch or mixer
of some sort to discard layers. That architecture is completely
compatible with the usage of a single transport session for voice and
video. The limitation applies only to the use of IP multicast for
real-time communications. The usage of multicast on the Internet has
substantially diminished over time. There is some usage today in
private networks but primarily for streaming media distribution. The
usage for real-time communications is quite rare. As such, we find
this to be a theoretical corner case.
</t>

</section>

<section title="RTP Retransmission">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session would not be interoperable
with endpoints doing RTP retransmission per <xref target="RFC4588"/>. 
</t>

<t>
As pointed out above, interoperability with existing endpoints without
the usage of a media intermediary is not a given at this point, and we
argue it should only be supported for the common case - a basic,
voice-only RTP-capable endpoint. There is, to our knowledge,
relatively little deployment of RFC4588, at least for real-time
communications. It is certainly not a common feature in basic RTP
endpoints and never a baseline requirement for
interoperability. Consequently, if there is a need to interoperate
with an endpoint supporting RFC4588, and it is desired to avoid a
media intermediary, RFC4588 can just be turned off for the session.
</t>

<t>
As such, we find the interoperability argument here not compelling.
</t>

</section>

<section title="Forward Error Correction">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session will limit the applicability
of FEC <xref target="RFC5109"/> to when the RTP packets are half of
the path MTU.
</t>

<t>
There are two cases to consider - interoperability with existing
endpoints and usage for calls between rtcweb endpoints.
</t>

<t>
For interoperability with existing endpoints, we argue the same thing
here as for retransmits. FEC is not commonly used in legacy voice
endpoints, and if it is supported, is never a required
feature. Consequently, if present, its usage can be disabled when
interoperating with an rtcweb endpoint. If FEC is included as part of
the rtcweb specifications, the lower bandwidth of voice means that FEC
packets could be sent on the same port, using <xref
target="RFC2198"/>, without approaching the path MTU.
</t>

<t>
For communications between rtcweb endpoints, this is only an issue if
FEC is included as part of the rtcweb specification. If the group
decides to do that (there is some value for real-time video), it
should define a mechanism which allows for FEC packets to be sent
using a separate SSRC.
</t>

</section>

<section title="RTCP Issues">

<t>
<xref target="I-D.perkins-rtcweb-rtp-usage"/> argues that multiplexing of
voice and video on the same RTP session will introduce complications
in the usage of RTCP, primarily when considering RTCP extensions.
</t>

<t>
It is our belief that normal RTCP operation as defined in the RTCP
specification will work fine with multiplexed voice and video
traffic. SRs and RRs are already generated per SSRC to handle multiple
senders, and RTCP in general supports feedback for multiple SSRC
within a session. These mechanisms work as defined when each SSRC
happens to represent a different media stream instead of a different
user. 
</t>

<t>
The only complication that arises is for RTCP extensions which are
defined to be media
dependent. <xref target="I-D.perkins-rtcweb-rtp-usage"/> points out, as an
example, the usage of RTCP extended report blocks (XR)
<xref target="RFC3611"/>. However, XR works fine in conjunction with
multiplexing of voice and video within the same port. Each of the
seven report blocks defined in <xref target="RFC3611"/> include the
SSRC of the source as part of the block, and thus will
work. <xref target="I-D.perkins-rtcweb-rtp-usage"/> indicates that
"SSRC purpose tagging needs not only to be one the media side, but 
also on the RTCP reporting". However, we do not believe this to be
accurate. Since the XR blocks report the SSRC source already, the
specifications provide all that is needed. The XR report is merely
included when it is relevant.
</t>

<t>
Furthermore, the discussion around XR assumes that we need to support
them for interoperability with existing VoIP endpoints, or we are
utilizing it for rtcweb itself. As with FEC and retransmissions, in
the case of interoperability, if there is an issue, XR can simply be
disabled in these cases. <xref target="RFC3611"/> does specify that XR
can be sent without prior signaling. In the worst case XR are received
by an rtcweb endpoint which are then discarded. In terms of usage of
RTCP XR for communications between rtcweb endpoints, we would argue
that a much more flexible solution would be to provide Javascript APis
which allow the application to have access to the same data used to
generate the XR, and then the application itself can use this data as
it sees fit, including sending it back to the sender through some kind
of application data packet. 
</t>

</section>

</section>

<section title="Arguing Against a Shim">

<t>
It has been proposed on the mailing list that an alternative approach
for multiplexing on the same port would be to specify a new
multiplexing protocol that has a small shim, which could then be used
to separate voice and video traffic as a layer between UDP and
RTP. Such a shim could then also be used to enable non-RTP data
traffic as well.
</t>

<t>
We believe that such a shim would be a mistake, for the same reason
that shims have been avoided in the multiplexing of RTCP, STUN, and
DTLS on the same port as RTP:
</t>

<t>
<list style="symbols">

<t>The shim would break interoperability with a great deal of existing
network inspection gear - firewalls, packet sniffers, traffic
analyzers, and so on - which know how to extract, parse, and process
RTP packets. 
</t>

<t>The shim would add complexity through yet another layer of
multiplexing.
</t>

<t>The shim would increase packet overhead further.
</t>

<t>A shim is a mistake which cannot be undone later.
If multiplexing on a single port truly causes interoperability issues,
clients can fall back to using multiple ports, possibly even in the
preponderance of cases. However, once a shim is inserted,
interoperability will always require an intermediary to strip it out,
forever.
</t>

</list>
</t>

</section>

<section title="Conclusion">

<t>
In conclusion, we feel that benefits of multiplexing of voice and
video on a single RTP session (and thus single transport connection),
outweight the drawbacks. The primary benefit is the impact on NAT
capacity, which is becoming an important issue in the modern
Internet. Furthermore, the unique nature of backwards compatibility
for rtcweb lessens many of the interoperability concerns, and the
traditional arguments around multicast and RSVP are simply no longer
relevant and those technologies have faded from use. 
</t>

</section>

</middle>

<back>

<references title="Informative References">
<?rfc include="reference.I-D.perkins-rtcweb-rtp-usage"?>
<?rfc include="reference.I-D.rosenberg-rtcweb-framework"?>
<?rfc include="reference.I-D.ietf-rtcweb-use-cases-and-requirements"?>
<?rfc include="reference.RFC.5245"?>
<?rfc include="reference.RFC.3550"?>
<?rfc include="reference.RFC.4787"?>
<?rfc include="reference.RFC.4588"?>
<?rfc include="reference.RFC.5109"?>
<?rfc include="reference.RFC.3611"?>
<?rfc include="reference.RFC.2198"?>
</references>

</back>

</rfc>

