<?xml version="1.0" encoding="US-ASCII"?>
<!DOCTYPE rfc SYSTEM "rfc2629.dtd">
<?rfc toc="yes"?>
<?rfc tocompact="yes"?>
<?rfc tocdepth="3"?>
<?rfc tocindent="yes"?>
<?rfc symrefs="yes"?>
<?rfc sortrefs="yes"?>
<?rfc comments="yes"?>
<?rfc inline="yes"?>
<?rfc compact="yes"?>
<?rfc subcompact="no"?>
<rfc category="info" docName="draft-alvestrand-rtp-sess-neutral-01"
     ipr="trust200902">
  <front>
    <title abbrev="RTP session neutrality">Why RTP Sessions Should Be Content
    Neutral</title>

    <author fullname="Harald T. Alvestrand" initials="H. T. "
            surname="Alvestrand">
      <organization>Google</organization>

      <address>
        <postal>
          <street>Kungsbron 2</street>

          <city>Stockholm</city>

          <region></region>

          <code>11122</code>

          <country>Sweden</country>
        </postal>

        <email>harald@alvestrand.no</email>
      </address>
    </author>

    <date day="17" month="June" year="2012" />

    <abstract>
      <t>This document is not intended for publication as an RFC.</t>

      <t>It gives the underpinning arguments for why the idea that RTP
      sessions and MIME top level types are related is a deeply broken
      paradigm, and that we need to get away from it.</t>

      <t>These arguments are solely the opinion of the listed author.</t>
    </abstract>

    <note title="Requirements Language">
      <t>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in <xref
      target="RFC2119">RFC 2119</xref>.</t>
    </note>
  </front>

  <middle>
    <section title="Introduction">
      <t>The RTP universe of functionality can, for the purposes of this
      argument, be reduced to two components: The RTP wire protocol <xref
      target="RFC3550"></xref> (consisting of the RTP packet format, the RTCP
      reporting format, and the handling rules for RTP sessions), and the SDP
      session description language. For the purposes of this argument, the SDP
      functionality for describing non-RTP sessions is ignored, as is the
      ability to negotiate RTP sessions by other means than SDP.</t>

      <t>This document argues that the RTP mechanisms of multiple RTP sessions
      make sense for a lot of purposes, but does NOT make sense for a mandated
      separation between different top-level MIME media types.</t>
    </section>

    <section title="Stuff that can be carried over RTP">
      <t>RTP, according to its own description an application layer framework
      component, is a suitable protocol for framing data that needs to travel
      across the network in a time-sensitive fashion, with the idea that it is
      going to be presented at the receiving end in a time sequence. Normally,
      the data (usually called "media") is streamed across the network at a
      rate approximately equal to the speed at which it is intended to be
      presented ("real time data").</t>

      <t>Examples of data carried over RTP include:</t>

      <t><list style="symbols">
          <t>G.711 Audio - 64 Kbits/second, completely fixed bitrate</t>

          <t>GSM AMR Audio - 4.75 to 12.2 Kbits/second, variable bitrate</t>

          <t>OPUS audio compressed into near-incomprehensibility - 6
          kbits/second, variable bitrate</t>

          <t>OPUS audio carrying high fidelity music - 500 kbits/second,
          variable bitrate</t>

          <t>QQVGA (160x120) video at 15 FPS in H.264 compression - 50
          Kbits/second, variable bitrate, lots of schemes for error
          concealment and correction</t>

          <t>HD video at 1920x1080@60 in H.264 compression - 1.4
          Mbits/second</t>

          <t>Real-time text (T.140) - very few bits/second</t>

          <t><xref target="RFC4733"></xref> DTMF tone signalling - very few
          bits/second</t>
        </list>Schemes designed to increase the reliability of data carried
      across RTP include:</t>

      <t><list style="symbols">
          <t>Forward error correction (FEC)</t>

          <t>Duplicated streams, codec-independent</t>

          <t>Duplicate sending of important information within the codec </t>

          <t>NAK-based resends signalled over RTCP</t>

          <t>Stream reset requests signalled over RTCP</t>
        </list>Some of these are only applicable to media types (in
      particular, "send me a new I-frame" doesn't make sense if you don't have
      I-frames). Others can be used with any type of data.</t>

      <t></t>
    </section>

    <section title="What the network can do to &quot;help&quot; a flow">
      <t>The network can apply various things to help the session data arrive
      according to policy:</t>

      <t><list style="symbols">
          <t>Capacity reservation for specific flows</t>

          <t>Priority queueing, sending certain types of data faster than
          others</t>

          <t>Filtering or blocking certain types of communication that the
          managers deem inappropriate</t>
        </list>The network can do these things in multiple ways, including
      so-called "deep packet inspection", but the most common techniques
      require being able to identify either the requested handling of the
      packets (DiffServ using DSCP codepoints) or recognizing the flow based
      on its 5-tuple (source and destination address and port + protocol),
      possibly correlating the 5-tuple with information carried to the router
      through some kind of management interface (either connected to the
      session setup protocol or managed via some other interface such as
      RSVP/IntServ), and behaving accordingly.</t>

      <t>All techniques have limitations; DSCP requires a certain trust in the
      endpoints using the codepoints for "deserving traffic"; deep packet
      inspection requires that packets be unencrypted, and stream control
      requires that 5-tuples be related back to their putative purpose either
      by heuristics or by being connected to management protocols.</t>
    </section>

    <section title="The definition of an RTP &quot;session&quot;">
      <t>An RTP session is defined in RFC 3550 section 3:</t>

      <t>"RTP Session: An association among a set of participants
      communicating with RTP. A participant may be involved in multiple RTP
      sessions at the same time. In a multimedia session, each medium is
      typically carried in a separate RTP session with its own RTCP packets
      unless the encoding itself multiplexes multiple media into a single data
      stream. A participant distinguishes multiple RTP sessions by reception
      of different sessions using different pairs of destination transport
      addresses, where a pair of transport addresses comprises one network
      address plus a pair of ports for RTP and RTCP. All participants in an
      RTP session may share a common destination transport address pair, as in
      the case of IP multicast, or the pairs may be different for each
      participant, as in the case of individual unicast network addresses and
      port pairs. In the unicast case, a participant may receive from all
      other participants in the session using the same pair of ports, or may
      use a distinct pair of ports for each.</t>

      <t>The distinguishing feature of an RTP session is that each maintains a
      full, separate space of SSRC identifiers (defined next). The set of
      participants included in one RTP session consists of those that can
      receive an SSRC identifier transmitted by any one of the participants
      either in RTP as the SSRC or a CSRC (also defined below) or in RTCP. For
      example, consider a three- party conference implemented using unicast
      UDP with each participant receiving from the other two on separate port
      pairs. If each participant sends RTCP feedback about data received from
      one other participant only back to that participant, then the conference
      is composed of three separate point-to-point RTP sessions. If each
      participant provides RTCP feedback about its reception of one other
      participant to both of the other participants, then the conference is
      composed of one multi-party RTP session. The latter case simulates the
      behavior that would occur with IP multicast communication among the
      three participants.</t>

      <t>The RTP framework allows the variations defined here, but a
      particular control protocol or application design will usually impose
      constraints on these variations."</t>

      <t>An RTP session is thus characterized by:</t>

      <t><list style="symbols">
          <t>A single SSRC space</t>

          <t>A single reporting space - all participants see all RTCP
          messages</t>

          <t>Non overlapping transport addresses</t>
        </list></t>

      <t>As we can see here, it is not possible to tell from a single packet
      whether it belongs to the same session as another packet or not; if we
      observe two packets with the same source and destination addresses, it
      seems safe to assume that they belong to the same session, but for all
      other cases, deciding whether or not two packets or packet streams are
      in the same session requires knowledge of the configuration of the
      session.</t>
    </section>

    <section title="Proper and improper use of RTP sessions">
      <t>Section 5.2 of RFC 3550 gives the canonical statement of RTP session
      (mis)use:</t>

      <t>"In RTP, multiplexing is provided by the destination transport
      address (network address and port number) which is different for each
      RTP session. For example, in a teleconference composed of audio and
      video media encoded separately, each medium SHOULD be carried in a
      separate RTP session with its own destination transport address."</t>

      <t>This sentence makes two very important leaps of faith:</t>

      <t><list style="symbols">
          <t>That distinguishing sessions by destination transport address is
          necessary and sufficient</t>

          <t>That it is appropriate to give strong guidance about the
          distribution of media streams across RTP sessions</t>
        </list>Both of these are shaky.</t>

      <t>As the cost of connecting ports has increased due to NATs, firewalls
      and IPv4 exhaustion, there has been a strong push towards using fewer
      ports, and indeed fewer 5-tuples, so that it is not uncommon to see
      flows that can be distinguished only by source address; there have also
      been proposals floated for putting multiple RTP sessions across one
      5-tuple [draft-westerlund-avtcore-transport-multiplexing].</t>

      <t>The cost of ports is also one factor pushing towards multiple media
      types in one RTP session; however, the more important underlying
      challenge is that this distinction is neither necessary nor sufficient
      to distinguish the cases in which RTP media streams want to have
      differential treatment from the network, and thus need to assign streams
      either to the same session (to guarantee the same treatment) or to
      different sessions (to allow for differential treatment).</t>

      <t>Consider the list of scenarios above, and imagine RTP being used
      for:</t>

      <t><list style="symbols">
          <t>A videoconference between 3 people who know each other well,
          using low end equipment and barely-sufficient bandwidth pipes</t>

          <t>A Berlin Philharmonic concert broadcast featuring Brahms' "Tragic
          Overture"</t>

          <t>A point-to-point transmission of a Manchester United vs Liverpool
          football match</t>

          <t>A professor's lecture, with "talking head" presentation
          simultaneous with slides, and opportunity for students to ask
          questions</t>
        </list>In each of these contexts, the tradeoff between audio and video
      is different; in the Brahms case, the audio (which is the point of the
      transmission) is likely to be transmitted at higher bandwidth than the
      video, and if one of them has to have his bandwidth reduced, the video
      should be reduced in quality before the audio is. In contrast, in the
      football match, spectators care about seeing the action; as long as they
      can understand the commentator's voice, the audio quality is "good
      enough".</t>

      <t>In the lecture case, quality of the lecturer's slides and voice is
      critical; video from students is almost irrelevant to the larger
      purpose.</t>

      <t>A logical arrangement of media streams in RTP sessions would be to
      group them by importance, and send them with appropriate traffic
      engineering structuring; in the lecturer case, the slides and the
      professor's voice would be carried in a high priority media stream,
      while the professor's picture would have second priority, and voice and
      video from students would be made available on an "if it works, it
      works" basis. Someone may easily decide that the student feedback track
      is not worth listening to, or remove the talking head of the professor;
      it would be strange indeed to try to listen to the lecture without
      viewing the supporting material.</t>

      <t>This illustrates two points:</t>

      <t><list style="symbols">
          <t>The RTP session mechanism, using the 5-tuple as the unit of
          differentiation, is a simple, effective and readily deployed
          mechanism for separating streams that require different treatment
          from the network in easily distinguished partitions.</t>

          <t>The assignment of media to such partitions is application
          dependent, and the decision on how to group and how to prioritize
          needs to be taken by the application developer.</t>
        </list></t>
    </section>

    <section title="The Pernicious Effect of SDP on the Media Type System">
      <t>In the list of reasons to argue against the inappropriate advice
      quoted above from RFC 3550, its pernicious influence on the MIME type
      system bears mentioning.</t>

      <t>The MIME type system, as described in <xref target="RFC4288"></xref>,
      consists of a two-level hierarchy: A top level media type (text, audio,
      video, application and so on), and a media subtype that identifies (to
      some level of precision) the format of the data being carried.</t>

      <t>The system has mostly been respected, with some types (for instance
      PDF) forever being borderline between the various categories, but over
      the years, a few types have been entered into the system with their top
      level types being decided, not by the nature of their content, but by
      *the type with which their proponents wished to have them multiplexed in
      an RTP session*.</t>

      <t>This includes the types that designate repair mechanism
      (audio/parityfec, audio/red), timed data transfer (audio/clearmode) and
      that ultimate triumph of expediency over cleanliness: audio/t140c,
      audio/3gpp-tt and video/3gpp-tt: Text types registered as audio and
      video.</t>

      <t>For each of these, there is a fairly natural fit in the normal MIME
      hierarchy (application/ for the mechanism types and text/ for the text
      types); the assignment of them to the "media" top level types has been
      done as an expediency in order to get around the stultifying results of
      the advice given in RFC 3550.</t>
    </section>

    <section title="The Mixer Fallacy">
      <t>One of the arguments in favour of the RFC 3550 separation has been
      that a mixer can be deployed that knows nothing of the semantics of the
      media streams; it can "just mix them".</t>

      <t>This applies partly to exactly one type of application: The audio
      conference bridge.</t>

      <t>For a video mixer, it does not apply; external logic (such as
      listening to the audio voume of the corresponding audio channel, or
      explicit flow control) is needed to select the right video stream to
      send out. And even for larger audio bridges, it is common to have
      functions like floor control, remote mute and other participant
      management tools in order to control the bridge - as soon as such tools
      are introduced, they are as relevant for a multi-media-type RTP session
      as they are to a single-media-type RTP session.</t>
    </section>

    <section title="Corrective Actions">
      <t>There are not many protocol changes that really need to be taken to
      solve this problem.</t>

      <t>The basic mechanism of RTP is media type independent. There are some
      RTCP issues with dealing with RTP flows of wildly varying bandwidth, but
      as can be seen from the table of media types in the introduction, this
      issue isn't solved by separating them; the bandwidth ranges of the types
      overlap.</t>

      <t>The thing that binds most in the current protocol suite is the
      conservation of the inappropriate binding in the SDP media
      description/negotiation format, where the MIME type is represented in
      two pieces, one of which is tied to the RTP session rather than to the
      payload type it is associated with, and there are fairly well-understood
      ways to get around that, such as the BUNDLE grouping extension <xref
      target="I-D.ietf-mmusic-sdp-bundle-negotiation"></xref>. Better designed
      negotiation protocols would not have this problem at all.</t>

      <t>In order to get out of the bind that SDP places us in, a change such
      as BUNDLE should be adopted, and the IETF should record that the advice
      from RFC 3550 is to be considered *advice*, not command: It is sometimes
      appropriate to separate media streams according to top level type, and
      sometimes not appropriate to do so. The application is the one that
      needs to make this decision.</t>
    </section>

    <section anchor="IANA" title="IANA Considerations">
      <t>This document makes no request of IANA.</t>

      <t>Note to RFC Editor: this section may be removed on publication as an
      RFC.</t>
    </section>

    <section anchor="Security" title="Security Considerations">
      <t>This note does not discuss any change that the author thinks would
      have any significant influence on the security of RTP traffic.</t>
    </section>

    <section anchor="Acknowledgements" title="Acknowledgements">
      <t>This note has benefited greatly from exchanges with Colin Perkins,
      whose unwavering support of a sharply differing viewpoint has served to
      inform the arguments presented in this document. Magnus Westerlund and
      Christer Holmberg also deserve special mention for engaging
      constructively in the discussion.</t>
    </section>
  </middle>

  <back>
    <references title="Normative References">
      <?rfc include="reference.RFC.2119"?>
    </references>

    <references title="Informative References">
      <?rfc include='reference.RFC.3550'?>

      <?rfc include='reference.RFC.4288'?>

      <?rfc include='reference.RFC.4733'?>

      <?rfc include='reference.I-D.westerlund-avtcore-transport-multiplexing'?>

      <?rfc include='reference.I-D.ietf-mmusic-sdp-bundle-negotiation'?>
    </references>

    <section title="Change log">
      <t></t>

      <section title="Version -00 to -01">
        <t>Version number bump, since the debate is ongoing. A few nits fixed.
        Added the "Mixer Fallacy" section. Updated reference to "bundle" to
        new draft name.</t>

        <t>This should be the last version, since the author is in the process
        of working with the authors of <xref
        target="I-D.westerlund-avtcore-transport-multiplexing"></xref> to
        achieve a jointly agreeable text. Hopefully this will take lesss than
        6 months.</t>
      </section>
    </section>
  </back>
</rfc>
