INTERNET-DRAFT                                       Katsushi Kobayashi
 draft-ietf-avt-dv-audio-02.txt        Communication Research Laboratory
                                                          Akimichi Ogawa
                                                         Keio University
                                                          Stephen Casner
                                                           Cisco Systems
                                                         Carsten Bormann
                                                 Universitaet Bremen TZI
                                                           June 26, 2000
                                                   Expires December 2000

 RTP Payload Format for 12-bit DAT, 20- and 24-bit Linear Sampled Audio

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

1. Abstract

   This document specifies the packetization scheme for encapsulating
   the 12-bit nonlinear, 20-bit linear and 24-bit linear audio data
   streams into a payload of the Real-time Transport Protocol (RTP).
   This draft also specifies the way of SDP announcement, when the audio
   data is preemphasized before sampling. The treatment of preemphasized
   audio data specified this document could be used in other audio
   formats such as L16.

2. Introduction


Kobayashi, et al.         Expires December 2000                 [Page 1]


Internet Draft                                             June 26, 2000


   This document describes the sampling of audio data in 12 bits
   nonlinear, 20 bits linear and 24 bits linear, and specifies the
   encapsulation of the audio data into the Real-time Transport Protocol
   (RTP), version 2 [1,2].  The audio formats are used in DAT and DV
   video devices [3,4].  The packetization scheme for audio data in 16
   bits linear encoding (L16) is already specified [2,5].  The
   packetization scheme specified in this document basically follows
   those formats. Thus, this document just specifies the differences
   from L16.  The reader is advised to consult RFC1890 along with this
   specification.  This document also specifies the out-band negotiation
   method whether analog preemphasis technique is applied to the audio
   data.

   2.1 Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [6]

3. The need for the RTP encapsulation for 12-, 20- and 24-bit audio.

   Many high quality digital audio and visual systems, such as DAT and
   DV, adopt sample-based audio encoding. Various audio formats are
   defined in accordance with the various situations.  To transport the
   audio data with RTP, an RTP encapsulation needs to be defined for
   each specific format. Only 16 bits linear audio encapsulation has
   been defined as L16. Some other encoding formats have already
   appeared, such as 12 bits nonlinear, 20 bits linear and 24 bits
   linear used in the DAT and DV video world. This specification defines
   the RTP payload encapsulation format in order to use the new
   encodings in the RTP environment.

   The format of 12-bit nonlinear audio defined in IEC61119 is the same
   as 16-bit linear audio except for the packing of each sampled data
   element [3].  An element of 12-bit nonlinear audio data can be
   obtained from the corresponding 16-bit linear one. It would be easy
   to convert 12-bit nonlinear audio into 16-bit linear form at the RTP
   sender and transmit it using the L16 audio format already defined.
   However, the amount of data consumed by 16 bits is an increase of 33%
   compared with 12 bits, and it wastes network bandwidth with
   meaningless data.

4. 12-bit nonlinear audio encapsulation

   The 12-bit nonlinear audio format in DAT and DV, called LP (Long
   Play) audio, is specified in IEC61119 [3]. Each sample of 12-bit
   nonlinear audio is derived from a single sample of 16-bit linear
   audio. The conversion detail between 16 and 12 bits is shown in Table


Kobayashi, et al.         Expires December 2000                 [Page 2]


Internet Draft                                             June 26, 2000


   1. The 12-bit samples are packed contiguously into payload octets
   starting with the most significant bit. When there is an odd number
   of samples in the payload, the four LSBs of the last octet are
   unused. Parameters other than quantization, e.g., sampling frequency
   and audio channel assignment, are the same as in the L16.

   When conveying encoding information in an SDP [7] session
   description, the 12-bit nonlinear audio payload format specified here
   is given the encoding name "DAT12". Thus, the media format
   representation might be:

      m=audio 49230 RTP/AVP 97 98
      a=rtpmap:97 DAT12/32000/2
      a=rtpmap:98 L16/48000/2


    16 bits linear (X)                          12 bits nonlinear (Y)
   ------------------------------------------------------------
     32,767 (7FFFh) Y = INT(X/64) + (600h)        2,047 (7FFh)
     16,384 (4000h)                               1,792 (700h)
   ------------------------------------------------------------
     16,383 (3FFFh) Y = INT(X/32) + (500h)        1,791 (6FFh)
      8,192 (2000h)                               1,536 (600h)
   ------------------------------------------------------------
      8,191 (1FFFh) Y = INT(X/16) + (400h)        1,535 (5FFh)
      4,096 (1000h)                               1,280 (500h)
   ------------------------------------------------------------
      4,095 (0FFFh) Y = INT(X/8) + (300h)         1,279 (4FFh)
      2,048 (0800h)                               1,024 (400h)
   ------------------------------------------------------------
      2,047 (07FFh) Y = INT(X/4) + (200h)         1,023 (3FFh)
      1,024 (0400h)                                 768 (300h)
   ------------------------------------------------------------
      1,023 (03FFh) Y = INT(X/2) + (100h)           767 (2FFh)
        512 (0200h)                                 512 (200h)
   ------------------------------------------------------------
        511 (01FFh) Y = X                           511 (1FFh)
          0 (0000h)                                   0 (000h)
   ------------------------------------------------------------
         -1 (FFFFh) Y = X                            -1 (FFFh)
       -512 (FE00h)                                -512 (E00h)
   ------------------------------------------------------------
       -513 (FFFFh) Y = INT((X + 1)/2) - (101h)    -513 (DFFh)
     -1,024 (FE00h)                                -768 (D00h)
   ------------------------------------------------------------
     -1,025 (FBFFh) Y = INT((X + 1)/4) - (201h)    -769 (CFFh)
     -2,048 (F800h)                              -1,024 (C00h)
   ------------------------------------------------------------


Kobayashi, et al.         Expires December 2000                 [Page 3]


Internet Draft                                             June 26, 2000


     -2,049 (F7FFh) Y = INT((X + 1)/8) - (301h)  -1,025 (BFFh)
     -4,096 (F000h)                              -1,280 (B00h)
   ------------------------------------------------------------
     -4,097 (EFFFh) Y = INT((X + 1)/16) - (401h) -1,281 (AFFh)
     -8,192 (E000h)                              -1,536 (A00h)
   ------------------------------------------------------------
     -8,193 (DFFFh) Y = INT((X + 1)/32) - (501h) -1,537 (9FFh)
    -16,384 (C000h)                              -1,792 (900h)
   ------------------------------------------------------------
    -16,385 (BFFFh) Y = INT((X + 1)/64) - (601h) -1,793 (8FFh)
    -32,768 (8000h)                              -2,048 (800h)
   ------------------------------------------------------------
    Table 1. Conversion between 16 bits to 12 bits [3]

5. 20- and 24-bit linear audio encapsulation

   The 20- and 24-bit linear audio encodings are simply an extension of
   the L16 linear audio encoding [2].  The 20- or 24-bit uncompressed
   audio data samples are represented as signed values in two's
   complement notation. The samples are packed contiguously into payload
   octets starting with the most significant bit.  For the 20-bit
   encoding, when there is an odd number of samples in the payload, the
   four LSBs of the last octet are unused.  When conveying encoding
   information in an SDP session description, the 20- and 24-bit linear
   audio payload format specified here are given the encoding names
   "L20" and "L24", respectively. The SDP audio media description might
   be shown as:

      m=audio 49230 RTP/AVP 99 100
      a=rtpmap:99 L20/48000/2
      a=rtpmap:100 L24/48000

6. Preemphasized audio data

   In order to improve the higher frequency character in audio, analog
   preemphasis is often applied to the data before quantization.  If
   analog preemphasis was applied before the payload data was sampled,
   the type of the preemphasis should be conveyed with a format specific
   parameters a=fmtp line as bellow:

          a=fmtp:<payload type> emphasis=<emphasis type>

   The parameter <emphasis type> will be used one of the following:

          o none            <no emphasis>
          o 50/15           <50/15 micro sec. CD-type emphasis>

   Due to backward compatibility, if not applied preemphasis, the


Kobayashi, et al.         Expires December 2000                 [Page 4]


Internet Draft                                             June 26, 2000


   attribute concerning emphasis MUST not described in the SDP record.
   An example SDP record showing preemphasis applied only to payload
   type 99 might be as follows:

      m=audio 49230 RTP/AVP 99 100
      a=rtpmap:99 L20/48000/2
      a=fmtp:99 emphasis=50/15
      a=rtpmap:100 L24/48000

   This preemphasis attribute could be used in other audio format as
   L16.

7. non AIFF-C audio channel convention

   Existing RTP conventions for audio follow AIFF-C convention when
   sending more than two audio channels within a single RTP stream.
   However, some application are not covered by this convention.  For
   example, although a "woofer" channel is defined in some DV audio
   formats, AIFF-C cannot specify such channel depending on frequency.
   Thus, it is necessary to specify explicit audio channel allocation in
   formation when the contents of audio stream is beyond the scope of
   AIFF-C.  In audio payload formats, the a=fmtp line will be used to
   show the order of audio channels and will be used as below:

          a=fmtp:<payload type> channels=<channel convention>
   [<symbol1/symbol2>]

   The first parameter <channel convention> is specified which type of
   audio assignment convention is used. The first parameter <channel
   convention> specifies one of the following:

         o AIFF-C (default)
         o DV

   The second parameter <symbol1[/symbol2[..]]> shows what type of
   channel is carried within the stream, and shows the arrangement of
   audio data encoding.  The value of <symbol> specifies the type of
   audio contents on corresponding channel, and the order of audio
   contents is described with symbol character(s) order using "/"
   delimiter.  The <symbol> parameter set relies the parameter of
   channel convention, and is defined for each audio channel convention.
   If the arrangement of channel and the contents determined with the
   other encoding information i.e., type of channel convention, the type
   of encodings, and the number of the audio channel, <symbol> value
   MUST not defined.  In the case of AIFF-C, channel order parameter is
   not specified since the channel order is just determined by the
   number of audio channel in AIFF.  When using DV audio convention, the
   symbol of audio contents described in the DV video specification will


Kobayashi, et al.         Expires December 2000                 [Page 5]


Internet Draft                                             June 26, 2000


   be used[4].  The symbols and the meaning of the symbols are also
   specified in Appendix.

   An example of SDP description using this attributes is:

      v=0
      o=ikob 2890844526 2890842807 IN IP4 126.16.64.4
      s=POI (Audio only)
      i=A Seminar of how to make Presentation on the Internet
      u=http://www.koganei.wide.ad.jp/~ikob/POI/index.html
      e=ikob@koganei.wide.ad.jp (Katsushi Kobayashi)
      c=IN IP4 224.2.17.12/127
      t=2873397496 2873404696
      m=audio 49170 RTP/AVP 112 113
      a=rtpmap:112 L16/48000/2
      a=rtpmap:113 DAT12/32000/4
      a=fmtp:113 channels=DV L/R/C/WO

   This line shows a session audio data are sent with the format of L16
   and DAT12 using payload type 112 and 113, respectively.  In case of
   using DAT12 encodings, the audio data contains 4 channel stereo data
   with DV audio convention, and the channels encoded order is left,
   right, center, and woofer.

   The attribute channels defined here provides generic out-of-band
   notification way for not AIFF-C encodings.  However, if the multi-
   channel audio data could be sent in AIFF-C convention after simple
   processing such as a data shuffling on the sender side, the sender
   MUST be used AIFF-C.  The channels: attribute could be only used when
   unable to specify AIFF-C manner.  Moreover, encoding multi channel
   audio data within single RTP stream could be only used when each
   audio channel data is indispensable for playout as L and R channel in
   the stereo.  The independent audio channel SHOULD be sent with a
   different RTP session. If a receiver anticipate to hear all channels,
   the receiver SHOULD join every RTP session.


8. MIME registration

   This document defines some new RTP payload names and associated MIME
   types, DAT12, L20 and L24. The registration form for these MIME types
   are shown as below:

   8.1 DAT12 registration form

   MIME media type name: audio

     MIME subtype name: DAT12


Kobayashi, et al.         Expires December 2000                 [Page 6]


Internet Draft                                             June 26, 2000


     Required parameters:
        rate: number of samples per second -- Permissible values for
          rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100,
   48000, and
          48000 samples per second. Also, other number value might
          be acceptable.

     Optional parameters:
        channels: how many audio streams are interleaved defaults
          to 1; stereo would be 2, etc.  Interleaving takes place
          between individual 12 bits samples.

        emphasis: type of preemphasis defaults to none. Permissible
          values for emphasis are 50/15 and none.

     Encoding considerations: DAT12 audio can be transmitted
     with RTP as specified in "draft-ietf-avt-dv-audio-02".

     Security considerations: None

     Interoperability considerations: NONE

     Published specification: IEC1119 Standard.
                              draft-ietf-avt-dv-audio-02

     Applications which use this media type:
                              Audio communication.

     Additional information: None

       Magic number(s): None
       File extension(s): None
       Macintosh File Type Code(s): None

     Person & email address to contact for further information:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

     Intended usage: COMMON

     Author/Change controller:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

   8.2 L20 registration form

   MIME media type name: audio


Kobayashi, et al.         Expires December 2000                 [Page 7]


Internet Draft                                             June 26, 2000


     MIME subtype name: L20

     Required parameters:
        rate: number of samples per second -- Permissible values for
          rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100,
   48000,
          and 96000 samples per second. Also, other number value might
          be acceptable.

     Optional parameters:
        channels: how many audio streams are interleaved defaults
          to 1; stereo would be 2, etc.  Interleaving takes place
          between individual 20 bits samples.

        emphasis: type of preemphasis defaults to none. Permissible
          values for emphasis are 50/15 and none.

     Encoding considerations: L20 audio can be transmitted
     with RTP as specified in "draft-ietf-avt-dv-audio-02".

     Security considerations: None

     Interoperability considerations: NONE

     Published specification: draft-ietf-avt-dv-audio-02

     Applications which use this media type:
                              Audio communication.

     Additional information: None

       Magic number(s): None
       File extension(s): None
       Macintosh File Type Code(s): None

     Person & email address to contact for further information:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

     Intended usage: COMMON

     Author/Change controller:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

   7.4 L24 registration form

   MIME media type name: audio


Kobayashi, et al.         Expires December 2000                 [Page 8]


Internet Draft                                             June 26, 2000


     MIME subtype name: L24

     Required parameters:
        rate: number of samples per second -- Permissible values for
          rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100,
   48000,
          and 96000  samples per second. Also, other number value might
          be acceptable.

     Optional parameters:
        channels: how many audio streams are interleaved defaults
          to 1; stereo would be 2, etc.  Interleaving takes place
          between individual 24 bits samples.

        emphasis: type of preemphasis defaults to none. Permissible
          values for emphasis are 50/15 and none.

     Encoding considerations: L24 audio can be transmitted
     with RTP as specified in "draft-ietf-avt-dv-audio-02".

     Security considerations: None

     Interoperability considerations: NONE

     Published specification: draft-ietf-avt-dv-audio-02

     Applications which use this media type:
                              Audio communication.

     Additional information: None

       Magic number(s): None
       File extension(s): None
       Macintosh File Type Code(s): None

     Person & email address to contact for further information:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

     Intended usage: COMMON

     Author/Change controller:
       Katsushi Kobayashi
       e-mail: ikob@koganei.wide.ad.jp

9. Security Considerations

   RTP packets using the payload format defined in this specification


Kobayashi, et al.         Expires December 2000                 [Page 9]


Internet Draft                                             June 26, 2000


   are subject to the security considerations discussed in the RTP
   specification [1], and any appropriate RTP profile.  This implies
   that confidentiality of the media streams is achieved by encryption.
   Because the data compression used along with this payload format is
   applied to end-to-end, encryption may be performed after compression
   so there is no conflict between the two operations.

   A potential denial-of-service threat exists for data encodings using
   compression techniques that have non-uniform receiver-end
   computational load.  The attacker can inject pathological datagrams
   into the stream which are complex to decode and cause the receiver to
   be overloaded.  However, this encoding does not exhibit any
   significant non-uniformity.

   As with any IP-based protocol, in some circumstances a receiver may
   be overloaded simply by the receipt of too many packets, either
   desired or undesired.  Network-layer authentication may be used to
   discard packets from undesired sources, but the processing cost of
   the authentication itself may be too high.  In a multicast
   environment, pruning of specific sources may be implemented in future
   versions of IGMP [8] and in multicast routing protocols to allow a
   receiver to select which sources are allowed to reach it.

10. Full Copyright Statement

   Copyright (C) The Internet Society (1999). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works.

   However, this document itself may not be modified in any way, such as
   by removing the copyright notice or references to the Internet
   Society or other Internet organizations, except as needed for the
   purpose of developing Internet standards in which case the procedures
   for copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING


Kobayashi, et al.         Expires December 2000                [Page 10]


Internet Draft                                             June 26, 2000


   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE."

11. Authors' Addresses

   Katsushi Kobayashi Communication Research Laboratory 4-2-1 Nukii-kita
   machi, Koganei Tokyo 184-8795 JAPAN EMail:  ikob@koganei.wide.ad.jp

   Akimichi Ogawa Keio University 5322 Endo, Fujisawa Kanagawa 252 JAPAN
   EMail:  akimichi@sfc.wide.ad.jp

   Stephen L. Casner Cisco Systems, Inc.  170 West Tasman Drive San
   Jose, CA 95134-1706 United States EMail: casner@cisco.com

   Carsten Bormann Universitaet Bremen FB3 TZI Postfach 330440 D-28334
   Bremen, GERMANY Phone: +49.421.218-7024 Fax: +49.421.218-7000 EMail:
   cabo@tzi.org

12. Bibliography


   [1] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson.  RTP: A
       transport protocol for real-time applications. IETF Audio/Video
       Transport Working Group, January 1996. RFC1889.

   [2] Schulzrinne, H., "RTP Profile for Audio and Video Conferences
       with Minimal Control", RFC 1890, January 1996.

   [3] IEC61119, Digital audio tape cassette system (DAT), November
       1992.

   [4] IEC61834, Helical-scan digital video cassette recording system
       using 6,35 mm magnetic tape for consumer use (525-60, 625-50,
       1125-60 and 1250-50 systems), August 1998.

   [5] Salsman, J., "The Audio/L16 MIME content type", RFC 2586, May
       1999.

   [6] S. Bradner, "Key words for use in RFCs to Indicate Requirement
       Levels", RFC 2119, March 1997.

   [7] M.Handley, V.Jacobson, "SDP: Session Description Protocol",
       RFC 2327, April 1998.

   [8] Deering, S., "Host Extensions for IP Multicasting", STD 5,
       RFC 1112, August 1989.


Kobayashi, et al.         Expires December 2000                [Page 11]


Internet Draft                                             June 26, 2000


Appendix.

   The audio channel symbols for each channel are specified in Table 2.
   This symbols are simply referred from the notation used in the
   IEC61834-4 DV video specification chapter 8.1 [4].  Therefore, the
   exact meaning of each symbol should consult original DV video
   specification.

       L: Left channel of stereo
       R: Right channel of stereo
       M: Monoral signal
       C: Center channel of 3,4,6 or 8 ch stereo
       S: Surround channel of 4 ch stereo
       LS, LS1, LS2: Left surround channel
       RS, RS1, RS2: Right surround channel
       LC: Left center channel of 8 ch stereo
       RC: Right center channel of 8 ch stereo
       WO: Woofer channel
       Lmix: L + 0.7071C + 0.7071LS
       Rmix: R + 0.7071C + 0.7071RS
       T: 0.7071C
       Q1: 0.7071LS + 0.7071RS
       Q2: 0.7071LS - 0.7071RS

       Table 2. Channel symbol of audio channel in DV video[4]


Kobayashi, et al.         Expires December 2000                [Page 12]