INTERNET-DRAFT Katsushi Kobayashi draft-ietf-avt-dv-audio-02.txt Communication Research Laboratory Akimichi Ogawa Keio University Stephen Casner Cisco Systems Carsten Bormann Universitaet Bremen TZI June 26, 2000 Expires December 2000 RTP Payload Format for 12-bit DAT, 20- and 24-bit Linear Sampled Audio Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. 1. Abstract This document specifies the packetization scheme for encapsulating the 12-bit nonlinear, 20-bit linear and 24-bit linear audio data streams into a payload of the Real-time Transport Protocol (RTP). This draft also specifies the way of SDP announcement, when the audio data is preemphasized before sampling. The treatment of preemphasized audio data specified this document could be used in other audio formats such as L16. 2. Introduction Kobayashi, et al. Expires December 2000 [Page 1] Internet Draft June 26, 2000 This document describes the sampling of audio data in 12 bits nonlinear, 20 bits linear and 24 bits linear, and specifies the encapsulation of the audio data into the Real-time Transport Protocol (RTP), version 2 [1,2]. The audio formats are used in DAT and DV video devices [3,4]. The packetization scheme for audio data in 16 bits linear encoding (L16) is already specified [2,5]. The packetization scheme specified in this document basically follows those formats. Thus, this document just specifies the differences from L16. The reader is advised to consult RFC1890 along with this specification. This document also specifies the out-band negotiation method whether analog preemphasis technique is applied to the audio data. 2.1 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [6] 3. The need for the RTP encapsulation for 12-, 20- and 24-bit audio. Many high quality digital audio and visual systems, such as DAT and DV, adopt sample-based audio encoding. Various audio formats are defined in accordance with the various situations. To transport the audio data with RTP, an RTP encapsulation needs to be defined for each specific format. Only 16 bits linear audio encapsulation has been defined as L16. Some other encoding formats have already appeared, such as 12 bits nonlinear, 20 bits linear and 24 bits linear used in the DAT and DV video world. This specification defines the RTP payload encapsulation format in order to use the new encodings in the RTP environment. The format of 12-bit nonlinear audio defined in IEC61119 is the same as 16-bit linear audio except for the packing of each sampled data element [3]. An element of 12-bit nonlinear audio data can be obtained from the corresponding 16-bit linear one. It would be easy to convert 12-bit nonlinear audio into 16-bit linear form at the RTP sender and transmit it using the L16 audio format already defined. However, the amount of data consumed by 16 bits is an increase of 33% compared with 12 bits, and it wastes network bandwidth with meaningless data. 4. 12-bit nonlinear audio encapsulation The 12-bit nonlinear audio format in DAT and DV, called LP (Long Play) audio, is specified in IEC61119 [3]. Each sample of 12-bit nonlinear audio is derived from a single sample of 16-bit linear audio. The conversion detail between 16 and 12 bits is shown in Table Kobayashi, et al. Expires December 2000 [Page 2] Internet Draft June 26, 2000 1. The 12-bit samples are packed contiguously into payload octets starting with the most significant bit. When there is an odd number of samples in the payload, the four LSBs of the last octet are unused. Parameters other than quantization, e.g., sampling frequency and audio channel assignment, are the same as in the L16. When conveying encoding information in an SDP [7] session description, the 12-bit nonlinear audio payload format specified here is given the encoding name "DAT12". Thus, the media format representation might be: m=audio 49230 RTP/AVP 97 98 a=rtpmap:97 DAT12/32000/2 a=rtpmap:98 L16/48000/2 16 bits linear (X) 12 bits nonlinear (Y) ------------------------------------------------------------ 32,767 (7FFFh) Y = INT(X/64) + (600h) 2,047 (7FFh) 16,384 (4000h) 1,792 (700h) ------------------------------------------------------------ 16,383 (3FFFh) Y = INT(X/32) + (500h) 1,791 (6FFh) 8,192 (2000h) 1,536 (600h) ------------------------------------------------------------ 8,191 (1FFFh) Y = INT(X/16) + (400h) 1,535 (5FFh) 4,096 (1000h) 1,280 (500h) ------------------------------------------------------------ 4,095 (0FFFh) Y = INT(X/8) + (300h) 1,279 (4FFh) 2,048 (0800h) 1,024 (400h) ------------------------------------------------------------ 2,047 (07FFh) Y = INT(X/4) + (200h) 1,023 (3FFh) 1,024 (0400h) 768 (300h) ------------------------------------------------------------ 1,023 (03FFh) Y = INT(X/2) + (100h) 767 (2FFh) 512 (0200h) 512 (200h) ------------------------------------------------------------ 511 (01FFh) Y = X 511 (1FFh) 0 (0000h) 0 (000h) ------------------------------------------------------------ -1 (FFFFh) Y = X -1 (FFFh) -512 (FE00h) -512 (E00h) ------------------------------------------------------------ -513 (FFFFh) Y = INT((X + 1)/2) - (101h) -513 (DFFh) -1,024 (FE00h) -768 (D00h) ------------------------------------------------------------ -1,025 (FBFFh) Y = INT((X + 1)/4) - (201h) -769 (CFFh) -2,048 (F800h) -1,024 (C00h) ------------------------------------------------------------ Kobayashi, et al. Expires December 2000 [Page 3] Internet Draft June 26, 2000 -2,049 (F7FFh) Y = INT((X + 1)/8) - (301h) -1,025 (BFFh) -4,096 (F000h) -1,280 (B00h) ------------------------------------------------------------ -4,097 (EFFFh) Y = INT((X + 1)/16) - (401h) -1,281 (AFFh) -8,192 (E000h) -1,536 (A00h) ------------------------------------------------------------ -8,193 (DFFFh) Y = INT((X + 1)/32) - (501h) -1,537 (9FFh) -16,384 (C000h) -1,792 (900h) ------------------------------------------------------------ -16,385 (BFFFh) Y = INT((X + 1)/64) - (601h) -1,793 (8FFh) -32,768 (8000h) -2,048 (800h) ------------------------------------------------------------ Table 1. Conversion between 16 bits to 12 bits [3] 5. 20- and 24-bit linear audio encapsulation The 20- and 24-bit linear audio encodings are simply an extension of the L16 linear audio encoding [2]. The 20- or 24-bit uncompressed audio data samples are represented as signed values in two's complement notation. The samples are packed contiguously into payload octets starting with the most significant bit. For the 20-bit encoding, when there is an odd number of samples in the payload, the four LSBs of the last octet are unused. When conveying encoding information in an SDP session description, the 20- and 24-bit linear audio payload format specified here are given the encoding names "L20" and "L24", respectively. The SDP audio media description might be shown as: m=audio 49230 RTP/AVP 99 100 a=rtpmap:99 L20/48000/2 a=rtpmap:100 L24/48000 6. Preemphasized audio data In order to improve the higher frequency character in audio, analog preemphasis is often applied to the data before quantization. If analog preemphasis was applied before the payload data was sampled, the type of the preemphasis should be conveyed with a format specific parameters a=fmtp line as bellow: a=fmtp: emphasis= The parameter will be used one of the following: o none o 50/15 <50/15 micro sec. CD-type emphasis> Due to backward compatibility, if not applied preemphasis, the Kobayashi, et al. Expires December 2000 [Page 4] Internet Draft June 26, 2000 attribute concerning emphasis MUST not described in the SDP record. An example SDP record showing preemphasis applied only to payload type 99 might be as follows: m=audio 49230 RTP/AVP 99 100 a=rtpmap:99 L20/48000/2 a=fmtp:99 emphasis=50/15 a=rtpmap:100 L24/48000 This preemphasis attribute could be used in other audio format as L16. 7. non AIFF-C audio channel convention Existing RTP conventions for audio follow AIFF-C convention when sending more than two audio channels within a single RTP stream. However, some application are not covered by this convention. For example, although a "woofer" channel is defined in some DV audio formats, AIFF-C cannot specify such channel depending on frequency. Thus, it is necessary to specify explicit audio channel allocation in formation when the contents of audio stream is beyond the scope of AIFF-C. In audio payload formats, the a=fmtp line will be used to show the order of audio channels and will be used as below: a=fmtp: channels= [] The first parameter is specified which type of audio assignment convention is used. The first parameter specifies one of the following: o AIFF-C (default) o DV The second parameter shows what type of channel is carried within the stream, and shows the arrangement of audio data encoding. The value of specifies the type of audio contents on corresponding channel, and the order of audio contents is described with symbol character(s) order using "/" delimiter. The parameter set relies the parameter of channel convention, and is defined for each audio channel convention. If the arrangement of channel and the contents determined with the other encoding information i.e., type of channel convention, the type of encodings, and the number of the audio channel, value MUST not defined. In the case of AIFF-C, channel order parameter is not specified since the channel order is just determined by the number of audio channel in AIFF. When using DV audio convention, the symbol of audio contents described in the DV video specification will Kobayashi, et al. Expires December 2000 [Page 5] Internet Draft June 26, 2000 be used[4]. The symbols and the meaning of the symbols are also specified in Appendix. An example of SDP description using this attributes is: v=0 o=ikob 2890844526 2890842807 IN IP4 126.16.64.4 s=POI (Audio only) i=A Seminar of how to make Presentation on the Internet u=http://www.koganei.wide.ad.jp/~ikob/POI/index.html e=ikob@koganei.wide.ad.jp (Katsushi Kobayashi) c=IN IP4 224.2.17.12/127 t=2873397496 2873404696 m=audio 49170 RTP/AVP 112 113 a=rtpmap:112 L16/48000/2 a=rtpmap:113 DAT12/32000/4 a=fmtp:113 channels=DV L/R/C/WO This line shows a session audio data are sent with the format of L16 and DAT12 using payload type 112 and 113, respectively. In case of using DAT12 encodings, the audio data contains 4 channel stereo data with DV audio convention, and the channels encoded order is left, right, center, and woofer. The attribute channels defined here provides generic out-of-band notification way for not AIFF-C encodings. However, if the multi- channel audio data could be sent in AIFF-C convention after simple processing such as a data shuffling on the sender side, the sender MUST be used AIFF-C. The channels: attribute could be only used when unable to specify AIFF-C manner. Moreover, encoding multi channel audio data within single RTP stream could be only used when each audio channel data is indispensable for playout as L and R channel in the stereo. The independent audio channel SHOULD be sent with a different RTP session. If a receiver anticipate to hear all channels, the receiver SHOULD join every RTP session. 8. MIME registration This document defines some new RTP payload names and associated MIME types, DAT12, L20 and L24. The registration form for these MIME types are shown as below: 8.1 DAT12 registration form MIME media type name: audio MIME subtype name: DAT12 Kobayashi, et al. Expires December 2000 [Page 6] Internet Draft June 26, 2000 Required parameters: rate: number of samples per second -- Permissible values for rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000, and 48000 samples per second. Also, other number value might be acceptable. Optional parameters: channels: how many audio streams are interleaved defaults to 1; stereo would be 2, etc. Interleaving takes place between individual 12 bits samples. emphasis: type of preemphasis defaults to none. Permissible values for emphasis are 50/15 and none. Encoding considerations: DAT12 audio can be transmitted with RTP as specified in "draft-ietf-avt-dv-audio-02". Security considerations: None Interoperability considerations: NONE Published specification: IEC1119 Standard. draft-ietf-avt-dv-audio-02 Applications which use this media type: Audio communication. Additional information: None Magic number(s): None File extension(s): None Macintosh File Type Code(s): None Person & email address to contact for further information: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp Intended usage: COMMON Author/Change controller: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp 8.2 L20 registration form MIME media type name: audio Kobayashi, et al. Expires December 2000 [Page 7] Internet Draft June 26, 2000 MIME subtype name: L20 Required parameters: rate: number of samples per second -- Permissible values for rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000, and 96000 samples per second. Also, other number value might be acceptable. Optional parameters: channels: how many audio streams are interleaved defaults to 1; stereo would be 2, etc. Interleaving takes place between individual 20 bits samples. emphasis: type of preemphasis defaults to none. Permissible values for emphasis are 50/15 and none. Encoding considerations: L20 audio can be transmitted with RTP as specified in "draft-ietf-avt-dv-audio-02". Security considerations: None Interoperability considerations: NONE Published specification: draft-ietf-avt-dv-audio-02 Applications which use this media type: Audio communication. Additional information: None Magic number(s): None File extension(s): None Macintosh File Type Code(s): None Person & email address to contact for further information: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp Intended usage: COMMON Author/Change controller: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp 7.4 L24 registration form MIME media type name: audio Kobayashi, et al. Expires December 2000 [Page 8] Internet Draft June 26, 2000 MIME subtype name: L24 Required parameters: rate: number of samples per second -- Permissible values for rate are 8000, 11025, 16000, 22050, 24000, 32000, 44100, 48000, and 96000 samples per second. Also, other number value might be acceptable. Optional parameters: channels: how many audio streams are interleaved defaults to 1; stereo would be 2, etc. Interleaving takes place between individual 24 bits samples. emphasis: type of preemphasis defaults to none. Permissible values for emphasis are 50/15 and none. Encoding considerations: L24 audio can be transmitted with RTP as specified in "draft-ietf-avt-dv-audio-02". Security considerations: None Interoperability considerations: NONE Published specification: draft-ietf-avt-dv-audio-02 Applications which use this media type: Audio communication. Additional information: None Magic number(s): None File extension(s): None Macintosh File Type Code(s): None Person & email address to contact for further information: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp Intended usage: COMMON Author/Change controller: Katsushi Kobayashi e-mail: ikob@koganei.wide.ad.jp 9. Security Considerations RTP packets using the payload format defined in this specification Kobayashi, et al. Expires December 2000 [Page 9] Internet Draft June 26, 2000 are subject to the security considerations discussed in the RTP specification [1], and any appropriate RTP profile. This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used along with this payload format is applied to end-to-end, encryption may be performed after compression so there is no conflict between the two operations. A potential denial-of-service threat exists for data encodings using compression techniques that have non-uniform receiver-end computational load. The attacker can inject pathological datagrams into the stream which are complex to decode and cause the receiver to be overloaded. However, this encoding does not exhibit any significant non-uniformity. As with any IP-based protocol, in some circumstances a receiver may be overloaded simply by the receipt of too many packets, either desired or undesired. Network-layer authentication may be used to discard packets from undesired sources, but the processing cost of the authentication itself may be too high. In a multicast environment, pruning of specific sources may be implemented in future versions of IGMP [8] and in multicast routing protocols to allow a receiver to select which sources are allowed to reach it. 10. Full Copyright Statement Copyright (C) The Internet Society (1999). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING Kobayashi, et al. Expires December 2000 [Page 10] Internet Draft June 26, 2000 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE." 11. Authors' Addresses Katsushi Kobayashi Communication Research Laboratory 4-2-1 Nukii-kita machi, Koganei Tokyo 184-8795 JAPAN EMail: ikob@koganei.wide.ad.jp Akimichi Ogawa Keio University 5322 Endo, Fujisawa Kanagawa 252 JAPAN EMail: akimichi@sfc.wide.ad.jp Stephen L. Casner Cisco Systems, Inc. 170 West Tasman Drive San Jose, CA 95134-1706 United States EMail: casner@cisco.com Carsten Bormann Universitaet Bremen FB3 TZI Postfach 330440 D-28334 Bremen, GERMANY Phone: +49.421.218-7024 Fax: +49.421.218-7000 EMail: cabo@tzi.org 12. Bibliography [1] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson. RTP: A transport protocol for real-time applications. IETF Audio/Video Transport Working Group, January 1996. RFC1889. [2] Schulzrinne, H., "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 1890, January 1996. [3] IEC61119, Digital audio tape cassette system (DAT), November 1992. [4] IEC61834, Helical-scan digital video cassette recording system using 6,35 mm magnetic tape for consumer use (525-60, 625-50, 1125-60 and 1250-50 systems), August 1998. [5] Salsman, J., "The Audio/L16 MIME content type", RFC 2586, May 1999. [6] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997. [7] M.Handley, V.Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [8] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 1112, August 1989. Kobayashi, et al. Expires December 2000 [Page 11] Internet Draft June 26, 2000 Appendix. The audio channel symbols for each channel are specified in Table 2. This symbols are simply referred from the notation used in the IEC61834-4 DV video specification chapter 8.1 [4]. Therefore, the exact meaning of each symbol should consult original DV video specification. L: Left channel of stereo R: Right channel of stereo M: Monoral signal C: Center channel of 3,4,6 or 8 ch stereo S: Surround channel of 4 ch stereo LS, LS1, LS2: Left surround channel RS, RS1, RS2: Right surround channel LC: Left center channel of 8 ch stereo RC: Right center channel of 8 ch stereo WO: Woofer channel Lmix: L + 0.7071C + 0.7071LS Rmix: R + 0.7071C + 0.7071RS T: 0.7071C Q1: 0.7071LS + 0.7071RS Q2: 0.7071LS - 0.7071RS Table 2. Channel symbol of audio channel in DV video[4] Kobayashi, et al. Expires December 2000 [Page 12]