Internet Draft Adam H. Li draft-li-avt-vocoder-00.txt UCLA November 9, 2001 Editor Expires: May 9, 2002 An RTP Payload Format for EVRC, SMV and Other Frame-Based Vocoders STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as work in progress. The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. ABSTRACT This document describes the RTP payload format for Enhanced Variable Rate Codec (EVRC) Speech and Selectable Mode Vocoder (SMV) Speech. Other vocoders which share common characteristics with EVRC and SMV can be easily supported by following the procedures specified in this document. The packet format supports various formats for different application scenarios. A bundled/interleaved format is included to reduce the effect of packet loss on Speech quality. A non-bundled format is also supported for conversational applications. Table of Contents 1. Introduction ................................................... 2 2. Background ..................................................... 2 3. The Codecs Supported ........................................... 3 3.1. EVRC Codec ................................................... 3 Adam H. Li [Page 1] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 3.2. SMV Codec .................................................... 3 3.3. Other Frame-Based Vocoders ................................... 4 4. RTP/Vocoder Packet Format ...................................... 4 4.1. Type 1 RTP/Vocoder Packet Format ............................. 4 4.2. Type 2 RTP/Vocoder Packet Format ............................. 6 4.3. Detection Between the Type 1 and Type 2 Packets .............. 6 5. Packet Table of Content Entries and Codec Data Frame Format .... 7 5.1. Packet Table of Content entries .............................. 7 5.2. The Codec Data Frame ......................................... 8 6. Interleaving Codec Data Frames in Type 1 Packets ............... 9 6.1. Finding Interleave Group Boundaries ......................... 10 6.2. Reconstructing Interleaved Speech ........................... 11 6.3. Receiving Invalid Interleaving Values ....................... 11 6.4. Additional Receiver Responsibilities ........................ 11 7. Bundling Codec Data Frames in Type 1 Packets .................. 12 8. Handling Lost RTP Packets ..................................... 12 9. Implementation Issues ......................................... 13 9.1. Interleaving Length ......................................... 13 9.2. Flow Control ................................................ 13 10. IANA Considerations .......................................... 14 10.1 Storage Mode ................................................ 14 10.2 EVRC MIME Registration ...................................... 15 10.3 SMV MIME Registration ....................................... 16 11. Mapping to SDP Parameters .................................... 17 12. Security Considerations ...................................... 17 13. Adding Support of Other Frame-Based Vocoders ................. 18 14. Acknowledgements ............................................. 18 15. References ................................................... 18 16. AuthorsË Address ............................................. 19 1. Introduction This document describes how compressed speech may be formatted for use as an RTP payload type. The codec data supported includes those produced by the EVRC codec [1], SMV codec [2], as well as other frame-based codecs that share the same characteristics. Methods are provided to packetize the codec data frames into RTP packets, in interleaved/bundled and zero-header formats. The sender may choose among various formats the best solutions for different application scenarios based on the network condition, bandwidth restriction, delay requirements, and packet-loss tolerance. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. 2. Background The 3rd Generation Partnership Project 2 (3GPP2) has published two standards which define the speech compression algorithms for CDMA applications: EVRC [1] and SMV [2]. EVRC codec is currently deployed Adam H. Li [Page 2] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 in millions of first and second generation CDMA handsets. SMV codec is the preferred speech codec standard for CDMA2000, and will be deployed in third generation handsets, in addition to EVRC codec. Improvements and new codecs are likely to keep emerging as technology improves. Future handsets will support multiple codecs. The formats of EVRC and SMV codec frames are very similar. In additional, many other vocoders share common characteristics, and have many similar application scenarios. This parallelism enables an RTP payload format to be designed for both EVRC and SMV, as well as other vocoders that possess the same common properties as them. This can simplify the protocol for transporting vocoder data frames through RTP and reduce the complexity of implementation. 3. The Codecs Supported 3.1. EVRC codec The EVRC codec [1] compresses each 20 milliseconds of 8000 Hz, 16- bit sampled input speech into one of three different size output frames: Rate 1 (171 bits), Rate 1/2 (80 bits), or Rate 1/8 (16 bits). In addition, there are two zero bit codec frame types: null frames and erasure frames. Null frames are produced as a result of the vocoder running at rate 0. Null frames are zero bits long and are normally not transmitted. Erasure frames are the frames substituted by the receiver to the codec for the lost or damaged frames. The codec chooses the output frame rate based on analysis of the input speech and the current operating mode (either normal or one of several reduced rates). For typical speech patterns, this results in an average output of 4.2 kilobits/second for normal mode and lower for reduced rate modes. 3.2. SMV codec The SMV codec [2] compresses each 20 milliseconds of 8000 Hz, 16-bit sampled input speech into one of three different size output frames: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 1/4 (40 bits), or Rate 1/8 (16 bits). In addition, there are two zero bit codec frame types: null frames and erasure frames. Null frames are produced as a result of the vocoder running at rate 0. Null frames are zero bits long and are normally not transmitted. Erasure frames are the frames substituted by the receiver to the codec for the lost or damaged frames. SMV codec can operate in four modes. Each mode operates in all of the rates (full rate to 1/8 rate) for varying percentages of time, based the characteristics of the speech samples. The SMV mode can change on a frame-by-frame basis. SMV codec does not need additional information other than the codec data frames to correctly decode the data of various modes. The information of the encoding mode is not transmitted. Adam H. Li [Page 3] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 The percentage of different frame rate and the average data rate (ADR) for the four SMV modes are shown in the table below. Mode 0 Mode 1 Mode 2 Mode 3 ------------------------------------------------------------- Rate 1 68.90% 38.14% 15.43% 07.49% Rate 1/2 06.03% 15.82% 38.34% 46.28% Rate 1/4 00.00% 17.37% 16.38% 16.38% Rate 1/8 25.07% 28.67% 29.85% 29.85% ------------------------------------------------------------- ADR 7205 bps 5182 bps 4073 bps 3692 bps SMV codec chooses the output frame rate based on an analysis of the input speech and the current operating mode. For typical speech patterns, this results in an average output of 4.2k bits/second for Mode 0 and lower for other reduced rate modes. SMV is more bandwidth efficient than EVRC. EVRC is equivalent in performance to SMV mode 1. 3.3. Other Frame-Based Vocoders Other frame-based vocoders can be supported by the RTP payload format defined in this document, as long as they have the following properties: . The codec is frame-based; . blank and erasure frames are allowed; . total number of rates is less than 17; . maximum full rate frame can be transported in a single RTP packet using this specific format. Vocoders with the characteristics listed above can be transported using the format specified in this document by following the steps listed in section 13. 4. RTP/Vocoder Packet Format The RTP timestamp is in 1/8000 of a second units for EVRC and SMV. For the other vocoders that would use this payload format, this value can be different, and thus need to be defined explicitly for the each vocoder. The RTP payload data for vocoder MUST be transmitted in packets of one of the following two types. 4.1. Type 1 RTP/Vocoder Packet Format This format is intended for the situation where the sender and the receiver use interleaving/bundling to send one or more codec data frames per packet. The RTP packet for this format is as follows: Adam H. Li [Page 4] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [4] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |R|R| LLL | NNN | FFF | Count | TOC | ... | TOC |padding| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | one or more codec data frames, one per TOC entry | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The RTP header has the expected values as described in [4]. The M bit should be set as specified in the applicable RTP profile, for example, RFC 1890 [5]. Note that RFC 1890 [5] specifies that if the sender does not suppress, the M bit will always be zero. When multiple codec data frames are present in a single RTP packet, the timestamp is, as always, that of the oldest data represented in the RTP packet. The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or if that is not done, then a payload type in the dynamic range shall be chosen by the sender. The first octet of a Type 1 format packet is the Interleave Byte. The bits within the Interleave Byte are specified as follows: Reserved (RR): 2 bits Reserved bits. MUST be set to zero by sender, SHOULD be ignored by receiver. Interleave Length (LLL): 3 bits Indicates the length of interleave. MUST have a value between 0 and 7 inclusive (where a value 0 indicates bundling, a special case of interleaving). See Section 6 and Section 7 for more detailed discussion. Interleave Index (NNN): 3 bits Indicates the index within a interleaving group. MUST have a value less than or equal to the value of LLL. Values of NNN greater than the value of LLL are invalid. Flow Control (FFF): 3 bits The flow control field is used to signal flow control information between receiver and sender. See Section 9.2 for more details. Frame Count (Count): 5 bits Indicates the number of ToC fields that follows this field. A value of zero indicates one ToC field is following. A value of 31 indicates 32 ToC field is following. The number of the ToC fields is the value of the frame count field plus one. Adam H. Li [Page 5] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Padding (padding): 0 or 4 bits The padding ensures that the codec data frame starts from octet boundary. MUST be set to zero by sender, SHOULD be ignored by receiver. When the frame count is odd, the sender MUST set 4 bits of padding following the last TOC. When the frame count is even, the sender MUST NOT send the padding bits. The Table of Content field (ToC) contains the index(es) for the codec data frame(s) in the packet. There is one entry for each codec data frame. The detailed formats of the ToC field and codec data frame are specified in Section 5. More than one codec data frames MAY be included in a single Type 1 RTP/Vocoder packet by a sender. Multiple data frames may be included within a Type 1 packet with interleaving/bundling format as described in Section 6 and Section 7. Since no count is transmitted as part of the RTP payload and the codec data frames have differing lengths, the only way to determine how many codec data frames are present in a Type 1 RTP/EVRC packet is to examine the ToC fields of the packet. 4.2. Type 2 RTP/Vocoder Packet Format The Type 2 RTP/Vocoder Packet Format is designed for maximum efficiency and low latency in transmission of the vocoder data. Exactly one codec data frame MUST be sent in each Type 2 RTP/Vocoder packet. There MUST NOT be ToC field preceding the codec data. The codec rate for the data frame can be found out at the receiver from the length of the codec data frame, since there is only one codec data frame in each Type 2 packet. The Flow Control Signal (See Section 4.1) can not be send in-bound with the Type 2 packets because of the lacking of the ToC field in this type. Use of the RTP header fields for Type 2 RTP/Vocoder Packet Format is the same as described in Section 4.1 for Type 1 RTP/Vocoder Packet Format. The detailed formats of the codec data frame are specified in Section 5. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [4] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | | + ONLY one codec data frame +-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 4.3. Detection Between the Type 1 and Type 2 Packets All receivers MUST be able to process both types of packets. The sender MAY choose to use one or both types of packets. Adam H. Li [Page 6] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 A receiver MUST have prior knowledge of the packet type to correctly decode the RTP packets. The packet types used in a RTP session MUST be specified by the sender, and signaled through out-of-band means, for example by SDP during the setup of a session. When packets of both types are used during the same session, the packets of the two types can be distinguished by using different payload type value for the two packet types at the sender and checking the payload type field in the RTP header at the receiver. The association of payload type number with the packet type is also done out-of-band, for example by SDP during the setup of a session. 5. Packet Table of Content Entries and Codec Data Frame Format 5.1. Packet Table of Content entries For each of the codec data frames in Type 1 packets, there is a corresponding Table of Content (ToC) entry. The ToC entry includes flags that indicates if there are more entries following the current one, if rate reduction on the reverse direction is desired, and the rate of the corresponding codec frame. Type 2 packets MUST NOT have the ToC field, and there is always only one codec data frame in each Type 2 packet. Each ToC entry is 4-bit in size. The format of the octet is indicated below: 0 1 2 3 +-+-+-+-+ |fr type| +-+-+-+-+ Frame Type: 4 bits The frame type indicates the type of the corresponding codec data frame in the RTP packet. For EVRC and SMV codecs, the frame type values and size of the associated codec data frame are described in the table below: Value Rate Total codec data frame size (in octets) --------------------------------------------------------- 0 Blank 0 (0 bit) 1 1/8 2 (16 bits) 2 1/4 5 (40 bits) 3 1/2 10 (80 bits) 4 1 22 (171 bits; 5 padded at end with zeros) 5 Erasure 0 (SHOULD NOT be transmitted by sender) All values not listed in the above table MUST be considered reserved. Receipt of a ToC entry with a reserved value in Frame Type MUST be considered invalid data. Adam H. Li [Page 7] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 For other vocoders that would use this payload format, the corresponding frame types of codec data frames need to be specified. 5.2. The Codec Data Frame The output of vocoder MUST be converted into codec data frames for inclusion in the RTP payload. The conversion for EVRC and SMV codecs are specified below. For other vocoders that would use this payload format, the corresponding conversion of vocoders output data frame needs to be specified. The codec output data bits as numbered in EVRC and SMV are packed into octets. The lowest numbered bit (bit 1 for Rate 1, Rate 1/2 and Rate 1/8) is placed in the most significant bit (internet bit 0) of octet 1 of the codec data frame, the second lowest bit is placed in the second most significant bit of the first octet, the third lowest in the third most significant bit of the first octet, and so on. This continues until all of the bits have been placed in the codec data frame. The remaining unused bits of the last octet of the codec data frame MUST be set to zero. Note that this is only applicable to Rate 1 frames (171 bits) as the Rate 1/2 (80 bits), Rate ¡ (40 bits) and Rate 1/8 frames (16 bits) fit exactly into a whole number of octets. Following is a detailed listing showing a Rate 1 EVRC/SMV codec output frame converted into a codec data frame: The codec data frame for a EVRC/SMV codec Rate 1 frame is 22-byte long. Bits 1 through 171 from the EVRC/SMV codec Rate 1 frame are placed as indicated, with bits marked with "Z" set to zero. EVRC/SMV codec Rate 1/8, Rate 1/4 and Rate 1/2 frames are converted similarly, but do not require zero padding because they align on octet boundaries. Rate 1 codec data frame (bytes 0 - 3) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3| |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Adam H. Li [Page 8] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Rate 1 codec data frame (bytes 19 - 21) 1 1 1 1 4 5 6 7 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | | |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z| |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 6. Interleaving Codec Data Frames in Type 1 Packets As indicated in Section 4.1, more than one codec data frame MAY be included in a single Type 1 packet by a sender. This is accomplished by interleaving/bundling. Interleaving/bundling of codec data frames is signaled by setting the LLL value in the Interleaving Byte to a value between 0 and 7 inclusive. The special case with the LLL value set to 0 is a reduced and simplified case of interleaving. This is sometimes called bundling, because multiple consecutive codec data frames are included in one RTP packet in this case. The discussions on general interleaving apply to the bundling case with reduced complexity. The bundling case is discussed in detail in Section 7. Senders MAY support interleaving/bundling. All receivers MUST support interleaving/bundling. Given a time-ordered sequence of output frames from the EVRC codec numbered 0..n, a bundling value B, and an interleave length L where n = B * (L+1) - 1, the output frames are placed into RTP packets as follows (the values of the fields LLL and NNN are indicated for each RTP packet): First RTP Packet in Interleave group: LLL=L, NNN=0 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of B frames Second RTP Packet in Interleave group: LLL=L, NNN=1 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a total of B frames This continues to the last RTP packet in the interleave group: L+1 RTP Packet in Interleave group: LLL=L, NNN=L Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a total of B frames Adam H. Li [Page 9] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Senders MUST transmit in timestamp-increasing order. Furthermore, within each interleave group, the RTP packets making up the interleave group MUST be transmitted in value-increasing order of the NNN field. While this does not guarantee reduced end-to-end delay on the receiving end, when packets are delivered in order by the underlying transport, delay will be reduced to the minimum possible. Receivers MAY signal the maximum number of codec data frames (i.e., the maximum acceptable bundling value B) they can handle in a single RTP packet using the OPTIONAL maxptime RTP mode parameter identified in Section 10. Receivers MAY signal the maximum interleave length (i.e., the maximum acceptable LLL value in the Interleaving Byte) they will accept using the OPTIONAL maxinterleave RTP mode parameter identified in Section 10. Additionally, senders have the following restrictions: o MUST NOT bundle more codec data frames in a single RTP packet than indicated by maxptime (see Section 10) if it is signaled. o SHOULD NOT bundle more codec data frames in a single RTP packet than will fit in the MTU of the underlying network. For the purpose of computing the maximum bundling value, all codec data frames MUST be assumed to have the Rate 1 size. o Once beginning a session with a given maximum interleaving value set by maxinterleave in Section 10, MUST NOT increase the interleaving value (LLL) exceeding the maximum interleaving value that is signaled. o MAY change the interleaving value only between interleave groups. 6.1. Finding Interleave Group Boundaries Given an RTP packet with sequence number S, interleave length (field LLL) L, interleave index value (field NNN) N, and bundling value B, the interleave group consists of RTP packets with sequence numbers from S-N to S-N+L inclusive. (The sequence numbers used here are for illustrative purposes. When wrapping around happens, the sequence numbers need to be adjusted accordingly). In other words, the interleave group always consists of L+1 RTP packets with sequential sequence numbers. The bundling value for all RTP packets in an interleave group MUST be the same. The receiver determines the expected bundling value for all RTP packets in an interleave group by the number of codec data frames bundled in the first RTP packet of the interleave group received. Note that this may not be the first RTP packet of the interleave group sent if packets are delivered out of order by the underlying transport. Adam H. Li [Page 10] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 On receipt of an RTP packet in an interleave group with other than the expected bundling value, the receiver MAY discard codec data frames off the end of the RTP packet or add erasure codec data frames to the end of the packet in order to manufacture a substitute packet with the expected bundling value. The receiver MAY instead choose to discard the whole interleave group. 6.2. Reconstructing Interleaved Speech Given an RTP sequence number ordered set of RTP packets in an interleave group numbered 0..L, where L is the interleave length and B is the bundling value, and codec data frames within each RTP packet that are numbered in order from first to last with the numbers 1..B, the original, time-ordered sequence of output frames from the EVRC codec may be reconstructed as follows: First L+1 frames: Frame 0 from packet 0 of interleave group Frame 0 from packet 1 of interleave group And so on up to... Frame 0 from packet L of interleave group Second L+1 frames: Frame 1 from packet 0 of interleave group Frame 1 from packet 1 of interleave group And so on up to... Frame 1 from packet L of interleave group And so on up to... Bth L+1 frames: Frame B from packet 0 of interleave group Frame B from packet 1 of interleave group And so on up to... Frame B from packet L of interleave group 6.3. Receiving Invalid Interleaving Values On receipt of an RTP packet with an invalid value of the LLL or NNN field, the RTP packet MUST be treated as lost by the receiver for the purpose of generating erasure frames as described in Section 8. 6.4. Additional Receiver Responsibilities Assume that the receiver has begun playing frames from an interleave group. The time has come to play frame x from packet n of the interleave group. Further assume that packet n of the interleave group has not been received. As described in section 8, an erasure frame will be sent to the receiving vocoder. Now, assume that packet n of the interleave group arrives before frame x+1 of that packet is needed. Receivers SHOULD use frame x+1 of Adam H. Li [Page 11] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 the newly received packet n rather than substituting an erasure frame. In other words, just because packet n was not available the first time it was needed to reconstruct the interleaved speech, the receiver SHOULD NOT assume it is not available when it is subsequently needed for interleaved speech reconstruction. 7. Bundling Codec Data Frames in Type 1 Packets As discussed in Section 6, the bundling of codec data frames is a special reduced case of interleaving with LLL value in the Interleave Byte set to 0. Bundling codec data frames indicates multiple data frames are included consecutively in a packet, because the interleaving length (LLL) is 0. The interleaving group is thus reduced to a single RTP packet, and the reconstruction of the code data frames from RTP packets becomes a much simpler process. Furthermore, the additional restriction on the senders are reduced to: o MUST NOT bundle more codec data frames in a single RTP packet than indicated by maxptime (see Section 10) if it is signaled. o SHOULD NOT bundle more codec data frames in a single RTP packet than will fit in the MTU of the underlying network. For the purpose of computing the maximum bundling value, all codec data frames MUST be assumed to have the Rate 1 size. 8. Handling Lost RTP Packets The vocoders covered by this payload format SHOULD support the notion of erasure frames. These are frames that for whatever reason are not available. When reconstructing or playing back speech, erasure frames MUST be fed to the receiving vocoder for all of the missing packets. Receivers MUST use the timestamp clock to determine how many codec data frames are missing. For EVRC/SMV vocoders, each codec data frame advances the timestamp clock exactly 160 counts. Since the interleaving length/bundling value may vary, the timestamp clock is the only reliable way to calculate exactly how many codec data frames are missing when a packet is dropped. Specifically when reconstructing interleaved speech, a missing RTP packet in the interleave group MUST be treated as containing B erasure codec data frames where B is the bundling value for that interleave group. Adam H. Li [Page 12] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 9. Implementation Issues 9.1. Interleaving Length The vocoder interpolates the missing speech content when given an erasure frame. However, the best quality is perceived by the listener when erasure frames are not consecutive. This makes interleaving desirable as it increases speech quality when packet loss may occur. On the other hand, interleaving can greatly increase the end-to-end delay. Where an interactive session is desired, either Type 1 with interleaving length 0 or Type 2 RTP payload types are RECOMMENDED. When end-to-end delay is not a concern, an interleaving length (field LLL) of 4 or 5 is RECOMMENDED. The parameters maxptime and maxinterleave at the initial setup of the session guarantees that the receiver can allocate a well-known amount of buffer space at the beginning of the session that will be sufficient for all future reception in that session. Less buffer space may be required at some point in the future if the sender decreases the bundling value or interleaving length, but never more buffer space. This prevents the possibility of the receiver needing to allocate more buffer space (with the possible result that none is available). 9.2. Flow Control The Flow Control signal requests a reduction of the codec rate in the reverse direction. All implementations are RECOMMENDED to honor the Flow Control request. If an implementation responds to the Flow Control request, it MUST be able to process/react to the flow control field in type 1 format packets. The Flow Control signal SHOULD only be used in one-to-one sessions. In multiparty sessions, any received Flow Control signals SHOULD be ignored. In addition, the Flow Control signal MAY also be sent through non-RTP means, which is out of the scope of this specification. The 3-bit flow control field is used to signal the receiver to reduce the output bit rate of its audio encoder. If the Flow Control field is set to a non-zero value in RTP packets from node A to node B, it is a request for node B to reduce the rate of its audio encoder and therefore the bit rate of the RTP stream from node B to node A. Once a node sets this field to a non-zero value it SHOULD continue to set the field to the same value in subsequent packets until the need for the requested rate reduction has changed. A silent receiver MAY send an RTP packet containing a blank frame for the purpose of communicating flow control information to the receiver. The blank frame packet MAY be retransmitted if the sender detects that the receiver has not reduced its rate. The receiver of Adam H. Li [Page 13] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 such a frame SHOULD respect the rate reduction request until it receives an updated value in a subsequent packet. Each codec type SHOULD define its own interpretation of the Rate Reduction field. Codecs SHOULD follow the convention that higher values of the 3-bit field correspond to an equal or lower average output bit rate. For EVRC codec, the flow control field MUST be interpreted according to Tables 2.2.1.2-1 and 2.2.1.2-2 of the EVRC codec specifications [1]. Values above '100' are currently reserved. If an unknown value above '100' is received, it MUST be handled as if '100' were received. For SMV codec, the flow control field MUST be interpreted according to Table 2.2-2 of the SMV codec specifications [2]. Values above '101' are currently reserved. If an unknown value above '101' is received, it MUST be handled as if '101' were received. 10. IANA Considerations One new MIME sub-type as described in this section is to be registered. The MIME-name for the EVRC codec is allocated from the IETF tree since all the vocoders covered are expected to be widely used for Voice-over-IP applications. The RTP mode has been described in the previous sections. 10.1. Storage Mode The storage mode is used for storing speech frames, e.g. as a file or e-mail attachment. The file begins with a magic number to identify the vocoder that is used. The magic number for EVRC corresponds to the ASCII character string "#!EVRC\n", i.e., "0x23 0x21 0x45 0x56 0x52 0x43 0x0A" in network byte order. The magic number for SMV corresponds to the ASCII character string "#!SMV\n", i.e., "0x23 0x21 0x53 0x4d 0x56 0x0a" in network byte order. The codec data frames are stored in consecutive order, with a single TOC entry field, padded to one octet, prefixing each codec data frame. Speech frames lost in transmission and non-received frames MUST be stored as erasure frames (frame type 5, see definition in Section 5.1) to maintain synchronization with the original media. Adam H. Li [Page 14] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 10.2. EVRC MIME Registration Media Type Name: audio Media Subtype Name: EVRC Required Parameters: ptype: Indicates the Type of the RTP/Vocoder packets. The valid values are 1 (Type 1) or 2 (Type 2). Optional parameters for RTP mode: ptime: Defined as usual for RTP audio [6]. maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time SHALL be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the duration of a single codec data frame (20 msec). If not signaled, the default maxptime value SHALL be 200 milliseconds. maxinterleave: Maximum number for interleaving length (field LLL in the Interleaving Byte). The interleaving lengths used in the entire session MUST NOT exceed this maximum value. If not signaled, the maxinterleave length SHALL be 5. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 6 and Section 7 of RFC xxxx. Encoding considerations for storage mode: see Section 10.1 of RFC xxxx. Security considerations: see Section 12 "Security Considerations" of RFC xxxx. Public specification: RFC xxxx. Additional information for storage mode: Magic number: #!EVRC\n File extensions: evc, EVC Macintosh file type code: none Object identifier or OID: none Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Person & email address to contact for further information: The authors of this document. Adam H. Li [Page 15] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Author/Change controller: This registration is part of the IETF registration tree. 10.3. SMV MIME Registration Media Type Name: audio Media Subtype Name: SMV Required Parameters: ptype: Indicates the Type of the RTP/Vocoder packets. The valid values are 1 (Type 1) or 2 (Type 2). Optional parameters for RTP mode: ptime: Defined as usual for RTP audio [6]. maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time SHALL be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the duration of a single codec data frame (20 msec). If not signaled, the default maxptime value SHALL be 200 milliseconds. maxinterleave: Maximum number for interleaving length (field LLL in the Interleaving Byte). The interleaving lengths used in the entire session MUST NOT exceed this maximum value. If not signaled, the maxinterleave length SHALL be 5. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 6 and Section 7 of RFC xxxx. Encoding considerations for storage mode: see Section 10.1 of RFC xxxx. Security considerations: see Section 12 "Security Considerations" of RFC xxxx. Public specification: RFC xxxx. Additional information for storage mode: Magic number: #!SMV\n File extensions: smv, SMV Macintosh file type code: none Object identifier or OID: none Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Adam H. Li [Page 16] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Person & email address to contact for further information: The authors of this document. Author/Change controller: This registration is part of the IETF registration tree. 11. Mapping to SDP Parameters Please note that this section applies to the RTP mode only. Parameters are mapped to SDP [6] as usual. Example usage in SDP: m = audio 49120 RTP/AVP 97 a = rtpmap:97 EVRC a = fmtp:97 ptype=1; maxinterleave=2 a = maxptime:80 12. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [4], and any appropriate profile (for example [5]). This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed after compression so there is no conflict between the two operations. A potential denial-of-service threat exists for data encoding using compression techniques that have non-uniform receiver-end computational load. The attacker can inject pathological datagrams into the stream which are complex to decode and cause the receiver to become overloaded. However, this encoding does not exhibit any significant non-uniformity. As with any IP-based protocol, in some circumstances, a receiver may be overloaded simply by the receipt of too many packets, either desired or undesired. Network-layer authentication may be used to discard packets from undesired sources, but the processing cost of the authentication itself may be too high. In a multicast environment, pruning of specific sources may be implemented in future versions of IGMP [7] and in multicast routing protocols to allow a receiver to select which sources are allowed to reach it. Interleaving MAY affect encryption. Depending on the used encryption scheme there MAY be restrictions on for example the time when keys can be changed. Adam H. Li [Page 17] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 13. Adding Support of Other Frame-Based Vocoders As described above, the RTP payload format defined in this documents is very flexible to be extended to support other frame-based vocoders. The additional supported vocoders must be frame-based, and process some common properties described in Section 3.3. The following need to be done in order for any eligible vocoders to use the RTP payload format defined in this document: . Define the unit used for RTP time stamp; . Define the meaning of the flow control bits; . Define corresponding codec data frame type values for ToC; . Define the conversion procedure for vocoders output data frame; . Define a magic number for storage mode, and complete the corresponding MIME registration. 14. Acknowledgements The editor thanks the following authors for contributions to this document: J. D. Villasenor, D.S. Park, J.H. Park, K. Miller, S. C. Greer, D. Leon, N. Leung, K. J. McKay, M. Lioy, M. L. Espelien, R. Gellens, T. Hiller, P. J. McCann, S. S. Mathai, M. D. Turner, A. Rajkumar, D. Gal, M. Westerlund, L.-E. Jonsson, G. Sherwood, and T. Zeng. 15. References [1] 3GPP2 C.S0014, "Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems", January 1997. [2] 3GPP2 C.S0030, "Selectable Mode Vocoder", August 2001. [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [4] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. [5] Schulzrinne, H., "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 1890, January 1996. [6] M. Handley and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [7] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 1112, August 1989. Adam H. Li [Page 18] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 16. Authors' Address Adam H. Li Image Communication Lab Electrical Engineering Department University of California Los Angeles, CA 90095 USA Phone: +1 310 825 5178 Email: adamli@icsl.ucla.edu John D. Villasenor Image Communication Lab Electrical Engineering Department University of California Los Angeles, CA 90095 USA Phone: +1 310 825 0228 Email: villa@icsl.ucla.edu Dong-Seek Park Samsung Electronics Suwon, Kyungki 442-742 Korea Phone: +82 31 200 3674 Email: dspark@samsung.com Jeong-Hoon Park Samsung Electronics Suwon, Kyungki 442-742 Korea Phone: +82 31 200 3747 Email: dspark@samsung.com Keith Miller Nokia 6000 Connection Drive Irving, Texas 75039 USA Phone: +1 972 894 4296 Email: keith.miller@nokia.com S. Craig Greer Nokia 6000 Connection Drive Irving, Texas 75039 USA Phone: +1 972 894 4867 Email: craig.greer@nokia.com Adam H. Li [Page 19] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 David Leon Nokia 6000 Connection Drive Irving, Texas 75039 USA Phone: +1 972 374 1860 Email: david.leon@nokia.com Marcello Lioy QUALCOMM, Incorporated 5775 Morehouse Drive San Diego, CA 92121 USA Phone: +1 858 651 8220 Email: mlioy@qualcomm.com Nikolai Leung QUALCOMM, Incorporated 7710 Takoma Ave. Takoma Park, MD 20912 USA Phone: +1 703 346 8351 Email: nleung@qualcomm.com Kyle J. McKay QUALCOMM, Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 USA Phone: +1 858 587 1121 EMail: kylem@qualcomm.com Magdalena L. Espelien QUALCOMM, Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 USA Phone: +1 858 651-6733 Email: magda@qualcomm.com Randall Gellens QUALCOMM Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 USA Phone: +1 858 651-5115 Email: rg+ietf@qualcomm.com Adam H. Li [Page 20] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Tom Hiller Lucent Technologies 263 Shuman Drive, Room 2F-218 Naperville, IL 60137 USA Phone: +1 630 979 7673 Email: tom.hiller@lucent.com Peter J. McCann Lucent Technologies 263 Shuman Drive, Room 2Z-305 Naperville, IL 60137 USA Phone: +1 630 713 9359 Email: mccap@lucent.com Stinson S. Mathai Lucent Technologies 263 Shuman Blvd, Room 1E-550 Naperville, IL 60566 USA Phone: +1 630 713 5190 Email: smathai@lucent.com Michael D. Turner Lucent Technologies 67 Whippany Rd, Room 2A-203 Whippany, NJ 07981 USA Phone: +1 973 386 3579 Email: mdturner@lucent.com Ajay Rajkumar Lucent Technologies 67 Whippany Rd, Room 1A-235 Whippany, NJ 07981 USA Phone: +1 973 386 5249 Email: ajayrajkumar@lucent.com Dan Gal Lucent Technologies 67 Whippany Rd Whippany, NJ 07981 USA Phone: +1 973 428 7734 Email: dgal@lucent.com Adam H. Li [Page 21] INTERNET-DRAFT An RTP Payload Format for Vocoders November 9, 2001 Magnus Westerlund Ericsson Radio Systems AB Torshamnsgatan 23 SE-164 80 Stockholm Sweden Phone: +46 8 4048287 Email: magnus.westerlund@ericsson.com Lars-Erik Jonsson Ericsson Erisoft AB Box 920 SE-971 28 Luleà Sweden Phone: +46 920 20 21 07 Email: lars-erik.jonsson@ericsson.com Greg Sherwood PacketVideo Corporation 4820 Eastgate Mall San Diego, CA 92121 USA Email: sherwood@packetvideo.com Thomas Zeng PacketVideo Corporation 4820 Eastgate Mall San Diego, CA 92121 USA Email: zeng@packetvideo.com Adam H. Li [Page 22]