Internet Draft S. Wenger Document: draft-ietf-avt-rtp-h264-03.txt M.M. Hannuksela Expires: December 2003 T. Stockhammer M. Westerlund D. Singer October 2003 Expires April 2004 RTP payload Format for H.264 Video Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This memo describes an RTP Payload format for the ITU-T Recommendation H.264 video codec. This codec was designed as a Wenger et. al. Expires August 2003 [Page 1] Internet Draft 26 June, 2003 joint project of the Video Coding Experts Group (VCEG) of ITU-T and the Moving Picture Experts Group (MPEG) of ISO/IEC. Recommendation H.264 was approved by ITU-T on May 2003, and the approved draft specification is available for public review. ISO/IEC International Standard 14496-10 will be technically identical to ITU-T Recommendation H.264. Wenger et. al. Expires December 2003 [Page 2] Internet Draft 26 June, 2003 Table of Contents 1. Introduction.................................................5 1.1. The H.264 codec...........................................5 1.2. Parameter Set Concept.....................................6 1.3. Network Abstraction Layer Unit Types......................7 2. Conventions..................................................8 3. Scope........................................................9 4. Definitions and Abbreviations................................9 4.1. Definitions...............................................9 4.2. Abbreviations............................................11 5. RTP Payload Format..........................................11 5.1. RTP Header Usage.........................................11 5.2. Common structure of the RTP payload format...............14 5.3. NAL Unit Octet Usage.....................................15 5.4. Packetization Modes......................................17 5.5. Decoding Order Number (DON)..............................18 5.6. Single NAL Unit Packet...................................20 5.7. Aggregation Packets......................................21 5.8. Fragmentation Units (FUs)................................29 6. Packetization Rules.........................................33 6.1. Common Packetization Rules...............................33 6.2. Single NAL Unit Mode.....................................35 6.3. Non-Interleaved Mode.....................................35 6.4. Interleaved Mode.........................................35 7. De-Packetization Process (Informative)......................35 7.1. Single NAL Unit and Non-Interleaved Mode.................36 7.2. Interleaved Mode.........................................36 7.3. Additional De-Packetization Guidelines...................39 8. Payload Format Parameters...................................40 8.1. MIME Registration........................................40 8.2. SDP Parameters...........................................47 9. Security Considerations.....................................48 10. Informative Appendix: Application Examples.................49 10.1. Video Telephony according to ITU-T Recommendation H.241 Annex A.......................................................49 10.2. Video Telephony, No Slice Data Partitioning, No NAL Unit Aggregation...................................................50 10.3. Video Telephony, Interleaved Packetization Using NAL Unit Aggregation...................................................50 10.4. Video Telephony, with Data Partitioning.................51 10.5. Video Telephony or Streaming, with FUs and Forward Error Correction....................................................52 10.6. Low-Bit-Rate Streaming..................................54 Wenger et. al. Expires December 2003 [Page 3] Internet Draft 26 June, 2003 10.7. Robust Packet Scheduling in Video Streaming.............55 11. Informative Appendix: Rationale for Decoding Order Number..56 11.1. Introduction............................................56 11.2. Example of Multi-Picture Slice Interleaving.............56 11.3. Example of Robust Packet Scheduling.....................58 11.4. Robust Transmission Scheduling of Redundant Coded Slic..62 11.5. Remarks on Other Design Possibilities...................63 12. Open Issues................................................64 13. Full Copyright Statement...................................64 14. Intellectual Property Notice...............................65 15. References.................................................66 15.1. Normative References....................................66 15.2. Informative References..................................66 Annex A: Changes relative to draft-ietf-avt-rtp-h264-02.txt....68 Wenger et. al. Expires December 2003 [Page 4] Internet Draft 26 June, 2003 1. Introduction 1.1. The H.264 codec This memo specifies an RTP payload specification for the video coding standard known as ITU-T Recommendation H.264 [1] and ISO/IEC International Standard 14496 Part 10 (also known as MPEG-4 Advanced Video Coding) [2]. Recommendation H.264 was approved by ITU-T on May 2003, and the approved draft specification is available for public review [8]. In this memo the H.264 acronym is used for the codec and the standard, but the memo is equally applicable to the ISO/IEC counterpart of the coding standard. The H.264 video codec has a very broad application range that covers all forms of digital compressed video from low bit rate Internet Streaming applications to HDTV broadcast and Digital Cinema applications with near loss-less coding. Most, if not all, relevant companies in all of these fields (including Video- Conferencing, Streaming, TV broadcast, and Digital Cinema) have participated in the standardization, which gives hope that this wide application range is more than an illusion and may materialize, probably in a relatively short time frame. The overall performance of H.264 is as such that bit rate savings of 50% or more, compared to the current state of technology, are reported. Digital Satellite TV quality, for example, was reported to be achievable at 1.5 Mbit/s, compared to the current operation point of MPEG 2 video at around 3.5 Mbit/s [9]. The codec specification [1] itself distinguishes conceptually between a video coding layer (VCL), and a network abstraction layer (NAL). The VCL contains the signal processing functionality of the codec, things such as transform, quantization, motion search/compensation, and the loop filter. It follows the general concept of most of today's video codecs, a macroblock-based coder that utilizes inter picture prediction with motion compensation, and transform coding of the residual signal. The VCL encoder outputs slices: a bit string that contains the macroblock data of an integer number of macroblocks, and the information of the slice header (containing the spatial address of the first macroblock in the slice, the initial quantization parameter, and similar). Macroblocks in slices are ordered in scan order unless a different Wenger et. al. Expires December 2003 [Page 5] Internet Draft 26 June, 2003 macroblock allocation is specified, using the so-called Flexible Macroblock Ordering syntax. In-picture prediction is used only within a slice. More information is provided in [8]. The NAL encoder encapsulates the slice output of the VCL encoder into Network Abstraction Layer Units (NAL units), which are suitable for the transmission over packet networks or the use in packet oriented multiplex environments. Annex B of H.264 defines an encapsulation process to transmit such NAL units over byte- stream oriented networks. In the scope of this memo Annex B is not relevant. Internally, the NAL uses NAL units. A NAL unit consists of a one- byte header and the payload byte string. The header co-serves as the RTP payload header and indicates the type of the NAL unit, the (potential) presence of bit errors or syntax violations in the NAL unit payload, and information regarding the relative importance of the NAL unit for the decoding process. This RTP payload specification is designed to be unaware of the bit string in the NAL unit payload. One of the main properties of H.264 is the complete decoupling of the transmission time, the decoding time, and the sampling or presentation time of slices and pictures. The decoding process specified in H.264 is unaware of time, and the H.264 syntax does not carry information such as the number of skipped frames (as common in the form of the Temporal Reference in earlier video compression standards). Also, there are NAL units that are affecting many pictures and are, hence, inherently time-less. For this reason, the handling of the RTP timestamp requires some special considerations for those NAL units for which the sampling or presentation time is not defined, or, at transmission time, unknown. 1.2. Parameter Set Concept One very fundamental design concept of H.264 is to generate self- contained packets, to make mechanisms such as the header duplication of RFC 2429 [11] or MPEG-4's HEC [12] unnecessary. The Wenger et. al. Expires December 2003 [Page 6] Internet Draft 26 June, 2003 way how this was achieved is to decouple information that is relevant to more than one slice from the media stream. This higher layer meta information should be sent reliably, asynchronously and in advance from the RTP packet stream that contains the slice packets. (Provisions for sending this information in-band are also available for such applications that do not have an out-of-band transport channel appropriate for the purpose). The combination of the higher-level parameters is called a parameter set. The H.264 specification includes two types of parameter sets: sequence parameter set and picture parameter set. An active sequence parameter set remains unchanged throughout a coded video sequence, and an active picture parameter set remains unchanged within a coded picture. The sequence and picture parameter set structures contain information such as picture size, optional coding modes employed, and macroblock to slice group map. In order to be able to change picture parameters (such as the picture size), without having the need to transmit parameter set updates synchronously to the slice packet stream, the encoder and decoder can maintain a list of more than one sequence and picture parameter set. Each slice header contains a codeword that indicates the sequence and picture parameter set to be used. This mechanism allows to decouple the transmission of parameter sets from the packet stream, and transmit them by external means, e.g. as a side effect of the capability exchange, or through a (reliable or unreliable) control protocol. It may even be possible that they get never transmitted but are fixed by an application design specification. 1.3. Network Abstraction Layer Unit Types Tutorial information on the NAL design can be found in [13], [14] and [15]. All NAL units consist of a single NAL unit type octet, which also co-serves as the payload header. The payload of a NAL unit follows immediately. Wenger et. al. Expires December 2003 [Page 7] Internet Draft 26 June, 2003 The syntax and semantics of the NAL unit type octet are specified in [1], but the essential properties of the NAL unit type octet are summarized below. The NAL unit type octet has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|NRI| Type | +---------------+ The semantics of the components of the NAL unit type octet, as specified in the H.264 specification, are described briefly below. F: 1 bit forbidden_zero_bit. The H.264 specification declares a value of 1 as a syntax violation. NRI: 2 bits nal_ref_idc. A value of 00 indicates that the content of the NAL unit is not used to reconstruct reference pictures for inter picture prediction. Such NAL units can be discarded without risking the integrity of the reference pictures. Values greater than 00 indicate that the decoding of the NAL unit is required to maintain the integrity of the reference pictures. Type: 5 bits nal_unit_type. The NAL unit payload type as defined in table 7- 1 of [1], and later within this memo. For a reference of all currently defined NAL unit types and their semantics please refer to section 7.4.1 in [1]. This memo introduces new NAL unit types, which are introduced in Section 5.2. Note that the NAL unit types defined in this memo are marked as unspecified in [1]. Moreover, this specification extends the semantics of F and NRI as described in section 5.3. 2. Conventions Wenger et. al. Expires December 2003 [Page 8] Internet Draft 26 June, 2003 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. This specification uses the notion of setting and clearing a bit when handling bit fields. Setting a bit is the same as assigning that bit the value of 1 (On). Clearing a bit is the same as assigning that bit the value of 0 (Off). 3. Scope This payload specification can only be used to carry the "naked" H.264 NAL unit stream over RTP. Likely, the first applications of this specification will be in the conversational multimedia field, video telephone or video conference. The draft is not intended for the use in conjunction with the byte stream format of Annex B of H.264. 4. Definitions and Abbreviations 4.1. Definitions This document uses the definitions of [1]. The following terms defined in [1] are summed up below for convenience: access unit: A set of NAL units always containing a primary coded picture. In addition to the primary coded picture, an access unit may also contain one or more redundant coded pictures or other NAL units not containing slices or slice data partitions of a coded picture. The decoding of an access unit always results in a decoded picture. coded video sequence: A sequence of access units that consists, in decoding order, of an IDR access unit followed zero or more non-IDR access units including all subsequent access units up to but not including any subsequent IDR access unit. instantaneous decoding refresh (IDR) access unit: An access unit in which the primary coded picture is an IDR picture. Wenger et. al. Expires December 2003 [Page 9] Internet Draft 26 June, 2003 instantaneous decoding refresh (IDR) picture: A coded picture containing only slices with I or SI slice types that causes a "reset" in the decoding process. After the decoding of an IDR picture all following coded pictures in decoding order can be decoded without inter prediction from any picture decoded prior to the IDR picture. primary coded picture: The coded representation of a picture to be used by the decoding process for a bitstream conforming to H.264. The primary coded picture contains all macroblocks of the picture. redundant coded picture: A coded representation of a picture or a part of a picture. The content of a redundant coded picture shall not be used by the decoding process for a bitstream conforming to H.264. The content of a redundant coded picture may be used by the decoding process for a bitstream that contains errors or losses. VCL NAL unit: A collective term used to refer to coded slice and coded data partition NAL units. In addition, the following definitions apply: decoding order number (DON): A field in the payload structure or a derived variable indicating NAL unit decoding order. Values of DON are in the range of 0 to 65535, inclusive. After reaching the maximum value, the value of DON wraps around to 0. NAL unit decoding order: A NAL unit order that conforms to the constraints on NAL unit order given in section 7.4.1.2 in [1]. transmission order: The order of packets in ascending RTP sequence number order (in modulo arithmetic). Within an aggregation packet, the NAL unit transmission order is the same as the order of appearance of NAL units in the packet. Wenger et. al. Expires December 2003 [Page 10] Internet Draft 26 June, 2003 4.2. Abbreviations DON: Decoding Order Number DONB: Decoding Order Number Base DOND: Decoding Order Number Difference FU: Fragmentation Unit IDR: Instantaneous Decoding Refresh IEC: International Engineering Consortium ISO: International Organization for Standardization ITU-T: International Telecommunication Union, Telecommunication Standardization Sector MTAP: Multi-Time Aggregation Packet MTAP16: MTAP with 16-bit timestamp offset MTAP24: MTAP with 24-bit timestamp offset NAL: Network Adaptation Layer NALU: NAL Unit SEI: Supplemental Enhancement Information STAP: Single-Time Aggregation Packet STAP-A: STAP type A STAP-B: STAP type B TS: Timestamp VCL: Video Coding Layer 5. RTP Payload Format 5.1. RTP Header Usage The format of the RTP header is specified in RFC 3550 [4] and reprinted in Figure 1 for convenience. This payload format uses the fields of the header in a manner consistent with that specification. When encapsulating one NAL unit per RTP packet, the RECOMMENDED RTP payload format is specified in section 5.6. The RTP payload (and the settings for some RTP header bits) for aggregation packets and fragmentation units are specified in sections 5.7 and 5.8, respectively. Wenger et. al. Expires December 2003 [Page 11] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |V=2|P|X| CC |M| PT | sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | synchronization source (SSRC) identifier | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | contributing source (CSRC) identifiers | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1: RTP header according RFC 3550. The RTP header information is set as follows: Version (V): 2 bits Set to 2 according to RFC 3550. Padding (P): 1 bit Used according to RFC 3550. Extension (X): 1 bit Used according to RFC 3550 and profile definitions. CSRC count (CC): 4 bits Used according to RFC 3550. Marker bit (M): 1 bit Set for the very last packet of the access unit indicated by the RTP timestamp, in line with the normal use of the M bit in video formats and to allow an efficient playout buffer handling. Decoders MAY use this bit as an early indication of the last packet of an access unit, but MUST NOT rely on this property. Informative note: Only one M bit is associated with an aggregation packet carrying multiple NAL units, and thus if a gateway has re-packetized an aggregation packet into several packets, it cannot reliably set the M bit of those packets. Wenger et. al. Expires December 2003 [Page 12] Internet Draft 26 June, 2003 Payload type (PT): 7 bits The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. The assignment of a payload type needs to be performed either through the profile used or in a dynamic way. Sequence number (SN): 16 bits Increased by one for each sent packet. Set to a random value during startup as per RFC 3550 Timestamp: 32 bits The RTP timestamp is set to the sampling timestamp of the content. A 90 kHz clock rate MUST be used. If the NAL unit has no own timing properties (e.g. parameter set and SEI NAL units), the RTP timestamp is set to the RTP timestamp of the primary coded picture of the access unit to which the NAL unit is included according to section 7.4.1.2 of [1]. The setting of the RTP Timestamp for MTAPs is defined in section 5.7.2. If the content is a part of a coded frame that was sampled as two fields having distinct sampling times and that is supposed to be displayed as fields having distinct display times, the RTP timestamp MUST be set to the sampling timestamp of the latest sampled field. In addition, the picture timing supplemental enhancement information (SEI) message (subclauses D.1.2 and D.2.2 of [1]) SHOULD be used to convey the timestamps for display, and the last clock timestamp in decoding order conveyed in a picture timing SEI message MUST correspond to the RTP timestamp of the primary coded picture of the same access unit. Informative note: Displaying coded frames as fields is needed commonly in an operation known as 3:2 pulldown where film content that consists of coded frames is displayed on an display using interlaced scanning. The picture timing SEI message enables carriage of multiple timestamps for the same coded Wenger et. al. Expires December 2003 [Page 13] Internet Draft 26 June, 2003 picture, and therefore the 3:2 pulldown process is perfectly controlled. The picture timing SEI message mechanism is necessary, because only one timestamp per coded frame can be conveyed in the RTP timestamp. Receivers SHOULD ignore any picture timing SEI messages included in access units that have only one display timestamp. Instead, receivers SHOULD use the RTP timestamp for synchronizing the display process. RTP senders SHOULD NOT transmit picture timing SEI messages for pictures that are not supposed to be displayed as multiple fields. Synchronization source (SSRC) identifier: 32 bits Used according to RFC 3550. Contributing source (CSRC) identifiers: 0 to 15 items, 32 bits each Used according to RFC 3550. 5.2. Common structure of the RTP payload format The payload format is defined as a number of different payload structures depending on need. However, which structure a received RTP packet contains is evident from the first byte of the payload. This byte will always be structured as a NAL unit header. The NAL unit type field indicates which structure is present. The possible structures are: Single NAL Unit Packet: Contains only a single NAL unit in the payload. The NAL header type field will be equal to the original NAL unit type, i.e., in the range of 1 to 23, inclusive. Specified in section 5.6. Aggregation packet: Packet type used to aggregate multiple NAL units into a single RTP payload. This packet exists in four versions, the Single-Time Aggregation Packet type A (STAP-A), the Single-Time Aggregation Packet type B (STAP-B), Multi-Time Aggregation Packet (MTAP) with 16 bit offset (MTAP16), and Multi- Wenger et. al. Expires December 2003 [Page 14] Internet Draft 26 June, 2003 Time Aggregation Packet (MTAP) with 24 bit offset (MTAP24). The NAL unit type numbers assigned for STAP-A, STAP-B, MTAP16, and MTAP24 are 24, 25, 26, and 27 respectively. Specified in section 5.7. Fragmentation unit: Used to fragment a single NAL unit over multiple RTP packets. Exists with two versions identified with the NAL unit type numbers 28 and 29. Specified in section 5.8. Table 1. Summary of NAL unit types and their payload structures. Type Packet Type name Section -------------------------------------------------------- 1-23 NAL unit A single NAL unit packet 5.6 24 STAP-A Single-time aggregation packet 5.7.1 25 STAP-B Single-time aggregation packet 5.7.1 26 MTAP16 Multi-time aggregation packet 5.7.2 27 MTAP24 Multi-time aggregation packet 5.7.2 28 FU-A Fragmentation unit 5.8 29 FU-B Fragmentation unit 5.8 5.3. NAL Unit Octet Usage The structure and semantics of the NAL unit octet were introduced in section 1.3. For convenience, the format of the NAL unit type octet is reprinted below: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|NRI| Type | +---------------+ This section specifies the semantics of F and NRI according to this specification. F: 1 bit forbidden_zero_bit. A value of 0 indicates that the NAL unit type octet and payload SHOULD not contain bit errors or other Wenger et. al. Expires December 2003 [Page 15] Internet Draft 26 June, 2003 syntax violations. A value of 1 indicates that the NAL unit type octet and payload MAY contain bit errors or other syntax violations. Network elements, such as gateways, MAY set the F bit to indicate detected bit errors in the NAL unit. The H.264 specification requires that the F bit is equal to 0. Thus, receivers MUST NOT pass NAL units in which the F bit is equal to 1 to the decoder, when the decoder is incapable of handling erroneous bitstreams. Otherwise, when the decoder is capable of handling erroneous bitstreams, receivers SHOULD pass NAL unit in which the F bit is equal to 1 to the decoder. When the F bit is set, the decoder is advised that bit errors or any other syntax violation may be present in the payload or in the NAL unit type octet. The simplest decoder reaction to respond to a NAL unit in which the F bit is equal to 1 is to discard such a NAL unit and to conceal the lost data in the discarded NAL unit. NRI: 2 bits nal_ref_idc. The semantics of value 00 and a non-zero value remain unchanged compared to the H.264 specification. In other words, a value of 00 indicates that the content of the NAL unit is not used to reconstruct reference pictures for inter picture prediction. Such NAL units can be discarded without risking the integrity of the reference pictures. Values above 00 indicate that the decoding of the NAL unit is required to maintain the integrity of the reference pictures. In addition to the specification above, according to this RTP payload specification, values of NRI greater than 00 indicate the relative transport priority, as determined by the encoder. Intelligent network elements can use this information to protect more important NAL units better than less important NAL units. 11 is the highest transport priority, followed by 10, then by 01 and, finally, 00 is the lowest. Informative note: Any non-zero value of NRI is handled identically in H.264 decoders. Therefore, receivers need not manipulate the value of NRI when passing NAL units to the decoder. Wenger et. al. Expires December 2003 [Page 16] Internet Draft 26 June, 2003 5.4. Packetization Modes This memo specifies three cases of packetization modes: o Single NAL unit mode o Non-interleaved mode o Interleaved mode The single NAL unit mode is targeted for conversational systems that comply with ITU-T Recommendation H.241 [16] (see section 10.1). The non-interleaved mode is targeted for conversational systems that may not comply with ITU-T Recommendation H.241. In the non-interleaved mode NAL units are transmitted in NAL unit decoding order. The interleaved mode is targeted for systems that do not require very low end-to-end latency. The interleaved mode allows transmission of NAL units out of NAL unit decoding order. The packetization mode in use MAY be signaled by the value of the optional packetization-mode MIME parameter or by external means. The used packetization mode governs which NAL unit types are allowed in RTP payloads. Table 2 summarizes the allowed NAL unit types for each packetization mode. "No" in the "type 1-23" row indicates that an RTP payload cannot contain a single NAL unit whose type is in the range of 1 to 23, inclusive. Packetization modes are explained in detail in section 6. Table 2. Summary of allowed NAL unit types for each packetization mode (yes = allowed, no = disallowed). Type Packet Single NAL Non-Interleaved Interleaved Unit Mode Mode Mode ------------------------------------------------------------- 1-23 NAL unit yes yes no 24 STAP-A no yes no 25 STAP-B no no yes 26 MTAP16 no no yes 27 MTAP24 no no yes 28 FU-A no yes yes 29 FU-B no no yes Wenger et. al. Expires December 2003 [Page 17] Internet Draft 26 June, 2003 5.5. Decoding Order Number (DON) In the interleaved packetization mode, the transmission order of NAL units is allowed to differ from the decoding order of the NAL units. Decoding order number (DON) is a field in the payload structure or a derived variable that indicates the NAL unit decoding order. Rationale and example use cases for transmission out of decoding order and for the use of DON are given in section 11. The coupling of transmission and decoding order is controlled by the optional interleaving-depth MIME parameter as follows. When the value of the optional interleaving-depth MIME parameter is equal to 0 and transmission of NAL units out of their decoding order is disallowed by external means, the transmission order of NAL units MUST conform to the NAL unit decoding order. When the value of the optional interleaving-depth MIME parameter is greater than 0 or transmission of NAL units out of their decoding order is allowed by external means, o the order of NAL units in an MTAP16 and an MTAP24 is NOT REQUIRED to be the NAL unit decoding order, and o the order of NAL units composed by decapsulating STAP-Bs, MTAPs, and FUs in two consecutive packets is NOT REQUIRED to be the NAL unit decoding order. The RTP payload structures for a single NAL unit packet, an STAP-A, and an FU-A do not include DON. STAP-B and FU-B structures include DON, and the structure of MTAPs enables derivation of DON as specified in section 5.7.2. Informative note: If a transmitter wants to encapsulate one NAL unit per packet and transmit packets out of their decoding order, STAP-B packet type can be used. In the single NAL unit packetization mode, the transmission order of NAL units MUST be the same as their NAL unit decoding order. In the non-interleaved packetization mode, the transmission order of NAL units in single NAL unit packets and STAP-As, and FU-As MUST be Wenger et. al. Expires December 2003 [Page 18] Internet Draft 26 June, 2003 the same as their NAL unit decoding order. The NAL units within an STAP MUST appear in the NAL unit decoding order. Informative note: Due to the fact that H.264 allows the decoding order to be different from the display order, values of RTP timestamps may not be monotonically non-decreasing as a function of RTP sequence numbers. Signaling of the value of DON for NAL units carried in STAP-B, MTAP, and a series of fragmentation units starting with an FU-B is specified in sections 5.7.1, 5.7.2, and 5.8 respectively. The DON value of the first NAL unit in transmission order MAY be set to any value. Values of DON are in the range of 0 to 65535, inclusive. After reaching the maximum value, the value of DON wraps around to 0. The decoding order of two NAL units contained in any STAP-B, MTAP, or a series of fragmentation units starting with an FU-B is determined as follows. Let the value of DON of one NAL unit be D1 and the value of DON of another NAL unit be D2. If D1 equals to D2, then the NAL unit decoding order of the two NAL units can be whichever. If D1 < D2 and D2 - D1 < 32768, or if D1 > D2 and D1 - D2 >= 32768, then the NAL unit having a value of DON equal to D1 precedes the NAL unit having a value of DON equal to D2 in NAL unit decoding order. If D1 < D2 and D2 - D1 >= 32768, or if D1 > D2 and D1 - D2 < 32768, then the NAL unit having a value of DON equal to D2 precedes the NAL unit having a value of DON equal to D1 in NAL unit decoding order. Values of DON related fields (DON, DONB, and DOND, see section 5.7) MUST be such that the decoding order determined by the values of DON as specified above conforms to the NAL unit decoding order. If the order of two consecutive NAL units in the NAL unit stream is switched and the new order still conforms to the NAL unit decoding order, the NAL units MAY have the same value of DON. For example, when arbitrary slice order is allowed by the video coding profile in use, all the coded slice NAL units of a coded picture are allowed to have the same value of DON. Consequently, NAL units having the same value of DON can be decoded in any order, and two NAL units having a different value of DON should be passed to the Wenger et. al. Expires December 2003 [Page 19] Internet Draft 26 June, 2003 decoder in the order specified above. When two consecutive NAL units in the NAL unit decoding order have a different value of DON, the value of DON for the second NAL unit in decoding order SHOULD be the value of DON for the first NAL unit in decoding order incremented by one. An example decapsulation process to recover the NAL unit decoding order is given in section 7. Informative note: Receivers SHOULD not expect that the absolute difference of values of DON for two consecutive NAL units in the NAL unit decoding order is equal to one even in case of error-free transmission. An increment by one is not required, because at the time of associating values of DON to NAL units, it may not be known, whether all NAL units are delivered to the receiver. For example, a gateway may not forward coded slice NAL units of non- reference pictures or SEI NAL units, when there is a shortage of bitrate in the network to which the packets are forwarded. In another example a live broadcast is interrupted by pre-encoded content such as commercials from time to time. The first intra picture of a pre-encoded clip is transmitted in advance to ensure that it is readily available in the receiver. At the time of transmitting the first intra picture, the originator does not exactly know how many NAL units are going to be encoded before the first intra picture of the pre-encoded clip follows in decoding order. Thus, the values of DON for the NAL units of the first intra picture of the pre-encoded clip have to be estimated at the time of transmitting them and gaps in values of DON may occur. 5.6. Single NAL Unit Packet The single NAL unit packet defined here MUST contain one and only one NAL unit of the types defined in [1]. This means that neither an aggregation packet nor a fragmentation unit can be used within a single NAL unit packet. A NAL unit stream composed by decapsulating single NAL unit packets in RTP sequence number order MUST conform to the NAL unit decoding order. The structure of the single NAL unit packet is shown in Figure 2. Wenger et. al. Expires December 2003 [Page 20] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | Single NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2. RTP payload format for single NAL unit packet. 5.7. Aggregation Packets Aggregation packets are the NAL unit aggregation scheme of this payload specification. The scheme is introduced to reflect the dramatically different MTU sizes of two key target networks -- wireline IP networks (with an MTU size that is often limited by the Ethernet MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. ITU-T H.324/M) based wireless communication systems with preferred transmission unit sizes of 254 bytes or less. In order to prevent media transcoding between the two worlds, and to avoid undesirable packetization overhead, a NAL unit aggregation scheme is introduced. Two types of aggregation packets are defined by this specification: o Single-time aggregation packet (STAP) aggregates NAL units with identical NALU-time. Two types of STAPs are defined, one without DON (STAP-A) and another one including DON (STAP-B). o Multi-time aggregation packet (MTAP) aggregates NAL units with potentially differing NALU-time. Two different MTAPs are defined that differ in the length of the NAL unit timestamp offset. Wenger et. al. Expires December 2003 [Page 21] Internet Draft 26 June, 2003 The term NALU-time is defined as the value that the RTP timestamp would have if that NAL unit would be transported in its own RTP packet. Each NAL unit to be carried in an aggregation packet is encapsulated in an aggregation unit. Please see below for the three different aggregation units and their characteristics. The structure of the RTP payload format for aggregation packets is presented in Figure 3. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |F|NRI| type | | +-+-+-+-+-+-+-+-+ | | | | one or more aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3. RTP payload format for aggregation packets. MTAPs and STAPs share the following packetization rules: The RTP timestamp MUST be set to the earliest of the NALU times of all the NAL units to be aggregated. The type field of the NAL unit type octet MUST be set to the appropriate value as indicated in Table 3. The F bit MUST be cleared if all F bits of the aggregated NAL units are zero, otherwise it MUST be set. The value of NRI MUST be the maximum of all the NAL units carried in the aggregation packet. Wenger et. al. Expires December 2003 [Page 22] Internet Draft 26 June, 2003 Table 3. Type field for STAPs and MTAPs Type Packet Timestamp offset DON related fields field length (DON, DONB, DOND) (in bits) present -------------------------------------------------------- 24 STAP-A 0 no 25 STAP-B 0 yes 26 MTAP16 16 yes 27 MTAP24 24 yes The marker bit in the RTP header MUST be set to the value the marker bit of the last NAL unit of the aggregated packet would have if it were transported in its own RTP packet. The payload of an aggregation packet consists of one or more aggregation units. See section 5.7.1 and 5.7.2 for the three different types of aggregation units. An aggregation packet can carry as many aggregation units as necessary, however the total amount of data in an aggregation packet obviously MUST fit into an IP packet, and the size SHOULD be chosen such that the resulting IP packet is smaller than the MTU size. An aggregation packet MUST NOT contain fragmentation units specified in section 5.8. 5.7.1. Single-Time Aggregation Packet Single-time aggregation packet (STAP) SHOULD be used whenever aggregating NAL units that share the same NALU-time. The payload of an STAP-A does not include DON and consists of at least one single-time aggregation unit as presented in Figure 4. The payload of an STAP-B consists of a 16-bit unsigned decoding order number (DON) followed by at least one single-time aggregation unit as presented in Figure 5. Wenger et. al. Expires December 2003 [Page 23] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | +-+-+-+-+-+-+-+-+ | | | | single-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4. Payload format for STAP-A. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : decoding order number (DON) | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | single-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5. Payload format for STAP-B. A single-time aggregation unit consists of 16-bit unsigned size information that indicates the size of the following NAL unit in bytes (excluding these two octets, but including the NAL unit type octet of the NAL unit), followed by the NAL unit itself including its NAL unit type byte. A single-time aggregation unit is byte- aligned within the RTP payload but it may not be aligned on a 32- bit word boundary. Figure 6 presents the structure of the single- time aggregation unit. Wenger et. al. Expires December 2003 [Page 24] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NAL unit size | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 6. Structure for single-time aggregation unit. The DON field specifies the value of DON for the first NAL unit in an STAP-B in transmission order. The value of DON for each successive NAL unit in appearance order in an STAP-B is equal to (the value of DON of the previous NAL unit in the STAP-B + 1) % 65536, in which '%' stands for the modulo operation. Figure 7 presents an example of an RTP packet that contains an STAP-B. The STAP contains two single-time aggregation units, labeled as 1 and 2 in the figure. Wenger et. al. Expires December 2003 [Page 25] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |STAP-B NAL HDR | DON | NALU 1 Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 Size | NALU 1 HDR | NALU 1 Data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + : | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NALU 2 Size | NALU 2 HDR | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 Data | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7. An example of an RTP packet including an STAP-B and two single-time aggregation units. 5.7.2. Multi-Time Aggregation Packets (MTAPs) The NAL unit payload of MTAPs consists of a 16-bit unsigned decoding order number base (DONB) and one or more multi-time aggregation units as presented in Figure 8. DONB MUST contain the value of DON for the first NAL unit in the NAL unit decoding order among the NAL units of the MTAP. Informative note: The first NAL unit in the NAL unit decoding order is not necessarily the first NAL unit in the order the NAL units are encapsulated in an MTAP. The choice between the different MTAP types (MTAP16 and MTAP24) is application dependent -- the larger the timestamp offset is, the higher is the flexibility of the MTAP, but the higher is also the overhead. Wenger et. al. Expires December 2003 [Page 26] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : decoding order number base | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | multi-time aggregation units | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8. NAL unit payload format for MTAPs. Two different multi-time aggregation units are defined in this specification. Both of them consist of 16 bits unsigned size information of the following NAL unit, an 8-bit unsigned decoding order number delta (DOND), and n bits of timestamp offset (TS offset) for this NAL unit, whereby n can be 16 or 24. The structure of the multi-time aggregation units for MTAP16 and MTAP24 are presented in Figure 9 and Figure 10 respectively. Note that the starting or ending position of an aggregation unit within a packet is NOT REQUIRED to be on a 32-bit word boundary. DON of the following NAL unit is equal to (DONB + DOND) % 65536, in which % denotes the modulo operation. This memo does not specify how the NAL units within an MTAP are ordered, but, in most cases, NAL unit decoding order SHOULD be used. The timestamp offset field MUST be set to a value equal to the value of the following formula: (the NALU-time of the NAL unit - the RTP timestamp of the packet). Wenger et. al. Expires December 2003 [Page 27] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NAL unit size | DOND | TS offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TS offset | | +-+-+-+-+-+-+-+-+ NAL unit | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9. Multi-time aggregation unit for MTAP16 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : NALU unit size | DOND | TS offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | TS offset | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NAL unit | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 10. Multi-time aggregation unit for MTAP24 For the "earliest" multi-time aggregation unit in an MTAP the timing offset MUST be zero. Hence, the RTP timestamp of the MTAP itself is identical to the earliest NALU-time. Informative note: The "earliest" multi-time aggregation unit is such that has the smallest RTP timestamp among all the aggregation units of an MTAP if the aggregation units were encapsulated in single NAL unit packets. Such an "earliest" aggregation unit may not be the first one in the order the aggregation units are encapsulated in an MTAP. The "earliest" NAL unit need not be the same as the first NAL unit in the NAL unit decoding order either. Wenger et. al. Expires December 2003 [Page 28] Internet Draft 26 June, 2003 Figure 11 presents an example of an RTP packet that contains a multi-time aggregation packet of type MTAP16 that contains two multi-time aggregation units, labeled as 1 and 2 in the figure. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |MTAP16 NAL HDR | decoding order number base | NALU 1 Size | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 Size | NALU 1 DOND | NALU 1 TS offset | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 1 HDR | NALU 1 DATA | +-+-+-+-+-+-+-+-+ + : | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | NALU 2 SIZE | NALU 2 DOND | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NALU 2 TS offset | NALU 2 HDR | NALU 2 DATA | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 11. An example of an RTP packet including a multi-time aggregation packet of type MTAP16 and two multi-time aggregation units. 5.8. Fragmentation Units (FUs) This payload type allows fragmenting a NAL unit into several RTP packets. Doing so on the application layer instead of relying on lower layer fragmentation (e.g. by IP) has the following advantages: Wenger et. al. Expires December 2003 [Page 29] Internet Draft 26 June, 2003 o The payload format is capable of transporting NAL units bigger than 64 kbytes over an IPv4 network that may be present in pre- recorded video, particularly in High Definition formats (there is a limit of the number of slices per picture, which results in a limit of NAL units per picture, which may result in big NAL units) o The fragmentation mechanism allows fragmenting a single picture and applying generic forward error correction as described in section 10.5. Fragmentation is defined only for a single NAL unit, and not for any aggregation packets. A fragment of a NAL unit consists of an integer number of consecutive octets of that NAL unit. Each octet of the NAL unit MUST be part of exactly one fragment of that NAL unit. Fragments of the same NAL unit MUST be sent in consecutive order with ascending RTP sequence numbers (with no other RTP packets within the same RTP packet stream being sent between the first and last fragment). Similarly, a NAL unit MUST be reassembled in RTP sequence number order. When a NAL unit is fragmented and conveyed within fragmentation units (FUs), it is referred to as fragmented NAL unit. STAPs and MTAPs MUST NOT be fragmented. FUs MUST NOT be nested, i.e., an FU MUST NOT contain another FU. The RTP timestamp of an RTP packet carrying an FU is set to the NALU time of the fragmented NAL unit. Figure 12 presents the RTP payload format for FU-As. An FU-A consists of a fragmentation unit indicator of one octet, a fragmentation unit header of one octet, and a fragmentation unit payload. Wenger et. al. Expires December 2003 [Page 30] Internet Draft 26 June, 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU indicator | FU header | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 12. RTP payload format for FU-A. Figure 13 presents the RTP payload format for FU-Bs. An FU-B consists of a fragmentation unit indicator of one octet, a fragmentation unit header of one octet, a decoding order number (DON), and a fragmentation unit payload. In other words, the structure of FU-B is the same as the structure of FU-A except for the additional DON field. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | FU indicator | FU header | DON | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| | | | FU payload | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | :...OPTIONAL RTP padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 13. RTP payload format for FU-B. NAL unit type FU-B MUST be used in the interleaved packetization mode for the first fragmentation unit of a fragmented NAL unit. NAL unit type FU-B MUST NOT be used in any other case. Wenger et. al. Expires December 2003 [Page 31] Internet Draft 26 June, 2003 The FU indicator octet has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |F|NRI| Type | +---------------+ Values equal to 28 and 29 in the Type field of the FU indicator octet identify an FU-A and an FU-B respectively. The use of the F bit is described in section 1.3. The value of the NRI field MUST be set according to the value of the NRI field in the fragmented NAL unit. The FU header has the following format: +---------------+ |0|1|2|3|4|5|6|7| +-+-+-+-+-+-+-+-+ |S|E|R| Type | +---------------+ S: 1 bit The Start bit, when one, indicates the start of a fragmented NAL unit. Otherwise, when the following FU payload is not the start of a fragmented NAL unit payload, the Start bit is set to zero. E: 1 bit The End bit, when one, indicates the end of a fragmented NAL unit, i.e., the last byte of the payload is also the last byte of the fragmented NAL unit. Otherwise, when the following FU payload is not the last fragment of a fragmented NAL unit, the End bit is set to zero. R: 1 bit The Reserved bit MUST be equal to 0 and MUST be ignored by the receiver. Type: 5 bits The NAL unit payload type as defined in table 7-1 of [1]. Wenger et. al. Expires December 2003 [Page 32] Internet Draft 26 June, 2003 The value of DON in FU-Bs is selected as described in section 5.5. Informative note: The DON field in FU-Bs allows gateways to fragment NAL units to FU-Bs without organizing the incoming NAL units to the NAL unit decoding order. A fragmented NAL unit MUST NOT be transmitted in one FU, i.e., Start bit and End bit MUST NOT both be set to one in the same FU header. The FU payload consists of fragments of the payload of the fragmented NAL unit such that if the fragmentation unit payloads of consecutive FUs are sequentially concatenated, the payload of the fragmented NAL unit is reconstructed. Note that the NAL unit type octet of the fragmented NAL unit is not included as such in the fragmentation unit payload, but rather the information of the NAL unit type octet of the fragmented NAL unit is conveyed in F and NRI fields of the FU indicator octet of the fragmentation unit and in the type field of the FU header. A FU payload MAY have any number of octets and MAY be empty. If a fragmentation unit is lost, the receiver SHOULD discard all following fragmentation units in transmission order corresponding to the same fragmented NAL unit. 6. Packetization Rules The packetization modes are introduced in section 5.2. The packetization rules that are common to more than one of the packetization modes are specified in section 6.1. The packetization rules for the single NAL unit mode, the non- interleaved mode, and the interleaved mode are specified in sections 6.2, 6.3, and 6.4 respectively. 6.1. Common Packetization Rules Wenger et. al. Expires December 2003 [Page 33] Internet Draft 26 June, 2003 All senders MUST enforce the following packetization rules regardless of the packetization mode in use: o Coded slice NAL units or coded slice data partition NAL units belonging to the same coded picture (and hence sharing the same RTP timestamp value) MAY be sent in any order permitted by the applicable profile defined in [1], although, for delay-critical systems, they SHOULD be sent in their original coding order to minimize the delay. Note that the coding order is not necessarily the scan order, but the order the NAL packets become available to the RTP stack. o Sequence and picture parameter set NAL units MUST NOT be sent in an RTP session whose parameter sets were already changed by control protocol messages during the lifetime of the RTP session. o Network elements such as gateways MUST NOT duplicate any NAL unit except for sequence or picture parameter set NAL units, because neither this memo nor the H.264 specification provides means to identify duplicated NAL units. Sequence and picture parameter set NAL units MAY be duplicated to make their correct reception more probable, but any such duplication MUST NOT affect the contents of any active sequence or picture parameter set. Senders according to the non-interleaved mode and the interleaved mode MUST enforce the following packetization rule: o Network elements such as gateways MAY convert single NAL unit packets into one aggregation packet, convert an aggregation packet into several single NAL unit packets, or mix both concepts. However, when doing so they SHOULD take into account at least the following parameters: path MTU size, unequal protection mechanisms (e.g. through packet-based FEC according to RFC 2733 [21], carried by RFC 2198 [20], especially for sequence and picture parameter set NAL units and coded slice data partition A NAL units), bearable latency of the system, and buffering capabilities of the receiver. Wenger et. al. Expires December 2003 [Page 34] Internet Draft 26 June, 2003 6.2. Single NAL Unit Mode This mode is in use when the value of the optional packetization- mode MIME parameter is equal to 0 or packetization-mode is not present or no other packetization mode is signaled by external means. All receivers MUST support this mode. It is primarily intended for low-delay applications that are compatible with systems using ITU-T Recommendation H.241 [16] (see section 10.1). Only single NAL unit packets MAY be used in this mode. STAPs, MTAPs, and FUs MUST NOT be used. The transmission order of single NAL unit packets MUST comply with the NAL unit decoding order. 6.3. Non-Interleaved Mode This mode is in use when the value of the optional packetization- mode MIME parameter is equal to 1 or the mode is turned on by external means. This mode SHOULD be supported. It is primarily intended for low-delay applications. Only single NAL unit packets, STAP-As and FU-As MAY be used in this mode. STAP-Bs, MTAPs, and FU-Bs MUST NOT be used. The transmission order of NAL units MUST comply with the NAL unit decoding order. 6.4. Interleaved Mode This mode is in use when the value of the optional packetization- mode MIME parameter is equal to 2 or the mode is turned on by external means. Some receivers MAY support this mode. STAP-Bs, MTAPs, FU-As, and FU-Bs MAY be used. STAP-As and single NAL unit packets MUST NOT be used. The transmission order of packets and NAL units is constrained as specified in section 5.5. 7. De-Packetization Process (Informative) The de-packetization process is implementation dependent. Hence, the following description should be seen as an example of a suitable implementation. Other schemes may be used as well. Optimizations relative to the described algorithms are likely possible. Section 7.1 presents the de-packetization process for Wenger et. al. Expires December 2003 [Page 35] Internet Draft 26 June, 2003 the single NAL unit and non-interleaved packetization modes, whereas section 7.2 describes the process for the interleaved mode. Section 7.3 includes additional decapsulation guidelines for intelligent receivers. 7.1. Single NAL Unit and Non-Interleaved Mode The receiver includes a receiver buffer to compensate transmission delay jitter. The receiver stores incoming packets in reception order into the receiver buffer. Packets are decapsulated in RTP sequence number order. If a decapsulated packet is a single NAL unit packet, the NAL unit contained in the packet is passed to the decoder immediately after decapsulation. If a decapsulated packet is an STAP-A, the NAL units contained in the packet are passed to the decoder in the order they are encapsulated in the packet immediately after decapsulation. If a decapsulated packet is an FU-A, all the fragments of the fragmented NAL unit are concatenated and passed to the decoder. Note: If the decoder supports Arbitrary Slice Order, coded slices of a picture can be passed to the decoder in any order regardless of their reception and transmission order. 7.2. Interleaved Mode The general concept behind these de-packetization rules is to reorder NAL units from transmission order to the NAL unit decoding order. The receiver includes a receiver buffer, which is used to reorder packets from transmission order to the NAL unit decoding order. The receiver may use the following guidelines when determining the size of the receiver buffer. The optional interleaving-depth MIME parameter indicates the size of the receiver buffer as the number of VCL NAL units. The size of the receiver buffer in bytes may be estimated by multiplying the optional init-buf-time MIME parameter with the bandwidth-value of the corresponding media level SDP description parameter, if available (and taking into account the necessary conversions between the units of init-buf-time and bandwidth-value). The receiver should also take buffering for Wenger et. al. Expires December 2003 [Page 36] Internet Draft 26 June, 2003 transmission delay jitter into account and either reserve a separate buffer for transmission delay jitter buffering or combine the buffer for transmission delay jitter with the receiver buffer. The receiver stores incoming NAL units in reception order into the receiver buffer as follows. NAL units of aggregation packets are stored into the receiver buffer individually. The value of DON is calculated and stored for all NAL units. Hereinafter, let N be the value of the optional interleaving-depth MIME type parameter (see section 8.1) incremented by 1. Furthermore, let function AbsDON be the same as specified in section 8.1 and function don_diff(m,n) be specified as follows: If DON(m) == DON(n), don_diff(m,n) = 0 If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), don_diff(m,n) = DON(n) - DON(m) If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), don_diff(m,n) = 65536 - DON(m) + DON(n) If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), don_diff(m,n) = - (DON(m) + 65536 - DON(n)) If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), don_diff(m,n) = - (DON(m) - DON(n)) where DON(i) is the decoding order number of the NAL unit having index i in the transmission order. The decoding order number is specified in section 5.5 of this RTP payload specification. A positive value of don_diff(m,n) indicates that the NAL unit having transmission order index n follows, in decoding order, the NAL unit having transmission order index m. There are two buffering states in the receiver: initial buffering and buffering while playing. Initial buffering occurs when the RTP session is initialized. After initial buffering, decoding and playback is started and the buffering-while-playing mode is used. Wenger et. al. Expires December 2003 [Page 37] Internet Draft 26 June, 2003 Initial buffering lasts until one of the following conditions is fulfilled: o There are N VCL NAL units in the receiver buffer. o If max-don-diff is present, don_diff(m,n) is greater than the value of max-don-diff, in which n corresponds to the NAL unit having the greatest value of AbsDON among the received NAL units and m corresponds to the NAL unit having the smallest value of AbsDON among the received NAL units. o Initial buffering has lasted for the duration equal to or greater than the value of the optional init-buf-time MIME parameter. The NAL units to be removed from the receiver buffer are determined as follows: o If the receiver buffer contains at least N VCL NAL units, NAL units are removed from the receiver buffer and passed to the decoder in the order specified below until the buffer contains N- 1 VCL NAL units. o If max-don-diff is present, all NAL units m for which don_diff(m,n) is greater than max-don-diff are removed from the receiver buffer and passed to the decoder in the order specified below. Herein, n corresponds to the NAL unit having the greatest value of AbsDON among the received NAL units. o Variable ts is set to the value of system timer that was initialized to 0 when the first packet of the NAL unit stream was received. If the receiver buffer contains a NAL unit whose reception time tr fulfills the condition that ts - tr > init-buf- time, NAL units are passed to the decoder (and removed from the receiver buffer) in the order specified below until the receiver buffer contains no NAL unit whose reception time tr fulfills the specified condition. Note that transmission delay jitter should be taken into account in the calculations with timestamps. The order that NAL units are passed to the decoder is specified as follows: o Let PDON be a variable that is initialized to 0 at the beginning of the an RTP session. Wenger et. al. Expires December 2003 [Page 38] Internet Draft 26 June, 2003 o For each NAL unit associated with a value of DON, a DON distance is calculated as follows. If the value of DON of the NAL unit is larger than the value of PDON, the DON distance is equal to DON - PDON. Otherwise, the DON distance is equal to 65535 - PDON + DON + 1. o NAL units are delivered to the decoder in ascending order of DON distance. If several NAL units share the same value of DON distance, they can be passed to the decoder in any order. o When a desired number of NAL units have been passed to the decoder, the value of PDON is set to the value of DON for the last NAL unit passed to the decoder. 7.3. Additional De-Packetization Guidelines The following additional de-packetization rules may be used to implement an operational H.264 de-packetizer: o Intelligent RTP receivers (e.g. in gateways) may identify lost coded slice data partitions A (DPAs). If a lost DPA is found, a gateway may decide not to send the corresponding coded slice data partitions B and C, as their information is meaningless for H.264 decoders. In this way a network element can reduce network load by discarding useless packets, without parsing a complex bitstream. o Intelligent RTP receivers (e.g. in gateways) may identify lost FUs. If a lost FU is found, a gateway may decide not to send the following FUs of the same NAL unit, as their information is meaningless for H.264 decoders. In this way a network element can reduce network load by discarding useless packets, without parsing a complex bitstream. o Intelligent receivers may discard all packets in which the value of the NRI field of the NAL unit type octet is equal to 0. However, they process those packets if possible, because the user experience may suffer if the packets are discarded. Wenger et. al. Expires December 2003 [Page 39] Internet Draft 26 June, 2003 8. Payload Format Parameters This section specifies the parameters that MAY be used to select optional features of the payload format. The parameters are specified here as part of the MIME subtype registration for the ITU-T H.264 | ISO/IEC 14496-10 codec. A mapping of the parameters into the Session Description Protocol (SDP) [5] is also provided for those applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use MIME or SDP. 8.1. MIME Registration The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is allocated from the IETF tree. The receiver MUST ignore any unspecified parameter. Media Type name: video Media subtype name: H264 Required parameters: none Optional parameters: profile-level-id: A string of profile-level elements without any delimiters, in which each profile-level element is a base16 [6] (hexadecimal) representation of the following three bytes in the sequence parameter set NAL unit specified in [1]: 1) profile_idc, 2) a byte herein referred to as profile-iop, composed of the values of constrained_set0_flag, constrained_set1_flag, constrained_set2_flag, and reserved_zero_5bits in bit-significance order starting from the most significant bit, and 3) level_idc. Note that reserved_zero_5bits is required to be equal to Wenger et. al. Expires December 2003 [Page 40] Internet Draft 26 June, 2003 0 in [1], but other values for it may be specified in the future by ITU-T or ISO/IEC. If the profile-level-id parameter is used for indicating properties of a NAL unit stream, it indicates the profiles that are in use in the stream and the highest level that is in use for each signaled profile. The profile-iop byte for each signaled profile indicates whether the NAL unit stream also obeys all constraints of the indicated profiles as follows. If bit 7 (the most significant bit), bit 6, or bit 5 of profile-iop is equal to 1, all constraints of the Baseline profile, the Main profile, or the Extended profile, respectively, are obeyed in the NAL unit stream. If the profile-level-id parameter is used for capability exchange or session setup procedure, it indicates the profiles that the codec supports and the highest level that is supported for each signaled profile. The profile-iop byte for each signaled profile indicates whether the codec has such additional limitations that only the common subset of the algorithmic features and limitations of the profiles signaled with the profile-iop byte and the profile indicated by profile_idc is supported by the codec. For example, if a codec supports the Baseline Profile at level 3 and below and the Main Profile at level 2.1 and below without any additional limitations, the profile-level-id becomes 42A01E4D4015. If a codec supports only the common subset of the coding tools of the Baseline profile and the Main profile at level 2.1 and below, the profile-level-id becomes 42E015. If no profile-level-id is present, the Baseline Profile without additional constraints at Level 1 MUST be implied. Wenger et. al. Expires December 2003 [Page 41] Internet Draft 26 June, 2003 parameter-sets: This parameter MAY be used to convey such sequence and picture parameter set NAL units, herein referred to as the initial parameter set NAL units, that MUST precede any other NAL units in decoding order. The parameter MUST NOT be used to indicate codec capability in any capability exchange procedure. The value of the parameter is the base64 [6] representation of the initial parameter set NAL units as specified in sections 7.3.2.1 and 7.3.2.2 of [1]. The parameter sets are conveyed in decoding order and no framing of the parameter set NAL units takes place. A comma is used to separate any pair of parameter sets in the list. Note that the number of bytes in a parameter set NAL unit is typically less than 10 bytes, but a picture parameter set NAL unit can contain several hundreds of bytes. packetization-mode: When the value of packetization-mode is equal to 0 or packetization-mode is not present, single NAL unit packets MUST be present in the stream, but STAPs, MTAPs, and FUs MUST NOT be present in the stream. This mode is in use in standards using ITU-T Recommendation H.241 [16] (see section 10.1). When the value of packetization-mode is equal to 1, single NAL unit packets, STAP-As and FU- As MAY be present in the stream, but STAP-Bs, MTAPs, and FU-Bs MUST NOT be present in the stream. When the value of packetization-mode is equal to 2, STAP-Bs, MTAPs, FU-As, and FU- Bs MAY be present in the stream, but single NAL unit packets and STAP-As MUST NOT be present in the stream. The value of packetization mode MUST be an integer in the range of 0 to 2, inclusive. Wenger et. al. Expires December 2003 [Page 42] Internet Draft 26 June, 2003 interleaving-depth: This parameter MUST NOT be present when packetization-mode is not present or the value of packetization-mode is equal to 0 or 1. This parameter MUST be present when the value of packetization-mode is equal to 2. This parameter signals the properties of a NAL unit stream or the capabilities of a receiver implementation. When the parameter is used to signal the properties of a NAL unit stream, it specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the NAL unit stream in transmission order and follow the VCL NAL unit in decoding order. Consequently, it is guaranteed that receivers can reconstruct NAL unit decoding order, when the buffer size for NAL unit decoding order recovery is at least the value of interleaving-depth + 1 in terms of VCL NAL units. When the parameter is used to signal the capabilities of a receiver implementation, the receiver is able to correctly reconstruct the NAL unit decoding order of NAL unit streams that are characterized by the same value of interleaving-depth. When the receiver buffers such number of VCL NAL units that equals to or is greater than the value of interleaving-depth, it is able to reconstruct NAL unit decoding order from the transmission order. If the parameter is not present, then a value of 0 MUST be used for interleaving-depth. The value of interleaving-depth MUST be an integer in the range of 0 to 32767, inclusive. init-buf-time: This parameter MAY be used to signal the properties of a NAL unit stream or the capabilities of a receiver implementation. Wenger et. al. Expires December 2003 [Page 43] Internet Draft 26 June, 2003 When the parameter is used to signal the properties of a NAL unit steam, it signals the initial buffering time that a receiver MUST buffer before starting decoding to recover the NAL unit decoding order from the transmission order. The parameter is the maximum value of (transmission time of a NAL unit - decoding time of the NAL unit) assuming reliable and instantaneous transmission, the same timeline for transmission and decoding, and starting of decoding when the first packet arrives. An example of specifying the value of init- buf-time follows: A NAL unit stream is sent in the following interleaved order, in which the value corresponds to the decoding time and the transmission order is from left to right: 0 2 1 3 5 4 6 8 7 ... Assuming a steady transmission rate of NAL units, the transmission times are: 0 1 2 3 4 5 6 7 8 ... Subtracting the decoding time from the transmission time column-wise results into the following series: 0 -1 1 0 -1 1 0 -1 1 ... Thus, the value of init-buf-time in this example is 1 in terms of intervals of NAL unit transmission times. When the parameter is used to signal the capabilities of a receiver implementation, it signals the duration of initial buffering that the receiver is capable of handling in any circumstances. Wenger et. al. Expires December 2003 [Page 44] Internet Draft 26 June, 2003 The parameter is coded as a decimal representation in clock ticks of a 90-kHz clock. If the parameter is not present, then a value of 0 MUST be used for init-buf-time. The value of initial-init-buf-time MUST be an integer in the range of 0 to 4 294 967 295, inclusive. Receivers SHOULD take transmission delay jitter buffering, including buffering for the delay jitter caused by mixers, translators, gateways, proxies, traffic-shapers and other network elements, into account in addition to the signaled init-buf-time. max-don-diff: This parameter MAY be used to signal the properties of a NAL unit stream. It MUST NOT be used be used to signal transmitter or receiver or codec capabilities. The parameter MUST NOT be present, if the value of packetization-mode is equal to 0 or 1. max- don-diff is an integer in the range of 0 to 32767, inclusive. If max-don-diff is not present, the value of the parameter is unspecified. max-don-diff is calculated as follows: max-don-diff = max{AbsDON(i) - AbsDON(j)}, for any i and any j>i, where i and j indicate the index of the NAL unit in the transmission order and AbsDON denotes such decoding order number of the NAL unit that does not wrap around to 0 after 65535. In other words, AbsDON is calculated as follows: Let m and n are consecutive NAL units in transmission order. For the very first NAL unit in transmission order (whose index is 0), AbsDON(0) = DON(0). For other NAL units, AbsDON is calculated as follows: Wenger et. al. Expires December 2003 [Page 45] Internet Draft 26 June, 2003 If DON(m) == DON(n), AbsDON(n) = AbsDON(m) If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), AbsDON(n) = AbsDON(m) + DON(n) - DON(m) If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n) If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), AbsDON(n) = AbsDON(m) - (DON(m) + 65536 - DON(n)) If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), AbsDON(n) = AbsDON(m) - (DON(m) - DON(n)) where DON(i) is the decoding order number of the NAL unit having index i in the transmission order. The decoding order number is specified in section 5.5 of this RTP payload specification. Informative note: Receivers MAY use max-don- diff to trigger which NAL units in the receiver buffer can be passed to the decoder. Encoding considerations: This type is only defined for transfer via RTP (RFC 3550). Security considerations: See section 9 of RFC XXXX. [Ed.Note: to be replaced with the RFC number of this specification] Wenger et. al. Expires December 2003 [Page 46] Internet Draft 26 June, 2003 Public specification: Please refer to RFC XXXX [Ed.Note: to be replaced with the RFC number of this specification] and its section 15. Additional information: None File extensions: none Macintosh file type code: none Object identifier or OID: none Person & email address to contact for further information: stewe@cs.tu-berlin.de Intended usage: COMMON. Author/Change controller: stewe@cs.tu-berlin.de IETF Audio/Video transport working group 8.2. SDP Parameters The MIME media type video/H264 string is mapped to fields in the Session Description Protocol (SDP) [5] as follows: o The media name in the "m=" line of SDP MUST be video. o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the MIME subtype). o The clock rate in the "a=rtpmap" line MUST be 90000. o The optional parameters "profile-level-id", "parameter-sets", "packetization-mode", "interleaving-depth", "init-buf-time", and "max-don-diff", if any, SHALL be included in the "a=fmtp" line of SDP. These parameters are expressed as a MIME media type string, in the form of as a semicolon separated list of parameter=value pairs. Wenger et. al. Expires December 2003 [Page 47] Internet Draft 26 June, 2003 An example of media representation in SDP is as follows (Baseline Profile, Level 3.0, more than one slice group, arbitrary slice ordering, and redundant slices are in use): m=video 49170/2 RTP/AVP 98 a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=42A01E 9. Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [4], and any appropriate RTP profile (for example [17]). This implies that confidentiality of the media streams is achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption may be performed after compression so there is no conflict between the two operations. A potential denial-of-service threat exists for data encodings using compression techniques that have non-uniform receiver-end computational load. The attacker can inject such pathological datagrams into the stream that are complex to decode and cause the receiver to be overloaded. H.264 is particularly vulnerable to such attacks because it is extremely simple to generate datagrams containing NAL units that affect the decoding process of many future NAL units. As with any IP-based protocol, in some circumstances a receiver may be overloaded simply by the receipt of too many packets, either desired or undesired. Network-layer authentication may be used to discard packets from undesired sources, but the processing cost of the authentication itself may be too high. In a multicast environment, pruning of specific sources may be implemented in future versions of IGMP [18] and in multicast routing protocols to allow a receiver to select which sources are allowed to reach it. Wenger et. al. Expires December 2003 [Page 48] Internet Draft 26 June, 2003 Decoders MUST exercise caution with respect to the handling of user data SEI messages, particularly if they contain active elements, and MUST restrict their domain of applicability to the presentation containing the stream. 10. Informative Appendix: Application Examples This payload specification is very flexible in its use, to cover the extremely wide application space that is anticipated for the H.264. However, such a great flexibility also makes it difficult for an implementer to decide on a reasonable packetization scheme. Some information how to apply this specification to real-world scenarios is likely to appear in the form of academic publications and a test model software and description in the near future. However, some preliminary usage scenarios are described here as well. 10.1. Video Telephony according to ITU-T Recommendation H.241 Annex A H.323-based video telephony systems that use H.264 as an optional video compression scheme are required to support H.241 Annex A [16] as a packetization scheme. The packetization mechanism defined in this Annex is technically identical with a small subset of this specification. When operating according to H.241 Annex A, parameter sets NAL units are sent in-band. Only Single NAL unit packets are used. A typical packet stream generated by such a system consists of all sequence and picture parameter sets used for the future video sequence, possibly sent in more than one copy to raise the likeliness of their arrival at the receiver, followed by the packets carrying the NAL units of the IDR picture, and followed by packets carrying the subsequent pictures. Many such systems are not sending IDR pictures regularly, but only when required by user interaction or by control protocol means, e.g. when switching between video channels in an Multipoint Control Unit. Wenger et. al. Expires December 2003 [Page 49] Internet Draft 26 June, 2003 10.2. Video Telephony, No Slice Data Partitioning, No NAL Unit Aggregation The RTP part of this scheme is implemented and tested (though not the control-protocol part, see below). In most real-world video telephony applications, the picture parameters such as picture size or optional modes never change during the lifetime of a connection. Hence, all necessary parameter sets (usually only one) are sent as a side effect of the capability exchange/announcement process e.g. according to the SDP syntax specified in section 8.2 of this document. Since all necessary parameter set information is established before the RTP session starts, there is no need for sending any parameter set NAL units. Slice data partitioning is not used either. Hence, the RTP packet stream consists basically of NAL units that carry single coded slices. The encoder chooses the size of coded slice NAL units such that they offer the best performance. Often, this is done by adapting the coded slice size to the MTU size of the IP network. For small picture sizes this may result in a one-picture-per-one-packet strategy. Intra refresh algorithms clean up the loss of packets and the resulting drift-related artifacts. 10.3. Video Telephony, Interleaved Packetization Using NAL Unit Aggregation This scheme allows better error concealment and is widely used in H.263 based designed using RFC 2429 packetization [11]. It is also implemented and good results were reported [13]. The VCL encoder codes the source picture such that all macroblocks (MBs) of one MB line are assigned to one slice. All slices with even MB row addresses are combined into one STAP, and all slices with odd MB row addresses into another STAP. Those STAPs are transmitted as RTP packets. The establishment of the parameter sets is performed as discussed above. Wenger et. al. Expires December 2003 [Page 50] Internet Draft 26 June, 2003 Note that the use of STAPs is essential here, because the high number of individual slices (18 for a CIF picture) would lead to unacceptably high IP/UDP/RTP header overhead (unless the source coding tool FMO is used, which is not assumed in this scenario). Furthermore, some wireless video transmission systems, such as H.324M and the IP-based video telephony specified in 3GPP, are likely to use relatively small transport packet size. For example, a typical MTU size of H.223 AL3 SDU is around 100 bytes [19]. Coding individual slices according to this packetization scheme provides a further advantage in communication between wired and wireless networks, as individual slices are likely to be smaller than the preferred maximum packet size of wireless systems. Consequently, a gateway can convert the STAPs used in a wired network to several RTP packets with only one NAL unit that are preferred in a wireless network and vice versa. 10.4. Video Telephony, with Data Partitioning This scheme is implemented and was shown to offer good performance especially at higher packet loss rates [13]. Data Partitioning is known to be useful only when some form of unequal error protection is available. Normally, in single-session RTP environments, even error characteristics are assumed, i.e., the packet loss probability of all packets of the session is the same statistically. However, there are means to reduce the packet loss probability of individual packets in an RTP session. RFC 2198 [20], for example, allows carrying a redundant copy of an essential packet in the next RTP packet. Packet-based Forward Error Correction [21] carried in RFC 2198 is also an appropriate means to protect high priority information. In all cases, the incurred overhead is substantial, but in the same order of magnitude as the number of bits that have otherwise be spent for intra information. However, this mechanism is not adding any delay to the system. Again, the complete parameter set establishment is performed through control protocol means. Wenger et. al. Expires December 2003 [Page 51] Internet Draft 26 June, 2003 10.5. Video Telephony or Streaming, with FUs and Forward Error Correction This scheme is implemented and was shown to provide good performance especially at higher packet loss rates [22]. The most efficient means to combat packet-losses for scenarios where retransmissions are not applicable is forward error correction (FEC). Although end-to-end solutions are usually not preferable, they are unavoidable in some scenarios. For example, RFC2733 [21] provides means to use generic FEC in packet-loss environments. A binary forward error correcting code is generated by applying the XOR operation to the bits at the same bit position in different packets. The binary code can be specified by the parameters (n,k) in which k is the number of information packets used in the connection and n is the total number of packets generated for k information packets, i.e., n-k parity packets are generated for k information packets. When using a code with parameters (n,k) within the RFC2733 framework, the following properties are well-known: a) RFC2733 can only be applied over a sequence of RTP packets, not over one RTP packet. b) RFC2733 is most bit-rate efficient if XOR-connected packets have equal length. c) At the same packet loss probability p and for a fixed k, the greater the value of n is, the smaller the residual error probability becomes. For example, for packet loss probability 10%, k=1, and n=2, the residual error probability is about 1%, whereas for n=3, the residual error probability is about 0.1%. d) At the same packet loss probability p and for a fixed code rate k/n, the greater the value of n is, the smaller the residual error probability becomes. For example, at a packet loss probability of p=10%, k=1 and n=2, the residual error rate is about 1%, whereas for an extended Golay code with k=12 and n=24, the residual error rate is about 0.01%. Wenger et. al. Expires December 2003 [Page 52] Internet Draft 26 June, 2003 For applying RFC2733 in combination with H.264 baseline coded video without using FUs several options might be considered: 1) The video encoder produces NAL units where each video frame is coded in a single slice. Applying FEC, one could use a simple code, e.g. (n=2, k=1), i.e., each NAL unit would basically just be repeated. The disadvantage is obviously the bad code performance according to (d) and the low flexibility as only (n, k=1) codes can be used. 2) The video encoder produces NAL units where each video frame is encoded in a single slice. Applying FEC, one could use a better code, e.g. (n=24, k=12), over a sequence of NAL units. The disadvantage is obviously that in case of losses a significant delay is introduced and packets of completely different length might be connected, which decreases bit-rate efficiency according to (b) 3) The video encoder produces NAL units, where a certain frame contains k slices of possibly almost equal length. Then, applying FEC, a better code, e.g. (n=24, k=12), over the sequence of NAL units for each frame can be used. The delay compared to (2) is reduced, but several disadvantages are obvious. Firstly, the coding efficiency of the encoded video is lowered significantly as slice-structured coding reduces intra- frame prediction and additional slice overhead is necessary. Secondly, pre-encoded content or, when operating over a gateway, the video is usually not appropriately coded with k slices such that FEC can be applied. Finally, the encoding of video producing k slices of equal length is not straightforward and might require more than one encoding pass. Many of the mentioned disadvantages can be avoided by applying FUs in combination with FEC. Each NAL unit can be split into any number of FUs of basically equal length, and therefore FEC with a reasonable k and n can be applied even if the encoder made no effort of producing slices of equal length. For example, a coded slice NAL unit containing an entire frame can be split to k FUs and a parity check code (n=k+1, k) can be applied. The presented technique makes it possible to achieve good transmission error tolerance even if no additional source coding layer redundancy, such as periodic intra frames, is present. Wenger et. al. Expires December 2003 [Page 53] Internet Draft 26 June, 2003 Consequently, the same coded video sequence can be used for achieving the maximum compression efficiency and quality over error-free transmission and for transmission over error-prone networks. Furthermore, the technique allows the application of FEC to pre-encoded sequences without adding delay. In addition, in this case pre-encoded sequences that are not encoded for error- prone networks can still be transmitted almost reliably without adding extensive delays. In addition, FUs of equal length result in a bit-rate efficient use of RFC2733. In case that the error probability depends on the length of the transmitted packet, e.g. in case of mobile transmission [15], the benefits of applying FUs with FEC are even more obvious. Basically, the flexibility of the size of FUs allows applying appropriate FEC for each NAL unit and even unequal error protection of NAL units. The incurred overhead when using FUs and FEC is substantial, but in the same order of magnitude as the number of bits that have to be spent for intra coded macroblocks if no FEC is applied. In [22] it was shown that the overall performance at the same error rate and the same overall bit-rate including the overhead, the FEC-based approach can enhance the quality. 10.6. Low-Bit-Rate Streaming This scheme has been implemented with H.263 and non-standard RTP packetization and gave good results [23]. There is no technical reason why similarly good results could not be achievable with H.264. In today's Internet streaming, some of the offered bit-rates are relatively low in order to allow terminals with dial-up modems to access the content. In wired IP networks, relatively large packets, say 500 - 1500 bytes, are preferred to smaller and more frequently occurring packets in order to reduce network congestion. Moreover, use of large packets decreases the amount of RTP/UDP/IP header overhead. For low-bit-rate video, the use of large packets Wenger et. al. Expires December 2003 [Page 54] Internet Draft 26 June, 2003 means that sometimes up to few pictures should be encapsulated in one packet. However, loss of a packet including many coded pictures would have drastic consequences in visual quality, as there is practically no other way to conceal a loss of an entire picture than to repeat the previous one. One way to construct relatively large packets and maintain possibilities for successful loss concealment is to construct MTAPs that contain slices from several pictures in an interleaved manner. An MTAP should not contain spatially adjacent slices from the same picture or spatially overlapping slices from any picture. If a packet is lost, it is likely that a lost slice is surrounded by spatially adjacent slices of the same picture and spatially corresponding slices of the temporally previous and succeeding pictures. Consequently, concealment of the lost slice is likely to succeed relatively well. 10.7. Robust Packet Scheduling in Video Streaming This scheme has been implemented with MPEG-4 Part 2 and simulated in a wireless streaming environment [24]. There is no technical reason why similar or better results could not be achievable with H.264. Streaming clients typically have a receiver buffer that is capable of storing a relatively large amount of data. Initially, when a streaming session is established, a client does not start playing the stream back immediately, but rather it typically buffers the incoming data for a few seconds. This buffering helps to maintain continuous playback, because, in case of occasional increased transmission delays or network throughput drops, the client can decode and play buffered data. Otherwise, without initial buffering, the client has to freeze the display, stop decoding, and wait for incoming data. The buffering is also necessary for either automatic or selective retransmission in any protocol level. If any part of a picture is lost, a retransmission mechanism may be used to resend the lost data. If the retransmitted data is received before its scheduled decoding or playback time, the loss is perfectly recovered. Coded pictures can be ranked according to Wenger et. al. Expires December 2003 [Page 55] Internet Draft 26 June, 2003 their importance in the subjective quality of the decoded sequence. For example, non-reference pictures, such as conventional B pictures, are subjectively least important, because their absence does not affect decoding of any other pictures. In addition to non-reference pictures, the ITU-T H.264 | ISO/IEC 14496-10 standard includes a temporal scalability method called sub-sequences [25]. Subjective ranking can also be made on coded slice data partition or slice group basis. Coded slices and coded slice data partitions that are subjectively the most important can be sent earlier than their decoding order indicates, whereas coded slices and coded slice data partitions that are subjectively the least important can be sent later than their natural coding order indicates. Consequently, any retransmitted parts of the most important slices and coded slice data partitions are more likely to be received before their scheduled decoding or playback time compared to the least important slices and slice data partitions. 11. Informative Appendix: Rationale for Decoding Order Number 11.1. Introduction The Decoding Order Number (DON) concept was introduced mainly to enable efficient multi-picture slice interleaving (see section 10.6) and robust packet scheduling (see section 10.7). In both of these applications NAL units are transmitted out of decoding order. DON indicates the decoding order of NAL units and should be used in the receiver to recover the decoding order. Example use cases for efficient multi-picture slice interleaving and for robust packet scheduling are given in sections 11.2 and 11.3 respectively. Section 11.4 describes the benefits of the DON concept in error resiliency achieved by redundant coded pictures. Section 11.5 summarizes considered alternatives to DON and justifies why DON was chosen to this RTP payload specification. 11.2. Example of Multi-Picture Slice Interleaving An example of multi-picture slice interleaving follows. A subset of a coded video sequence is depicted below in output order. R Wenger et. al. Expires December 2003 [Page 56] Internet Draft 26 June, 2003 denotes a reference picture, N denotes a non-reference picture, and the number indicates a relative output time. ... R1 N2 R3 N4 R5 ... The decoding order of these pictures is from left to right as follows: ... R1 R3 N2 R5 N4 ... The NAL units of pictures R1, R3, N2, R5, and N4 are marked with a DON equal to 1, 2, 3, 4, and 5, respectively. Each reference picture consists of three slice groups that are scattered as follows (a number denotes the slice group number for each macroblock in a QCIF frame): 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 For the sake of simplicity, we assume that all the macroblocks of a slice group are included in one slice. Three MTAPs are constructed from three consecutive reference pictures so that each MTAP contains three aggregation units, each of which contains all the macroblocks from one slice group. The first MTAP contains slice group 0 of picture R1, slice group 1 of picture R2, and slice group 2 of picture R3. The second MTAP contains slice group 1 of picture R1, slice group 2 of picture R2, and slice group 0 of picture R3. The third MTAP contains slice group 2 of picture R1, slice group 0 of picture R2, and slice group 1 of picture R3. Each non-reference picture is encapsulated into an STAP-B. Consequently, the transmission order of NAL units is the following: R1, slice group 0, DON 1, carried in MTAP, RTP SN: N Wenger et. al. Expires December 2003 [Page 57] Internet Draft 26 June, 2003 R3, slice group 1, DON 2, carried in MTAP, RTP SN: N R5, slice group 2, DON 4, carried in MTAP, RTP SN: N R1, slice group 1, DON 1, carried in MTAP, RTP SN: N+1 R3, slice group 2, DON 2, carried in MTAP, RTP SN: N+1 R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+1 R1, slice group 2, DON 1, carried in MTAP, RTP SN: N+2 R3, slice group 1, DON 2, carried in MTAP, RTP SN: N+2 R5, slice group 0, DON 4, carried in MTAP, RTP SN: N+2 N2, DON 3, carried in STAP-B, RTP SN: N+3 N4, DON 5, carried in STAP-B, RTP SN: N+4 The receiver is able to organize the NAL units back in decoding order based on the value of DON associated with each NAL unit. If one the MTAPs is lost, the spatially adjacent and temporally co- located macroblocks are received and can be used to conceal the loss efficiently. If one of the STAPs is lost, the effect of the loss does not propagate temporally. 11.3. Example of Robust Packet Scheduling An example of robust packet scheduling follows. The communication system used in the example consists of the following components in the order that the video is processed from source to sink: o camera and capturing o pre-encoding buffer o encoder o encoded picture buffer o transmitter o transmission channel o receiver o receiver buffer o decoder o decoded picture buffer o display The video communication system used in the example operates as follows. Note that processing of the video stream happens gradually and at the same time in all components of the system. Wenger et. al. Expires December 2003 [Page 58] Internet Draft 26 June, 2003 The source video sequence is shot and captured to a pre-encoding buffer. The pre-encoding buffer can be used to order pictures from sampling order to encoding order or to analyze multiple uncompressed frames for bitrate rate control purposes, for example. In some cases the pre-encoding buffer may not exist, but rather the sampled pictures are encoded right away. The encoder encodes pictures from the pre-encoding buffer and stores the output, i.e., coded pictures, to the encoded picture buffer. The transmitter encapsulates the coded pictures from the encoded picture buffer to transmission packets and sends them to a receiver through a transmission channel. The receiver stores the received packets to the receiver buffer. The receiver buffering process typically includes buffering for transmission delay jitter. The receiver buffer can also be used to recover correct decoding order of coded data. The decoder reads coded data from the receiver buffer and produces decoded pictures as output into the decoded picture buffer. The decoded picture buffer is used to recover the output (or display) order of pictures. Finally, pictures are displayed. In the following example figures, I denotes an IDR picture, R denotes a reference picture, N denotes a non-reference picture, and the number after I, R, or N indicates a relative sampling time proportional to the previous IDR picture in decoding order. Values below the sequence of pictures indicate scaled system clock timestamps. The system clock is initialized arbitrarily in this example, and time runs from left to right. Each I, R, and N picture is mapped into the same timeline compared to the previous processing step, if any, assuming that encoding, transmission, and decoding take no time. Thus, events happening at the same time are located in the same column throughout all example figures. A subset of a sequence of coded pictures is depicted below in sampling order. ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ... N58 N59 I00 N01 ... ... --|---|---|---|---|---|---|---|---|- ... -|---|---|---|- ... ... 58 59 60 61 62 63 64 65 66 ... 128 129 130 131 ... Wenger et. al. Expires December 2003 [Page 59] Internet Draft 26 June, 2003 The sampled pictures are buffered in the pre-encoding buffer to arrange them in encoding order. In this example, we assume that the non-reference pictures are predicted from both the previous and the next reference picture in output order. Thus, the pre-encoding buffer has to contain at least two pictures and the buffering causes a delay of two picture intervals. The output of the pre- encoding buffering process and the encoding (and decoding) order of the pictures are as follows: ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... ... -|---|---|---|---|---|---|---|---|- ... ... 60 61 62 63 64 65 66 67 68 ... The encoder or the transmitter can set the value of DON for each picture to a value of DON for the previous picture in decoding order plus one. For the sake of simplicity, let us assume that: o the frame rate of the sequence is constant, o each picture consists of only one slice, o each slice is encapsulated in a single NAL unit packet, o pictures are transmitted in decoding order, and o pictures are transmitted at constant intervals (that is equal to 1 / frame rate). Thus, pictures are received in decoding order: ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... ... -|---|---|---|---|---|---|---|---|- ... ... 60 61 62 63 64 65 66 67 68 ... The optional interleaving-depth MIME type parameter is set to 0, because the transmission (or reception) order is identical to the decoding order. The decoder has to buffer for one picture interval initially in its decoded picture buffer to organize pictures from decoding order to output order as depicted below: ... N58 N59 I00 N01 N02 R03 N04 N05 R06 ... Wenger et. al. Expires December 2003 [Page 60] Internet Draft 26 June, 2003 ... -|---|---|---|---|---|---|---|---|- ... ... 61 62 63 64 65 66 67 68 69 ... The amount of required initial buffering in the decoded picture buffer can be signaled in the buffering period SEI message or with the num_reorder_frames syntax element of H.264 video usability information. num_reorder_frames indicates the maximum number of frames, complementary field pairs, or non-paired fields that precede any frame, complementary field pair, or non-paired field in the sequence in decoding order and follow it in output order. For the sake of simplicity, we assume that num_reorder_frames is used to indicate the initial buffer in the decoded picture buffer. In this example, num_reorder_frames is equal to 1. It can be observed that if the IDR picture I00 is lost during transmission and a retransmission request is issued when the value of the system clock is 62, there is one picture interval of time (until the system clock reaches timestamp 63) to receive the retransmitted IDR picture I00. Let us then assume that IDR pictures are transmitted two frame intervals earlier than their decoding position, i.e., the pictures are transmitted as follows: ... I00 N58 N59 R03 N01 N02 R06 N04 N05 ... ... --|---|---|---|---|---|---|---|---|- ... ... 62 63 64 65 66 67 68 69 70 ... The optional interleaving-depth MIME type parameter is set equal to 1 according to its definition. (The value of interleaving-depth in this example can be derived as follows: Picture I00 is the only picture preceding picture N58 or N59 in transmission order and following it in decoding order. Except for pictures I00, N58, and N59, the transmission order is the same as the decoding order of pictures. Since a coded picture is encapsulated into exactly one NAL unit, the value of interleaving-depth is equal to the maximum number of pictures preceding any picture in transmission order and following the picture in decoding order.) Wenger et. al. Expires December 2003 [Page 61] Internet Draft 26 June, 2003 The receiver buffering process contains two pictures at a time according to the value of the interleaving-depth parameter and orders pictures from the reception order to the correct decoding order based on the value of DON associated with each picture. The output of the receiver buffering process is the following: ... N58 N59 I00 R03 N01 N02 R06 N04 N05 ... ... -|---|---|---|---|---|---|---|---|- ... ... 63 64 65 66 67 68 69 70 71 ... Again, an initial buffering delay of one picture interval is needed to organize pictures from decoding order to output order as depicted below: ... N58 N59 I00 N01 N02 R03 N04 N05 ... ... -|---|---|---|---|---|---|---|- ... ... 64 65 66 67 68 69 70 71 ... It can be observed that the maximum delay that IDR pictures can undergo during transmission, including possible application, transport, or link layer retransmission, is equal to three picture intervals. Thus, the loss resiliency of IDR pictures is improved in systems supporting retransmission compared to the case in which pictures were transmitted in their decoding order. 11.4. Robust Transmission Scheduling of Redundant Coded Slices A redundant coded picture is a coded representation of a picture or a part of a picture that is not used in the decoding process if the corresponding primary coded picture is correctly decoded. There should be no noticeable difference between any area of the decoded primary picture and a corresponding area that would result from application of the H.264 decoding process for any redundant picture in the same access unit. A redundant coded slice is a coded slice that is a part of a redundant coded picture. Redundant coded pictures can be used to provide unequal error protection in error-prone video transmission. If a primary coded representation of a picture is decoded incorrectly, a corresponding Wenger et. al. Expires December 2003 [Page 62] Internet Draft 26 June, 2003 redundant coded picture can be decoded. Examples of applications and coding techniques utilizing the redundant codec picture feature include the video redundancy coding [26] and protection of "key pictures" in multicast streaming [27]. One property of many error-prone video communications systems is that transmission errors are often bursty and therefore they may affect more than one consecutive transmission packets in transmission order. In low bitrate video communication it is relatively common that an entire coded picture can be encapsulated into one transmission packet. Consequently, a primary coded picture and the corresponding redundant coded pictures may be transmitted in consecutive packets in transmission order. In order to make the transmission scheme more tolerant of bursty transmission errors, it is beneficial to transmit a primary coded picture apart from the corresponding redundant coded pictures. The DON concept enables this. 11.5. Remarks on Other Design Possibilities The slice header syntax structure of the H.264 coding standard contains the frame_num syntax element that can indicate the decoding order of coded frames. However, the usage of the frame_num syntax element is not feasible or desirable to recover the decoding order due to the following reasons: o The receiver is required to parse at least one slice header per coded picture (before passing the coded data to the decoder). o Coded slices from multiple coded video sequences cannot be interleaved, because the frame number syntax element is reset to 0 in each IDR picture. o The coded fields of a complementary field pair share the same value of the frame_num syntax element. Thus, the decoding order of the coded fields of a complementary field pair cannot be recovered based on the frame_num syntax element or any other syntax element of the H.264 coding syntax. The RTP payload format for transport of MPEG-4 elementary streams [28] enables interleaving of access units and transmission of multiple access units in the same RTP packet. An access unit is Wenger et. al. Expires December 2003 [Page 63] Internet Draft 26 June, 2003 specified in the H.264 coding standard to consist of all NAL units that are associated with a primary coded picture according to subclause 7.4.1.2 of [1]. Consequently, slices of different pictures cannot be interleaved and the multi-picture slice interleaving technique (see section 10.6) for improved error resilience cannot be used. 12. Open Issues o Security section needs review. o Is max-don-diff necessary? o ITU-T H.241 provides a way for decoders to signal capability for greater processing speed or memory amount than required in the profile and level that is used. H.241 specifies CustomMaxMBPS, CustomMaxFS, CustomMaxDPB, and CustomMaxBRandCPB. Should similar parameters be specified as optional MIME/SDP parameters to enhance the capability exchange procedure of SIP-based video conferencing? 13. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. Wenger et. al. Expires December 2003 [Page 64] Internet Draft 26 June, 2003 The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 14. Intellectual Property Notice The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. The IETF has been notified of intellectual property rights claimed in regard to some or all of the specification contained in this document. For more information consult the online list of claimed rights at http://www.ietf.org/ipr. Wenger et. al. Expires December 2003 [Page 65] Internet Draft 26 June, 2003 15. References 15.1. Normative References [1] ITU-T Recommendation H.264, "Advanced video coding for generic audiovisual services", May 2003. [2] ISO/IEC International Standard 14496-10:2003. [3] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [4] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003. [5] M. Handley and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [6] S. Josefsson, "The Base16, Base32, and Base64 Data Encodings", RFC 3548, July 2003. [7] ITU-T Recommendation T.35, "Procedure for the allocation of ITU-T defined codes for non-standard facilities", February 2000. 15.2. Informative References [8] "Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 | ISO/IEC 14496-10 AVC)", available from ftp://ftp.imtc- files.org/jvt-experts/2003_03_Pattaya/JVT-G50r1.zip, May 2003. [9] A. Luthra, G.J. Sullivan, and T. Wiegand (eds.), Special Issue on H.264/AVC. IEEE Transactions on Circuits and Systems on Video Technology, July 2003. [10] P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG- N57r2, available from ftp://standard.pictel.com/video- site/0109_San/VCEG-N57r2.doc, September 2001. [11] C. Borman et. Al., "RTP Payload Format for the 1998 Version of ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998. [12] ISO/IEC IS 14496-2. [13] S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and Systems for Video technology, July 2003. [14] S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", Proceedings Packet Video Workshop 02, April 2002 Wenger et. al. Expires December 2003 [Page 66] Internet Draft 26 June, 2003 [15] T. Stockhammer, M.M. Hannuksela, and S. Wenger, "H.26L/JVT Coding Network Abstraction Layer and IP-based Transport" in Proc. ICIP 2002, Rochester, NY, September 2002. [16] ITU-T Recommendation H.241, "Extended video procedures and control signals for H.300 series terminals", July 2003. [17] H. Schulzrinne and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 3551, July 2003. [18] B. Cain, S. Deering, I. Kouvelas, B. Fenner, and A. Thyagarajan, "Internet Group Management Protocol, Version 3", RFC 3376, October 2002. [19] ITU-T Recommendation H.223, "Multiplexing protocol for low bit rate multimedia communication", July 2001. [20] C. Perkins et. al., "RTP Payload for Redundant Audio Data", RFC 2198, September 1997. [21] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, December 1999. [22] T. Stockhammer, T. Wiegand, T. Oelbaum, and F. Obermeier, "Video Coding and Transport Layer Techniques for H.264/AVC- Based Transmission over Packet-Lossy Networks", IEEE International Conference on Image Processing (ICIP 2003), Barcelona, Spain, September 2003. [23] V. Varsa, M. Karczewicz, "Slice interleaving in compressed video packetization", Packet Video Workshop 2000. [24] S.H. Kang and A. Zakhor, "Packet scheduling algorithm for wireless video streaming," International Packet Video Workshop 2002, available http://www.pv2002.org. [25] M.M. Hannuksela, "Enhanced concept of GOP", JVT-B042, available ftp://standard.pictel.com/video-site/0201_Gen/JVT- B042.doc, January 2002. [26] S. Wenger, "Video Redundancy Coding in H.263+", 1997 International Workshop on Audio-Visual Services over Packet Networks, September 1997. [27] Y.-K. Wang, M.M. Hannuksela, and M. Gabbouj, "Error Resilient Video Coding Using Unequally Protected Key Pictures", in Proc. International Workshop VLBV03, September 2003. [28] J. van der Meer, D. Mackie, V. Swaminathan, D. Singer, and P. Gentric, "RTP Payload Format for Transport of MPEG-4 Elementary Streams", draft-ietf-avt-mpeg4-simple-08.txt, August 2003. Wenger et. al. Expires December 2003 [Page 67] Internet Draft 26 June, 2003 Author's Addresses Stephan Wenger Phone: +49-172-300-0813 TU Berlin / Teles AG Email: stewe@cs.tu-berlin.de Franklinstr. 28-29 D-10587 Berlin Germany Miska M. Hannuksela Phone: +358-7180-73151 Nokia Corporation Email: miska.hannuksela@nokia.com P.O. Box 100 33721 Tampere Finland Thomas Stockhammer Phone: +49-89-28923474 Institute for Communications Eng. Email: stockhammer@ei.tum.de Munich University of Technology D-80290 Munich Germany Magnus Westerlund Phone: +46-8-4048287 Multimedia Technologies Email: Ericsson Research EAB/TVA/A magnus.westerlund@ericsson.com Ericsson AB Torshamsgatan 23 S-164 80 Stockholm Sweden David Singer Phone +1 408 974-3162 QuickTime Engineering Email: singer@apple.com Apple 1 Infinite Loop MS 302-3MT Cupertino CA 95014 USA Annex A: Changes relative to draft-ietf-avt-rtp-h264-02.txt [This section will be removed in a future version of this draft.] Wenger et. al. Expires December 2003 [Page 68] Internet Draft 26 June, 2003 This memo contains the following technical changes relative to the previous I-D: o Assignment of DON values for NAL units in an STAP-B and decoding order of NAL units in an STAP-B corrected and clarified. De- packetization process changed accordingly. o Derivation of DON for MTAPs changed to allow wraparound of DON values within one MTAP. o The use of RTP timestamp and picture timing SEI message is clarified. o Single NAL unit packetization mode introduced for compatibility with ITU-T Recommendation H.241. o Packetization modes simplified. Single-picture and multi-picture mode changed to non-interleaved and interleaved modes. Packets including DON cannot be mixed with packets not including DON anymore, and therefore the derivation of the decoding order becomes easier and more tolerant to transmission delay jitter. o The optional packetization-mode MIME parameter introduced to reflect the new packetization modes. The previous parameter for selecting the packetization mode, i.e., mtap-allowed, was deleted. o Created two types of fragmentation units, FU-A (not including DON) and FU-B (including DON). o Base64 encoding used in the optional parameter-sets MIME parameter instead of hexadecimal encoding to improve compression efficiency. o Added an informative note clarifying why values of DON in consecutive NAL units in decoding order are not required to be incremented by one. o Section 1.3 ("Network Abstraction Layer Unit Types") appeared in the introduction section but actually specified such semantics of Wenger et. al. Expires December 2003 [Page 69] Internet Draft 26 June, 2003 F and NRI that are specific only to the RTP payload format. These semantics are now specified in section 5.3. o A third option, max-don-diff, was added as an option to control the receiver buffering in the interleaved packetization mode. max-don-diff is specified similarly to the maximum displacement parameter in the draft-ietf-avt-mpeg4-simple Internet Draft, but instead of using a maximum difference in terms of RTP timestamps a maximum difference in terms of decoding order numbers is used. This design decision was made due to the following facts: 1) RTP timestamp indicates the capture/display timestamp. 2) H.264/AVC allows decoding order different from output order. 3) The receiver buffer is used to reorder packets from transmission/reception order to decoding order. 4) Thus, displacement specified between differences in RTP timestamps cannot be used to reception-to-decoding-order reorganization. o Editorial changes and new informative notes. Wenger et. al. Expires December 2003 [Page 70]