idnits 2.17.1 draft-wenger-avt-rtp-jvt-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Looks like you're using RFC 2026 boilerplate. This must be updated to follow RFC 3978/3979, as updated by RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** The abstract seems to contain references ([2]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 327: '...and, hence, PSIs SHOULD NOT be used in...' RFC 2119 keyword, line 330: '... such cases PSIs MAY be used. Severe ...' RFC 2119 keyword, line 333: '...his reason, PSIs MUST NOT be used in a...' RFC 2119 keyword, line 336: '...rotocol messages MUST NOT be used that...' RFC 2119 keyword, line 343: '...he PSIs (when used) SHOULD be conveyed...' (39 more instances...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year == Using lowercase 'not' together with uppercase 'MUST', 'SHALL', 'SHOULD', or 'RECOMMENDED' is not an accepted usage according to RFC 2119. Please use uppercase 'NOT' together with RFC 2119 keywords (if that is what you mean). Found 'MUST not' in this paragraph: Marker bit (M): 1 bit Set for the very last packet of the picture indicated by the RTP timestamp, in line with the normal use of the M bit and to allow an efficient playout buffer handling. Decoders MAY use this bit as an early indication of the last packet of a coded picture, but MUST not rely on this property because the last packet of the picture may get lost, and because the use of MTAPs does not always preserve the M bit. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (December 2002) is 7803 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '2' on line 815 looks like a reference -- Missing reference section? '1' on line 812 looks like a reference -- Missing reference section? '3' on line 817 looks like a reference -- Missing reference section? '4' on line 818 looks like a reference -- Missing reference section? '5' on line 819 looks like a reference -- Missing reference section? '6' on line 822 looks like a reference -- Missing reference section? '7' on line 824 looks like a reference -- Missing reference section? '8' on line 826 looks like a reference -- Missing reference section? '9' on line 827 looks like a reference -- Missing reference section? '10' on line 830 looks like a reference -- Missing reference section? '11' on line 831 looks like a reference -- Missing reference section? '12' on line 834 looks like a reference -- Missing reference section? '13' on line 836 looks like a reference Summary: 6 errors (**), 0 flaws (~~), 3 warnings (==), 15 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Draft S. Wenger 2 Document: draft-wenger-avt-rtp-jvt-01.txt M. Hannuksela 3 Expires: December 2002 T. Stockhammer 4 June 2002 5 Expires December 2002 7 RTP payload Format for JVT Video 9 Status of this Memo 11 This document is an Internet-Draft and is in full conformance with all 12 provisions of Section 10 of RFC2026. Internet-Drafts are working 13 documents of the Internet Engineering Task Force (IETF), its areas, and 14 its working groups. Note that other groups may also distribute working 15 documents as Internet-Drafts. 17 Internet-Drafts are draft documents valid for a maximum of six months 18 and may be updated, replaced, or obsoleted by other documents at any 19 time. It is inappropriate to use Internet-Drafts as reference material 20 or to cite them other than as "work in progress." 22 The list of current Internet-Drafts can be accessed at 23 http://www.ietf.org/1id-abstracts.txt 25 The list of Internet-Draft Shadow Directories can be accessed at 26 http://www.ietf.org/shadow.html 28 Abstract 30 This memo describes an RTP Payload format for the JVT codec. This 31 codec is designed as a joint project of the ITU-T SG 16 VCEG, and 32 the ISO/IEC JTC1/SC29/WG11 MPEG groups. The most up-to-date draft 33 of the video codec was specified in early May 2002, is due for 34 revision in late July 2002, and is available for public review [2]. 36 1. The JVT codec 38 This memo specifies an RTP payload specification for a new video 39 codec that is currently under development by the Joint Video Group 40 (JVT), which is formed of video coding experts of MPEG and the ITU- 41 T. After the likely approval by the two parent bodies, the codec 42 specification will have the status of an ITU-T Recommendation 43 (likely H.264) and become part of the MPEG-4 specification (ISO/IEC 44 14496 Part 10). The current project timeline of the JVT project is 45 such that a technically frozen specification (pending bug fixes) is 46 expected in July 2002 in the form of an ISO/IEC Final Committee 47 Draft (FCD). Before JVT was formed in late 2001, this project used 48 the ITU-T project name H.26L and the JVT project inherited all the 49 technical concepts of the H.26L project. 51 The JVT video codec has a very broad application range that covers 52 the whole range from low bit rate Internet Streaming applications to 53 HDTV broadcast and Digital Cinema applications with near loss-less 54 coding. Most, if not all, relevant companies in all of these fields 55 (including TV broadcast) have participated in the standardization, 56 which gives hope that this wide application range is more than an 57 illusion and may materialize, probably in a relatively short time 58 frame. The overall performance of the JVT codec is as such that bit 59 rate savings of 50% or more, compared to the current state of 60 technology, are reported. Digital Satellite TV quality, for 61 example, was reported to be achievable at 1.5 Mbit/s, compared to 62 the current operation point of MPEG 2 video at around 3.5 Mbit/s 63 [1]. 65 The codec specification [2] itself distinguishes between a video 66 coding layer (VCL), and a network abstraction layer (NAL). The VCL 67 contains the signal processing functionality of the codec, things 68 such as transform, quantization, motion search/compensation, and the 69 loop filter. It follows the general concept of most of today's 70 video codecs, a macroblock based coder that utilized inter picture 71 prediction with motion compensation, and transform coding of the 72 residual signal. The output of the VCL are slices: a bit string 73 that contains the macroblock data of an integer number of 74 macroblocks, and the information of the slice header (containing the 75 spatial address of the first macroblock in the slice, the initial 76 quantization parameter, and similar). Macroblocks in slices are 77 ordered in scan order unless a different macroblock allocation is 78 specified, using the so-called Flexible Macroblock Ordering syntax. 79 In-picture prediction is used only within a slice. 81 The NAL encapsulates the slice output of the VCL into Network 82 Abstraction Layer Units (NALUs), which are suitable for the 83 transmission over packet networks or the use in packet oriented 84 multiplex environments. JVT's Annex B defines an encapsulation 85 process to transmit such NALUs over byte-stream oriented networks. 86 In the scope of this memo Annex B is not relevant. 88 Neither VCL nor NAL are claimed to be media or network independent - 89 the VCL needs to know transmission characteristics in order to 90 appropriately select the error resilience strength, slice size, 91 etc., whereas the NAL needs information like the importance of a bit 92 string provided by the VCL to select the appropriate application 93 layer protection. 95 Internally, the NAL uses NAL Units or NALUs. A NALU consists of a 96 one-byte header and the payload byte string. The header co-serves 97 as the RTP payload header and indicates the type of the NALU, the 98 (potential) presence of bit errors in the NALU payload, and 99 information whether this NALU is required for maintaining the 100 synchronicity of the encoder/decoder loops. This RTP payload 101 specification is designed to be unaware of the bit string in the 102 NALU payload. 104 One of the main properties of the JVT codec is the possibility of 105 the use of Reference Picture Selection. For each macroblock the 106 reference picture to be used can be selected independently. The 107 reference pictures may be used in a first-in, first-out fashion, but 108 it is also possible to handle the reference picture buffers 109 explicitly. A consequence of this new feature (it was available 110 before only in H.263++ [3]) is the complete decoupling of the 111 transmission time, the decoding time, and the sampling or 112 presentation time of slices and pictures. For this reason, the 113 handling of the RTP timestamp requires some special considerations 114 for those NALUs for which the sampling or presentation time is not 115 defined, or, at transmission time, unknown. 117 2. Status of JVT, and Changes relative to the -00 version 119 [This section will be removed in a future version of this draft.] 121 2.1. Status of the JVT standardization, and recent changes to JVT 123 Since the last draft, JVT has met and a new JVT working draft was 124 produced. This JVT working draft is currently in the first stage of 125 the ISO/IEC approval process, the ballot on the so-called Committee 126 Draft. Procedural provisions are taken by interested ISO/IEC 127 members to ensure that changes relative to this draft are still 128 possible, even after the ballot. 130 The meeting brought a lot of changes in the VCL, which do not have a 131 direct influence to this memo. However, there were also numerous 132 changes introduced to the NAL. They somehow break the clean design 133 of the NAL as it was presented at the Minneapolis IETF, in favor to 134 save bits in a byte stream environment. This memo reflects the 135 current JVT working draft, but please see the following section on 136 our expectations regarding future changes of the NAL design. 138 The main changes of the JVT NAL relative to the pre-Fairfax design 139 are as follows: 141 - Introduction of a picture header 142 - A means to carry redundant copies of the picture header 143 - Adding of a "Disposable Flag" to the NALU type. 145 - Adding many more slice types to the NALU type (were 8, now 30) 147 The next JVT meeting will take place in the week after the Japan 148 IETF in Klagenfurt, Austria. This will be the last meeting in which 149 significant changes (anything but bug fixes) can be done. 151 2.2. Authors' comments and expectation regarding JVT NAL design 153 The authors deem many of the changes to the NAL as technically 154 problematic, and are working within JVT to fix the freshly 155 introduced and, from the RTP point-of-view, problematic features. 156 The re-introduction of the picture header concept will lead to an 157 undesirable overhead in packet network environments, by making 158 mechanisms such as header repetition necessary. It also breaks the 159 clean Parameter Set concept, making it easier for people to take 160 shortcuts. 162 We know that we can show that the number of bits that can be saved 163 in a byte stream environment through the picture header concept is 164 negligible, and insignificant when compared to the problems the 165 packet world has with this concept. We are confident that we can 166 replace the picture header mechanism with something like a 167 hierarchical Parameter Set concept. 169 If we can convince JVT to go back to the clean JVT NAL design, the 170 number of NALU types (30, plus one for the aggregation packets now) 171 would go down to something more reasonable and freeing codepoint 172 space for future extensions. Otherwise, the draft will require 173 language that recommends the amount of redundant picture header data 174 to be sent. 176 2.3. Changes relative to draft-wenger-avt-rtp-jvt-00.txt 178 This memo reflects the current JVT WD, and hence required alignment 179 with this draft. In addition to editorial changes (mostly to 180 reflect the changed terminology in the JVT draft), the discussion of 181 the NAL unit types was aligned. 183 As a response to the last IETF meeting's request, the RTP timestamp 184 is now the sampling/presentation timestamp. (It is unclear to us 185 how to distinguish between the two). 187 The RTP clock is now fixed at 90 kHz. 189 Compound Packets are renamed to Aggregation Packets. 191 Since the timestamp now carries vital information, a second type of 192 an aggregation packet is necessary. The compound packet of draft- 193 wenger-avt-rtp-jvt-00.txt can now be used only to aggregate packets 194 that share the same RTP timestamp, and is now called Single-Time 195 Aggregation Packet (STAP). Usually, this packet type can only be 196 used to aggregate packets belonging to the same picture. The second 197 aggregated packet type adds a 16-bit timestamp offset to the 198 aggregated packet data structure for each of the aggregated NALUs, 199 and is called Multi-Time Aggregation Packet (MTAP). At 90 kHz clock 200 this packet type allows to aggregate NALUs that are roughly 2/3rd's 201 of a second apart. It is believed that such a distance is a good 202 compromise between the requirements of the streaming industry (they 203 want to packetize NALUs belonging to more than one picture into one 204 packet) and the overhead constraints (16 bits per NALU). See 205 section 11 (Open issues) for a more flexible concept. 207 In the JVT meeting a "Disposable Flag" was introduced in the NALU 208 header. That bit is documented here as well. 210 3. Scope 212 This payload specification can only be used to carry the "naked" JVT 213 NALU stream over RTP. Likely, the first applications of a Standard 214 Track RFC resulting from this draft will be in the conversational 215 multimedia field, video telephone or video conference. The draft is 216 not intended for the use in conjunction with the Byte Stream format 217 of Annex B of the JVT working draft, the MPEG 4 system layer [4] or 218 other multiplexing schemes. 220 4. NAL basics 222 Tutorial information on the NAL design can be found in [5] and 223 [6]. For the precise definition of the NAL it is referred to [2]. 224 This section tries to provide a very short overview of the concepts 225 used. 227 4.1. Parameter Set Concept 229 One very fundamental design concept of the JVT codec is to generate 230 self-contained packets, to make mechanisms such as the header 231 duplication of RFC2429 [7] or MPEG-4's HEC [8] unnecessary. (Please 232 see section 2.2 regarding the authors' opinion re the Picture 233 header.) The way how this was achieved is to decouple information 234 that is relevant for more than one slice from the media stream. 235 This higher layer meta information should be sent reliably and 236 asynchronously from the RTP packet stream that contains the slice 237 packets. The combination of the higher level parameters is called a 238 Parameter Set. The Parameter Set contains information such as 240 o picture size, 241 o display window, 242 o optional coding modes employed, 243 o and others. 245 In order to be able to change picture parameters (such as the 246 picture size), without having the need to transmit Parameter Set 247 updates synchronously to the slice packet stream, the encoder and 248 decoder can maintain a list of more than one Parameter Set. Each 249 slice header contains a codeword that indicates the Parameter Set to 250 be used. 252 This mechanism allows to decouple the transmission of the Parameter 253 Sets from the packet stream, and transmit them by external means, 254 e.g. as a side effect of the capability exchange, or through a 255 (reliable or unreliable) control protocol. It may even be possible 256 that they get never transmitted but are fixed by an application 257 design specification. 259 Although, conceptually, the Parameter Set updates are not designed 260 to be sent in the synchronous packet stream, this memo contains a 261 means to convey them in the RTP packet stream. 263 4.2. Network Abstraction Layer Packet (NALU) Types 265 All NALUs consist of a single NALU Type octet, which also serves as 266 the payload header. The payload of a NALU follows immediately. 268 The NALU type octet has the following format: 270 +---------------+ 271 |0|1|2|3|4|5|6|7| 272 +-+-+-+-+-+-+-+-+ 273 |E| Type |P|D| 274 +---------------+ 276 E: 1 bit 277 The Error Indication bit, when cleared assures a bit-error free 278 payload of the NALU and of the NALU type octet. When set, the 279 decoder is advised that bit errors may be present in the payload 280 or in the NALU type octet. A prudent reaction of decoders that 281 are incapable of handling bit errors is to discard such packets. 283 Type: 5 bits 284 The NAL Unit payload type as defined in table 8.2 of [2]. 286 P: 1 bit 287 Picture Header Flag. Indicates the presence of a Picture Header 288 at the beginning of the payload. 290 D: 1 bit 291 The Disposable Flag indicates that the payload of the NALU, after 292 decoding, will not be used for future prediction. Hence, the 293 decoder and/or media aware network elements can discard such 294 packets without hurting the codec performance or start error 295 propagation due to predicted coding. However, the user 296 experience will suffer (most likely due to lower frame rates). 298 For a reference of all currently defined NALU types and their 299 semantics please see section 8.2 in [2]. Because we anticipate 300 significant changes to this table, only a few remarks on those NALU 301 types shall be provided here. 303 NAL Units of the type X Picture Header (where X is Intra, Inter, B, 304 SI, or SP) indicate a payload that consists of a picture header of 305 the indicated type. 307 All NAL Unit types called X slice contain exactly one coded slice of 308 the specified type. In some cases it is also assured that not only 309 this slice, but also all other slices of the coded picture are of 310 the same slice type. This can help the resource allocation process 311 at the decoder. An instantaneous decoder refresh picture (IDER 312 picture) is an I or SI picture that can be used as a random access 313 point. 315 The NAL unit of the types DPB and DPC carry Data Partitions 316 consisting only of Intra and Inter CPBs and coefficients. 318 The Supplemental Enhancement Information type (SEI) is used to carry 319 metadata that is not necessary to keep the loops in encoder and 320 decoder synchronized. A prime example for SEI information is the 321 presentation time in such networks that do not have a time property 322 comparable to the RTP timestamp. 324 Parameter Set Information NALUs (PSIs) are used to carry new 325 Parameter Sets or updates to previous Parameter Sets. Normally, the 326 transmission and update of Parameter Sets is a function of a control 327 protocol and, hence, PSIs SHOULD NOT be used in such systems where 328 adequate protocol support is available. However, there are 329 applications where the packet stream has to be self-contained. In 330 such cases PSIs MAY be used. Severe synchronization problems 331 between the RTP stream containing PSIs and control protocol messages 332 can occur if PSIs and control protocol messages are used in the same 333 RTP session. For this reason, PSIs MUST NOT be used in an RTP 334 session whose Parameter Sets were already changed by control 335 protocol messages during the lifetime of the RTP session. 336 Similarly, control protocol messages MUST NOT be used that affect 337 any RTP session on which at least one PSI was sent. 339 The Parameter Set mechanism is designed to decouple the transmission 340 of picture/GOP/sequence header information from the picture data 341 that is composed of the other NALU types. To successfully decode a 342 picture, all Parameter Sets (referenced by the slice Header) need to 343 be available. Hence, the PSIs (when used) SHOULD be conveyed 344 significantly before their content is first referenced. 346 4.3. Aggregation Packets 347 Aggregation packets are the packet aggregation scheme of this 348 payload specification. The scheme is introduced to reflect the 349 dramatically different MTU sizes of two target networks -- wireline 350 IP networks (with an MTU size that is often limited by the Ethernet 351 MTU size -- roughly 1500 bytes), and IP or non-IP (e.g. H.324/M) 352 based wireless networks with preferred transmission unit sizes of 353 254 bytes or less. In order to prevent media transcoding between 354 the two worlds, and to avoid undesirable packetization overhead, a 355 packet aggregation scheme is introduced. 357 Two types of Aggregation packets are defined by this specification: 359 o Single-Time Aggregation Packet (STAP) aggregate NALUs with 360 identical NALU-time. 361 o Multi-Time Aggregation Packet (MTAP) aggregate NALUs with 362 potentially differing NALU-time. 364 The term NALU-time is defined as the value the RTP timestamp would 365 have if that NALU would be transported in its own RTP packet. 367 MTAP and STAP share the following packetization rules: 369 The disposable flag MUST be set if it is set in all aggregated 370 NALUs, otherwise it MUST be cleared. The Type field of the NALU 371 type octet MUST be zero. The E bit MUST be cleared if all E bits of 372 the aggregated NALUs are zero, otherwise it MUST be set. 374 For MTAPs and STAPs (identified by type = 0 in the NALU type byte) 375 the Picture Header flag is overloaded with a new semantic. A zero 376 in the Picture Header flag indicates a STAP, a one indicates an 377 MTAP. 379 The Marker bit in the RTP header MUST be set to the value the marker 380 bit of the last NALU of the aggregated packet would have if it were 381 transported in its own RTP packet. 383 The NALU Payload of an aggregation packet consists of one or more 384 aggregation units. See section 4.3.1 and 4.3.2 for the two 385 different types of aggregation units. An aggregation packet can 386 carry as many aggregation units as necessary, however the total 387 amount of data in an aggregation packet obviously MUST fit into an 388 IP packet, and the size SHOULD be chosen such that the resulting IP 389 packet is smaller than the MTU size. 391 4.3.1. Single-Time Aggregation Packet 393 Single-Time Aggregation Packet (STAP) SHOULD be used when 394 aggregating NALUs that share the same NALU-time. The Picture Header 395 Flag MUST be set to zero in order to distinguish an STAP from an 396 MTAP. 398 The NALU payload of an STAP consists of Single-Picture Aggregation 399 units. 401 A Single-Picture Aggregation Unit consists of 16-bit unsigned size 402 information that indicates the size of the following NALU in bytes 403 (excluding these two octets, but including the NALU type octet of 404 the NALU), followed by the NALU itself including its NALU type 405 byte. 407 4.3.2. Multi-Time Aggregation Packet (MTAP) 409 An MTAP has a similar architecture as an STAP. It consists of the 410 NALU header byte and one or more Multi-Picture Aggregation Units. 411 The Picture Header flag in the MTAP NALU type byte is set to 1 to 412 distinguish an MTAP from an STAP. 414 This Memo does not specify how the NALUs within an MTAP are 415 ordered. In most cases, the natural "decoding order" SHOULD be 416 used, in particular in conjunction with bi-predicted pictures that 417 use a forward reference picture. However, all other NALU ordering 418 schemes that are legal in JVT video MAY be used as well. 420 A Multi-Picture Aggregation Unit consists of 16 bits unsigned size 421 information of the following NALU (same as the size information of 422 in the STAP). These 16 bits are followed by 16 bits of timing 423 information for this NALU. The timing information field MUST be set 424 so that the RTP timestamp of an RTP packet of each NALU in the MTAP 425 (the NALU-time) can be generated by subtracting the timing 426 information from the RTP timestamp of the MTAP. 428 For the "latest" multi-picture Aggregation Unit in an MTAP the 429 timing offset MUST be zero. Hence, the RTP timestamp of the MTAP 430 itself is identical to the latest NALU-time. 432 5. RTP Packetization Process 434 The RTP packetization process of the JVT codec is straightforward 435 and follows the general principles outlined in RFC1889. When using 436 one NALU per RTP packet, the RTP payload consists of the bit buffer 437 containing the NALU. The RTP payload (and the settings for some RTP 438 header bits) for aggregation packets were already defined in section 439 4.3 above. There is no specific RTP payload header -- the NALU type 440 byte double-functions in this task. The RTP header information is 441 set as follows: 443 Timestamp: 32 bits 444 The RTP timestamp is set to the presentation/sampling timestamp 445 of the content. If the NALU has no own timing properties (e.g. 446 PSIs, SEI), or if the presentation/sampling time is unknown, the 447 RTP timestamp is set to the RTP timestamp of the last transmitted 448 RTP packet in the session. The setting of the RTP Timestamp for 449 MTAPs is defined in section 4.3.2 above. 451 Marker bit (M): 1 bit 452 Set for the very last packet of the picture indicated by the RTP 453 timestamp, in line with the normal use of the M bit and to allow 454 an efficient playout buffer handling. Decoders MAY use this bit 455 as an early indication of the last packet of a coded picture, but 456 MUST not rely on this property because the last packet of the 457 picture may get lost, and because the use of MTAPs does not 458 always preserve the M bit. 460 Sequence No (Seq): 16 bit 461 Increased by one for each sent packet. Set to a random value 462 during startup as per RFC1889 464 Version (V): 2 bits 465 set to 2 467 Padding (P): 1 bit 468 set to 0 470 Extension (X): 1 bit 471 set to 0 473 Payload Type (PT): 8 bits 474 established dynamically during connection establishment 476 All other RTP header fields are set as per RFC1889. 478 6. Packetization Rules 480 Two cases of packetization rules have to be distinguished by the 481 possibility to put packets belonging to more than a single picture 482 into a single aggregated packet (using STAPs or MTAPs). 484 6.1. Unrestricted Mode (Multiple Picture Model) 486 This mode MAY be supported by some receivers. Usually, the 487 capability of a receiver to support this mode is indicated by one of 488 the profiles of the JVT codec (this is not yet defined in [2]). The 489 following packetization rules MUST be enforced by the sender: 491 o Single slice packets belonging to the same picture (and hence 492 share the same RTP timestamp value) MAY be sent in any order, 493 although, for delay critical systems, they SHOULD be sent in their 494 original coding order to minimize the delay. Note that the coding 495 order is not necessarily the scan order, but the order the NAL 496 packets become available to the RTP stack. 498 o Both MTAPs and STAPs MAY be used. 500 o SEI packets MAY be sent anytime. 502 o PSIs MUST NOT be sent in an RTP session whose Parameter Sets were 503 already changed by control protocol messages during the lifetime 504 of the RTP session. If PSIs are allowed by this condition, they 505 MAY be sent at any time. 507 o All NALU types MAY be mixed freely, provided that above 508 rules are obeyed. In particular, it is allowed to mix slices in 509 data-partitioned and single-slice mode. 511 o Network elements MAY convert multiple RTP packets carrying 512 individual NALUs into one aggregated RTP packet, convert an 513 aggregated RTP packet into several RTP packets carrying individual 514 NALUs, or mix both concepts. However, when doing so they SHOULD 515 take into account at least the following parameters: path MTU 516 size, unequal protection mechanisms (e.g. through packet 517 duplication, packet-based FEC carried by RFC2198, especially for 518 header and Type A Data Partitioning packets), bearable latency of 519 the system, and buffering capabilities of the receiver. 521 o NALUs of all types MAY be conveyed as aggregation units of an STAP 522 or MTAP rather than individual RTP packets. Special care SHOULD 523 be taken (particularly in gateways) to avoid more than a single 524 copy of identical NALUs in a single STAP/MTAP in order to avoid 525 unnecessary data transfers without any improvements of QoS. 527 6.2. Restricted Mode (Single Picture Model) 529 This mode MUST be supported by all receivers. It is primarily 530 intended for low delay applications. Its main difference from the 531 Unrestricted Mode is to forbid the packetization of data belonging 532 to more than one picture in a single RTP packet. Hence, MTAPs MUST 533 NOT be used. The following packetization rules MUST be enforced by 534 the sender: 536 o All rules of the Unrestricted Mode above, with the following 537 additions 539 o only STAPs MAY be used, MTAPs MUST NOT be used. This implies that 540 aggregated packets MUST NOT include slices or data partitions 541 belonging to different pictures. 543 7. De-Packetization Process 545 The de-packetization process is implementation dependent. Hence, 546 the following description should be seen as an example of a suitable 547 implementation. Other schemes MAY be used as well. Optimizations 548 relative to the described algorithms are likely possible. 550 The general concept behind these de-packetization rules is to 551 collect all packets belonging to a picture, bringing them into a 552 reasonable order, discard anything that is unusable, and pass the 553 rest to the decoder. Aggregation packets are handled by unloading 554 their payload into individual RTP packets carrying NALUs. Those 555 NALUs are processed as if they were received in separate RTP 556 packets, in the order they were arranged in the Aggregation Packet. 558 The following de-packetization rules MAY be used to implement an 559 operational JVT de-packetizer: 561 o NALUs are presented to the JVT decoder in the order of the 562 RTP sequence number. 564 o NALUs carried in an Aggregation Packet are presented in their 565 order in the Aggregation packet. All NALUs of the Aggregation 566 packet are processed before the next RTP packet is processed. 568 o Intelligent RTP receivers (e.g. in Gateways) MAY identify lost 569 DPAs. If a lost DPA is found, the Gateway MAY decide not to send 570 the DPB and DPC partitions, as their information is meaningless 571 for the JVT Decoder. In this way a network element can reduce 572 network load by discarding useless packets, without parsing a 573 complex bit stream 575 o Intelligent receivers MAY discard all packets that have the 576 Disposable Flag set. However, they SHOULD process those packets 577 if possible, because the user experience may suffer if the packets 578 are discarded. 580 8. MIME Considerations 582 This section is to be completed later. 584 9. Security Considerations 586 So far, no security considerations beyond those of RFC1889 have been 587 identified. 589 Currently, the JVT CD does not allow carrying any type of active 590 payload. However, the inclusion of a "user data" mechanism is under 591 consideration, which could potentially be used for mechanisms such 592 as remote software updates of the video decoder and similar tasks. 594 10. Informative Appendix: Application Examples 596 This payload specification is very flexible in its use, to cover the 597 extremely wide application space that is anticipated for the JVT 598 codec. However, such a great flexibility also makes it difficult 599 for an implementer to decide on a reasonable packetization scheme. 600 Some information how to apply this specification to real-world 601 scenarios is likely to appear in the form of academic publications 602 and a Test Model in the near future. However, some preliminary 603 usage scenarios should be described here as well. 605 10.1. Video Telephony, no Data Partitioning, no packet aggregation 607 The RTP part of this scheme is implemented and tested (though not 608 the control-protocol part, see below). 610 In most real-world video telephony applications, the picture 611 parameters such as picture size or optional modes never change 612 during the lifetime of a connection. Hence, all necessary Parameter 613 Sets (usually only one) are sent as a side effect of the capability 614 exchange/announcement process. An example for such a capability 615 exchange with an SDP-like syntax can be found in [9], but other 616 schemes such as ASN.1 are possible as well. Since all necessary 617 Parameter Set information is established before the RTP session 618 starts, there is no need for sending any PSIs. Data Partitioning is 619 not used either. Hence, the RTP packet stream consists basically of 620 NALUs that carry single slices of video information. 622 The size of those single-slice NALUs is chosen by the encoder such 623 that they offer the best performance. Often, this is done by 624 adapting the coded slice size to the MTU size of the IP network. 625 For small picture sizes this may result in a one-picture-per-one- 626 packet strategy. The loss of packets and the resulting drift- 627 related artifacts are cleaned up by Intra refresh algorithms. 629 10.2. Video Telephony, Interleaved Packetization using Packet 630 Aggregation 632 This scheme allows better error concealment and is widely used in 633 H.263 based designed using RFC2429 packetization. It is also 634 implemented and good results were reported [5]. 636 The source picture is coded by the VCL such that all MBs of one MB 637 line are assigned to one slice. All slices with even MB row 638 addresses are combined into one STAP, and all slices with odd MB row 639 addresses into another STAP. Those STAPs are transmitted as RTP 640 packets. The establishment of the Parameter Sets is performed as 641 discussed above. 643 Note that the use of STAPs is essential here, because the high 644 number of individual slices (18 for a CIF picture) would lead to 645 unacceptably high IP/UDP/RTP header overhead (unless the source 646 coding tool FMO is used, which is not assumed in this scenario). 647 Furthermore, some wireless video transmission systems, such as 648 H.324M and the IP-based video telephony specified in 3GPP, are 649 likely to use relatively small transport packet size. For example, 650 a typical MTU size of H.223 AL3 SDU is around 100 bytes [10]. 651 Coding individual slices according to this packetization scheme 652 provides a further advantage in communication between wired and 653 wireless networks, as individual slices are likely to be smaller 654 than the preferred maximum packet size of wireless systems. 655 Consequently, a gateway can convert the STAPs used in a wired 656 network to several RTP packets with only one NALU that are preferred 657 in a wireless network and vice versa. 659 10.3. Video Telephony, with Data Partitioning 661 This scheme is implemented and was shown to offer good performance 662 especially at higher packet loss rates [5]. 663 Data Partitioning is known to be useful only when some form of 664 unequal error protection is available. Normally, in single-session 665 RTP environments, even error characteristics are assumed -- 666 statistically, the packet loss probability of all packets of the 667 session is the same. However, there are means to reduce the packet 668 loss probability of individual packets in an RTP session. One 669 simple way is known as Packet Duplication: simply send the to-be- 670 protected packet twice, with the same sequence number. If both 671 packets survive, the receiver will assume a packet duplication by 672 UDP and discard one of the two packets. Other means of unequal 673 protection within the same RTP session include the use of RFC 2198 674 [11] (for this application it is essentially a packet duplication 675 process as well, with some saved bytes for the second RTP header), 676 or packet-based Forward Error Correction [12] carried in RFC2198. 678 The implemented software uses the simple packet duplication process 679 to increase the probability of all DPA NALUs. The incurred overhead 680 is substantial, but in the same order of magnitude as the number of 681 bits that have otherwise be spent for intra information. However, 682 this mechanism is not adding any delay to the system. 684 Again, the complete Parameter Set establishment is performed through 685 control protocol means. 687 10.4. MPEG-2 Transport to RTP Gateway 689 This example is not implemented completely, but the basic mechanisms 690 are part of the interim file format the JVT group uses and, hence, 691 well tested. 693 When using JVT video in satellite/cable broadcast environments, 694 there is no control protocol available that can be used for the 695 transmission of Parameter Sets. Furthermore, a receiver has to be 696 able to "tune" into an ongoing packet stream at any time, without 697 much delay and artifacts. For this reason, PSIs that contain all 698 Parameter Set information are included in the packet stream at any 699 Instantaneous Decoder Refresh Point (which are similar to Key Frames 700 in earlier coding standards). IDERP packets are used to signal 701 these "key frames" so that a decoder can most easily determine where 702 to start in its decoding process. 704 Since the byte stream format used in satellite/cable broadcast 705 environments does not include timing information in the video 706 stream, the gateway needs to use external timing information (e.g. 707 from the MPEG-2 system layer) to generate the RTP timestamp. Please 708 note that this timestamp is also a 90 kHz clock -- hence, in most 709 cases, the conversion should be relatively simple. 711 The simplest possible MPEG-2 transport to RTP gateway could take the 712 NALUs as they come from the MPEG-2 transport stream (after de- 713 framing), and send them, each NALU in one RTP packet, with 714 increasing RTP sequence numbers. However, less than perfect packet 715 loss rates would lead to a very poor performance of such a system. 716 However, a Gateway could use the protection mechanisms discussed 717 above to unequally protect the most important packets, e.g. all PSIs 718 (very strong protection) IDERPs (weak protection), and transmit 719 everything else best effort. The Gateway can do this without 720 parsing the bit stream, by simply using the NALU type byte. 721 A more sophisticated Gateway may be able to combine some small NALUs 722 to a big STAP or MTAP in order to save the bytes used for the 723 IP/UDP/RTP headers. 725 A similar mechanism is, of course, also possible in H.320 to RTP 726 gateways. Here, however, the system environment does not include 727 any timing information, and exact presentation timing is carried in 728 the form of SEIs. Hence, in the H.320 to IP data path, the gateway 729 has the additional duty to filter out SEIs containing timing 730 information and setting the RTP timestamp of the following video 731 packets accordingly. In the reverse direction, SEIs need to be 732 generated using the RTP timestamp as a guideline. 734 10.5. Low-Bit-Rate Streaming 736 This scheme has been implemented with H.263 and gave good results 737 [13]. There is no technical reason why similarly good results could 738 not be achievable using the JVT codec. 740 In today's Internet streaming, some of the offered bit-rates are 741 relatively low in order to allow terminals with dial-up modems to 742 access the content. In wired IP networks, relatively large packets, 743 say 500 - 1500 bytes, are preferred to smaller and more frequently 744 occurring packets in order to reduce network congestion. Moreover, 745 use of large packets decreases the amount of RTP/UDP/IP header 746 overhead. For low-bit-rate video, the use of large packets means 747 that sometimes up to few pictures should be encapsulated in one 748 packet. 750 However, loss of such a packet would have drastic consequences in 751 visual quality, as there is practically no other way to conceal a 752 loss of an entire picture than to repeat the previous one. One way 753 to construct relatively large packets and maintain possibilities for 754 successful loss concealment is to construct MTAPs that contain 755 slices from several pictures in an interleaved manner. An MTAP 756 should not contain spatially adjacent slices from the same picture 757 or spatially overlapping slices from any picture. If a packet is 758 lost, it is likely that a lost slice is surrounded by spatially 759 adjacent slices of the same picture and spatially corresponding 760 slices of the temporally previous and succeeding pictures. 761 Consequently, concealment of the lost slice is likely to succeed 762 relatively well. 764 11. Open Issues 765 There are several open issues on which the authors would like to 766 receive opinions. They are listed below. 768 MTAPs: are they efficient enough? And, is 16 bit unsigned offset to 769 a 90 kHz timestamp enough? Need input from the streaming industry. 770 One solution would be to create five different xTAP, with 0, 8, 16, 771 24, and 32 bit timestamps per aggregation unit. Another option 772 would be a more complex payload header that signals presence (and 773 size) of the timing information per aggregation unit. 775 Since JVT will likely be approved as the advanced video codec of 776 MPEG-4, it may be desirable to align this payload specification with 777 other payload specifications for MPEG 4. The authors of this I-D 778 and some authors of the MPEG-4 packetization I-Ds are discussing the 779 issue, and there is a chance that in the future changes to this I-D 780 will be proposed to AVT to reflect the outcome of these discussions. 782 12. Full Copyright Statement 783 Copyright (C) The Internet Society (2002). All Rights Reserved. 785 This document and translations of it may be copied and furnished to 786 others, and derivative works that comment on or otherwise explain it 787 or assist in its implementation may be prepared, copied, published 788 and distributed, in whole or in part, without restriction of any 789 kind, provided that the above copyright notice and this paragraph 790 are included on all such copies and derivative works. 792 However, this document itself may not be modified in any way, such 793 as by removing the copyright notice or references to the Internet 794 Society or other Internet organizations, except as needed for the 795 purpose of developing Internet standards in which case the 796 procedures for copyrights defined in the Internet Standards process 797 must be followed, or as required to translate it into languages 798 other than English. 800 The limited permissions granted above are perpetual and will not be 801 revoked by the Internet Society or its successors or assigns. 803 This document and the information contained herein is provided on an 804 "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 805 TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 806 BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 807 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 808 MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 810 13. Bibliography 812 [1] P. Borgwardt, "Handling Interlaced Video in H.26L", VCEG-N57r2, 813 available from ftp://standard.pictel.com/video- 814 site/0109_San/VCEG-N57r2.doc, September 2001 815 [2] JVT Joint Committee Draft, available from ftp://ftp.imtc- 816 files.org/jvt-experts/2002_05_Fairfax/JVT-C167.doc 817 [3] ITU-T Recommendation H.263-2000 818 [4] ISO/IEC IS 14496-1 819 [5] S. Wenger, "H.26L over IP", IEEE Transaction on Circuits and 820 Systems for Video technology, to appear (April 2002) 822 [6] S. Wenger, "H.26L over IP: The IP Network Adaptation Layer", 823 Proceedings Packet Video Workshop 02, April 2002, to appear. 824 [7] C. Borman et. Al., "RTP Payload Format for the 1998 Version of 825 ITU-T Rec. H.263 Video (H.263+)", RFC 2429, October 1998 826 [8] ISO/IEC IS 14496-2 827 [9] S. Wenger, T. Stockhammer, "H.26L over IP and H.324 Framework", 828 VCEG-N52, available from ftp://standard.pictel.com/video- 829 site/0109_San/VCEG-N52.doc, September 2001 830 [10] ITU-T Recommendation H.223 (1999) 831 [11] C. Perkins et. al., "RTP Payload for Redundant Audio Data", RFC 832 2198, September 1997 834 [12] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for 835 Generic Forward Error Correction", RFC 2733, December 1999 836 [13] V Varsa, M. Karczewicz, "Slice interleaving in compressed video 837 packetization", Packet Video Workshop 2000 839 Author's Addresses 841 Stephan Wenger Phone: +49-172-300-0813 842 TU Berlin / Teles AG Email: stewe@cs.tu-berlin.de 843 Franklinstr. 28-29 844 D-10587 Berlin 845 Germany 847 Thomas Stockhammer Phone: +49-89-28923474 848 Institute for Communications Eng. Email: stockhammer@ei.tum.de 849 Munich University of Technology 850 D-80290 Munich 851 Germany 853 Miska M. Hannuksela Phone: +358 40 5212845 854 Nokia Corporation Email: miska.hannuksela@nokia.com 855 P.O. Box 68 856 33721 Tampere 857 Finland