idnits 2.17.1 draft-ietf-payload-rtp-h265-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 153 instances of weird spacing in the document. Is it really formatted ragged-right, rather than justified? ** There are 11 instances of too long lines in the document, the longest one being 14 characters in excess of 72. ** The abstract seems to contain references ([HEVC]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 27 has weird spacing: '... at any ti...' == Line 30 has weird spacing: '... The list ...' == Line 45 has weird spacing: '...fo) in effec...' == Line 46 has weird spacing: '...ication of t...' == Line 47 has weird spacing: '...ly, as they ...' == (148 more instances...) -- The document date (July 1, 2013) is 3945 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '3GP' is mentioned on line 266, but not defined -- Looks like a reference, but probably isn't: '0' on line 958 == Missing Reference: 'RFC5117' is mentioned on line 2074, but not defined ** Obsolete undefined reference: RFC 5117 (Obsoleted by RFC 7667) == Missing Reference: 'RFC2326' is mentioned on line 2275, but not defined ** Obsolete undefined reference: RFC 2326 (Obsoleted by RFC 7826) == Missing Reference: 'RFC2974' is mentioned on line 2276, but not defined == Missing Reference: 'RFC5583' is mentioned on line 2320, but not defined == Missing Reference: 'RFC3551' is mentioned on line 2480, but not defined == Missing Reference: 'RFC3711' is mentioned on line 2480, but not defined == Missing Reference: 'RFC5124' is mentioned on line 2481, but not defined == Missing Reference: 'I-D.ietf-avt-srtp-not-mandatory' is mentioned on line 2483, but not defined == Missing Reference: 'I-D.ietf-avtcore-rtp-security-options' is mentioned on line 2490, but not defined == Missing Reference: 'RFC 3711' is mentioned on line 2506, but not defined == Missing Reference: 'RFC 3551' is mentioned on line 2530, but not defined == Unused Reference: 'RFC6051' is defined on line 2611, but no explicit reference was found in the text == Unused Reference: '3GPPFF' is defined on line 2651, but no explicit reference was found in the text == Unused Reference: 'RFC5109' is defined on line 2667, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HEVC' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) Summary: 6 errors (**), 0 flaws (~~), 23 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group Y.-K. Wang 2 Internet Draft Qualcomm 3 Intended status: Standards track Y. Sanchez 4 Expires: January 2014 T. Schierl 5 Fraunhofer HHI 6 S. Wenger 7 Vidyo 8 M. M. Hannuksela 9 Nokia 10 July 1, 2013 12 RTP Payload Format for High Efficiency Video Coding 13 draft-ietf-payload-rtp-h265-00.txt 15 Status of this Memo 17 This Internet-Draft is submitted to IETF in full conformance with 18 the provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF), its areas, and its working groups. Note that 22 other groups may also distribute working documents as Internet- 23 Drafts. 25 Internet-Drafts are draft documents valid for a maximum of six 26 months and may be updated, replaced, or obsoleted by other documents 27 at any time. It is inappropriate to use Internet-Drafts as 28 reference material or to cite them other than as "work in progress." 30 The list of current Internet-Drafts can be accessed at 31 http://www.ietf.org/ietf/1id-abstracts.txt. 33 The list of Internet-Draft Shadow Directories can be accessed at 34 http://www.ietf.org/shadow.html. 36 This Internet-Draft will expire on December 11, 2013. 38 Copyright and License Notice 40 Copyright (c) 2013 IETF Trust and the persons identified as the 41 document authors. All rights reserved. 43 This document is subject to BCP 78 and the IETF Trust's Legal 44 Provisions Relating to IETF Documents 45 (http://trustee.ietf.org/license-info) in effect on the date of 46 publication of this document. Please review these documents 47 carefully, as they describe your rights and restrictions with 48 respect to this document. Code Components extracted from this 49 document must include Simplified BSD License text as described in 50 Section 4.e of the Trust Legal Provisions and are provided without 51 warranty as described in the Simplified BSD License. 53 Abstract 55 This memo describes an RTP payload format for the video coding 56 standard ITU-T Recommendation H.265 and ISO/IEC International 57 Standard 23008-2, both also known as High Efficiency Video Coding 58 (HEVC) [HEVC], developed by the Joint Collaborative Team on Video 59 Coding (JCT-VC). The RTP payload format allows for packetization of 60 one or more Network Abstraction Layer (NAL) units in each RTP packet 61 payload, as well as fragmentation of a NAL unit into multiple RTP 62 packets. Furthermore, it supports transmission of an HEVC stream 63 over a single as well as multiple RTP flows. The payload format has 64 wide applicability in videoconferencing, Internet video streaming, 65 and high bit-rate entertainment-quality video, among others. 67 Table of Contents 69 Status of this Memo...............................................1 70 Abstract..........................................................3 71 Table of Contents.................................................3 72 1 . Introduction..................................................5 73 1.1 . Overview of the HEVC Codec...............................5 74 1.1.1 Coding-Tool Features..................................5 75 1.1.2 Systems and Transport Interfaces......................7 76 1.1.3 Parallel Processing Support..........................13 77 1.1.4 NAL Unit Header......................................15 78 1.2 . Overview of the Payload Format..........................17 79 2 . Conventions..................................................17 80 3 . Definitions and Abbreviations................................17 81 3.1 Definitions...............................................17 82 3.1.1 Definitions from the HEVC Specification..............18 83 3.1.2 Definitions Specific to This Memo....................19 84 3.2 Abbreviations.............................................20 85 4 . RTP Payload Format...........................................22 86 4.1 RTP Header Usage..........................................22 87 4.2 Payload Structures........................................23 88 4.3 Transmission Modes........................................24 89 4.4 Decoding Order Number.....................................25 90 4.5 Single NAL Unit Packets...................................27 91 4.6 Aggregation Packets (APs).................................27 92 4.7 Fragmentation Units (FUs).................................32 93 5 . Packetization Rules..........................................36 94 6 . De-packetization Process.....................................37 95 7 . Payload Format Parameters....................................38 96 7.1 Media Type Registration...................................39 97 7.2 SDP Parameters............................................52 98 7.2.1 Mapping of Payload Type Parameters to SDP............53 99 7.2.2 Usage with SDP Offer/Answer Model....................54 100 7.2.3 Usage in Declarative Session Descriptions............58 101 7.2.4 Dependency Signaling in Multi-Session Transmission...60 102 8 . Use with Feedback Messages...................................60 103 8.1 Definition of the SPLI Feedback Message...................62 104 8.2 Use of HEVC with the RPSI Feedback Message................63 105 8.3 Use of HEVC with the SPLI Feedback Message................63 106 9 . Security Considerations......................................63 107 10 . Congestion Control..........................................65 108 11 . IANA Consideration..........................................66 109 12 . Acknowledgements............................................66 110 13 . References..................................................66 111 13.1 Normative References.....................................66 112 13.2 Informative References...................................67 113 14 . Authors' Addresses..........................................68 115 1. Introduction 117 1.1. Overview of the HEVC Codec 119 High Efficiency Video Coding [HEVC], formally known as ITU-T 120 Recommendation H.265 and ISO/IEC International Standard 23008-2 was 121 ratified by ITU-T in April 2013 and reportedly provides significant 122 coding efficiency gains over H.264 [H.264]. 124 As both H.264 [H.264] and its RTP payload format [RFC6184] are 125 widely deployed and generally known in the relevant implementer 126 communities, frequently only the differences between those two 127 specifications are highlighted in non-normative, explanatory parts 128 of this memo. Basic familiarity with both specifications is assumed 129 for those parts. However, the normative parts of this memo do not 130 require study of H.264 or its RTP payload format. 132 H.264 and HEVC share a similar hybrid video codec design. 133 Conceptually, both technologies include a video coding layer (VCL), 134 which is often used to refer to the coding-tool features, and a 135 network abstraction layer (NAL), which is often used to refer to the 136 systems and transport interface aspects of the codecs. 138 1.1.1 Coding-Tool Features 140 Similarly to earlier hybrid-video-coding-based standards, including 141 H.264, the following basic video coding design is employed by HEVC. 142 A prediction signal is first formed either by intra or motion 143 compensated prediction, and the residual (the difference between the 144 original and the prediction) is then coded. The gains in coding 145 efficiency are achieved by redesigning and improving almost all 146 parts of the codec over earlier designs. In addition, HEVC includes 147 several tools to make the implementation on parallel architectures 148 easier. Below is a summary of HEVC coding-tool features. 150 Quad-tree block and transform structure 152 One of the major tools that contribute significantly to the coding 153 efficiency of HEVC is the usage of flexible coding blocks and 154 transforms, which are defined in a hierarchical quad-tree manner. 155 Unlike H.264, where the basic coding block is a macroblock of fixed 156 size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size 157 of 64x64. Each CTU can be divided into smaller units in a 158 hierarchical quad-tree manner and can represent smaller blocks down 159 to size 4x4. Similarly, the transforms used in HEVC can have 160 different sizes, starting from 4x4 and going up to 32x32. Utilizing 161 large blocks and transforms contribute to the major gain of HEVC, 162 especially at high resolutions. 164 Entropy coding 166 HEVC uses a single entropy coding engine, which is based on Context 167 Adaptive Binary Arithmetic Coding (CABAC), whereas H.264 uses two 168 distinct entropy coding engines. CABAC in HEVC shares many 169 similarities with CABAC of H.264, but contains several improvements. 170 Those include improvements in coding efficiency and lowered 171 implementation complexity, especially for parallel architectures. 173 In-loop filtering 175 H.264 includes an in-loop adaptive deblocking filter, where the 176 blocking artifacts around the transform edges in the reconstructed 177 picture are smoothed to improve the picture quality and compression 178 efficiency. In HEVC, a similar deblocking filter is employed but 179 with somewhat lower complexity. In addition, pictures undergo a 180 subsequent filtering operation called Sample Adaptive Offset (SAO), 181 which is a new design element in HEVC. SAO basically adds a pixel- 182 level offset in an adaptive manner and usually acts as a de-ringing 183 filter. It is observed that SAO improves the picture quality, 184 especially around sharp edges contributing substantially to visual 185 quality improvements of HEVC. 187 Motion prediction and coding 189 There have been a number of improvements in this area that are 190 summarized as follows. The first category is motion merge and 191 advanced motion vector prediction (AMVP) modes. The motion 192 information of a prediction block can be inferred from the spatially 193 or temporally neighboring blocks. This is similar to the DIRECT 194 mode in H.264 but includes new aspects to incorporate the flexible 195 quad-tree structure and methods to improve the parallel 196 implementations. In addition, the motion vector predictor can be 197 signaled for improved efficiency. The second category is high- 198 precision interpolation. The interpolation filter length is 199 increased to 8-tap from 6-tap, which improves the coding efficiency 200 but also comes with increased complexity. In addition, the 201 interpolation filter is defined with higher precision without any 202 intermediate rounding operations to further improve the coding 203 efficiency. 205 Intra prediction and intra coding 207 Compared to 8 intra prediction modes in H.264, HEVC supports angular 208 intra prediction with 33 directions. This increased flexibility 209 improves both objective coding efficiency and visual quality as the 210 edges can be better predicted and ringing artifacts around the edges 211 can be reduced. In addition, the reference samples are adaptively 212 smoothed based on the prediction direction. To avoid contouring 213 artifacts a new interpolative prediction generation is included to 214 improve the visual quality. Furthermore, discrete sine transform 215 (DST) is utilized instead of traditional discrete cosine transform 216 (DCT) for 4x4 intra transform blocks. 218 Other coding-tool features 220 HEVC includes some tools for lossless coding and efficient screen 221 content coding, such as skipping the transform for certain blocks. 222 These tools are particularly useful for example when streaming the 223 user-interface of a mobile device to a large display. 225 1.1.2 Systems and Transport Interfaces 227 HEVC inherited the basic systems and transport interfaces designs, 228 such as the NAL-unit-based syntax structure, the hierarchical syntax 229 and data unit structure from sequence-level parameter sets, multi- 230 picture-level or picture-level parameter sets, slice-level header 231 parameters, lower-level parameters, the supplemental enhancement 232 information (SEI) message mechanism, the hypothetical reference 233 decoder (HRD) based video buffering model, and so on. In the 234 following, a list of differences in these aspects compared to H.264 235 is summarized. 237 Video parameter set 239 A new type of parameter set, called video parameter set (VPS), was 240 introduced. For the first (2013) version of [HEVC], the video 241 parameter set NAL unit is required to be available prior to its 242 activation, while the information contained in the video parameter 243 set is not necessary for operation of the decoding process. For 244 future HEVC extensions, such as the 3D or scalable extensions, the 245 video parameter set is expected to include information necessary for 246 operation of the decoding process, e.g. decoding dependency or 247 information for reference picture set construction of enhancement 248 layers. The VPS provides a "big picture" of a bitstream, including 249 what types of operation points are provided, the profile, tier, and 250 level of the operation points, and some other high-level properties 251 of the bitstream that can be used as the basis for session 252 negotiation and content selection, etc. (see section 7.1). 254 Profile, tier and level 256 The profile, tier and level syntax structure that can be included in 257 both VPS and sequence parameter set (SPS) includes 12 bytes data to 258 describe the entire bitstream (including all temporally scalable 259 layers, which are referred to as sub-layers in the HEVC 260 specification), and can optionally include more profile, tier and 261 level information pertaining to individual temporally scalable 262 layers. The profile indicator indicates the "best viewed as" 263 profile when the bitstream conforms to multiple profiles, similar to 264 the major brand concept in the ISO base media file format (ISOBMFF) 265 [ISOBMFF] and file formats derived based on ISOBMFF, such as the 266 3GPP file format [3GP]. The profile, tier and level syntax 267 structure also includes the indications of whether the bitstream is 268 free of frame-packed content, whether the bitstream is free of 269 interlaced source content and free of field pictures, i.e., contains 270 only frame pictures of progressive source, such that clients/players 271 with no support of post-processing functionalities for handling of 272 frame-packed or interlaced source content or field pictures can 273 reject those bitstreams. 275 Bitstream and elementary stream 277 HEVC includes a definition of an elementary stream, which is new 278 compared to H.264. An elementary stream consists of a sequence of 279 one or more bitstreams. An elementary stream that consists of two 280 or more bitstreams has typically been formed by splicing together 281 two or more bitstreams (or parts thereof). When an elementary 282 stream contains more than one bitstream, the last NAL unit of the 283 last access unit of a bitstream (except the last bitstream in the 284 elementary stream) must contain an end of bitstream NAL unit and the 285 first access unit of the subsequent bitstream must be an intra 286 random access point (IRAP) access unit. This IRAP access unit may 287 be a clean random access (CRA), broken link access (BLA), or 288 instantaneous decoding refresh (IDR) access unit. 290 Random access support 292 HEVC includes signaling in NAL unit header, through NAL unit types, 293 of IRAP pictures beyond IDR pictures. Three types of IRAP pictures, 294 namely IDR, CRA and BLA pictures are supported, wherein IDR pictures 295 are conventionally referred to as closed group-of-pictures (closed- 296 GOP) random access points, and CRA and BLA pictures are those 297 conventionally referred to as open-GOP random access points. BLA 298 pictures usually originate from splicing of two bitstreams or part 299 thereof at a CRA picture, e.g. during stream switching. To enable 300 better systems usage of IRAP pictures, altogether six different NAL 301 units are defined to signal the properties of the IRAP pictures, 302 which can be used to better match the stream access point (SAP) 303 types as defined in the ISOBMFF [ISOBMFF], which are utilized for 304 random access support in both 3GP-DASH [3GPDASH] and MPEG DASH 305 [MPEGDASH]. Pictures following an IRAP picture in decoding order 306 and preceding the IRAP picture in output order are referred to as 307 leading pictures associated with the IRAP picture. There are two 308 types of leading pictures, namely random access decodable leading 309 (RADL) pictures and random access skipped leading (RASL) pictures. 310 RADL pictures are decodable when the decoding started at the 311 associated IRAP picture, and RASL pictures are not decodable when 312 the decoding started at the associated IRAP picture and are usually 313 discarded. HEVC provides mechanisms to enable the specification of 314 conformance of bitstreams with RASL pictures being discarded, thus 315 to provide a standard-compliant way to enable systems components to 316 discard RASL pictures when needed. 318 Temporal scalability support 320 HEVC includes an improved support of temporal scalability, by 321 inclusion of the signaling of TemporalId in the NAL unit header, the 322 restriction that pictures of a particular temporal sub-layer cannot 323 be used for inter prediction reference by pictures of a higher 324 temporal sub-layer, the sub-bitstream extraction process, and the 325 requirement that each sub-bitstream extraction output be a 326 conforming bitstream. Media-aware network elements (MANEs) can 327 utilize the TemporalId in the NAL unit header for stream adaptation 328 purposes based on temporal scalability. 330 Temporal sub-layer switching support 332 HEVC specifies, through NAL unit types present in the NAL unit 333 header, the signaling of temporal sub-layer access (TSA) and 334 stepwise temporal sub-layer access (STSA). A TSA picture and 335 pictures following the TSA picture in decoding order do not use 336 pictures prior to the TSA picture in decoding order with TemporalId 337 greater than or equal to that of the TSA picture for inter 338 prediction reference. A TSA picture enables up-switching, at the 339 TSA picture, to the sub-layer containing the TSA picture or any 340 higher sub-layer, from the immediately lower sub-layer. An STSA 341 picture does not use pictures with the same TemporalId as the STSA 342 picture for inter prediction reference. Pictures following an STSA 343 picture in decoding order with the same TemporalId as the STSA 344 picture do not use pictures prior to the STSA picture in decoding 345 order with the same TemporalId as the STSA picture for inter 346 prediction reference. An STSA picture enables up-switching, at the 347 STSA picture, to the sub-layer containing the STSA picture, from the 348 immediately lower sub-layer. 350 Sub-layer reference or non-reference pictures 352 The concept and signaling of reference/non-reference pictures in 353 HEVC are different from H.264. In H.264, if a picture may be used 354 by any other picture for inter prediction reference, it is a 355 reference picture; otherwise it is a non-reference picture, and this 356 is signaled by two bits in the NAL unit header. In HEVC, a picture 357 is called a reference picture only when it is marked as "used for 358 reference". In addition, the concept of sub-layer reference picture 359 was introduced. If a picture may be used by another other picture 360 with the same TemporalId for inter prediction reference, it is a 361 sub-layer reference picture; otherwise it is a sub-layer non- 362 reference picture. Whether a picture is a sub-layer reference 363 picture or sub-layer non-reference picture is signaled through NAL 364 unit type values. 366 Extensibility 368 Besides the TemporalId in the NAL unit header, HEVC also includes 369 the signaling of a six-bit layer ID in the NAL unit header, which 370 must be equal to 0 for a single-layer bitstream. Extension 371 mechanisms have been included in VPS, SPS, PPS, SEI NAL unit, slice 372 headers, and so on. All these extension mechanisms enable future 373 extensions in a backward compatible manner, such that bitstreams 374 encoded according to potential future HEVC extensions can be fed to 375 then-legacy decoders (e.g. HEVC version 1 decoders) and the then- 376 legacy decoders can decode and output the base layer bitstream. 378 Bitstream extraction 380 HEVC includes a bitstream extraction process as an integral part of 381 the overall decoding process, as well as specification of the use of 382 the bitstream extraction process in description of bitstream 383 conformance tests as part of the hypothetical reference decoder 384 (HRD) specification. 386 Reference picture management 388 The reference picture management of HEVC, including reference 389 picture marking and removal from the decoded picture buffer (DPB) as 390 well as reference picture list construction (RPLC), differs from 391 that of H.264. Instead of the sliding window plus adaptive memory 392 management control operation (MMCO) based reference picture marking 393 mechanism in H.264, HEVC specifies a reference picture set (RPS) 394 based reference picture management and marking mechanism, and the 395 RPLC is consequently based on the RPS mechanism. A reference 396 picture set consists of a set of reference pictures associated with 397 a picture, consisting of all reference pictures that are prior to 398 the associated picture in decoding order, that may be used for inter 399 prediction of the associated picture or any picture following the 400 associated picture in decoding order. The reference picture set 401 consists of five lists of reference pictures; RefPicSetStCurrBefore, 402 RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr and 403 RefPicSetLtFoll. RefPicSetStCurrBefore, RefPicSetStCurrAfter and 404 RefPicSetLtCurr contains all reference pictures that may be used in 405 inter prediction of the current picture and that may be used in 406 inter prediction of one or more of the pictures following the 407 current picture in decoding order. RefPicSetStFoll and 408 RefPicSetLtFoll consists of all reference pictures that are not used 409 in inter prediction of the current picture but may be used in inter 410 prediction of one or more of the pictures following the current 411 picture in decoding order. RPS provides an "intra-coded" signaling 412 of the DPB status, instead of an "inter-coded" signaling, mainly for 413 improved error resilience. The RPLC process in HEVC is based on the 414 RPS, by signaling an index to an RPS subset for each reference 415 index. The RPLC process has been simplified compared to that in 416 H.264, by removal of the reference picture list modification (also 417 referred to as reference picture list reordering) process. 419 Ultra low delay support 421 HEVC specifies a sub-picture-level HRD operation, for support of the 422 so-called ultra-low delay. The mechanism specifies a standard- 423 compliant way to enable delay reduction below one picture interval. 424 Sub-picture-level coded picture buffer (CPB) and DPB parameters may 425 be signaled, and utilization of these information for the derivation 426 of CPB timing (wherein the CPB removal time corresponds to decoding 427 time) and DPB output timing (display time) is specified. Decoders 428 are allowed to operate the HRD at the conventional access-unit- 429 level, even when the sub-picture-level HRD parameters are present. 431 New SEI messages 433 HEVC inherits many H.264 SEI messages with changes in syntax and/or 434 semantics making them applicable to HEVC. The active parameter sets 435 SEI message includes the IDs of the active video parameter set and 436 the active sequence parameter set and can be used to activate VPSs 437 and SPSs. In addition, the SEI message includes the following 438 indications: 1) An indication of whether "full random accessibility" 439 is supported (when supported, all parameter sets needed for decoding 440 of the remaining of the bitstream when random accessing from the 441 beginning of the current coded video sequence by completely 442 discarding all access units earlier in decoding order are present in 443 the remaining bitstream and all coded pictures in the remaining 444 bitstream can be correctly decoded); 2) An indication of whether 445 there is no parameter set within the current coded video sequence 446 that updates another parameter set of the same type preceding in 447 decoding order. An update of a parameter set refers to the use of 448 the same parameter set ID but with some other parameters changed. 449 If this property is true for all coded video sequences in the 450 bitstream, then all parameter sets can be sent out-of-band before 451 session start. The region refresh information SEI message can be 452 used together with the recovery point SEI message (present in both 453 H.264 and HEVC) for improved support of gradual decoding refresh 454 (GDR). This supports random access from inter-coded pictures, 455 wherein complete pictures can be correctly decoded or recovered 456 after an indicated number of pictures in output/display order. 458 1.1.3 Parallel Processing Support 460 The reportedly significantly higher computational demand of HEVC 461 over H.264 (especially with respect to encoders, where a complexity 462 increase of a factor of ten has often been reported), in conjunction 463 with the ever increasing video resolution (both spatially and 464 temporally) required by the market, led to the adoption of VCL 465 coding tools specifically targeted to allow for parallelization on 466 the sub-picture level. That is, parallelization occurs, at the 467 minimum, at the granularity of an integer number of CTUs. The 468 targets for this type of high-level parallelization are multicore 469 CPUs and DSPs as well as multiprocessor systems. In a system 470 design, to be useful, these tools require signaling support, which 471 is provided in Section 7 of this memo. This section provides a 472 brief overview of the tools available in [HEVC]. 474 Many of the tools incorporated in HEVC were designed keeping in mind 475 the potential parallel implementations in multi-core/multi-processor 476 architectures. Specifically, for parallelization, four picture 477 partition strategies are available. 479 Slices are segments of the bitstream that can be reconstructed 480 independently from other slices within the same picture (though 481 there may still be interdependencies through loop filtering 482 operations). Slices are the only tool that can be used for 483 parallelization that is also available, in virtually identical form, 484 in H.264. Slices based parallelization does not require much inter- 485 processor or inter-core communication (except for inter-processor or 486 inter-core data sharing for motion compensation when decoding a 487 predictively coded picture, which is typically much heavier than 488 inter-processor or inter-core data sharing due to in-picture 489 prediction), as slices are designed to be independently decodable. 490 However, for the same reason, slices can require some coding 491 overhead. Further, slices (in contrast to some of the other tools 492 mentioned below) also serve as the key mechanism for bitstream 493 partitioning to match Maximum Transfer Unit (MTU) size requirements, 494 due to the in-picture independence of slices and the fact that each 495 regular slice is encapsulated in its own NAL unit. In many cases, 496 the goal of parallelization and the goal of MTU size matching can 497 place contradicting demands to the slice layout in a picture. The 498 realization of this situation led to the development of the more 499 advanced tools mentioned below. This payload format does not 500 contain any specific mechanisms aiding parallelization through 501 slices. 503 Dependent slice segments allow for fragmentation of a coded slice 504 into fragments at CTU boundaries without breaking any in-picture 505 prediction mechanism. They are complementary to the fragmentation 506 mechanism described in this memo in that they need the cooperation 507 of the encoder. As a dependent slice segment necessarily contains 508 an integer number of CTUs, a decoder using multiple cores operating 509 on CTUs can process a dependent slice segment without communicating 510 parts of the slice segment's bitstream to other cores. 511 Fragmentation, as specified in this memo, in contrast, does not 512 guarantee that a fragment contains an integer number of CTUs. 514 In wavefront parallel processing (WPP), the picture is partitioned 515 into rows of CTUs. Entropy decoding and prediction are allowed to 516 use data from CTUs in other partitions. Parallel processing is 517 possible through parallel decoding of CTU rows, where the start of 518 the decoding of a row is delayed by two CTUs, so to ensure that data 519 related to a CTU above and to the right of the subject CTU is 520 available before the subject CTU is being decoded. Using this 521 staggered start (which appears like a wavefront when represented 522 graphically), parallelization is possible with up to as many 523 processors/cores as the picture contains CTU rows. 525 Because in-picture prediction between neighboring CTU rows within a 526 picture is allowed, the required inter-processor/inter-core 527 communication to enable in-picture prediction can be substantial. 528 The WPP partitioning does not result in the creation of more NAL 529 units compared to when it is not applied, thus WPP cannot be used 530 for MTU size matching, though slices can be used in combination for 531 that purpose. 533 Tiles define horizontal and vertical boundaries that partition a 534 picture into tile columns and rows. The scan order of CTUs is 535 changed to be local within a tile (in the order of a CTU raster scan 536 of a tile), before decoding the top-left CTU of the next tile in the 537 order of tile raster scan of a picture. Similar to slices, tiles 538 break in-picture prediction dependencies (including entropy decoding 539 dependencies). However, they do not need to be included into 540 individual NAL units (same as WPP in this regard), hence tiles 541 cannot be used for MTU size matching, though slices can be used in 542 combination for that purpose. Each tile can be processed by one 543 processor/core, and the inter-processor/inter-core communication 544 required for in-picture prediction between processing units decoding 545 neighboring tiles is limited to conveying the shared slice header in 546 cases a slice is spanning more than one tile, and loop filtering 547 related sharing of reconstructed samples and metadata. Insofar, 548 tiles are less demanding in terms of inter-processor communication 549 bandwidth compared to WPP due to the in-picture independence between 550 two neighboring partitions. 552 1.1.4 NAL Unit Header 554 HEVC maintains the NAL unit concept of H.264 with modifications. 555 HEVC uses a two-byte NAL unit header, as shown in Figure 1. The 556 payload of a NAL unit refers to the NAL unit excluding the NAL unit 557 header. 559 +---------------+---------------+ 560 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 561 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 562 |F| Type | LayerId | TID | 563 +-------------+-----------------+ 565 Figure 1 The structure of HEVC NAL unit header 567 The semantics of the fields in the NAL unit header are as specified 568 in [HEVC] and described briefly below for convenience. In addition 569 to the name and size of each field, the corresponding syntax element 570 name in [HEVC] is also provided. 572 F: 1 bit 573 forbidden_zero_bit. MUST be zero. HEVC declares a value of 1 as 574 a syntax violation. Note that the inclusion of this bit in the 575 NAL unit header is to enable transport of HEVC video over MPEG-2 576 transport systems (avoidance of start code emulations) [MPEG2S]. 578 Type: 6 bits 579 nal_unit_type. This field specifies the NAL unit type as defined 580 in Table 7-1 of [HEVC]. For a reference of all currently defined 581 NAL unit types and their semantics, please refer to Section 7.4.1 582 in [HEVC]. 584 LayerId: 6 bits 585 nuh_layer_id. MUST be equal to zero. It is anticipated that in 586 future scalable or 3D video coding extensions of this 587 specification, this syntax element will be used to identify 588 additional layers that may be present in the coded video 589 sequence, wherein a layer may be, e.g. a spatial scalable layer, 590 a quality scalable layer, a texture view, or a depth view. 592 TID: 3 bits 593 nuh_temporal_id_plus1. This field specifies the temporal 594 identifier of the NAL unit plus 1. The value of TemporalId is 595 equal to TID minus 1. A TID value of 0 is illegal to ensure that 596 there is at least one bit in the NAL unit header equal to 1, so 597 to enable independent considerations of start code emulations in 598 the NAL unit header and in the NAL unit payload data. 600 1.2. Overview of the Payload Format 602 This payload format defines the following processes required for 603 transport of HEVC coded data over RTP [RFC3550]: 605 o Usage of RTP header with this payload format 607 o Packetization of HEVC coded NAL units into RTP packets using three 608 types of payload structures, namely single NAL unit packet, 609 aggregation packet, and fragment unit 611 o Transmission of HEVC NAL units of the same bitstream within a 612 single RTP session or multiple RTP sessions 614 o Media type parameters to be used with the Session Description 615 Protocol (SDP) [RFC4566] 617 2. Conventions 619 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 620 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 621 document are to be interpreted as described in BCP 14, RFC 2119 622 [RFC2119]. 624 This specification uses the notion of setting and clearing a bit 625 when bit fields are handled. Setting a bit is the same as assigning 626 that bit the value of 1 (On). Clearing a bit is the same as 627 assigning that bit the value of 0 (Off). 629 3. Definitions and Abbreviations 631 3.1 Definitions 633 This document uses the terms and definitions of [HEVC]. Section 634 3.1.1 lists relevant definitions copied from [HEVC] for convenience. 635 Section 3.1.2 gives definitions specific to this memo. 637 3.1.1 Definitions from the HEVC Specification 639 access unit: A set of NAL units that are associated with each other 640 according to a specified classification rule, are consecutive in 641 decoding order, and contain exactly one coded picture. 643 BLA access unit: An access unit in which the coded picture is a BLA 644 picture. 646 BLA picture: An IRAP picture for which each VCL NAL unit has 647 nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP. 649 coded video sequence: A sequence of access units that consists, in 650 decoding order, of an IRAP access unit with NoRaslOutputFlag equal 651 to 1, followed by zero or more access units that are not IRAP access 652 units with NoRaslOutputFlag equal to 1, including all subsequent 653 access units up to but not including any subsequent access unit that 654 is an IRAP access unit with NoRaslOutputFlag equal to 1. 656 Informative note: An IRAP access unit may be an IDR access unit, 657 a BLA access unit, or a CRA access unit. The value of 658 NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA 659 access unit, and each CRA access unit that is the first access 660 unit in the bitstream in decoding order, is the first access unit 661 that follows an end of sequence NAL unit in decoding order, or 662 has HandleCraAsBlaFlag equal to 1. 664 CRA access unit: An access unit in which the coded picture is a CRA 665 picture. 667 CRA picture: A RAP picture for which each slice has nal_unit_type 668 equal to CRA_NUT. 670 IDR access unit: An access unit in which the coded picture is an IDR 671 picture. 673 IDR picture: A RAP picture for which each slice has nal_unit_type 674 equal to IDR_W_RADL or IDR_N_LP. 676 IRAP access unit: An access unit in which the coded picture is an 677 IRAP picture. 679 IRAP picture: A coded picture for which each VCL NAL unit has 680 nal_unit_type in the range of BLA_W_LP to RSV_IRAP_VCL23, inclusive. 682 layer: A set of VCL NAL units that all have a particular value of 683 nuh_layer_id and the associated non-VCL NAL units, or one of a set 684 of syntactical structures having a hierarchical relationship. 686 operation point: bitstream created from another bitstream by 687 operation of the sub-bitstream extraction process with the another 688 bitstream, a target highest TemporalId, and a target layer 689 identifier list as inputs. 691 random access: The act of starting the decoding process for a 692 bitstream at a point other than the beginning of the stream. 694 sub-layer: A temporal scalable layer of a temporal scalable 695 bitstream consisting of VCL NAL units with a particular value of the 696 TemporalId variable, and the associated non-VCL NAL units. 698 tile: A rectangular region of coding tree blocks within a particular 699 tile column and a particular tile row in a picture. 701 tile column: A rectangular region of coding tree blocks having a 702 height equal to the height of the picture and a width specified by 703 syntax elements in the picture parameter set. 705 tile row: A rectangular region of coding tree blocks having a height 706 specified by syntax elements in the picture parameter set and a 707 width equal to the width of the picture. 709 3.1.2 Definitions Specific to This Memo 711 media aware network element (MANE): A network element, such as a 712 middlebox or application layer gateway that is capable of parsing 713 certain aspects of the RTP payload headers or the RTP payload and 714 reacting to their contents. 716 Informative note: The concept of a MANE goes beyond normal 717 routers or gateways in that a MANE has to be aware of the 718 signaling (e.g., to learn about the payload type mappings of the 719 media streams), and in that it has to be trusted when working 720 with SRTP. The advantage of using MANEs is that they allow 721 packets to be dropped according to the needs of the media coding. 722 For example, if a MANE has to drop packets due to congestion on a 723 certain link, it can identify and remove those packets whose 724 elimination produces the least adverse effect on the user 725 experience. After dropping packets, MANEs must rewrite RTCP 726 packets to match the changes to the RTP packet stream as 727 specified in Section 7 of [RFC3550]. 729 NAL unit decoding order: A NAL unit order that conforms to the 730 constraints on NAL unit order given in Section 7.4.2.4 in [HEVC]. 732 NALU-time: The value that the RTP timestamp would have if the NAL 733 unit would be transported in its own RTP packet. 735 RTP packet stream: A sequence of RTP packets with increasing 736 sequence numbers (except for wrap-around), identical PT and 737 identical SSRC (Synchronization Source), carried in one RTP session. 738 Within the scope of this memo, one RTP packet stream is utilized to 739 transport one or more temporal sub-layers. 741 transmission order: The order of packets in ascending RTP sequence 742 number order (in modulo arithmetic). Within an aggregation packet, 743 the NAL unit transmission order is the same as the order of 744 appearance of NAL units in the packet. 746 base session: an RTP session in Multi-Session Transmission mode that 747 transports a bitstream subset which the rest of RTP sessions in the 748 Multi-Session Transmission depends on. [Ed. (YK): Check the need of 749 this definition after the draft is more complete.] 751 3.2 Abbreviations 753 AP Aggregation Packet 755 BLA Broken Link Access 757 CRA Clean Random Access 759 CTB Coding Tree Block 761 CTU Coding Tree Unit 762 CVS Coded Video Sequence 764 FU Fragmentation Unit 766 GDR Gradual Decoding Refresh 768 HRD Hypothetical Reference Decoder 770 IDR Instantaneous Decoding Refresh 772 IRAP Intra Random Access Point 774 MANE Media Aware Network Element 776 MST Multi-Session Transmission 778 MTU Maximum Transfer Unit 780 NAL Network Abstraction Layer 782 NALU Network Abstraction Layer Unit 784 PPS Picture Parameter Set 786 RADL Random Access Decodable Leading (Picture) 788 RASL Random Access Skipped Leading (Picture) 790 RPS Reference Picture Set 792 SEI Supplemental Enhancement Information 794 SPS Sequence Parameter Set 796 SST Single-Session Transmission 798 STSA Step-wise Temporal Sub-layer Access 800 TSA Temporal Sub-layer Access 802 VCL Video Coding Layer 804 VPS Video Parameter Set 806 4. RTP Payload Format 808 4.1 RTP Header Usage 810 The format of the RTP header is specified in [RFC3550] and reprinted 811 in Figure 2 for convenience. This payload format uses the fields of 812 the header in a manner consistent with that specification. 814 The RTP payload (and the settings for some RTP header bits) for 815 aggregation packets and fragmentation units are specified in 816 Sections 4.6 and 4.7, respectively. 818 0 1 2 3 819 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 820 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 821 |V=2|P|X| CC |M| PT | sequence number | 822 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 823 | timestamp | 824 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 825 | synchronization source (SSRC) identifier | 826 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 827 | contributing source (CSRC) identifiers | 828 | .... | 829 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 831 Figure 2 RTP header according to [RFC3550] 833 The RTP header information to be set according to this RTP payload 834 format is set as follows: 836 Marker bit (M): 1 bit 838 Set for the last packet of the access unit indicated by the RTP 839 timestamp, in line with the normal use of the M bit in video 840 formats, to allow an efficient playout buffer handling. Decoders 841 can use this bit as an early indication of the last packet of an 842 access unit. 844 Payload type (PT): 7 bits 846 The assignment of an RTP payload type for this new packet format 847 is outside the scope of this document and will not be specified 848 here. The assignment of a payload type has to be performed 849 either through the profile used or in a dynamic way. 851 Sequence number (SN): 16 bits 853 Set and used in accordance with RFC 3550. 855 Timestamp: 32 bits 857 The RTP timestamp is set to the sampling timestamp of the 858 content. A 90 kHz clock rate MUST be used. 860 If the NAL unit has no timing properties of its own (e.g., 861 parameter set and SEI NAL units), the RTP timestamp is set to the 862 RTP timestamp of the coded picture of the access unit in which 863 the NAL unit is included, according to Section 7.4.2.4.4 of 864 [HEVC]. 866 Receivers SHOULD ignore the picture output timing information in 867 any picture timing SEI messages or decoding unit information SEI 868 messages as specified in [HEVC]. Instead, receivers SHOULD use 869 the RTP timestamp for the display process. Receivers MUST pass 870 picture timing SEI messages and decoding unit information SEI 871 messages to the decoder and MAY use the field/frame related 872 information for the display process e.g. when frame doubling or 873 frame tripling is indicated by the field/frame related 874 information. 876 4.2 Payload Header Usage 878 The TID value indicates (among other things) the relative importance 879 of an RTP packet, for example because NAL units belonging to higher 880 temporal sub-layers are not used for the decoding of lower temporal 881 sub-layers. A lower value of TID indicates a higher importance. 882 More important NAL units MAY be better protected against 883 transmission losses than less important NAL units. 885 4.3 Payload Structures 887 The first two bytes of the payload of an RTP packet are referred to 888 as the payload header. The payload header consists of the same 889 fields (F, Type, LayerId, and TID) as the NAL unit header as shown 890 in section 1.1.4, irrespective of the type of the payload structure. 892 Three different types of RTP packet payload structures are 893 specified. A receiver can identify the type of an RTP packet 894 payload through the Type field in the payload header. 896 The three different payload structures are as follows: 898 o Single NAL unit packet: Contains a single NAL unit in the 899 payload, and the NAL unit header of the NAL unit also serves as 900 the payload header. This payload structure is specified in 901 section 4.6. 903 o Aggregation packet (AP): Contains more than one NAL unit within 904 one access unit. This payload structure is specified in section 905 4.6. 907 o Fragmentation unit (FU): Contains a subset of a single NAL unit. 908 This payload structure is specified in section 4.7. 910 4.4 Transmission Modes 912 This memo enables transmission of an HEVC bitstream over a single 913 RTP session or multiple RTP sessions. The concept and working 914 principle is inherited from [RFC6190] and follows a similar design. 915 If only one RTP session is used for transmission of the HEVC 916 bitstream, the transmission mode is referred to as single-session 917 transmission (SST); otherwise (more than one RTP session is used for 918 transmission of the HEVC bitstream), the transmission mode is 919 referred to as multi-session transmission (MST). 921 [Ed. (YK): Unify the style of abbreviated words throughout the 922 document.] 924 SST SHOULD be used for point-to-point unicast scenarios, while MST 925 SHOULD be used for point-to-multipoint multicast scenarios where 926 different receivers require different operation points of the same 927 HEVC bitstream, to improve bandwidth utilizing efficiency. 929 Informative note: A multicast may degrade to a unicast after all 930 but one receivers have left (this is a justification of the first 931 "SHOULD" instead of "MUST"), and there might be scenarios where 932 MST is desirable but not possible e.g. when IP multicast is not 933 deployed in certain network (this is a justification of the 934 second "SHOULD" instead of "MUST"). 936 The transmission mode is indicated by the tx-mode media parameter 937 (see section 7.1). If tx-mode is equal to "SST", SST MUST be used. 938 Otherwise (tx-mode is equal to "MST"), MST MUST be used. 940 4.5 Decoding Order Number 942 For each NAL unit, the variable AbsDon is derived, representing the 943 decoding order number that is indicative of the NAL unit decoding 944 order. 946 Let NAL unit n be the n-th NAL unit in transmission order within an 947 RTP session. 949 If tx-mode is equal to "SST" and sprop-depack-buf-nalus is equal 950 to 0, AbsDon[n], the value of AbsDon for NAL unit n, is derived as 951 equal to n. 953 Otherwise (tx-mode is equal to "MST" or sprop-depack-buf-nalus is 954 greater than 0), AbsDon[n] is derived as follows, where DON[n] is 955 the value of the variable DON for NAL unit n: 957 o If n is equal to 0 (i.e. NAL unit n is the very first NAL unit in 958 transmission order), AbsDon[0] is set equal to DON[0]. 960 o Otherwise (n is greater than 0), the following applies for 961 derivation of AbsDon[n]: 963 If DON[n] == DON[n-1], 964 AbsDon[n] = AbsDon[n-1] 966 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 967 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 969 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 970 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 972 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 973 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - DON[n]) 975 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 976 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 978 For any two NAL units m and n, the following applies: 980 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n 981 follows NAL unit m in NAL unit decoding order. 983 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 984 of the two NAL units can be in either order. 986 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 987 NAL unit m in decoding order. 989 When two consecutive NAL units in the NAL unit decoding order have 990 different values of AbsDon, the value of AbsDon for the second NAL 991 unit in decoding order MUST be greater than the value of AbsDon for 992 the first NAL unit, and the absolute difference between the two 993 AbsDon values MAY be greater than or equal to 1. 995 Informative note: There are multiple reasons to allow for the 996 absolute difference of the values of AbsDon for two consecutive 997 NAL units in the NAL unit decoding order to be greater than one. 998 An increment by one is not required, as at the time of 999 associating values of AbsDon to NAL units, it may not be known 1000 whether all NAL units are to be delivered to the receiver. For 1001 example, a gateway may not forward coded slice NAL units of 1002 higher sub-layers or some SEI NAL units when there is congestion 1003 in the network. In another example, the first intra picture of a 1004 pre-encoded clip is transmitted in advance to ensure that it is 1005 readily available in the receiver, and when transmitting the 1006 first intra picture, the originator does not exactly know how 1007 many NAL units will be encoded before the first intra picture of 1008 the pre-encoded clip follows in decoding order. Thus, the values 1009 of AbsDon for the NAL units of the first intra picture of the 1010 pre-encoded clip have to be estimated when they are transmitted, 1011 and gaps in values of AbsDon may occur. Another example is MST 1012 where the AbsDon values must indicate cross-layer decoding order 1013 for NAL units conveyed in all the RTP sessions. 1015 4.6 Single NAL Unit Packets 1017 A single NAL unit packet contains exactly one NAL unit, and consists 1018 of a payload header (denoted as PayloadHdr), an optional 16-bit DONL 1019 field (in network byte order), and the NAL unit payload data (the 1020 NAL unit excluding its NAL unit header) of the contained NAL unit, 1021 as shown in Figure 3. 1023 0 1 2 3 1024 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1025 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1026 | PayloadHdr | DONL (optional) | 1027 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1028 | | 1029 | NAL unit payload data | 1030 | | 1031 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1032 | :...OPTIONAL RTP padding | 1033 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1035 Figure 3 The structure a single NAL unit packet 1037 The payload header SHOULD be an exact copy of the NAL unit header of 1038 the contained NAL unit. However, the Type (i.e. nal_unit_type) 1039 field MAY be changed, e.g. when it is desirable to handle a CRA 1040 picture to be a BLA picture [JCTVC-J0107]. 1042 The DONL field, when present, specifies the value of the 16 least 1043 significant bits of the decoding order number of the contained NAL 1044 unit. 1046 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1047 than 0, the DONL field MUST be present, and the variable DON for the 1048 contained NAL unit is derived as equal to the value of the DONL 1049 field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- 1050 nalus is equal to 0), the DONL field MUST NOT be present. 1052 4.7 Aggregation Packets (APs) 1054 Aggregation packets (APs) are introduced to enable the reduction of 1055 packetization overhead for small NAL units, such as most of the non- 1056 VCL NAL units, which are often only a few octets in size. 1058 An AP aggregates NAL units within one access unit. Each NAL unit to 1059 be carried in an AP is encapsulated in an aggregation unit. NAL 1060 units aggregated in one AP are in NAL unit decoding order. 1062 An AP consists of a payload header (denoted as PayloadHdr) followed 1063 by one or more aggregation units, as shown in Figure 4. 1065 0 1 2 3 1066 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1067 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1068 | PayloadHdr | | 1069 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1070 | | 1071 | one or more aggregation units | 1072 | | 1073 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1074 | :...OPTIONAL RTP padding | 1075 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1077 Figure 4 The structure of an aggregation packet 1079 The fields in the payload header are set as follows. The F bit MUST 1080 be equal to 0 if the F bit of each aggregated NAL unit is equal to 1081 zero; otherwise, it MUST be equal to 1. The Type field MUST be 1082 equal to 48. The value of LayerId MUST be equal to the lowest value 1083 of LayerId of all the aggregated NAL units. The value of TID MUST 1084 be the lowest value of TID of all the aggregated NAL units. 1086 Informative Note: All VCL NAL units in an AP have the same TID 1087 value since they belong to the same access unit. However, an AP 1088 may contain non-VCL NAL units for which the TID value in the NAL 1089 unit header may be different than the TID value of the VCL NAL 1090 units in the same AP. 1092 An AP MUST carry at least two aggregation units and can carry as 1093 many aggregation units as necessary; however, the total amount of 1094 data in an AP obviously MUST fit into an IP packet, and the size 1095 SHOULD be chosen so that the resulting IP packet is smaller than the 1096 MTU size so to avoid IP layer fragmentation. An AP MUST NOT contain 1097 Fragmentation Units (FUs) specified in section 4.7. APs MUST NOT be 1098 nested; i.e., an AP MUST NOT contain another AP. 1100 The first aggregation unit in an AP consists of an optional 16-bit 1101 DONL field (in network byte order) followed by a 16-bit unsigned 1102 size information (in network byte order) that indicates the size of 1103 the NAL unit in bytes (excluding these two octets, but including the 1104 NAL unit header), followed by the NAL unit itself, including its NAL 1105 unit header, as shown in Figure 5. 1107 0 1 2 3 1108 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1109 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1110 : DONL (optional) | NALU size | 1111 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1112 | NALU size | | 1113 +-+-+-+-+-+-+-+-+ NAL unit | 1114 | | 1115 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1116 | : 1117 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1119 Figure 5 The structure of the first aggregation unit in an AP 1121 The DONL field, when present, specifies the value of the 16 least 1122 significant bits of the decoding order number of the aggregated NAL 1123 unit. 1125 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1126 than 0, the DONL field MUST be present in an aggregation unit that 1127 is the first aggregation unit in an AP, and the variable DON for the 1128 aggregated NAL unit is derived as equal to the value of the DONL 1129 field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- 1130 nalus is equal to 0), the DONL field MUST NOT be present in an 1131 aggregation unit that is the first aggregation unit in an AP. 1133 An aggregation unit that is not the first aggregation unit in an AP 1134 consists of an optional 8-bit DOND field followed by a 16-bit 1135 unsigned size information (in network byte order) that indicates the 1136 size of the NAL unit in bytes (excluding these two octets, but 1137 including the NAL unit header), followed by the NAL unit itself, 1138 including its NAL unit header, as shown in Figure 6. 1140 0 1 2 3 1141 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1142 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1143 : DOND(optional)| NALU size | 1144 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1145 | | 1146 | NAL unit | 1147 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1148 | : 1149 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1151 Figure 6 The structure of an aggregation unit that is not the first 1152 aggregation unit in an AP 1154 When present, the DOND field plus 1 specifies the difference between 1155 the decoding order number values of the current aggregated NAL unit 1156 and the preceding aggregated NAL unit in the same AP. 1158 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1159 than 0, the DOND field MUST be present in an aggregation unit that 1160 is not the first aggregation unit in an AP, and the variable DON for 1161 the aggregated NAL unit is derived as equal to the DON of the 1162 preceding aggregated NAL unit in the same AP plus the value of the 1163 DOND field plus 1 modulo 65536. Otherwise (tx-mode is equal to 1164 "SST" and sprop-depack-buf-nalus is equal to 0), the DOND field MUST 1165 NOT be present in an aggregation unit that is not the first 1166 aggregation unit in an AP. 1168 Figure 7 presents an example of an AP that contains two aggregation 1169 units, labeled as 1 and 2 in the figure, without the DONL and DOND 1170 fields being present. 1172 0 1 2 3 1173 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1175 | RTP Header | 1176 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1177 | PayloadHdr | NALU 1 Size | 1178 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1179 | NALU 1 HDR | | 1180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1181 | . . . | 1182 | | 1183 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1184 | . . . | NALU 2 Size | NALU 2 HDR | 1185 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1186 | NALU 2 HDR | | 1187 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1188 | . . . | 1189 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1190 | :...OPTIONAL RTP padding | 1191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1193 Figure 7 An example of an AP packet containing two aggregation units 1194 without the DONL and DOND fields 1196 Figure 8 presents an example of an AP that contains two aggregation 1197 units, labeled as 1 and 2 in the figure, with the DONL and DOND 1198 fields being present. 1200 0 1 2 3 1201 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1203 | RTP Header | 1204 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1205 | PayloadHdr | NALU 1 DONL | 1206 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1207 | NALU 1 Size | NALU 1 HDR | 1208 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1209 | | 1210 | NALU 1 Data . . . | 1211 | | 1212 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1213 | | NALU 2 DOND | NALU 2 Size | 1214 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1215 | NALU 2 HDR | | 1216 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1217 | | 1218 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1219 | :...OPTIONAL RTP padding | 1220 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1222 Figure 8 An example of an AP containing two aggregation units with 1223 the DONL and DOND fields 1225 4.8 Fragmentation Units (FUs) 1227 Fragmentation units (FUs) are introduced to enable fragmenting a 1228 single NAL unit into multiple RTP packets, possibly without 1229 cooperation or knowledge of the HEVC encoder. A fragment of a NAL 1230 unit consists of an integer number of consecutive octets of that NAL 1231 unit. Fragments of the same NAL unit MUST be sent in consecutive 1232 order with ascending RTP sequence numbers (with no other RTP packets 1233 within the same RTP packet stream being sent between the first and 1234 last fragment). 1236 When a NAL unit is fragmented and conveyed within FUs, it is 1237 referred to as a fragmented NAL unit. APs MUST NOT be fragmented. 1238 FUs MUST NOT be nested; i.e., an FU MUST NOT contain a subset of 1239 another FU. 1241 The RTP timestamp of an RTP packet carrying an FU is set to the 1242 NALU-time of the fragmented NAL unit. 1244 An FU consists of a payload header (denoted as PayloadHdr), an FU 1245 header of one octet, an optional 16-bit DONL field (in network byte 1246 order), and an FU payload, as shown in Figure 9. 1248 0 1 2 3 1249 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1250 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1251 | PayloadHdr | FU header | DONL(optional)| 1252 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1253 | DONL(optional)| | 1254 |-+-+-+-+-+-+-+-+ | 1255 | FU payload | 1256 | | 1257 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1258 | :...OPTIONAL RTP padding | 1259 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1261 Figure 9 The structure of an FU 1263 The fields in the payload header are set as follows. The Type field 1264 MUST be equal to 49. The fields F, LayerId, and TID MUST be equal 1265 to the fields F, LayerId, and TID, respectively, of the fragmented 1266 NAL unit. 1268 The FU header consists of an S bit, an E bit, and a 6-bit FuType 1269 field, as shown in Figure 10. 1271 +---------------+ 1272 |0|1|2|3|4|5|6|7| 1273 +-+-+-+-+-+-+-+-+ 1274 |S|E| FuType | 1275 +---------------+ 1277 Figure 10 The structure of FU header 1279 The semantics of the FU header fields are as follows: 1280 S: 1 bit 1281 When set to one, the S bit indicates the start of a fragmented 1282 NAL unit i.e., the first byte of the FU payload is also the first 1283 byte of the payload of the fragmented NAL unit. When the FU 1284 payload is not the start of the fragmented NAL unit payload, the 1285 S bit MUST be set to zero. 1287 E: 1 bit 1288 When set to one, the E bit indicates the end of a fragmented NAL 1289 unit, i.e., the last byte of the payload is also the last byte of 1290 the fragmented NAL unit. When the FU payload is not the last 1291 fragment of a fragmented NAL unit, the E bit MUST be set to zero. 1293 FuType: 6 bits 1294 The field FuType MUST be equal to the field Type of the 1295 fragmented NAL unit. 1297 The DONL field, when present, specifies the value of the 16 least 1298 significant bits of the decoding order number of the fragmented NAL 1299 unit. 1301 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1302 than 0, and the S bit is equal to 1, the DONL field MUST be present 1303 in the FU, and the variable DON for the fragmented NAL unit is 1304 derived as equal to the value of the DONL field. Otherwise (tx-mode 1305 is equal to "SST" and sprop-depack-buf-nalus is equal to 0, or the S 1306 bit is equal to 0), the DONL field MUST NOT be present in the FU. 1308 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1309 the Start bit and End bit MUST NOT both be set to one in the same FU 1310 header. 1312 The FU payload consists of fragments of the payload of the 1313 fragmented NAL unit so that if the FU payloads of consecutive FUs, 1314 starting with an FU with the S bit equal to 1 and ending with an FU 1315 with the E bit equal to 1, are sequentially concatenated, the 1316 payload of the fragmented NAL unit can be reconstructed. The NAL 1317 unit header of the fragmented NAL unit is not included as such in 1318 the FU payload, but rather the information of the NAL unit header of 1319 the fragmented NAL unit is conveyed in F, LayerId, and TID fields of 1320 the FU payload headers of the FUs and the Type field of the FU 1321 header of the FUs. An FU payload MAY have any number of octets and 1322 MAY be empty. 1324 Informative note: Empty FU payloads are allowed to reduce the 1325 latency of a certain class of senders in nearly lossless 1326 environments. These senders can be characterized in that they 1327 packetize fragments of a NAL unit before the NAL unit is 1328 completely generated and, hence, before the NAL unit size is 1329 known. If zero-length FU payloads were not allowed, the sender 1330 would have to generate at least one bit of data of the following 1331 fragment of the NAL unit before the current FU could be sent. 1332 Due to the characteristics of HEVC, where sometimes several CTUs 1333 occupy zero bits, this is undesirable and can add delay. 1334 However, the (potential) use of zero-length FU payloads should be 1335 carefully weighted against the increased risk of the loss of at 1336 least a part of the fragmented NAL unit because of the additional 1337 packets employed for its transmission. 1339 If an FU is lost, the receiver SHOULD discard all following 1340 fragmentation units in transmission order corresponding to the same 1341 fragmented NAL unit, unless the decoder in the receiver is known to 1342 be prepared to gracefully handle incomplete NAL units. 1344 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1345 fragments of a NAL unit to an (incomplete) NAL unit, even if 1346 fragment n of that NAL unit is not received. In this case, the 1347 forbidden_zero_bit of the NAL unit MUST be set to one to indicate a 1348 syntax violation. 1350 5. Packetization Rules 1352 The following packetization rules apply: 1354 o If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1355 than 0 for an RTP session, the transmission order of NAL units 1356 carried in the RTP session MAY be different than the NAL unit 1357 decoding order. Otherwise (tx-mode is equal to "SST" and sprop- 1358 depack-buf-nalus is equal to 0 for an RTP session), the 1359 transmission order of NAL units carried in the RTP session MUST 1360 be the same as the NAL unit decoding order. 1362 o A NAL unit of a small size SHOULD be encapsulated in an 1363 aggregation packet together with one or more other NAL units in 1364 order to avoid the unnecessary packetization overhead for small 1365 NAL units. For example, non-VCL NAL units such as access unit 1366 delimiters, parameter sets, or SEI NAL units are typically small 1367 and can often be aggregated with slice NAL units without 1368 violating MTU size constraints. 1370 o Each non-VCL NAL unit SHOULD be encapsulated in an aggregation 1371 packet together with its associated VCL NAL unit, as typically a 1372 non-VCL NAL unit would be meaningless without the associated VCL 1373 NAL unit being available.FUs SHOULD NOT be applied in live- 1374 encoding scenarios such as video telephony, video conferencing, 1375 live streaming and live broadcast, in which cases dependent slice 1376 segments SHOULD be used when a slice should be transported in 1377 multiple RTP packets. For pre-encoded content where using of 1378 dependent slice segments is not possible without transcoding, FUs 1379 SHOULD be used for transporting of one NAL unit in multiple RTP 1380 packets for MTU size matching. 1382 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1383 unit packet MUST be used. 1385 6. De-packetization Process 1387 The general concept behind de-packetization is to get the NAL units 1388 out of the RTP packets in an RTP session and all the dependent RTP 1389 sessions, if any, and pass them to the decoder in the NAL unit 1390 decoding order. 1392 The de-packetization process is implementation dependent. 1393 Therefore, the following description should be seen as an example of 1394 a suitable implementation. Other schemes may be used as well as 1395 long as the output for the same input is the same as the process 1396 described below. The output is the same when the set of NAL units 1397 and their order are both identical. Optimizations relative to the 1398 described algorithms are possible. 1400 All normal RTP mechanisms related to buffer management apply. In 1401 particular, duplicated or outdated RTP packets (as indicated by the 1402 RTP sequences number and the RTP timestamp) are removed. To 1403 determine the exact time for decoding, factors such as a possible 1404 intentional delay to allow for proper inter-stream synchronization 1405 must be factored in. 1407 NAL units with NAL unit type values in the range of 0 to 47, 1408 inclusive may be passed to the decoder. NAL-unit-like structures 1409 with NAL unit type values in the range of 48 to 63, inclusive, MUST 1410 NOT be passed to the decoder. 1412 The receiver includes a receiver buffer, which is used to compensate 1413 for transmission delay jitter, to reorder NAL units from 1414 transmission order to the NAL unit decoding order, and to recover 1415 the NAL unit decoding order in MST, when applicable. In this 1416 section, the receiver operation is described under the assumption 1417 that there is no transmission delay jitter. To make a difference 1418 from a practical receiver buffer that is also used for compensation 1419 of transmission delay jitter, the receiver buffer is here after 1420 called the de-packetization buffer in this section. Receivers 1421 SHOULD also prepare for transmission delay jitter; i.e., either 1422 reserve separate buffers for transmission delay jitter buffering and 1423 de-packetization buffering or use a receiver buffer for both 1424 transmission delay jitter and de-packetization. Moreover, receivers 1425 SHOULD take transmission delay jitter into account in the buffering 1426 operation; e.g., by additional initial buffering before starting of 1427 decoding and playback. 1429 There are two buffering states in the receiver: initial buffering 1430 and buffering while playing. Initial buffering starts when the 1431 reception is initialized. After initial buffering, decoding and 1432 playback are started, and the buffering-while-playing mode is used. 1434 Regardless of the buffering state, the receiver stores incoming NAL 1435 units, in reception order, into the de-packetization buffer. NAL 1436 units carried in single NAL unit packets, APs, and FUs are stored in 1437 the de-packetization buffer individually, and the value of AbsDon is 1438 calculated and stored for each NAL unit. When MST is in use, NAL 1439 units of all RTP packet streams are stored in the same de- 1440 packetization buffer. 1442 Initial buffering lasts until condition A (the number of NAL units 1443 in the de-packetization buffer is greater than the value of sprop- 1444 depack-buf-nalus of the highest RTP session) is true. 1446 After initial buffering, whenever condition A is true, the following 1447 operation is repeatedly applied until condition A becomes false: 1449 o The NAL unit in the de-packetization buffer with the smallest 1450 value of AbsDon is removed from the de-packetization buffer and 1451 passed to the decoder. 1453 When no more NAL units are flowing into the de-packetization buffer, 1454 all NAL units remaining in the de-packetization buffer are removed 1455 from the buffer and passed to the decoder in the order of increasing 1456 AbsDon values. 1458 7. Payload Format Parameters 1460 This section specifies the parameters that MAY be used to select 1461 optional features of the payload format and certain features or 1462 properties of the bitstream. The parameters are specified here as 1463 part of the media type registration for the HEVC codec. A mapping 1464 of the parameters into the Session Description Protocol (SDP) 1465 [RFC4566] is also provided for applications that use SDP. 1466 Equivalent parameters could be defined elsewhere for use with 1467 control protocols that do not use SDP. 1469 7.1 Media Type Registration 1471 The media subtype for the HEVC codec is allocated from the IETF 1472 tree. 1474 The receiver MUST ignore any unspecified parameter. 1476 Media Type name: video 1478 Media subtype name: H265 1480 Required parameters: none 1482 OPTIONAL parameters: 1484 In the following definitions of parameters, "the stream" or "the 1485 NAL unit stream" refers to all NAL units conveyed in the current 1486 RTP session in SST, and all NAL units conveyed in the current RTP 1487 session and all NAL units conveyed in other RTP sessions that the 1488 current RTP session depends on in MST. 1490 profile-space, profile-id: 1492 The profile-space parameter indicates the context for 1493 interpretation of the profile-id parameter value. The 1494 profile, which specifies the subset of coding tools that may 1495 have been used to generate the stream or that the receiver 1496 supports, as specified in [HEVC], is defined by the 1497 combination of profile-space and profile-id. Note that 1498 profile-space is required to be equal to 0 in [HEVC], but 1499 other values for it may be specified in the future by ITU-T or 1500 ISO/IEC. 1502 If the profile-space and profile-id parameters are used to 1503 indicate properties of a NAL unit stream, it indicates that, 1504 to decode the stream, the minimum subset of coding tools a 1505 decoder has to support is the profile specified by both 1506 parameters. 1508 If the profile-space and profile-id parameters are used for 1509 capability exchange or session setup, it indicates the subset 1510 of coding tools, which is equal to the profile, that the codec 1511 supports for both receiving and sending. 1513 If no profile-space is present, a value of 0 MUST be inferred 1514 and if no profile-id is present the Main profile MUST be 1515 inferred. 1517 The profile-space and profile-id parameters are derived from 1518 the sequence parameter set or video parameter set NAL units, 1519 as specified in [HEVC], as follows. 1521 For SST or for the stream corresponding to the highest RTP 1522 session of MST when MST is applied, the following applies: 1524 o profile_space = general_profile_space 1525 o profile_id = general_profile_idc 1527 For streams not corresponding to the highest RTP session of 1528 MST when MST is applied, the following applies, with j being 1529 the value of the sub-layer-id parameter: 1531 o profile_space = sub_layer_profile_space[j] 1532 o profile_id = sub_layer_profile_idc[j] 1534 tier-flag, level-id: 1536 The tier-flag parameter indicates the context for 1537 interpretation of the level-id value. The default level, 1538 which limits values of syntax elements or on arithmetic 1539 combinations of values of syntax elements, as specified in 1540 [HEVC], is defined by the combination of tier-flag and level- 1541 id. 1543 If the tier-flag and level-id parameters are used to indicate 1544 properties of a NAL unit stream, it indicates that, to decode 1545 the stream the lowest level the decoder has to support is the 1546 default level. 1548 If the tier-flag and level-id parameters are used for 1549 capability exchange or session setup, the following applies. 1550 If max-recv-level-id is not present, the default level defined 1551 by tier-flag and level-id indicates the highest level the 1552 codec wishes to support. Otherwise, tier-flag and max-recv- 1553 level-id indicate the highest level the codec supports for 1554 receiving. For either receiving or sending, all levels that 1555 are lower than the highest level supported MUST also be 1556 supported. 1558 If no tier-flag is present, a value of 0 MUST be inferred and 1559 if no level-id is present, a value of 30 (i.e. level 1.0) MUST 1560 be inferred. 1562 The tier-flag and level-id parameters are derived from the 1563 sequence parameter set or video parameter set NAL units, as 1564 specified in [HEVC], as follows. 1566 For SST or for the stream corresponding to the highest RTP 1567 session of MST when MST is applied, the following applies: 1569 o tier-flag = general_tier_flag 1570 o level-id = general_level_idc 1572 For streams not corresponding to the highest RTP session of 1573 MST when MST is applied, the following applies, with j being 1574 the value of the sub-layer-id parameter: 1576 o tier-flag = sub_layer_tier_flag[j] 1577 o level-id = sub_layer_level_idc[j] 1579 interop-constraints: 1581 A base16 [RFC4648] (hexadecimal) representation of the six 1582 bytes derived from the sequence parameter set or video 1583 parameter set NAL units as specified in [HEVC] consisting of 1584 progressive_source_flag, interlaced_source_flag, 1585 non_packed_constraint_flag, frame_only_constraint_flag, and 1586 reserved_zero_44bits. Note that reserved_zero_44bits is 1587 required to be equal to 0 in [HEVC], but other values for it 1588 may be specified in the future by ITU-T or ISO/IEC. 1590 If no interop-constraints are present, the following MUST be 1591 inferred: 1593 o progressive_source_flag = 1 1594 o interlaced_source_flag = 0 1595 o non_packed_constraint_flag = 1 1596 o frame_only_constraint_flag = 1 1597 o reserved_zero_44bits = 0 1598 For SST or for the stream corresponding to the highest RTP 1599 session of MST when MST is applied, the following applies: 1601 o progressive_source_flag = general_progressive_source_flag 1602 o interlaced_source_flag = general_interlaced_source_flag 1603 o non_packed_constraint_flag = 1604 general_non_packed_constraint_flag 1605 o frame_only_constraint_flag = 1606 general_frame_only_constraint_flag 1607 o reserved_zero_44bits = general_reserved_zero_44bits 1609 For streams not corresponding to the highest RTP session of 1610 MST when MST is applied, the following applies, with j being 1611 the value of the sub-layer-id parameter: 1613 o progressive_source_flag = 1614 sub_layer_progressive_source_flag[j] 1615 o interlaced_source_flag = 1616 sub_layer_interlaced_source_flag[j] 1617 o non_packed_constraint_flag = 1618 sub_layer_non_packed_constraint_flag[j] 1619 o frame_only_constraint_flag = 1620 sub_layer_frame_only_constraint_flag[j] 1621 o reserved_zero_44bits = sub_layer_reserved_zero_44bits[j] 1623 profile-compatibility-indicator: 1625 A base16 [RFC4648] representation of the four bytes 1626 representing the 32 profile compatibility flags in the 1627 sequence parameter set or video parameter set NAL units. A 1628 decoder conforming to a certain profile may be able to decode 1629 bitstreams conforming to other profiles. The profile- 1630 compatibility-indicator provides exact information of the 1631 ability of a decoder conforming to a certain profile to decode 1632 bitstreams conforming to another profile. More concretely, if 1633 the profile compatibility flag corresponding to the profile, 1634 which a decoder conforms to, is set, then the decoder is able 1635 to decode that bitstream with the flag set, irrespective of 1636 the profile, which a bitstream conforms to (provided that the 1637 decoder supports the highest level of the bitstream). 1639 For SST or for the stream corresponding to highest RTP session 1640 of MST when MST is used with temporal scalability the 1641 following applies with j = 0..31: 1643 o The 32 flags = general_profile_compatibility_flag[j] 1645 For streams not corresponding to the highest RTP session (the 1646 RTP session which no other RTP session depends on) of MST when 1647 MST is used with temporal scalability the following applies 1648 with i being the value of the sub-layer-id parameter and j = 1649 0..31: 1651 o The 32 flags = sub_layer_profile_compatibility_flag[i][j] 1653 sub-layer-id: 1655 This parameter MAY be used to indicate the TID of the highest 1656 sub-layer of the stream. When not present, the value of sub- 1657 layer-id is inferred to be equal to 1658 vps_max_sub_layers_minus1+1 and sps_max_sub_layers_minus1+1 in 1659 the video parameter set and sequence parameter set as defined 1660 in [HEVC]. 1662 recv-sub-layer-id: 1664 This parameter MAY be used to signal a receiver's choice of 1665 the offers or declared sub-layers in the sprop-vps. The value 1666 of recv-sub-layer-id indicates the index of the highest sub- 1667 layer of the stream that a receiver supports. When not 1668 present, the value of recv-sub-layer-id is inferred to be 1669 equal to sub-layer-id. 1671 max-recv-level-id: 1673 This parameter MAY be used, together with tier-flag, to 1674 indicate the highest level a receiver supports. The highest 1675 level the receiver supports is equal to the value of max-recv- 1676 level-id divided by 30 for the Main or High tier (as 1677 determined by tier-flag equal to 0 or 1, respectively). 1679 When max-recv-level-id is not present, the value is inferred 1680 to be equal to level-id. 1682 max-recv-level-id MUST NOT be present when the highest level 1683 the receiver supports is not higher than the default level. 1685 sprop-vps: 1687 This parameter MAY be used to convey any video parameter set 1688 NAL unit of the stream. When present, the parameter MAY be 1689 used to indicate codec capability and sub-stream 1690 characteristics (i.e. properties of representations of sub- 1691 layers as defined in [HEVC]) as well as for out-of-band 1692 transmission of video parameter sets. The value of the 1693 parameter is a comma-separated (',') list of base64 [RFC4648] 1694 representations of the video parameter set NAL units as 1695 specified in Section 7.3.2.1 of [HEVC]. 1697 sprop-sps: 1699 This parameter MAY be used to convey sequence parameter set 1700 NAL units of the stream for out-of-band transmission of 1701 sequence parameter sets. The value of the parameter is a 1702 comma-separated (',') list of base64 [RFC4648] representations 1703 of the sequence parameter set NAL units as specified in 1704 Section 7.3.2.2 of [HEVC]. 1706 sprop-pps: 1708 This parameter MAY be used to convey picture parameter set NAL 1709 units of the stream for out-of-band transmission of picture 1710 parameter sets. The value of the parameter is a comma- 1711 separated (',') list of base64 [RFC4648] representations of 1712 the picture parameter set NAL units as specified in Section 1713 7.3.2.3 of [HEVC]. 1715 max-ls, max-lps, max-cpb, max-dpb, max-br: 1717 These parameters MAY be used to signal the capabilities of a 1718 receiver implementation. These parameters MUST NOT be used for 1719 any other purpose. The highest level (specified by tier-flag 1720 and max-recv-level-id) MUST be such that the receiver is fully 1721 capable of supporting. max-ls, max-lps, max-cpb, max-dpb, and 1722 max-br MAY be used to indicate capabilities of the receiver 1723 that extend the required capabilities of the signaled highest 1724 level, as specified below. 1726 When more than one parameter from the set (max-ls, max-lps, 1727 max-cpb, max-dpb, max-br) is present, the receiver MUST 1728 support all signaled capabilities simultaneously. For 1729 example, if both max-ls and max-br are present, the signaled 1730 highest level with the extension of both the frame rate and 1731 bitrate is supported. That is, the receiver is able to decode 1732 NAL unit streams in which the luma sample rate is up to max-ls 1733 (inclusive), the bitrate is up to max-br (inclusive), the 1734 coded picture buffer size is derived as specified in the 1735 semantics of the max-br parameter below, and the other 1736 properties comply with the highest level specified by tier- 1737 flag and max-recv-level-id. 1739 Informative note: When the OPTIONAL media type parameters 1740 are used to signal the properties of a NAL unit stream, 1741 max-ls, max-lps, max-cpb, max-dpb, and max-br are not 1742 present, and the value of profile-space, profile-id, tier- 1743 flag and level-id must always be such that the NAL unit 1744 stream complies fully with the specified profile and level. 1746 max-ls: 1747 The value of max-ls is an integer indicating the maximum 1748 processing rate in units of luma samples per second. The max- 1749 ls parameter signals that the receiver is capable of decoding 1750 video at a higher rate than is required by the signaled 1751 highest level. 1753 When max-ls is signaled, the receiver MUST be able to decode 1754 NAL unit streams that conform to the signaled highest level, 1755 with the exception that the MaxLumaSR value in Table A-2 of 1756 [HEVC] for the signaled highest level is replaced with the 1757 value of max-ls. The value of max-ls MUST be greater than or 1758 equal to the value of MaxLumaSR given in Table A-2 of [HEVC] 1759 for the highest level. Senders MAY use this knowledge to send 1760 pictures of a given size at a higher picture rate than is 1761 indicated in the signaled highest level. 1763 max-lps: 1764 The value of max-lps is an integer indicating the maximum 1765 picture size in units of luma samples. The max-lps parameter 1766 signals that the receiver is capable of decoding larger 1767 picture sizes than are required by the signaled highest level. 1768 When max-lps is signaled, the receiver MUST be able to decode 1769 NAL unit streams that conform to the signaled highest level, 1770 with the exception that the MaxLumaPS value in Table A-1 of 1771 [HEVC] for the signaled highest level is replaced with the 1772 value of max-lps. The value of max-lps MUST be greater than or 1773 equal to the value of MaxLumaPS given in Table A-1 of [HEVC] 1774 for the highest level. Senders MAY use this knowledge to send 1775 larger pictures at a proportionally lower frame rate than is 1776 indicated in the signaled highest level. 1778 max-cpb: 1779 The value of max-cpb is an integer indicating the maximum 1780 coded picture buffer size in units of CpbBrVclFactor bits for 1781 the VCL HRD parameters and in units of CpbBrNalFactor bits for 1782 the NAL HRD parameters, where CpbBrVclFactor and 1783 CpbBrNalFactor are defined in Section A.4 of [HEVC]. The max- 1784 cpb parameter signals that the receiver has more memory than 1785 the minimum amount of coded picture buffer memory required by 1786 the signaled highest level. When max-cpb is signaled, the 1787 receiver MUST be able to decode NAL unit streams that conform 1788 to the signaled highest level, with the exception that the 1789 MaxCPB value in Table A-1 of [HEVC] for the signaled highest 1790 level is replaced with the value of max-cpb. The value of max- 1791 cpb MUST be greater than or equal to the value of MaxCPB given 1792 in Table A-1 of [HEVC] for the highest level. Senders MAY use 1793 this knowledge to construct coded video streams with greater 1794 variation of bitrate than can be achieved with the MaxCPB 1795 value in Table A-1 of [HEVC]. 1797 Informative note: The coded picture buffer is used in the 1798 hypothetical reference decoder (Annex C of HEVC). The use 1799 of the hypothetical reference decoder is recommended in 1800 HEVC encoders to verify that the produced bitstream 1801 conforms to the standard and to control the output bitrate. 1802 Thus, the coded picture buffer is conceptually independent 1803 of any other potential buffers in the receiver, including 1804 de-packetization and de-jitter buffers. The coded picture 1805 buffer need not be implemented in decoders as specified in 1806 Annex C of HEVC, but rather standard-compliant decoders can 1807 have any buffering arrangements provided that they can 1808 decode standard-compliant bitstreams. Thus, in practice, 1809 the input buffer for a video decoder can be integrated with 1810 de-packetization and de-jitter buffers of the receiver. 1812 max-dpb: 1813 The value of max-dpb is an integer indicating the maximum 1814 decoded picture buffer size in units decoded pictures at the 1815 MaxLumaPS for the highest level, i.e. number of decoded 1816 pictures at the maximum picture size defined by the highest 1817 level. The value of max-dpb MUST be smaller than or equal to 1818 16. The max-dpb parameter signals that the receiver has more 1819 memory than the minimum amount of decoded picture buffer 1820 memory required by default, which is MaxDpbPicBuf as defined 1821 in [HEVC] (equal to 6). When max-dpb is signaled, the receiver 1822 MUST be able to decode NAL unit streams that conform to the 1823 signaled highest level, with the exception that the 1824 MaxDpbPicBuff value defined in [HEVC] as 6 is replaced with 1825 the value of max-dpb. Consequently, a receiver that signals 1826 max-dpb MUST be capable of storing the following number of 1827 decoded frames (MaxDpbSize) in its decoded picture buffer: 1829 if( PicSizeInSamplesY <= ( MaxLumaPS >> 2 ) ) 1830 MaxDpbSize = Min( 4 * max-dpb, 16 ) 1831 else if ( PicSizeInSamplesY <= ( MaxLumaPS >> 1 ) ) 1832 MaxDpbSize = Min( 2 * max-dpb, 16 ) 1833 else if ( PicSizeInSamplesY <= ( ( 3 * MaxLumaPS ) >> 2 ) ) 1834 MaxDpbSize = Min( (4 * max-dpb) / 3, 16 ) 1835 else 1836 MaxDpbSize = max-dpb 1838 Wherein MaxLumaPS given in Table A-1 of [HEVC] for the highest 1839 level and PicSizeInSamplesY is the current size of each 1840 decoded picture in units of luma samples as defined in [HEVC]. 1842 The value of max-dpb MUST be greater than or equal to the 1843 value of MaxDpbPicBuf (i.e. 6) as defined in [HEVC]. Senders 1844 MAY use this knowledge to construct coded video streams with 1845 improved compression. 1847 Informative note: This parameter was added primarily to 1848 complement a similar codepoint in the ITU-T Recommendation 1849 H.245, so as to facilitate signaling gateway designs. The 1850 decoded picture buffer stores reconstructed samples. There 1851 is no relationship between the size of the decoded picture 1852 buffer and the buffers used in RTP, especially de- 1853 packetization and de-jitter buffers. 1855 max-br: 1856 The value of max-br is an integer indicating the maximum video 1857 bitrate in units of CpbBrVclFactor bits per second for the VCL 1858 HRD parameters and in units of CpbBrNalFactor bits per second 1859 for the NAL HRD parameters, where CpbBrVclFactor and 1860 CpbBrNalFactor are defined in Section A.4 of [HEVC]. 1862 The max-br parameter signals that the video decoder of the 1863 receiver is capable of decoding video at a higher bitrate than 1864 is required by the signaled highest level. 1866 When max-br is signaled, the video codec of the receiver MUST 1867 be able to decode NAL unit streams that conform to the 1868 signaled highest level, with the following exceptions in the 1869 limits specified by the highest level: 1871 o The value of max-br replaces the MaxBR value in Table A-2 1872 of [HEVC] for the highest level. 1874 o When the max-cpb parameter is not present, the result of 1875 the following formula replaces the value of MaxCPB in Table A- 1876 1 of [HEVC]: 1878 (MaxCPB of the signaled level) * max-br / (MaxBR of the 1879 signaled highest level). 1881 For example, if a receiver signals capability for Main profile 1882 Level 2 with max-br equal to 2000, this indicates a maximum 1883 video bitrate of 2000 kbits/sec for VCL HRD parameters, a 1884 maximum video bitrate of 2200 kbits/sec for NAL HRD 1885 parameters, and a CPB size of 2000000 bits (2000000 / 1500000 1886 * 1500000). 1888 The value of max-br MUST be greater than or equal to the 1889 value MaxBR given in Table A-2 of [HEVC] for the signaled 1890 highest level. 1892 Senders MAY use this knowledge to send higher bitrate video as 1893 allowed in the level definition of Annex A of HEVC to achieve 1894 improved video quality. 1896 Informative note: This parameter was added primarily to 1897 complement a similar codepoint in the ITU-T Recommendation 1898 H.245, so as to facilitate signaling gateway designs. The 1899 assumption that the network is capable of handling such 1900 bitrates at any given time cannot be made from the value of 1901 this parameter. In particular, no conclusion can be drawn 1902 that the signaled bitrate is possible under congestion 1903 control constraints. 1905 tx-mode: 1907 This parameter indicates whether the transmission mode is SST 1908 or MST. 1910 The value of tx-mode MUST be equal to either "MST" or "SST". 1911 When not present, the value of tx-mode is inferred to be equal 1912 to "SST". 1914 If the value is equal to "MST", MST MUST be in use. Otherwise 1915 (the value is equal to "SST"), SST MUST be in use. 1917 The value of tx-mode MUST be equal to "MST" for all RTP 1918 sessions in an MST. 1920 sprop-depack-buf-nalus: 1922 This parameter specifies the maximum number of NAL units that 1923 precede a NAL unit in the de-packetization buffer in reception 1924 order and follow the NAL unit in decoding order. 1926 The value of sprop-depack-buf-nalus MUST be an integer in the 1927 range of 0 to 32767, inclusive. 1929 When not present, the value of sprop-depack-buf-nalus is 1930 inferred to be equal to 0. 1932 When the RTP session depends on one or more other RTP sessions 1933 (in this case tx-mode MUST be equal to "MST"), this parameter 1934 MUST be present and the value of sprop-depack-buf-nalus MUST 1935 be greater than 0. 1937 sprop-depack-buf-bytes: 1939 This parameter signals the required size of the de- 1940 packetization buffer in units of bytes. The value of the 1941 parameter MUST be greater than or equal to the maximum buffer 1942 occupancy (in units of bytes) of the de-packetization buffer 1943 as specified in section 6. 1945 The value of sprop-depack-buf-bytes MUST be an integer in the 1946 range of 0 to 4294967295, inclusive. 1948 When the RTP session depends on one or more other RTP sessions 1949 (in this case tx-mode MUST be equal to "MST") or sprop-depack- 1950 buf-nalus is present and is greater than 0, this parameter 1951 MUST be present and the value of sprop-depack-buf-bytes MUST 1952 be greater than 0. 1954 Informative note: sprop-depack-buf-bytes indicates the 1955 required size of the de-packetization buffer only. When 1956 network jitter can occur, an appropriately sized jitter 1957 buffer has to be available as well. 1959 depack-buf-cap: 1961 This parameter signals the capabilities of a receiver 1962 implementation and indicates the amount of de-packetization 1963 buffer space in units of bytes that the receiver has available 1964 for reconstructing the NAL unit decoding order. A receiver is 1965 able to handle any stream for which the value of the sprop- 1966 depack-buf-bytes parameter is smaller than or equal to this 1967 parameter. 1969 When not present, the value of depack-buf-req is inferred to 1970 be equal to 0. The value of depack-buf-cap MUST be an integer 1971 in the range of 0 to 4294967295, inclusive. 1973 Informative note: depack-buf-cap indicates the maximum 1974 possible size of the de-packetization buffer of the 1975 receiver only. When network jitter can occur, an 1976 appropriately sized jitter buffer has to be available as 1977 well. 1979 segmentation-id: 1981 This parameter MAY be used to signal the segmentation tools 1982 present in the stream and that can be used for 1983 parallelization. The value of segmentation-id MUST be an 1984 integer in the range of 0 to 3, inclusive. When not present, 1985 the value of segmentation-id is inferred to be equal to 0. 1987 When segmentation-id is equal to 0, no information about the 1988 segmentation tools is provided. When segmentation-id is equal 1989 to 1, it indicates that slices are present in the stream. 1990 When segmentation-id is equal to 2, it indicates that tiles 1991 are present in the stream. When segmentation-id is equal to 1992 3, it indicates that WPP is used in the stream. 1994 spatial-segmentation-idc: 1996 A base16 [RFC4648] representation of the syntax element 1997 min_spatial_segmentation_idc as specified in [HEVC]. This 1998 parameter MAY be used to describe parallelization capabilities 1999 of the stream. 2001 Encoding considerations: 2003 This type is only defined for transfer via RTP (RFC 3550). 2005 Security considerations: 2007 See Section 9 of RFC XXXX. 2009 Public specification: 2011 Please refer to Section 13 of RFC XXXX. 2013 Additional information: None 2015 File extensions: none 2017 Macintosh file type code: none 2019 Object identifier or OID: none 2021 Person & email address to contact for further information: 2023 Intended usage: COMMON 2025 Author: See Section 14 of RFC XXXX. 2027 Change controller: 2029 IETF Audio/Video Transport Payloads working group delegated 2030 from the IESG. 2032 7.2 SDP Parameters 2034 The receiver MUST ignore any parameter unspecified in this memo. 2036 7.2.1 Mapping of Payload Type Parameters to SDP 2038 The media type video/H265 string is mapped to fields in the Session 2039 Description Protocol (SDP) [RFC4566] as follows: 2041 o The media name in the "m=" line of SDP MUST be video. 2043 o The encoding name in the "a=rtpmap" line of SDP MUST be H265 (the 2044 media subtype). 2046 o The clock rate in the "a=rtpmap" line MUST be 90000. 2048 o The OPTIONAL parameters "profile-space", "profile-id", "tier- 2049 flag", "level-id", "interop-constraints", "profile-compatibility- 2050 indicator", "sub-layer-id", "recv-sub-layer-id", "max-recv-level- 2051 id", "max-ls", "max-lps", "max-cpb", "max-dpb", "max-br", "tx- 2052 mode", "sprop-depack-buf-nalus", "sprop-depack-buf-bytes", 2053 "depack-buf-cap", "segmentation-id", and "spatial-segmentation- 2054 idc", when present, MUST be included in the "a=fmtp" line of SDP. 2055 This parameter is expressed as a media type string, in the form 2056 of a semicolon separated list of parameter=value pairs. 2058 o The OPTIONAL parameters "sprop-vps", "sprop-sps", and "sprop- 2059 pps", when present, MUST be included in the "a=fmtp" line of SDP 2060 or conveyed using the "fmtp" source attribute as specified in 2061 section 6.3 of [RFC5576]. For a particular media format (i.e., 2062 RTP payload type), "sprop-vps" "sprop-sps", or "sprop-pps" MUST 2063 NOT be both included in the "a=fmtp" line of SDP and conveyed 2064 using the "fmtp" source attribute. When included in the "a=fmtp" 2065 line of SDP, these parameters are expressed as a media type 2066 string, in the form of a semicolon separated list of 2067 parameter=value pairs. When conveyed using the "fmtp" source 2068 attribute, these parameters are only associated with the given 2069 source and payload type as parts of the "fmtp" source attribute. 2071 Informative note: Conveyance of "sprop-vps", "sprop-sps", and 2072 "sprop-pps" using the "fmtp" source attribute allows for out- 2073 of-band transport of parameter sets in topologies like Topo- 2074 Video-switch-MCU as specified in [RFC5117]. 2076 An example of media representation in SDP is as follows: 2078 m=video 49170 RTP/AVP 98 2079 a=rtpmap:98 H265/90000 2080 a=fmtp:98 profile-id=ST; 2081 sprop-vps=