idnits 2.17.1 draft-schierl-payload-rtp-h265-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 152 instances of weird spacing in the document. Is it really formatted ragged-right, rather than justified? ** There are 11 instances of too long lines in the document, the longest one being 14 characters in excess of 72. ** The abstract seems to contain references ([HEVC]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 28 has weird spacing: '... at any ti...' == Line 31 has weird spacing: '... The list ...' == Line 46 has weird spacing: '...fo) in effec...' == Line 47 has weird spacing: '...ication of t...' == Line 48 has weird spacing: '...ly, as they ...' == (147 more instances...) -- The document date (June 11, 2013) is 3972 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '3GP' is mentioned on line 267, but not defined -- Looks like a reference, but probably isn't: '0' on line 950 == Missing Reference: 'RFC5117' is mentioned on line 2071, but not defined ** Obsolete undefined reference: RFC 5117 (Obsoleted by RFC 7667) == Missing Reference: 'RFC2326' is mentioned on line 2271, but not defined ** Obsolete undefined reference: RFC 2326 (Obsoleted by RFC 7826) == Missing Reference: 'RFC2974' is mentioned on line 2272, but not defined == Missing Reference: 'RFC5583' is mentioned on line 2316, but not defined == Missing Reference: 'RFC3551' is mentioned on line 2474, but not defined == Missing Reference: 'RFC3711' is mentioned on line 2474, but not defined == Missing Reference: 'RFC5124' is mentioned on line 2475, but not defined == Missing Reference: 'I-D.ietf-avt-srtp-not-mandatory' is mentioned on line 2477, but not defined == Missing Reference: 'I-D.ietf-avtcore-rtp-security-options' is mentioned on line 2484, but not defined == Missing Reference: 'RFC 3711' is mentioned on line 2500, but not defined == Missing Reference: 'RFC 3551' is mentioned on line 2524, but not defined == Unused Reference: 'RFC6051' is defined on line 2601, but no explicit reference was found in the text == Unused Reference: '3GPPFF' is defined on line 2641, but no explicit reference was found in the text == Unused Reference: 'RFC5109' is defined on line 2653, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. 'HEVC' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) Summary: 6 errors (**), 0 flaws (~~), 23 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Network Working Group T. Schierl 2 Internet Draft Fraunhofer HHI 3 Intended status: Standards track S. Wenger 4 Expires: December 2013 Vidyo 5 Y.-K. Wang 6 Qualcomm 7 M. M. Hannuksela 8 Nokia 9 Y. Sanchez 10 Fraunhofer HHI 11 June 11, 2013 13 RTP Payload Format for High Efficiency Video Coding 14 draft-schierl-payload-rtp-h265-03.txt 16 Status of this Memo 18 This Internet-Draft is submitted to IETF in full conformance with 19 the provisions of BCP 78 and BCP 79. 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 26 Internet-Drafts are draft documents valid for a maximum of six 27 months and may be updated, replaced, or obsoleted by other documents 28 at any time. It is inappropriate to use Internet-Drafts as 29 reference material or to cite them other than as "work in progress." 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt. 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 37 This Internet-Draft will expire on December 11, 2013. 39 Copyright and License Notice 41 Copyright (c) 2013 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with 49 respect to this document. Code Components extracted from this 50 document must include Simplified BSD License text as described in 51 Section 4.e of the Trust Legal Provisions and are provided without 52 warranty as described in the Simplified BSD License. 54 Abstract 56 This memo describes an RTP payload format for the video coding 57 standard ITU-T Recommendation H.265 and ISO/IEC International 58 Standard 23008-2, both also known as High Efficiency Video Coding 59 (HEVC) [HEVC], developed by the Joint Collaborative Team on Video 60 Coding (JCT-VC). The RTP payload format allows for packetization of 61 one or more Network Abstraction Layer (NAL) units in each RTP packet 62 payload, as well as fragmentation of a NAL unit into multiple RTP 63 packets. Furthermore, it supports transmission of an HEVC stream 64 over a single as well as multiple RTP flows. The payload format has 65 wide applicability in videoconferencing, Internet video streaming, 66 and high bit-rate entertainment-quality video, among others. 68 Table of Contents 70 Status of this Memo...............................................1 71 Abstract..........................................................3 72 Table of Contents.................................................3 73 1 . Introduction..................................................5 74 1.1 . Overview of the HEVC Codec...............................5 75 1.1.1 Coding-Tool Features..................................5 76 1.1.2 Systems and Transport Interfaces......................7 77 1.1.3 Parallel Processing Support..........................13 78 1.1.4 NAL Unit Header......................................15 79 1.2 . Overview of the Payload Format..........................17 80 2 . Conventions..................................................17 81 3 . Definitions and Abbreviations................................17 82 3.1 Definitions...............................................17 83 3.1.1 Definitions from the HEVC Specification..............18 84 3.1.2 Definitions Specific to This Memo....................19 85 3.2 Abbreviations.............................................20 86 4 . RTP Payload Format...........................................22 87 4.1 RTP Header Usage..........................................22 88 4.2 Payload Structures........................................23 89 4.3 Transmission Modes........................................24 90 4.4 Decoding Order Number.....................................25 91 4.5 Single NAL Unit Packets...................................27 92 4.6 Aggregation Packets (APs).................................27 93 4.7 Fragmentation Units (FUs).................................32 94 5 . Packetization Rules..........................................36 95 6 . De-packetization Process.....................................37 96 7 . Payload Format Parameters....................................38 97 7.1 Media Type Registration...................................39 98 7.2 SDP Parameters............................................52 99 7.2.1 Mapping of Payload Type Parameters to SDP............53 100 7.2.2 Usage with SDP Offer/Answer Model....................54 101 7.2.3 Usage in Declarative Session Descriptions............58 102 7.2.4 Dependency Signaling in Multi-Session Transmission...60 103 8 . Use with Feedback Messages...................................60 104 8.1 Definition of the SPLI Feedback Message...................62 105 8.2 Use of HEVC with the RPSI Feedback Message................63 106 8.3 Use of HEVC with the SPLI Feedback Message................63 107 9 . Security Considerations......................................63 108 10 . Congestion Control..........................................65 109 11 . IANA Consideration..........................................66 110 12 . Acknowledgements............................................66 111 13 . References..................................................66 112 13.1 Normative References.....................................66 113 13.2 Informative References...................................67 114 14 . Authors' Addresses..........................................68 116 1. Introduction 118 1.1. Overview of the HEVC Codec 120 High Efficiency Video Coding [HEVC], formally known as ITU-T 121 Recommendation H.265 and ISO/IEC International Standard 23008-2 was 122 ratified by ITU-T in April 2013 and reportedly provides significant 123 coding efficiency gains over H.264 [H.264]. 125 As both H.264 [H.264] and its RTP payload format [RFC6184] are 126 widely deployed and generally known in the relevant implementer 127 community, frequently only the differences between those two 128 specifications are highlighted in non-normative, explanatory parts 129 of this memo. Basic familiarity with both specifications is assumed 130 for those parts. However, the normative parts of this memo do not 131 require study of H.264 or its RTP payload format. 133 H.264 and HEVC share a similar hybrid video codec design. 134 Conceptually, both technologies include a video coding layer (VCL), 135 which is often used to refer to the coding-tool features, and a 136 network abstraction layer (NAL), which is often used to refer to the 137 systems and transport interface aspects of the codecs. 139 1.1.1 Coding-Tool Features 141 Similarly to earlier hybrid-video-coding-based standards, including 142 H.264, the following basic video coding design is employed by HEVC. 143 A prediction signal is first formed either by intra or motion 144 compensated prediction, and the residual (the difference between the 145 original and the prediction) is then coded. The gains in coding 146 efficiency are achieved by redesigning and improving almost all 147 parts of the codec over earlier designs. In addition, HEVC includes 148 several tools to make the implementation on parallel architectures 149 easier. Below is a summary of HEVC coding-tool features. 151 Quad-tree block and transform structure 153 One of the major tools that contribute significantly to the coding 154 efficiency of HEVC is the usage of flexible coding blocks and 155 transforms, which are defined in a hierarchical quad-tree manner. 156 Unlike H.264, where the basic coding block is a macroblock of fixed 157 size 16x16, HEVC defines a Coding Tree Unit (CTU) of a maximum size 158 of 64x64. Each CTU can be divided into smaller units in a 159 hierarchical quad-tree manner and can represent smaller blocks down 160 to size 4x4. Similarly, the transforms used in HEVC can have 161 different sizes, starting from 4x4 and going up to 32x32. Utilizing 162 large blocks and transforms contribute to the major gain of HEVC, 163 especially at high resolutions. 165 Entropy coding 167 HEVC uses a single entropy coding engine, which is based on Context 168 Adaptive Binary Arithmetic Coding (CABAC), whereas H.264 uses two 169 distinct entropy coding engines. CABAC in HEVC shares many 170 similarities with CABAC of H.264, but contains several improvements. 171 Those include improvements in coding efficiency and lowered 172 implementation complexity, especially for parallel architectures. 174 In-loop filtering 176 H.264 includes an in-loop adaptive deblocking filter, where the 177 blocking artifacts around the transform edges in the reconstructed 178 picture are smoothed to improve the picture quality and compression 179 efficiency. In HEVC, a similar deblocking filter is employed but 180 with somewhat lower complexity. In addition, pictures undergo a 181 subsequent filtering operation called Sample Adaptive Offset (SAO), 182 which is a new design element in HEVC. SAO basically adds a pixel- 183 level offset in an adaptive manner and usually acts as a de-ringing 184 filter. It is observed that SAO improves the picture quality, 185 especially around sharp edges contributing substantially to visual 186 quality improvements of HEVC. 188 Motion prediction and coding 190 There have been a number of improvements in this area that are 191 summarized as follows. The first category is motion merge and 192 advanced motion vector prediction (AMVP) modes. The motion 193 information of a prediction block can be inferred from the spatially 194 or temporally neighboring blocks. This is similar to the DIRECT 195 mode in H.264 but includes new aspects to incorporate the flexible 196 quad-tree structure and methods to improve the parallel 197 implementations. In addition, the motion vector predictor can be 198 signaled for improved efficiency. The second category is high- 199 precision interpolation. The interpolation filter length is 200 increased to 8-tap from 6-tap, which improves the coding efficiency 201 but also comes with increased complexity. In addition, 202 interpolation filter is defined with higher precision without any 203 intermediate rounding operations to further improve the coding 204 efficiency. 206 Intra prediction and intra coding 208 Compared to 8 intra prediction modes in H.264, HEVC supports angular 209 intra prediction with 33 directions. This increased flexibility 210 improves both objective coding efficiency and visual quality as the 211 edges can be better predicted and ringing artifacts around the edges 212 can be reduced. In addition, the reference samples are adaptively 213 smoothed based on the prediction direction. To avoid contouring 214 artifacts a new interpolative prediction generation is included to 215 improve the visual quality. Furthermore, discrete sine transform 216 (DST) is utilized instead of traditional discrete cosine transform 217 (DCT) for 4x4 intra transform blocks. 219 Other coding-tool features 221 HEVC includes some tools for lossless coding and efficient screen 222 content coding, such as skipping the transform coding for certain 223 blocks. These tools are particularly useful for example when 224 streaming the user-interface of a mobile device to a large display. 226 1.1.2 Systems and Transport Interfaces 228 HEVC inherited the basic systems and transport interfaces designs, 229 such as the NAL-unit-based syntax structure, the hierarchical syntax 230 and data unit structure from sequence-level parameter sets, multi- 231 picture-level or picture-level parameter sets, slice-level header 232 parameters, lower-level parameters, the supplemental enhancement 233 information (SEI) message mechanism, the hypothetical reference 234 decoder (HRD) based video buffering model, and so on. In the 235 following, a list of differences in these aspects compared to H.264 236 is summarized. 238 Video parameter set 240 A new type of parameter set, called video parameter set (VPS), was 241 introduced. For the first (2013) version of [HEVC], the video 242 parameter set NAL unit is required to be available prior to its 243 activation, while the information contained in the video parameter 244 set is not necessary for operation of the decoding process. For 245 future HEVC extensions, such as the 3D or scalable extensions, the 246 video parameter set is expected to include information necessary for 247 operation of the decoding process, e.g. decoding dependency or 248 information for reference picture set construction of enhancement 249 layers. The VPS provides a "big picture" of a bitstream, including 250 what types of operation points are provided, the profile, tier, and 251 level of the operation points, and some other high-level properties 252 of the bitstream that can be used as the basis for session 253 negotiation and content selection, etc. (see section 7.1). 255 Profile, tier and level 257 The profile, tier and level syntax structure that can be included in 258 both VPS and sequence parameter set (SPS) includes 12 bytes data to 259 describe the entire bitstream (including all temporally scalable 260 layers, which are referred to as sub-layers in the HEVC 261 specification), and can optionally include more profile, tier and 262 level information pertaining to individual temporally scalable 263 layers. The profile indicator indicates the "best viewed as" 264 profile when the bitstream conforms to multiple profiles, similar to 265 the major brand concept in the ISO base media file format (ISOBMFF) 266 [ISOBMFF] and file formats derived based on ISOBMFF, such as the 267 3GPP file format [3GP]. The profile, tier and level syntax 268 structure also includes the indications of whether the bitstream is 269 free of frame-packed content, whether the bitstream is free of 270 interlaced source content and free of field pictures, i.e., contains 271 only frame pictures of progressive source, such that clients/players 272 with no support of post-processing functionalities for handling of 273 frame-packed or interlaced source content or field pictures can 274 reject those bitstreams. 276 Bitstream and elementary stream 278 HEVC includes a definition of an elementary stream, which is new 279 compared to H.264. An elementary stream consists of a sequence of 280 one or more bitstreams. An elementary stream that consists of two 281 or more bitstreams has typically been formed by splicing together 282 two or more bitstreams (or parts thereof). When an elementary 283 stream contains more than one bitstream, the last NAL unit of the 284 last access unit of a bitstream (except the last bitstream in the 285 elementary stream) must contain an end of bitstream NAL unit and the 286 first access unit of the subsequent bitstream must be an intra 287 random access point (IRAP) access unit. This IRAP access unit may 288 be a clean random access (CRA), broken link access (BLA), or 289 instantaneous decoding refresh (IDR) access unit. 291 Random access support 293 HEVC includes signaling in NAL unit header, through NAL unit types, 294 of IRAP pictures beyond IDR pictures. Three types of IRAP pictures, 295 namely IDR, CRA and BLA pictures are supported, wherein IDR pictures 296 are conventionally referred to as closed group-of-pictures (closed- 297 GOP) random access points, and CRA and BLA pictures are those 298 conventionally referred to as open-GOP random access points. BLA 299 pictures usually originate from splicing of two bitstreams or part 300 thereof at a CRA picture, e.g. during stream switching. To enable 301 better systems usage of IRAP pictures, altogether six different NAL 302 units are defined to signal the properties of the IRAP pictures, 303 which can be used to better match the stream access point (SAP) 304 types as defined in the ISOBMFF [ISOBMFF], which are utilized for 305 random access support in both 3GP-DASH [3GPDASH] and MPEG DASH 306 [MPEGDASH]. Pictures following an IRAP picture in decoding order 307 and preceding the IRAP picture in output order are referred to as 308 leading pictures associated with the IRAP picture. There are two 309 types of leading pictures, namely random access decodable leading 310 (RADL) pictures and random access skipped leading (RASL) pictures. 311 RADL pictures are decodable when the decoding started at the 312 associated IRAP picture, and RASL pictures are not decodable when 313 the decoding started at the associated IRAP picture and are usually 314 discarded. HEVC provides mechanisms to enable the specification of 315 conformance of bitstreams with RASL pictures being discarded, thus 316 to provide a standard-compliant way to enable systems components to 317 discard RASL pictures when needed. 319 Temporal scalability support 321 HEVC includes an improved support of temporal scalability, by 322 inclusion of the signaling of TemporalId in the NAL unit header, the 323 restriction that pictures of a particular temporal sub-layer cannot 324 be used for inter prediction reference by pictures of a higher 325 temporal sub-layer, the sub-bitstream extraction process, and the 326 requirement that each sub-bitstream extraction output be a 327 conforming bitstream. Media-aware network elements (MANEs) can 328 utilize the TemporalId in the NAL unit header for stream adaptation 329 purposes based on temporal scalability. 331 Temporal sub-layer switching support 333 HEVC specifies, through NAL unit types present in the NAL unit 334 header, the signaling of temporal sub-layer access (TSA) and 335 stepwise temporal sub-layer access (STSA). A TSA picture and 336 pictures following the TSA picture in decoding order do not use 337 pictures prior to the TSA picture in decoding order with TemporalId 338 greater than or equal to that of the TSA picture for inter 339 prediction reference. A TSA picture enables up-switching, at the 340 TSA picture, to the sub-layer containing the TSA picture or any 341 higher sub-layer, from the immediately lower sub-layer. An STSA 342 picture does not use pictures with the same TemporalId as the STSA 343 picture for inter prediction reference. Pictures following an STSA 344 picture in decoding order with the same TemporalId as the STSA 345 picture do not use pictures prior to the STSA picture in decoding 346 order with the same TemporalId as the STSA picture for inter 347 prediction reference. An STSA picture enables up-switching, at the 348 STSA picture, to the sub-layer containing the STSA picture, from the 349 immediately lower sub-layer. 351 Sub-layer reference or non-reference pictures 353 The concept and signaling of reference/non-reference pictures in 354 HEVC are different from H.264. In H.264, if a picture may be used 355 by any other picture for inter prediction reference, it is a 356 reference picture; otherwise it is a non-reference picture, and this 357 is signaled by two bits in the NAL unit header. In HEVC, a picture 358 is called a reference picture only when it is marked as "used for 359 reference". In addition, the concept of sub-layer reference picture 360 was introduced. If a picture may be used by another other picture 361 with the same TemporalId for inter prediction reference, it is a 362 sub-layer reference picture; otherwise it is a sub-layer non- 363 reference picture. Whether a picture is a sub-layer reference 364 picture or sub-layer non-reference picture is signaled through NAL 365 unit type values. 367 Extensibility 369 Besides the TemporalId in the NAL unit header, HEVC also includes 370 the signaling of a six-bit layer ID in the NAL unit header, which 371 must be equal to 0 for a single-layer bitstream. Extension 372 mechanisms have been included in VPS, SPS, PPS, SEI NAL unit, slice 373 headers, and so on. All these extension mechanisms enable future 374 extensions in a backward compatible manner, such that bitstreams 375 encoded according to potential future HEVC extensions can be fed to 376 then-legacy decoders (e.g. HEVC version 1 decoders) and the then- 377 legacy decoders can decode and output the base layer bitstream. 379 Bitstream extraction 381 HEVC includes a bitstream extraction process as an integral part of 382 the overall decoding process, as well as specification of the use of 383 the bitstream extraction process in description of bitstream 384 conformance tests as part of the hypothetical reference decoder 385 (HRD) specification. 387 Reference picture management 389 The reference picture management of HEVC, including reference 390 picture marking and removal from the decoded picture buffer (DPB) as 391 well as reference picture list construction (RPLC), differs from 392 that of H.264. Instead of the sliding window plus adaptive memory 393 management control operation (MMCO) based reference picture marking 394 mechanism in H.264, HEVC specifies a reference picture set (RPS) 395 based reference picture management and marking mechanism, and the 396 RPLC is consequently based on the RPS mechanism. A reference 397 picture set consists of a set of reference pictures associated with 398 a picture, consisting of all reference pictures that are prior to 399 the associated picture in decoding order, that may be used for inter 400 prediction of the associated picture or any picture following the 401 associated picture in decoding order. The reference picture set 402 consists of five lists of reference pictures; RefPicSetStCurrBefore, 403 RefPicSetStCurrAfter, RefPicSetStFoll, RefPicSetLtCurr and 404 RefPicSetLtFoll. RefPicSetStCurrBefore, RefPicSetStCurrAfter and 405 RefPicSetLtCurr contains all reference pictures that may be used in 406 inter prediction of the current picture and that may be used in 407 inter prediction of one or more of the pictures following the 408 current picture in decoding order. RefPicSetStFoll and 409 RefPicSetLtFoll consists of all reference pictures that are not used 410 in inter prediction of the current picture but may be used in inter 411 prediction of one or more of the pictures following the current 412 picture in decoding order. RPS provides an "intra-coded" signaling 413 of the DPB status, instead of an "inter-coded" signaling, mainly for 414 improved error resilience. The RPLC process in HEVC is based on the 415 RPS, by signaling an index to an RPS subset for each reference 416 index. The RPLC process has been simplified compared to that in 417 H.264, by removal of the reference picture list modification (also 418 referred to as reference picture list reordering) process. 420 Ultra low delay support 422 HEVC specifies a sub-picture-level HRD operation, for support of the 423 so-called ultra-low delay. The mechanism specifies a standard- 424 compliant way to enable delay reduction below one picture interval. 425 Sub-picture-level coded picture buffer (CPB) and DPB parameters may 426 be signaled, and utilization of these information for the derivation 427 of CPB timing (wherein the CPB removal time corresponds to decoding 428 time) and DPB output timing (display time) is specified. Decoders 429 are allowed to operate the HRD at the conventional access-unit- 430 level, even when the sub-picture-level HRD parameters are present. 432 New SEI messages 434 HEVC inherits many H.264 SEI messages with changes in syntax and/or 435 semantics making them applicable to HEVC. The active parameter sets 436 SEI message includes the IDs of the active video parameter set and 437 the active sequence parameter set and can be used to activate VPSs 438 and SPSs. In addition, the SEI message includes the following 439 indications: 1) An indication of whether "full random accessibility" 440 is supported (when supported, all parameter sets needed for decoding 441 of the remaining of the bitstream when random accessing from the 442 beginning of the current coded video sequence by completely 443 discarding all access units earlier in decoding order are present in 444 the remaining bitstream and all coded pictures in the remaining 445 bitstream can be correctly decoded); 2) An indication of whether 446 there is any parameter set within the current coded video sequence 447 that updates another parameter set of the same type preceding in 448 decoding order. An update of a parameter set refers to the use of 449 the same parameter set ID but with some other parameters changed. 450 If this property is true for all coded video sequences in the 451 bitstream, then all parameter sets can be sent out-of-band before 452 session start. The region refresh information SEI message can be 453 used together with the recovery point SEI message (present in both 454 H.264 and HEVC) for improved support of gradual decoding refresh 455 (GDR). This supports random access from inter-coded pictures, 456 wherein complete pictures can be correctly decoded or recovered 457 after an indicated number of pictures in output/display order. 459 1.1.3 Parallel Processing Support 461 The reportedly significantly higher computational demand of HEVC 462 over H.264 (especially with respect to encoders, where a complexity 463 increase of a factor of ten has often been reported), in conjunction 464 with the ever increasing video resolution (both spatially and 465 temporally) required by the market, led to the adoption of VCL 466 coding tools specifically targeted to allow for parallelization on 467 the sub-picture level. That is, parallelization occurs, at the 468 minimum, at the granularity of an integer number of CTUs. The 469 targets for this type of high-level parallelization are multicore 470 CPUs and DSPs as well as multiprocessor systems. In a system 471 design, to be useful, these tools require signaling support, which 472 is provided in Section 7 of this memo. This section provides a 473 brief overview of the tools available in [HEVC]. 475 Many of the tools incorporated in HEVC were designed keeping in mind 476 the potential parallel implementations in multi-core/multi-processor 477 architectures. Specifically, for parallelization, four picture 478 partition strategies are available. 480 Slices are segments of the bitstream that can be reconstructed 481 independently from other slices within the same picture (though 482 there may still be interdependencies through loop filtering 483 operations). Slices are the only tool that can be used for 484 parallelization that is also available, in virtually identical form, 485 in H.264. Slices based parallelization does not require much inter- 486 processor or inter-core communication (except for inter-processor or 487 inter-core data sharing for motion compensation when decoding a 488 predictively coded picture, which is typically much heavier than 489 inter-processor or inter-core data sharing due to in-picture 490 prediction), as slices are designed to be independently decodable. 491 However, for the same reason, slices can require some coding 492 overhead. Further, slices (in contrast to some of the other tools 493 mentioned below) also serve as the key mechanism for bitstream 494 partitioning to match Maximum Transfer Unit (MTU) size requirements, 495 due to the in-picture independence of slices and the fact that each 496 regular slice is encapsulated in its own NAL unit. In many cases, 497 the goal of parallelization and the goal of MTU size matching can 498 place contradicting demands to the slice layout in a picture. The 499 realization of this situation led to the development of the more 500 advanced tools mentioned below. This payload format does not 501 contain any specific mechanisms aiding parallelization through 502 slices. 504 Dependent slice segments allow for fragmentation of a coded slice 505 into fragments at CTU boundaries without breaking any in-picture 506 prediction mechanism. They are complementary to the fragmentation 507 mechanism described in this memo in that they need the cooperation 508 of the encoder. As a dependent slice segment necessarily contains 509 an integer number of CTUs, a decoder using multiple cores operating 510 on CTUs can process a dependent slice segment without communicating 511 parts of the slice segment's bitstream to other cores. 512 Fragmentation, as specified in this memo, in contrast, does not 513 guarantee that a fragment contains an integer number of CTUs. 515 In wavefront parallel processing (WPP), the picture is partitioned 516 into rows of CTUs. Entropy decoding and prediction are allowed to 517 use data from CTUs in other partitions. Parallel processing is 518 possible through parallel decoding of CTU rows, where the start of 519 the decoding of a row is delayed by two CTUs, so to ensure that data 520 related to a CTU above and to the right of the subject CTU is 521 available before the subject CTU is being decoded. Using this 522 staggered start (which appears like a wavefront when represented 523 graphically), parallelization is possible with up to as many 524 processors/cores as the picture contains CTU rows. 526 Because in-picture prediction between neighboring CTU rows within a 527 picture is allowed, the required inter-processor/inter-core 528 communication to enable in-picture prediction can be substantial. 529 The WPP partitioning does not result in the creation of more NAL 530 units compared to when it is not applied, thus WPP cannot be used 531 for MTU size matching, though slices can be used in combination for 532 that purpose. 534 Tiles define horizontal and vertical boundaries that partition a 535 picture into tile columns and rows. The scan order of CTUs is 536 changed to be local within a tile (in the order of a CTU raster scan 537 of a tile), before decoding the top-left CTU of the next tile in the 538 order of tile raster scan of a picture. Similar to slices, tiles 539 break in-picture prediction dependencies (including entropy decoding 540 dependencies). However, they do not need to be included into 541 individual NAL units (same as WPP in this regard), hence tiles 542 cannot be used for MTU size matching, though slices can be used in 543 combination for that purpose. Each tile can be processed by one 544 processor/core, and the inter-processor/inter-core communication 545 required for in-picture prediction between processing units decoding 546 neighboring tiles is limited to conveying the shared slice header in 547 cases a slice is spanning more than one tile, and loop filtering 548 related sharing of reconstructed samples and metadata. Insofar, 549 tiles are less demanding in terms of inter-processor communication 550 bandwidth compared to WPP due to the in-picture independence between 551 two neighboring partitions. 553 1.1.4 NAL Unit Header 555 HEVC maintains the NAL unit concept of H.264 with modifications. 556 HEVC uses a two-byte NAL unit header, as shown in Figure 1. The 557 payload of a NAL unit refers to the NAL unit excluding the NAL unit 558 header. 560 +---------------+---------------+ 561 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 562 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 563 |F| Type | LayerId | TID | 564 +-------------+-----------------+ 566 Figure 1 The structure of HEVC NAL unit header 568 The semantics of the fields in the NAL unit header are as specified 569 in [HEVC] and described briefly below for convenience. In addition 570 to the name and size of each field, the corresponding syntax element 571 name in [HEVC] is also provided. 573 F: 1 bit 574 forbidden_zero_bit. MUST be zero. HEVC declares a value of 1 as 575 a syntax violation. Note that the inclusion of this bit in the 576 NAL unit header is to enable transport of HEVC video over MPEG-2 577 transport systems (avoidance of start code emulations) [MPEG2S]. 579 Type: 6 bits 580 nal_unit_type. This field specifies the NAL unit type as defined 581 in Table 7-1 of [HEVC]. For a reference of all currently defined 582 NAL unit types and their semantics, please refer to Section 7.4.1 583 in [HEVC]. 585 LayerId: 6 bits 586 nuh_layer_id. MUST be equal to zero. It is anticipated that in 587 future scalable or 3D video coding extensions of this 588 specification, this syntax element will be used to identify 589 additional layers that may be present in the coded video 590 sequence, wherein a layer may be, e.g. a spatial scalable layer, 591 a quality scalable layer, a texture view, or a depth view. 593 TID: 3 bits 594 nuh_temporal_id_plus1. This field specifies the temporal 595 identifier of the NAL unit plus 1. The value of TemporalId is 596 equal to TID minus 1. A TID value of 0 is illegal to ensure that 597 there is at least one bit in the NAL unit header equal to 1, so 598 to enable independent considerations of start code emulations in 599 the NAL unit header and in the NAL unit payload data. 601 1.2. Overview of the Payload Format 603 This payload format defines the following processes required for 604 transport of HEVC coded data over RTP [RFC3550]: 606 o Usage of RTP header with this payload format 608 o Packetization of HEVC coded NAL units into RTP packets using three 609 types of payload structures, namely single NAL unit packet, 610 aggregation packet, and fragment unit 612 o Transmission of HEVC NAL units of the same bitstream within a 613 single RTP session or multiple RTP sessions 615 o Media type parameters to be used with the Session Description 616 Protocol (SDP) [RFC4566] 618 2. Conventions 620 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 621 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 622 document are to be interpreted as described in BCP 14, RFC 2119 623 [RFC2119]. 625 This specification uses the notion of setting and clearing a bit 626 when bit fields are handled. Setting a bit is the same as assigning 627 that bit the value of 1 (On). Clearing a bit is the same as 628 assigning that bit the value of 0 (Off). 630 3. Definitions and Abbreviations 632 3.1 Definitions 634 This document uses the terms and definitions of [HEVC]. Section 635 3.1.1 lists relevant definitions copied from [HEVC] for convenience. 636 Section 3.1.2 gives definitions specific to this memo. 638 3.1.1 Definitions from the HEVC Specification 640 access unit: A set of NAL units that are associated with each other 641 according to a specified classification rule, are consecutive in 642 decoding order, and contain exactly one coded picture. 644 BLA access unit: An access unit in which the coded picture is a BLA 645 picture. 647 BLA picture: An IRAP picture for which each VCL NAL unit has 648 nal_unit_type equal to BLA_W_LP, BLA_W_RADL, or BLA_N_LP. 650 coded video sequence: A sequence of access units that consists, in 651 decoding order, of an IRAP access unit with NoRaslOutputFlag equal 652 to 1, followed by zero or more access units that are not IRAP access 653 units with NoRaslOutputFlag equal to 1, including all subsequent 654 access units up to but not including any subsequent access unit that 655 is an IRAP access unit with NoRaslOutputFlag equal to 1. 657 Informative note: An IRAP access unit may be an IDR access unit, 658 a BLA access unit, or a CRA access unit. The value of 659 NoRaslOutputFlag is equal to 1 for each IDR access unit, each BLA 660 access unit, and each CRA access unit that is the first access 661 unit in the bitstream in decoding order, is the first access unit 662 that follows an end of sequence NAL unit in decoding order, or 663 has HandleCraAsBlaFlag equal to 1. 665 CRA access unit: An access unit in which the coded picture is a CRA 666 picture. 668 CRA picture: A RAP picture for which each slice has nal_unit_type 669 equal to CRA_NUT. 671 IDR access unit: An access unit in which the coded picture is an IDR 672 picture. 674 IDR picture: A RAP picture for which each slice has nal_unit_type 675 equal to IDR_W_RADL or IDR_N_LP. 677 IRAP access unit: An access unit in which the coded picture is an 678 IRAP picture. 680 IRAP picture: A coded picture for which each VCL NAL unit has 681 nal_unit_type in the range of BLA_W_LP to RSV_IRAP_VCL23, inclusive. 683 layer: A set of VCL NAL units that all have a particular value of 684 nuh_layer_id and the associated non-VCL NAL units, or one of a set 685 of syntactical structures having a hierarchical relationship. 687 operation point: bitstream created from another bitstream by 688 operation of the sub-bitstream extraction process with the another 689 bitstream, a target highest TemporalId, and a target layer 690 identifier list as inputs. 692 random access: The act of starting the decoding process for a 693 bitstream at a point other than the beginning of the stream. 695 sub-layer: A temporal scalable layer of a temporal scalable 696 bitstream consisting of VCL NAL units with a particular value of the 697 TemporalId variable, and the associated non-VCL NAL units. 699 tile: A rectangular region of coding tree blocks within a particular 700 tile column and a particular tile row in a picture. 702 tile column: A rectangular region of coding tree blocks having a 703 height equal to the height of the picture and a width specified by 704 syntax elements in the picture parameter set. 706 tile row: A rectangular region of coding tree blocks having a height 707 specified by syntax elements in the picture parameter set and a 708 width equal to the width of the picture. 710 3.1.2 Definitions Specific to This Memo 712 media aware network element (MANE): A network element, such as a 713 middlebox or application layer gateway that is capable of parsing 714 certain aspects of the RTP payload headers or the RTP payload and 715 reacting to their contents. 717 Informative note: The concept of a MANE goes beyond normal 718 routers or gateways in that a MANE has to be aware of the 719 signaling (e.g., to learn about the payload type mappings of the 720 media streams), and in that it has to be trusted when working 721 with SRTP. The advantage of using MANEs is that they allow 722 packets to be dropped according to the needs of the media coding. 723 For example, if a MANE has to drop packets due to congestion on a 724 certain link, it can identify and remove those packets whose 725 elimination produces the least adverse effect on the user 726 experience. After dropping packets, MANEs must rewrite RTCP 727 packets to match the changes to the RTP packet stream as 728 specified in Section 7 of [RFC3550]. 730 NAL unit decoding order: A NAL unit order that conforms to the 731 constraints on NAL unit order given in Section 7.4.2.4 in [HEVC]. 733 NALU-time: The value that the RTP timestamp would have if the NAL 734 unit would be transported in its own RTP packet. 736 RTP packet stream: A sequence of RTP packets with increasing 737 sequence numbers (except for wrap-around), identical PT and 738 identical SSRC (Synchronization Source), carried in one RTP session. 739 Within the scope of this memo, one RTP packet stream is utilized to 740 transport one or more temporal sub-layers. 742 transmission order: The order of packets in ascending RTP sequence 743 number order (in modulo arithmetic). Within an aggregation packet, 744 the NAL unit transmission order is the same as the order of 745 appearance of NAL units in the packet. 747 base session: an RTP session in Multi-Session Transmission mode that 748 transports a bitstream subset which the rest of RTP sessions in the 749 Multi-Session Transmission depends on. [Ed. (YK): Check the need of 750 this definition after the draft is more complete.] 752 3.2 Abbreviations 754 AP Aggregation Packet 756 BLA Broken Link Access 758 CRA Clean Random Access 760 CTB Coding Tree Block 762 CTU Coding Tree Unit 763 CVS Coded Video Sequence 765 FU Fragmentation Unit 767 GDR Gradual Decoding Refresh 769 HRD Hypothetical Reference Decoder 771 IDR Instantaneous Decoding Refresh 773 IRAP Intra Random Access Point 775 MANE Media Aware Network Element 777 MST Multi-Session Transmission 779 MTU Maximum Transfer Unit 781 NAL Network Abstraction Layer 783 NALU Network Abstraction Layer Unit 785 PPS Picture Parameter Set 787 RADL Random Access Decodable Leading (Picture) 789 RASL Random Access Skipped Leading (Picture) 791 RPS Reference Picture Set 793 SEI Supplemental Enhancement Information 795 SPS Sequence Parameter Set 797 SST Single-Session Transmission 799 STSA Step-wise Temporal Sub-layer Access 801 TSA Temporal Sub-layer Access 803 VCL Video Coding Layer 805 VPS Video Parameter Set 807 4. RTP Payload Format 809 4.1 RTP Header Usage 811 The format of the RTP header is specified in [RFC3550] and reprinted 812 in Figure 2 for convenience. This payload format uses the fields of 813 the header in a manner consistent with that specification. 815 The RTP payload (and the settings for some RTP header bits) for 816 aggregation packets and fragmentation units are specified in 817 Sections 4.6 and 4.7, respectively. 819 0 1 2 3 820 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 821 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 822 |V=2|P|X| CC |M| PT | sequence number | 823 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 824 | timestamp | 825 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 826 | synchronization source (SSRC) identifier | 827 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 828 | contributing source (CSRC) identifiers | 829 | .... | 830 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 832 Figure 2 RTP header according to [RFC3550] 834 The RTP header information to be set according to this RTP payload 835 format is set as follows: 837 Marker bit (M): 1 bit 839 Set for the last packet of the access unit indicated by the RTP 840 timestamp, in line with the normal use of the M bit in video 841 formats, to allow an efficient playout buffer handling. Decoders 842 can use this bit as an early indication of the last packet of an 843 access unit. 845 Payload type (PT): 7 bits 847 The assignment of an RTP payload type for this new packet format 848 is outside the scope of this document and will not be specified 849 here. The assignment of a payload type has to be performed 850 either through the profile used or in a dynamic way. 852 Sequence number (SN): 16 bits 854 Set and used in accordance with RFC 3550. 856 Timestamp: 32 bits 858 The RTP timestamp is set to the sampling timestamp of the 859 content. A 90 kHz clock rate MUST be used. 861 If the NAL unit has no timing properties of its own (e.g., 862 parameter set and SEI NAL units), the RTP timestamp is set to the 863 RTP timestamp of the coded picture of the access unit in which 864 the NAL unit is included, according to Section 7.4.2.4.4 of 865 [HEVC]. 867 Receivers SHOULD ignore the picture output timing information in 868 any picture timing SEI messages or decoding unit information SEI 869 messages as specified in [HEVC]. Instead, receivers SHOULD use 870 the RTP timestamp for the display process. Receivers MUST pass 871 picture timing SEI messages and decoding unit information SEI 872 messages to the decoder and MAY use the field/frame related 873 information for the display process e.g. when frame doubling or 874 frame tripling is indicated by the field/frame related 875 information. 877 4.2 Payload Structures 879 The first two bytes of the payload of an RTP packet are referred to 880 as the payload header. The payload header consists of the same 881 fields (F, Type, LayerId, and TID) as the NAL unit header as shown 882 in section 1.1.4, irrespective of the type of the payload structure. 884 Three different types of RTP packet payload structures are 885 specified. A receiver can identify the type of an RTP packet 886 payload through the Type field in the payload header. 888 The three different payload structures are as follows: 890 o Single NAL unit packet: Contains a single NAL unit in the 891 payload, and the NAL unit header of the NAL unit also serves as 892 the payload header. This payload structure is specified in 893 section 4.6. 895 o Aggregation packet (AP): Contains one or more NAL units within 896 one access unit. This payload structure is specified in section 897 4.6. 899 o Fragmentation unit (FU): Contains a subset of a single NAL unit. 900 This payload structure is specified in section 4.7. 902 4.3 Transmission Modes 904 This memo enables transmission of an HEVC bitstream over a single 905 RTP session or multiple RTP sessions. The concept and working 906 principle is inherited from [RFC6190] and follows a similar design. 907 If only one RTP session is used for transmission of the HEVC 908 bitstream, the transmission mode is referred to as single-session 909 transmission (SST); otherwise (more than one RTP session is used for 910 transmission of the HEVC bitstream), the transmission mode is 911 referred to as multi-session transmission (MST). 913 [Ed. (YK): Unify the style of abbreviated words throughout the 914 document.] 916 SST SHOULD be used for point-to-point unicast scenarios, while MST 917 SHOULD be used for point-to-multipoint multicast scenarios where 918 different receivers require different operation points of the same 919 HEVC bitstream, to improve bandwidth utilizing efficiency. 921 Informative note: A multicast may degrade to a unicast after all 922 but one receivers have left (this is a justification of the first 923 "SHOULD" instead of "MUST"), and there might be scenarios where 924 MST is desirable but not possible e.g. when IP multicast is not 925 deployed in certain network (this is a justification of the 926 second "SHOULD" instead of "MUST"). 928 The transmission mode is indicated by the tx-mode media parameter 929 (see section 7.1). If tx-mode is equal to "SST", SST MUST be used. 930 Otherwise (tx-mode is equal to "MST"), MST MUST be used. 932 4.4 Decoding Order Number 934 For each NAL unit, the variable AbsDon is derived, representing the 935 decoding order number that is indicative of the NAL unit decoding 936 order. 938 Let NAL unit n be the n-th NAL unit in transmission order within an 939 RTP session. 941 If tx-mode is equal to "SST" and sprop-depack-buf-nalus is equal 942 to 0, AbsDon[n], the value of AbsDon for NAL unit n, is derived as 943 equal to n. 945 Otherwise (tx-mode is equal to "MST" or sprop-depack-buf-nalus is 946 greater than 0), AbsDon[n] is derived as follows, where DON[n] is 947 the value of the variable DON for NAL unit n: 949 o If n is equal to 0 (i.e. NAL unit n is the very first NAL unit in 950 transmission order), AbsDon[0] is set equal to DON[0]. 952 o Otherwise (n is greater than 0), the following applies for 953 derivation of AbsDon[n]: 955 If DON[n] == DON[n-1], 956 AbsDon[n] = AbsDon[n-1] 958 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 959 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 961 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 962 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 964 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 965 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - DON[n]) 967 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 968 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 970 For any two NAL units m and n, the following applies: 972 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n 973 follows NAL unit m in NAL unit decoding order. 975 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 976 of the two NAL units can be in either order. 978 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 979 NAL unit m in decoding order. 981 When two consecutive NAL units in the NAL unit decoding order have 982 different values of AbsDon, the value of AbsDon for the second NAL 983 unit in decoding order MUST be greater than the value of AbsDon for 984 the first NAL unit, and the absolute difference between the two 985 AbsDon values MAY be greater than or equal to 1. 987 Informative note: There are multiple reasons to allow for the 988 absolute difference of the values of AbsDon for two consecutive 989 NAL units in the NAL unit decoding order to be greater than one. 990 An increment by one is not required, as at the time of 991 associating values of AbsDon to NAL units, it may not be known 992 whether all NAL units are to be delivered to the receiver. For 993 example, a gateway may not forward coded slice NAL units of 994 higher sub-layers or some SEI NAL units when there is congestion 995 in the network. In another example, the first intra picture of a 996 pre-encoded clip is transmitted in advance to ensure that it is 997 readily available in the receiver, and when transmitting the 998 first intra picture, the originator does not exactly know how 999 many NAL units will be encoded before the first intra picture of 1000 the pre-encoded clip follows in decoding order. Thus, the values 1001 of AbsDon for the NAL units of the first intra picture of the 1002 pre-encoded clip have to be estimated when they are transmitted, 1003 and gaps in values of AbsDon may occur. Another example is MST 1004 where the AbsDon values must indicate cross-layer decoding order 1005 for NAL units conveyed in all the RTP sessions. 1007 4.5 Single NAL Unit Packets 1009 A single NAL unit packet contains exactly one NAL unit, and consists 1010 of a payload header (denoted as PayloadHdr), an optional 16-bit DONL 1011 field (in network byte order), and the NAL unit payload data (the 1012 NAL unit excluding its NAL unit header) of the contained NAL unit, 1013 as shown in Figure 3. 1015 0 1 2 3 1016 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1017 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1018 | PayloadHdr | DONL (optional) | 1019 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1020 | | 1021 | NAL unit payload data | 1022 | | 1023 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1024 | :...OPTIONAL RTP padding | 1025 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1027 Figure 3 The structure of the first aggregation unit in an AP 1029 The payload header MUST be an exact copy of the NAL unit header of 1030 the contained NAL unit. 1032 The DONL field, when present, specifies the value of the 16 least 1033 significant bits of the decoding order number of the contained NAL 1034 unit. 1036 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1037 than 0, the DONL field MUST be present, and the variable DON for the 1038 contained NAL unit is derived as equal to the value of the DONL 1039 field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- 1040 nalus is equal to 0), the DONL field MUST NOT be present. 1042 4.6 Aggregation Packets (APs) 1044 Aggregation packets (APs) are introduced to enable the reduction of 1045 packetization overhead for small NAL units, such as most of the non- 1046 VCL NAL units, which are often only a few octets in size. 1048 An AP aggregates NAL units within one access unit. Each NAL unit to 1049 be carried in an AP is encapsulated in an aggregation unit. NAL 1050 units aggregated in one AP are in NAL unit decoding order. 1052 An AP consists of a payload header (denoted as PayloadHdr) followed 1053 by one or more aggregation units, as shown in Figure 4. 1055 0 1 2 3 1056 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1057 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1058 | PayloadHdr | | 1059 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1060 | | 1061 | one or more aggregation units | 1062 | | 1063 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1064 | :...OPTIONAL RTP padding | 1065 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1067 Figure 4 The structure of an aggregation packet 1069 The fields in the payload header are set as follows. The F bit MUST 1070 be equal to 0 if the F bit of each aggregated NAL unit is equal to 1071 zero; otherwise, it MUST be equal to 1. The Type field MUST be 1072 equal to 48. The value of LayerId MUST be equal to the lowest value 1073 of LayerId of all the aggregated NAL units. The value of TID MUST 1074 be the lowest value of TID of all the aggregated NAL units. 1076 Informative Note: All VCL NAL units in an AP have the same TID 1077 value since they belong to the same access unit. However, an AP 1078 may contain non-VCL NAL units for which the TID value in the NAL 1079 unit header may be different than the TID value of the VCL NAL 1080 units in the same AP. 1082 An AP can carry as many aggregation units as necessary; however, the 1083 total amount of data in an AP obviously MUST fit into an IP packet, 1084 and the size SHOULD be chosen so that the resulting IP packet is 1085 smaller than the MTU size so to avoid IP layer fragmentation. An AP 1086 MUST NOT contain Fragmentation Units (FUs) specified in section 4.7. 1087 APs MUST NOT be nested; i.e., an AP MUST NOT contain another AP. 1089 The first aggregation unit in an AP consists of an optional 16-bit 1090 DONL field (in network byte order) followed by a 16-bit unsigned 1091 size information (in network byte order) that indicates the size of 1092 the NAL unit in bytes (excluding these two octets, but including the 1093 NAL unit header), followed by the NAL unit itself, including its NAL 1094 unit header, as shown in Figure 5. 1096 0 1 2 3 1097 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1098 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1099 : DONL (optional) | NALU size | 1100 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1101 | NALU size | | 1102 +-+-+-+-+-+-+-+-+ NAL unit | 1103 | | 1104 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1105 | : 1106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1108 Figure 5 The structure of the first aggregation unit in an AP 1110 The DONL field, when present, specifies the value of the 16 least 1111 significant bits of the decoding order number of the aggregated NAL 1112 unit. 1114 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1115 than 0, the DONL field MUST be present in an aggregation unit that 1116 is the first aggregation unit in an AP, and the variable DON for the 1117 aggregated NAL unit is derived as equal to the value of the DONL 1118 field. Otherwise (tx-mode is equal to "SST" and sprop-depack-buf- 1119 nalus is equal to 0), the DONL field MUST NOT be present in an 1120 aggregation unit that is the first aggregation unit in an AP. 1122 An aggregation unit that is not the first aggregation unit in an AP 1123 consists of an optional 8-bit DOND field followed by a 16-bit 1124 unsigned size information (in network byte order) that indicates the 1125 size of the NAL unit in bytes (excluding these two octets, but 1126 including the NAL unit header), followed by the NAL unit itself, 1127 including its NAL unit header, as shown in Figure 6. 1129 0 1 2 3 1130 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1131 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1132 : DOND(optional)| NALU size | 1133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1134 | | 1135 | NAL unit | 1136 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1137 | : 1138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 Figure 6 The structure of an aggregation unit that is not the first 1141 aggregation unit in an AP 1143 When present, the DOND field plus 1 specifies the difference between 1144 the decoding order number values of the current aggregated NAL unit 1145 and the preceding aggregated NAL unit in the same AP. 1147 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1148 than 0, the DOND field MUST be present in an aggregation unit that 1149 is not the first aggregation unit in an AP, and the variable DON for 1150 the aggregated NAL unit is derived as equal to the DON of the 1151 preceding aggregated NAL unit in the same AP plus the value of the 1152 DOND field plus 1 modulo 65536. Otherwise (tx-mode is equal to 1153 "SST" and sprop-depack-buf-nalus is equal to 0), the DOND field MUST 1154 NOT be present in an aggregation unit that is not the first 1155 aggregation unit in an AP. 1157 Figure 7 presents an example of an AP that contains two aggregation 1158 units, labeled as 1 and 2 in the figure, without the DONL and DOND 1159 fields being present. 1161 0 1 2 3 1162 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1163 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1164 | RTP Header | 1165 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1166 | PayloadHdr | NALU 1 Size | 1167 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1168 | NALU 1 HDR | | 1169 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1170 | . . . | 1171 | | 1172 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1173 | . . . | NALU 2 Size | NALU 2 HDR | 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1175 | NALU 2 HDR | | 1176 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1177 | . . . | 1178 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1179 | :...OPTIONAL RTP padding | 1180 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1182 Figure 7 An example of an AP packet containing two aggregation units 1183 without the DONL and DOND fields 1185 Figure 8 presents an example of an AP that contains two aggregation 1186 units, labeled as 1 and 2 in the figure, with the DONL and DOND 1187 fields being present. 1189 0 1 2 3 1190 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1192 | RTP Header | 1193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1194 | PayloadHdr | NALU 1 DONL | 1195 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1196 | NALU 1 Size | NALU 1 HDR | 1197 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1198 | | 1199 | NALU 1 Data . . . | 1200 | | 1201 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1202 | | NALU 2 DOND | NALU 2 Size | 1203 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1204 | NALU 2 HRD | | 1205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1206 | | 1207 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1208 | :...OPTIONAL RTP padding | 1209 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1211 Figure 8 An example of an AP containing two aggregation units with 1212 the DONL and DOND fields 1214 4.7 Fragmentation Units (FUs) 1216 Fragmentation units (FUs) are introduced to enable fragmenting a 1217 single NAL unit into multiple RTP packets, possibly without 1218 cooperation or knowledge of the HEVC encoder. A fragment of a NAL 1219 unit consists of an integer number of consecutive octets of that NAL 1220 unit. Fragments of the same NAL unit MUST be sent in consecutive 1221 order with ascending RTP sequence numbers (with no other RTP packets 1222 within the same RTP packet stream being sent between the first and 1223 last fragment). 1225 When a NAL unit is fragmented and conveyed within FUs, it is 1226 referred to as a fragmented NAL unit. APs MUST NOT be fragmented. 1227 FUs MUST NOT be nested; i.e., an FU MUST NOT contain another FU. 1229 The RTP timestamp of an RTP packet carrying an FU is set to the 1230 NALU-time of the fragmented NAL unit. 1232 An FU consists of a payload header (denoted as PayloadHdr), an FU 1233 header of one octet, an optional 16-bit DONL field (in network byte 1234 order), and an FU payload, as shown in Figure 9. 1236 0 1 2 3 1237 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1239 | PayloadHdr | FU header | DONL(optional)| 1240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1241 | DONL(optional)| | 1242 |-+-+-+-+-+-+-+-+ | 1243 | FU payload | 1244 | | 1245 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1246 | :...OPTIONAL RTP padding | 1247 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1249 Figure 9 The structure of an FU 1251 The fields in the payload header are set as follows. The Type field 1252 MUST be equal to 49. The fields F, LayerId, and TID MUST be equal 1253 to the fields F, LayerId, and TID, respectively, of the fragmented 1254 NAL unit. 1256 The FU header consists of an S bit, an E bit, and a 6-bit Type 1257 field, as shown in Figure 10. 1259 +---------------+ 1260 |0|1|2|3|4|5|6|7| 1261 +-+-+-+-+-+-+-+-+ 1262 |S|E| Type | 1263 +---------------+ 1265 Figure 10 The structure of FU header 1267 The semantics of the FU header fields are as follows: 1268 S: 1 bit 1269 When set to one, the S bit indicates the start of a fragmented 1270 NAL unit i.e., the first byte of the FU payload is also the first 1271 byte of the payload of the fragmented NAL unit. When the FU 1272 payload is not the start of the fragmented NAL unit payload, the 1273 S bit MUST be set to zero. 1275 E: 1 bit 1276 When set to one, the E bit indicates the end of a fragmented NAL 1277 unit, i.e., the last byte of the payload is also the last byte of 1278 the fragmented NAL unit. When the FU payload is not the last 1279 fragment of a fragmented NAL unit, the E bit MUST be set to zero. 1281 Type: 6 bits 1282 The field Type MUST be equal to the field Type of the fragmented 1283 NAL unit. 1285 The DONL field, when present, specifies the value of the 16 least 1286 significant bits of the decoding order number of the fragmented NAL 1287 unit. 1289 If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1290 than 0, and the S bit is equal to 1, the DONL field MUST be present 1291 in the FU, and the variable DON for the fragmented NAL unit is 1292 derived as equal to the value of the DONL field. Otherwise (tx-mode 1293 is equal to "SST" and sprop-depack-buf-nalus is equal to 0, or the S 1294 bit is equal to 0), the DONL field MUST NOT be present in the FU. 1296 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1297 the Start bit and End bit MUST NOT both be set to one in the same FU 1298 header. 1300 The FU payload consists of fragments of the payload of the 1301 fragmented NAL unit so that if the FU payloads of consecutive FUs, 1302 starting with an FU with the S bit equal to 1 and ending with an FU 1303 with the E bit equal to 1, are sequentially concatenated, the 1304 payload of the fragmented NAL unit can be reconstructed. The NAL 1305 unit header of the fragmented NAL unit is not included as such in 1306 the FU payload, but rather the information of the NAL unit header of 1307 the fragmented NAL unit is conveyed in F, LayerId, and TID fields of 1308 the FU payload headers of the FUs and the Type field of the FU 1309 header of the FUs. An FU payload MAY have any number of octets and 1310 MAY be empty. 1312 Informative note: Empty FU payloads are allowed to reduce the 1313 latency of a certain class of senders in nearly lossless 1314 environments. These senders can be characterized in that they 1315 packetize fragments of a NAL unit before the NAL unit is 1316 completely generated and, hence, before the NAL unit size is 1317 known. If zero-length FU payloads were not allowed, the sender 1318 would have to generate at least one bit of data of the following 1319 fragment of the NAL unit before the current FU could be sent. 1320 Due to the characteristics of HEVC, where sometimes several CTUs 1321 occupy zero bits, this is undesirable and can add delay. 1322 However, the (potential) use of zero-length FU payloads should be 1323 carefully weighted against the increased risk of the loss of at 1324 least a part of the fragmented NAL unit because of the additional 1325 packets employed for its transmission. 1327 If an FU is lost, the receiver SHOULD discard all following 1328 fragmentation units in transmission order corresponding to the same 1329 fragmented NAL unit, unless the decoder in the receiver is known to 1330 be prepared to gracefully handle incomplete NAL units. 1332 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1333 fragments of a NAL unit to an (incomplete) NAL unit, even if 1334 fragment n of that NAL unit is not received. In this case, the 1335 forbidden_zero_bit of the NAL unit MUST be set to one to indicate a 1336 syntax violation. 1338 5. Packetization Rules 1340 The following packetization rules apply: 1342 o If tx-mode is equal to "MST" or sprop-depack-buf-nalus is greater 1343 than 0 for an RTP session, the transmission order of NAL units 1344 carried in the RTP session MAY be different than the NAL unit 1345 decoding order. Otherwise (tx-mode is equal to "SST" and sprop- 1346 depack-buf-nalus is equal to 0 for an RTP session), the 1347 transmission order of NAL units carried in the RTP session MUST 1348 be the same as the NAL unit decoding order. 1350 o A NAL unit of a small size SHOULD be encapsulated in an 1351 aggregation packet together with one or more other NAL units in 1352 order to avoid the unnecessary packetization overhead for small 1353 NAL units. For example, non-VCL NAL units such as access unit 1354 delimiters, parameter sets, or SEI NAL units are typically small 1355 and can often be aggregated with slice NAL units without 1356 violating MTU size constraints. 1358 o Each non-VCL NAL unit SHOULD be encapsulated in an aggregation 1359 packet together with its associated VCL NAL unit, as typically a 1360 non-VCL NAL unit would be meaningless without the associated VCL 1361 NAL unit being available. 1363 o The TID value is designed to indicate (among other things) the 1364 relative importance of an RTP packet, for example because NAL 1365 units belonging to higher temporal sub-layers are not used for 1366 the decoding of lower temporal sub-layers. A lower value of TID 1367 indicates a higher importance. More important NAL units MAY be 1368 better protected against transmission losses than less important 1369 NAL units. 1371 o FUs SHOULD NOT be applied in live-encoding scenarios such as 1372 video telephony, video conferencing, live streaming and live 1373 broadcast, in which cases dependent slice segments SHOULD be used 1374 when a slice should be transported in multiple RTP packets. For 1375 pre-encoded content where using of dependent slice segments is 1376 not possible without transcoding, FUs SHOULD be used for 1377 transporting of one NAL unit in multiple RTP packets for MTU size 1378 matching. 1380 6. De-packetization Process 1382 The general concept behind de-packetization is to get the NAL units 1383 out of the RTP packets in an RTP session and all the dependent RTP 1384 sessions, if any, and pass them to the decoder in the NAL unit 1385 decoding order. 1387 The de-packetization process is implementation dependent. 1388 Therefore, the following description should be seen as an example of 1389 a suitable implementation. Other schemes may be used as well as 1390 long as the output for the same input is the same as the process 1391 described below. The output is the same when the set of NAL units 1392 and their order are both identical. Optimizations relative to the 1393 described algorithms are possible. 1395 All normal RTP mechanisms related to buffer management apply. In 1396 particular, duplicated or outdated RTP packets (as indicated by the 1397 RTP sequences number and the RTP timestamp) are removed. To 1398 determine the exact time for decoding, factors such as a possible 1399 intentional delay to allow for proper inter-stream synchronization 1400 must be factored in. 1402 NAL units with NAL unit type values in the range of 0 to 47, 1403 inclusive may be passed to the decoder. NAL-unit-like structures 1404 with NAL unit type values in the range of 48 to 63, inclusive, MUST 1405 NOT be passed to the decoder. 1407 The receiver includes a receiver buffer, which is used to compensate 1408 for transmission delay jitter, to reorder NAL units from 1409 transmission order to the NAL unit decoding order, and to recovery 1410 the NAL unit decoding order in MST, when applicable. In this 1411 section, the receiver operation is described under the assumption 1412 that there is no transmission delay jitter. To make a difference 1413 from a practical receiver buffer that is also used for compensation 1414 of transmission delay jitter, the receiver buffer is here after 1415 called the de-packetization buffer in this section. Receivers 1416 SHOULD also prepare for transmission delay jitter; i.e., either 1417 reserve separate buffers for transmission delay jitter buffering and 1418 de-packetization buffering or use a receiver buffer for both 1419 transmission delay jitter and de-packetization. Moreover, receivers 1420 SHOULD take transmission delay jitter into account in the buffering 1421 operation; e.g., by additional initial buffering before starting of 1422 decoding and playback. 1424 There are two buffering states in the receiver: initial buffering 1425 and buffering while playing. Initial buffering starts when the 1426 reception is initialized. After initial buffering, decoding and 1427 playback are started, and the buffering-while-playing mode is used. 1429 Regardless of the buffering state, the receiver stores incoming NAL 1430 units, in reception order, into the de-packetization buffer. NAL 1431 units carried in single NAL unit packets, APs, and FUs are stored in 1432 the de-packetization buffer individually, and the value of AbsDon is 1433 calculated and stored for each NAL unit. When MST is in use, NAL 1434 units of all RTP packet streams are stored in the same de- 1435 packetization buffer. 1437 Initial buffering lasts until condition A (the number of NAL units 1438 in the de-packetization buffer is greater than the value of sprop- 1439 depack-buf-nalus of the highest RTP session) is true. 1441 After initial buffering, whenever condition A is true, the following 1442 operation is repeatedly applied until condition A becomes false: 1444 o The NAL unit in the de-packetization buffer with the smallest 1445 value of AbsDon is removed from the de-packetization buffer and 1446 passed to the decoder. 1448 When no more NAL units are flowing into the de-packetization buffer, 1449 all NAL units remained in the de-packetization buffer are removed 1450 from the buffer and passed to the decoder in the order of increasing 1451 AbsDon values. 1453 7. Payload Format Parameters 1455 This section specifies the parameters that MAY be used to select 1456 optional features of the payload format and certain features or 1457 properties of the bitstream. The parameters are specified here as 1458 part of the media type registration for the HEVC codec. A mapping 1459 of the parameters into the Session Description Protocol (SDP) 1460 [RFC4566] is also provided for applications that use SDP. 1462 Equivalent parameters could be defined elsewhere for use with 1463 control protocols that do not use SDP. 1465 7.1 Media Type Registration 1467 The media subtype for the HEVC codec is allocated from the IETF 1468 tree. 1470 The receiver MUST ignore any unspecified parameter. 1472 Media Type name: video 1474 Media subtype name: H265 1476 Required parameters: none 1478 OPTIONAL parameters: 1480 In the following definitions of parameters, "the stream" or "the 1481 NAL unit stream" refers to all NAL units conveyed in the current 1482 RTP session in SST, and all NAL units conveyed in the current RTP 1483 session and all NAL units conveyed in other RTP sessions that the 1484 current RTP session depends on in MST. 1486 profile-space, profile-id: 1488 The profile-space parameter indicates the context for 1489 interpretation of the profile-id parameter value. The 1490 profile, which specifies the subset of coding tools that may 1491 have been used to generate the stream or that the receiver 1492 supports, as specified in [HEVC], is defined by the 1493 combination of profile-space and profile-id. Note that 1494 profile-space is required to be equal to 0 in [HEVC], but 1495 other values for it may be specified in the future by ITU-T or 1496 ISO/IEC. 1498 If the profile-space and profile-id parameters are used to 1499 indicate properties of a NAL unit stream, it indicates that, 1500 to decode the stream, the minimum subset of coding tools a 1501 decoder has to support is the profile specified by both 1502 parameters. 1504 If the profile-space and profile-id parameters are used for 1505 capability exchange or session setup, it indicates the subset 1506 of coding tools, which is equal to the profile, that the codec 1507 supports for both receiving and sending. 1509 If no profile-space is present, a value of 0 MUST be inferred 1510 and if no profile-id is present the Main profile MUST be 1511 inferred. 1513 The profile-space and profile-id parameters are derived from 1514 the sequence parameter set or video parameter set NAL units, 1515 as specified in [HEVC], as follows. 1517 For SST or for the stream corresponding to the highest RTP 1518 session of MST when MST is applied, the following applies: 1520 o profile_space = general_profile_space 1521 o profile_id = general_profile_idc 1523 For streams not corresponding to the highest RTP session of 1524 MST when MST is applied, the following applies, with j being 1525 the value of the sub-layer-id parameter: 1527 o profile_space = sub_layer_profile_space[j] 1528 o profile_id = sub_layer_profile_idc[j] 1530 tier-flag, level-id: 1532 The tier-flag parameter indicates the context for 1533 interpretation of the level-id value. The default level, 1534 which limits values of syntax elements or on arithmetic 1535 combinations of values of syntax elements, as specified in 1536 [HEVC], is defined by the combination of tier-flag and level- 1537 id. 1539 If the tier-flag and level-id parameters are used to indicate 1540 properties of a NAL unit stream, it indicates that, to decode 1541 the stream the lowest level the decoder has to support is the 1542 default level. 1544 If the tier-flag and level-id parameters are used for 1545 capability exchange or session setup, the following applies. 1546 If max-recv-level-id is not present, the default level defined 1547 by tier-flag and level-id indicates the highest level the 1548 codec wishes to support. Otherwise, tier-flag and max-recv- 1549 level-id indicate the highest level the codec supports for 1550 receiving. For either receiving or sending, all levels that 1551 are lower than the highest level supported MUST also be 1552 supported. 1554 If no tier-flag is present, a value of 0 MUST be inferred and 1555 if no level-id is present, a value of 1 MUST be inferred. 1557 The tier-flag and level-id parameters are derived from the 1558 sequence parameter set or video parameter set NAL units, as 1559 specified in [HEVC], as follows. 1561 For SST or for the stream corresponding to the highest RTP 1562 session of MST when MST is applied, the following applies: 1564 o tier-flag = general_tier_flag 1565 o level-id = general_level_idc 1567 For streams not corresponding to the highest RTP session of 1568 MST when MST is applied, the following applies, with j being 1569 the value of the sub-layer-id parameter: 1571 o tier-flag = sub_layer_tier_flag[j] 1572 o level-id = sub_layer_leve_idc[j] 1574 interop-constraints: 1576 A base16 [RFC4648] (hexadecimal) representation of the six 1577 bytes derived from the sequence parameter set or video 1578 parameter set NAL units as specified in [HEVC] consisting of 1579 progressive_source_flag, interlaced_source_flag, 1580 non_packed_constraint_flag, frame_only_constraint_flag, and 1581 reserved_zero_44bits. Note that reserved_zero_44bits is 1582 required to be equal to 0 in [HEVC], but other values for it 1583 may be specified in the future by ITU-T or ISO/IEC. 1585 If no interop-constraints are present, the following MUST be 1586 inferred: 1588 o progressive_source_flag = 1 1589 o interlaced_source_flag = 0 1590 o non_packed_constraint_flag = 1 1591 o frame_only_constraint_flag = 1 1592 o reserved_zero_44bits = 0 1594 For SST or for the stream corresponding to the highest RTP 1595 session of MST when MST is applied, the following applies: 1597 o progressive_source_flag = general_progressive_source_flag 1598 o interlaced_source_flag = general_interlaced_source_flag 1599 o non_packed_constraint_flag = 1600 general_non_packed_constraint_flag 1601 o frame_only_constraint_flag = 1602 general_frame_only_constraint_flag 1603 o reserved_zero_44bits = general_reserved_zero_44bits 1605 For streams not corresponding to the highest RTP session of 1606 MST when MST is applied, the following applies, with j being 1607 the value of the sub-layer-id parameter: 1609 o progressive_source_flag = 1610 sub_layer_progressive_source_flag[j] 1611 o interlaced_source_flag = 1612 sub_layer_interlaced_source_flag[j] 1613 o non_packed_constraint_flag = 1614 sub_layer_non_packed_constraint_flag[j] 1615 o frame_only_constraint_flag = 1616 sub_layer_frame_only_constraint_flag[j] 1617 o reserved_zero_44bits = sub_layer_reserved_zero_44bits[j] 1619 profile-compatibility-indicator: 1621 A base16 [RFC4648] representation of the four bytes 1622 representing the 32 profile compatibility flags in the 1623 sequence parameter set or video parameter set NAL units. A 1624 decoder conforming to a certain profile may be able to decode 1625 bitstreams conforming to other profiles. The profile- 1626 compatibility-indicator provides exact information of the 1627 ability of a decoder conforming to a certain profile to decode 1628 bitstreams conforming to another profile. More concretely, if 1629 the profile compatibility flag corresponding to the profile, 1630 which a decoder conforms to, is set, then the decoder is able 1631 to decode that bitstream with the flag set, irrespective of 1632 the profile, which a bitstream conforms to (provided that the 1633 decoder supports the highest level of the bitstream). 1635 For SST or for the stream corresponding to highest RTP session 1636 of MST when MST is used with temporal scalability the 1637 following applies with j = 0..31: 1639 o The 32 flags = general_profile_compatibility_flag[j] 1641 For streams not corresponding to the highest RTP session (the 1642 RTP session which no other RTP session depends on) of MST when 1643 MST is used with temporal scalability the following applies 1644 with i being the value of the sub-layer-id parameter and j = 1645 0..31: 1647 o The 32 flags = sub_layer_profile_compatibility_flag[i][j] 1649 sub-layer-id: 1651 This parameter MAY be used to indicate the TID of the highest 1652 sub-layer of the stream. When not present, the value of sub- 1653 layer-id is inferred to be equal to 1654 vps_max_sub_layers_minus1+1 and sps_max_sub_layers_minus1+1 in 1655 the video parameter set and sequence parameter set as defined 1656 in [HEVC]. 1658 recv-sub-layer-id: 1660 This parameter MAY be used to signal a receiver's choice of 1661 the offers or declared sub-layers in the sprop-vps. The value 1662 of recv-sub-layer-id indicates the index of the highest sub- 1663 layer of the stream that a receiver supports. When not 1664 present, the value of recv-sub-layer-id is inferred to be 1665 equal to sub-layer-id. 1667 max-recv-level-id: 1669 This parameter MAY be used, together with tier-flag, to 1670 indicate the highest level a receiver supports. The highest 1671 level the receiver supports is equal to the value of max-recv- 1672 level-id divided by 30 for the Main or High tier (as 1673 determined by tier-flag equal to 0 or 1, respectively). 1675 When max-recv-level-id is not present, the value is inferred 1676 to be equal to level-id. 1678 max-recv-level-id MUST NOT be present when the highest level 1679 the receiver supports is not higher than the default level. 1681 sprop-vps: 1683 This parameter MAY be used to convey any video parameter set 1684 NAL unit of the stream. When present, the parameter MAY be 1685 used to indicate codec capability and sub-stream 1686 characteristics (i.e. properties of representations of sub- 1687 layers as defined in [HEVC]) as well as for out-of-band 1688 transmission of video parameter sets. The value of the 1689 parameter is a comma-separated (',') list of base64 [RFC4648] 1690 representations of the video parameter set NAL units as 1691 specified in Section 7.3.2.1 of [HEVC]. 1693 sprop-sps: 1695 This parameter MAY be used to convey sequence parameter set 1696 NAL units of the stream for out-of-band transmission of 1697 sequence parameter sets. The value of the parameter is a 1698 comma-separated (',') list of base64 [RFC4648] representations 1699 of the sequence parameter set NAL units as specified in 1700 Section 7.3.2.2 of [HEVC]. 1702 sprop-pps: 1704 This parameter MAY be used to convey picture parameter set NAL 1705 units of the stream for out-of-band transmission of picture 1706 parameter sets. The value of the parameter is a comma- 1707 separated (',') list of base64 [RFC4648] representations of 1708 the picture parameter set NAL units as specified in Section 1709 7.3.2.3 of [HEVC]. 1711 max-ls, max-lps, max-cpb, max-dpb, max-br: 1713 These parameters MAY be used to signal the capabilities of a 1714 receiver implementation. These parameters MUST NOT be used for 1715 any other purpose. The highest level (specified by tier-flag 1716 and max-recv-level-id) MUST be such that the receiver is fully 1717 capable of supporting. max-ls, max-lps, max-cpb, max-dpb, and 1718 max-br MAY be used to indicate capabilities of the receiver 1719 that extend the required capabilities of the signaled highest 1720 level, as specified below. 1722 When more than one parameter from the set (max-ls, max-lps, 1723 max-cpb, max-dpb, max-br) is present, the receiver MUST 1724 support all signaled capabilities simultaneously. For 1725 example, if both max-ls and max-br are present, the signaled 1726 highest level with the extension of both the frame rate and 1727 bitrate is supported. That is, the receiver is able to decode 1728 NAL unit streams in which the luma sample rate is up to max-ls 1729 (inclusive), the bitrate is up to max-br (inclusive), the 1730 coded picture buffer size is derived as specified in the 1731 semantics of the max-br parameter below, and the other 1732 properties comply with the highest level specified by tier- 1733 flag and max-recv-level-id. 1735 Informative note: When the OPTIONAL media type parameters 1736 are used to signal the properties of a NAL unit stream, 1737 max-ls, max-lps, max-cpb, max-dpb, and max-br are not 1738 present, and the value of profile-space, profile-id, tier- 1739 flag and level-id must always be such that the NAL unit 1740 stream complies fully with the specified profile and level. 1742 max-ls: 1743 The value of max-ls is an integer indicating the maximum 1744 processing rate in units of luma samples per second. The max- 1745 ls parameter signals that the receiver is capable of decoding 1746 video at a higher rate than is required by the signaled 1747 highest level. 1749 When max-ls is signaled, the receiver MUST be able to decode 1750 NAL unit streams that conform to the signaled highest level, 1751 with the exception that the MaxLumaSR value in Table A-2 of 1752 [HEVC] for the signaled highest level is replaced with the 1753 value of max-ls. The value of max-ls MUST be greater than or 1754 equal to the value of MaxLumaSR given in Table A-2 of [HEVC] 1755 for the highest level. Senders MAY use this knowledge to send 1756 pictures of a given size at a higher picture rate than is 1757 indicated in the signaled highest level. 1759 max-lps: 1760 The value of max-lps is an integer indicating the maximum 1761 picture size in units of luma samples. The max-lps parameter 1762 signals that the receiver is capable of decoding larger 1763 picture sizes than are required by the signaled highest level. 1764 When max-lps is signaled, the receiver MUST be able to decode 1765 NAL unit streams that conform to the signaled highest level, 1766 with the exception that the MaxLumaPS value in Table A-1 of 1767 [HEVC] for the signaled highest level is replaced with the 1768 value of max-lps. The value of max-lps MUST be greater than or 1769 equal to the value of MaxLumaPS given in Table A-1 of [HEVC] 1770 for the highest level. Senders MAY use this knowledge to send 1771 larger pictures at a proportionally lower frame rate than is 1772 indicated in the signaled highest level. 1774 max-cpb: 1775 The value of max-cpb is an integer indicating the maximum 1776 coded picture buffer size in units of CpbBrVclFactor bits for 1777 the VCL HRD parameters and in units of CpbBrNalFactor bits for 1778 the NAL HRD parameters, where CpbBrVclFactor and 1779 CpbBrNalFactor are defined in Section A.4 of [HEVC]. The max- 1780 cpb parameter signals that the receiver has more memory than 1781 the minimum amount of coded picture buffer memory required by 1782 the signaled highest level. When max-cpb is signaled, the 1783 receiver MUST be able to decode NAL unit streams that conform 1784 to the signaled highest level, with the exception that the 1785 MaxCPB value in Table A-1 of [HEVC] for the signaled highest 1786 level is replaced with the value of max-cpb. The value of max- 1787 cpb MUST be greater than or equal to the value of MaxCPB given 1788 in Table A-1 of [HEVC] for the highest level. Senders MAY use 1789 this knowledge to construct coded video streams with greater 1790 variation of bitrate than can be achieved with the MaxCPB 1791 value in Table A-1 of [HEVC]. 1793 Informative note: The coded picture buffer is used in the 1794 hypothetical reference decoder (Annex C of HEVC). The use 1795 of the hypothetical reference decoder is recommended in 1796 HEVC encoders to verify that the produced bitstream 1797 conforms to the standard and to control the output bitrate. 1798 Thus, the coded picture buffer is conceptually independent 1799 of any other potential buffers in the receiver, including 1800 de-packetization and de-jitter buffers. The coded picture 1801 buffer need not be implemented in decoders as specified in 1802 Annex C of HEVC, but rather standard-compliant decoders can 1803 have any buffering arrangements provided that they can 1804 decode standard-compliant bitstreams. Thus, in practice, 1805 the input buffer for a video decoder can be integrated with 1806 de-packetization and de-jitter buffers of the receiver. 1808 max-dpb: 1809 The value of max-dpb is an integer indicating the maximum 1810 decoded picture buffer size in units decoded pictures at the 1811 MaxLumaPS for the highest level, i.e. number of decoded 1812 pictures at the maximum picture size defined by the highest 1813 level. The value of max-dpb MUST be smaller than or equal to 1814 16. The max-dpb parameter signals that the receiver has more 1815 memory than the minimum amount of decoded picture buffer 1816 memory required by default, which is MaxDpbPicBuf as defined 1817 in [HEVC](equal to 6). When max-dpb is signaled, the receiver 1818 MUST be able to decode NAL unit streams that conform to the 1819 signaled highest level, with the exception that the 1820 MaxDpbPicBuff value defined in [HEVC] as 6 is replaced with 1821 the value of max-dpb. Consequently, a receiver that signals 1822 max-dpb MUST be capable of storing the following number of 1823 decoded frames (MaxDpbSize) in its decoded picture buffer: 1825 if( PicSizeInSamplesY <= ( MaxLumaPS >> 2 ) ) 1826 MaxDpbSize = Min( 4 * max-dpb, 16 ) 1827 else if ( PicSizeInSamplesY <= ( MaxLumaPS >> 1 ) ) 1828 MaxDpbSize = Min( 2 * max-dpb, 16 ) 1829 else if ( PicSizeInSamplesY <= ( ( 3 * MaxLumaPS ) >> 2 ) ) 1830 MaxDpbSize = Min( (4 * max-dpb) / 3, 16 ) 1832 else 1833 MaxDpbSize = max-dpb 1835 Wherein MaxLumaPS given in Table A-1 of [HEVC] for the highest 1836 level and PicSizeInSamplesY is the current size of each 1837 decoded picture in units of luma samples as defined in [HEVC]. 1839 The value of max-dpb MUST be greater than or equal to the 1840 value of MaxDpbPicBuf (i.e. 6) as defined in [HEVC]. Senders 1841 MAY use this knowledge to construct coded video streams with 1842 improved compression. 1844 Informative note: This parameter was added primarily to 1845 complement a similar codepoint in the ITU-T Recommendation 1846 H.245, so as to facilitate signaling gateway designs. The 1847 decoded picture buffer stores reconstructed samples. There 1848 is no relationship between the size of the decoded picture 1849 buffer and the buffers used in RTP, especially de- 1850 packetization and de-jitter buffers. 1852 max-br: 1853 The value of max-br is an integer indicating the maximum video 1854 bitrate in units of CpbBrVclFactor bits per second for the VCL 1855 HRD parameters and in units of CpbBrNalFactor bits per second 1856 for the NAL HRD parameters, where CpbBrVclFactor and 1857 CpbBrNalFactor are defined in Section A.4 of [HEVC]. 1859 The max-br parameter signals that the video decoder of the 1860 receiver is capable of decoding video at a higher bitrate than 1861 is required by the signaled highest level. 1863 When max-br is signaled, the video codec of the receiver MUST 1864 be able to decode NAL unit streams that conform to the 1865 signaled highest level, with the following exceptions in the 1866 limits specified by the highest level: 1868 o The value of max-br replaces the MaxBR value in Table A-2 1869 of [HEVC] for the highest level. 1871 o When the max-cpb parameter is not present, the result of 1872 the following formula replaces the value of MaxCPB in Table A- 1873 1 of [HEVC]: 1875 (MaxCPB of the signaled level) * max-br / (MaxBR of the 1876 signaled highest level). 1878 For example, if a receiver signals capability for Main profile 1879 Level 2 with max-br equal to 2000, this indicates a maximum 1880 video bitrate of 2000 kbits/sec for VCL HRD parameters, a 1881 maximum video bitrate of 2200 kbits/sec for NAL HRD 1882 parameters, and a CPB size of 2000000 bits (2000000 / 1500000 1883 * 1500000). 1885 The value of max-br MUST be greater than or equal to the 1886 value MaxBR given in Table A-2 of [HEVC] for the signaled 1887 highest level. 1889 Senders MAY use this knowledge to send higher bitrate video as 1890 allowed in the level definition of Annex A of HEVC to achieve 1891 improved video quality. 1893 Informative note: This parameter was added primarily to 1894 complement a similar codepoint in the ITU-T Recommendation 1895 H.245, so as to facilitate signaling gateway designs. The 1896 assumption that the network is capable of handling such 1897 bitrates at any given time cannot be made from the value of 1898 this parameter. In particular, no conclusion can be drawn 1899 that the signaled bitrate is possible under congestion 1900 control constraints. 1902 tx-mode: 1904 This parameter indicates whether the transmission mode is SST 1905 or MST. 1907 The value of tx-mode MUST be equal to either "MST" or "SST". 1908 When not present, the value of tx-mode is inferred to be equal 1909 to "SST". 1911 If the value is equal to "MST", MST MUST be in use. Otherwise 1912 (the value is equal to "SST"), SST MUST be in use. 1914 The value of tx-mode MUST be equal to "MST" for all RTP 1915 sessions in an MST. 1917 sprop-depack-buf-nalus: 1919 This parameter specifies the maximum number of NAL units that 1920 precede a NAL unit in the de-packetization buffer in reception 1921 order and follow the NAL unit in decoding order. 1923 The value of sprop-depack-buf-nalus MUST be an integer in the 1924 range of 0 to 32767, inclusive. 1926 When not present, the value of sprop-depack-buf-nalus is 1927 inferred to be equal to 0. 1929 When the RTP session depends on one or more other RTP sessions 1930 (in this case tx-mode MUST be equal to "MST"), this parameter 1931 MUST be present and the value of sprop-depack-buf-nalus MUST 1932 be greater than 0. 1934 sprop-depack-buf-bytes: 1936 This parameter signals the required size of the de- 1937 packetization buffer in units of bytes. The value of the 1938 parameter MUST be greater than or equal to the maximum buffer 1939 occupancy (in units of bytes) of the de-packetization buffer 1940 as specified in section 6. 1942 The value of sprop-depack-buf-bytes MUST be an integer in the 1943 range of 0 to 4294967295, inclusive. 1945 When the RTP session depends on one or more other RTP sessions 1946 (in this case tx-mode MUST be equal to "MST") or sprop-depack- 1947 buf-nalus is present and is greater than 0, this parameter 1948 MUST be present and the value of sprop-depack-buf-bytes MUST 1949 be greater than 0. 1951 Informative note: sprop-depack-buf-bytes indicates the 1952 required size of the de-packetization buffer only. When 1953 network jitter can occur, an appropriately sized jitter 1954 buffer has to be available as well. 1956 depack-buf-cap: 1958 This parameter signals the capabilities of a receiver 1959 implementation and indicates the amount of de-packetization 1960 buffer space in units of bytes that the receiver has available 1961 for reconstructing the NAL unit decoding order. A receiver is 1962 able to handle any stream for which the value of the sprop- 1963 depack-buf-bytes parameter is smaller than or equal to this 1964 parameter. 1966 When not present, the value of depack-buf-req is inferred to 1967 be equal to 0. The value of depack-buf-cap MUST be an integer 1968 in the range of 0 to 4294967295, inclusive. 1970 Informative note: depack-buf-cap indicates the maximum 1971 possible size of the de-packetization buffer of the 1972 receiver only. When network jitter can occur, an 1973 appropriately sized jitter buffer has to be available as 1974 well. 1976 segmentation-id: 1978 This parameter MAY be used to signal the segmentation tools 1979 present in the stream and that can be used for 1980 parallelization. The value of segmentation-id MUST be an 1981 integer in the range of 0 to 3, inclusive. When not present, 1982 the value of segmentation-id is inferred to be equal to 0. 1984 When segmentation-id is equal to 0, no information about the 1985 segmentation tools is provided. When segmentation-id is equal 1986 to 1, it indicates that slices are present in the stream. 1987 When segmentation-id is equal to 2, it indicates that tiles 1988 are present in the stream. When segmentation-id is equal to 1989 3, it indicates that WPP is used in the stream. 1991 spatial-segmentation-idc: 1993 A base16 [RFC4648] representation of the syntax element 1994 min_spatial_segmentation_idc as specified in [HEVC]. This 1995 parameter MAY be used to describe parallelization capabilities 1996 of the stream. 1998 Encoding considerations: 2000 This type is only defined for transfer via RTP (RFC 3550). 2002 Security considerations: 2004 See Section 9 of RFC XXXX. 2006 Public specification: 2008 Please refer to Section 13 of RFC XXXX. 2010 Additional information: None 2012 File extensions: none 2014 Macintosh file type code: none 2016 Object identifier or OID: none 2018 Person & email address to contact for further information: 2020 Intended usage: COMMON 2022 Author: See Section 14 of RFC XXXX. 2024 Change controller: 2026 IETF Audio/Video Transport Payloads working group delegated 2027 from the IESG. 2029 7.2 SDP Parameters 2031 The receiver MUST ignore any parameter unspecified in this memo. 2033 7.2.1 Mapping of Payload Type Parameters to SDP 2035 The media type video/H265 string is mapped to fields in the Session 2036 Description Protocol (SDP) [RFC4566] as follows: 2038 o The media name in the "m=" line of SDP MUST be video. 2040 o The encoding name in the "a=rtpmap" line of SDP MUST be H265 (the 2041 media subtype). 2043 o The clock rate in the "a=rtpmap" line MUST be 90000. 2045 o The OPTIONAL parameters "profile-space", "profile-id", "tier- 2046 flag", "level-id", "interop-constraints", "profile-compatibility- 2047 indicator", "sub-layer-id", "recv-sub-layer-id", "max-recv-level- 2048 id", "max-ls", "max-lps", "max-cpb", "max-dpb", "max-br", "tx- 2049 mode", "sprop-depack-buf-nalus", "sprop-depack-buf-bytes", 2050 "depack-buf-cap", "segmentation-id", and "spatial-segmentation- 2051 idc", when present, MUST be included in the "a=fmtp" line of SDP. 2052 This parameter is expressed as a media type string, in the form 2053 of a semicolon separated list of parameter=value pairs. 2055 o The OPTIONAL parameters "sprop-vps", "sprop-sps", and "sprop- 2056 pps", when present, MUST be included in the "a=fmtp" line of SDP 2057 or conveyed using the "fmtp" source attribute as specified in 2058 section 6.3 of [RFC5576]. For a particular media format (i.e., 2059 RTP payload type), "sprop-vps" "sprop-sps", or "sprop-pps" MUST 2060 NOT be both included in the "a=fmtp" line of SDP and conveyed 2061 using the "fmtp" source attribute. When included in the "a=fmtp" 2062 line of SDP, these parameters are expressed as a media type 2063 string, in the form of a semicolon separated list of 2064 parameter=value pairs. When conveyed using the "fmtp" source 2065 attribute, these parameters are only associated with the given 2066 source and payload type as parts of the "fmtp" source attribute. 2068 Informative note: Conveyance of "sprop-vps", "sprop-sps", and 2069 "sprop-pps" using the "fmtp" source attribute allows for out- 2070 of-band transport of parameter sets in topologies like Topo- 2071 Video-switch-MCU as specified in [RFC5117]. 2073 An example of media representation in SDP is as follows: 2075 m=video 49170 RTP/AVP 98 2076 a=rtpmap:98 H265/90000 2077 a=fmtp:98 profile-id=ST; 2078 sprop-vps=