idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-13.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (18 November 2021) is 889 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1390 ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' -- Obsolete informational reference (is this intentional?): RFC 2326 (Obsoleted by RFC 7826) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 22 May 2022 Y. Sanchez 6 Fraunhofer HHI 7 Y.-K. Wang 8 Bytedance Inc. 9 M. Hannuksela 10 Nokia Technologies 11 18 November 2021 13 RTP Payload Format for Versatile Video Coding (VVC) 14 draft-ietf-avtcore-rtp-vvc-13 16 Abstract 18 This memo describes an RTP payload format for the video coding 19 standard ITU-T Recommendation H.266 and ISO/IEC International 20 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 21 developed by the Joint Video Experts Team (JVET). The RTP payload 22 format allows for packetization of one or more Network Abstraction 23 Layer (NAL) units in each RTP packet payload as well as fragmentation 24 of a NAL unit into multiple RTP packets. The payload format has wide 25 applicability in videoconferencing, Internet video streaming, and 26 high-bitrate entertainment-quality video, among other applications. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on 22 May 2022. 45 Copyright Notice 47 Copyright (c) 2021 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 52 license-info) in effect on the date of publication of this document. 53 Please review these documents carefully, as they describe your rights 54 and restrictions with respect to this document. Code Components 55 extracted from this document must include Simplified BSD License text 56 as described in Section 4.e of the Trust Legal Provisions and are 57 provided without warranty as described in the Simplified BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 63 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 64 1.1.2. Systems and Transport Interfaces (informative) . . . 6 65 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 66 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 67 1.2. Overview of the Payload Format . . . . . . . . . . . . . 15 68 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 69 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 70 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 71 3.1.1. Definitions from the VVC Specification . . . . . . . 15 72 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 18 73 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 74 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 75 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 76 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 22 77 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 78 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 23 79 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 80 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 81 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 82 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 31 83 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 84 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 85 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 34 86 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 45 87 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 45 88 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 46 89 7.2.3. Usage in Declarative Session Descriptions . . . . . . 55 90 7.2.4. Considerations for Parameter Sets . . . . . . . . . . 56 91 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 56 92 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 57 93 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 57 94 9. Security Considerations . . . . . . . . . . . . . . . . . . . 57 95 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 59 96 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 60 97 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 60 98 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 60 99 13.1. Normative References . . . . . . . . . . . . . . . . . . 60 100 13.2. Informative References . . . . . . . . . . . . . . . . . 62 101 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 63 102 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 64 104 1. Introduction 106 The Versatile Video Coding [VVC] specification, formally published as 107 both ITU-T Recommendation H.266 and ISO/IEC International Standard 108 23090-3, is currently in the ITU-T publication process and the ISO/ 109 IEC approval process. VVC is reported to provide significant coding 110 efficiency gains over HEVC [HEVC] as known as H.265, and other 111 earlier video codecs. 113 This memo specifies an RTP payload format for VVC. It shares its 114 basic design with the NAL (Network Abstraction Layer) unit based RTP 115 payload formats of AVC Video Coding [RFC6184], Scalable Video Coding 116 (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] and 117 their respective predecessors. With respect to design philosophy, 118 security, congestion control, and overall implementation complexity, 119 it has similar properties to those earlier payload format 120 specifications. This is a conscious choice, as at least RFC 6184 is 121 widely deployed and generally known in the relevant implementer 122 communities. Certain scalability-related mechanisms known from 123 [RFC6190] were incorporated into this document, as VVC version 1 124 supports temporal, spatial, and signal-to-noise ratio (SNR) 125 scalability. 127 1.1. Overview of the VVC Codec 129 VVC and HEVC share a similar hybrid video codec design. In this 130 memo, we provide a very brief overview of those features of VVC that 131 are, in some form, addressed by the payload format specified herein. 132 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 133 specifications pertaining to VVC to arrive at interoperable, well- 134 performing implementations. 136 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 137 which is often used to refer to the coding-tool features, and a NAL, 138 which is often used to refer to the systems and transport interface 139 aspects of the codecs. 141 1.1.1. Coding-Tool Features (informative) 143 Coding tool features are described below with occasional reference to 144 the coding tool set of HEVC, which is well known in the community. 146 Similar to earlier hybrid-video-coding-based standards, including 147 HEVC, the following basic video coding design is employed by VVC. A 148 prediction signal is first formed by either intra- or motion- 149 compensated prediction, and the residual (the difference between the 150 original and the prediction) is then coded. The gains in coding 151 efficiency are achieved by redesigning and improving almost all parts 152 of the codec over earlier designs. In addition, VVC includes several 153 tools to make the implementation on parallel architectures easier. 155 Finally, VVC includes temporal, spatial, and SNR scalability as well 156 as multiview coding support. 158 Coding blocks and transform structure 160 Among major coding-tool differences between HEVC and VVC, one of the 161 important improvements is the more flexible coding tree structure in 162 VVC, i.e., multi-type tree. In addition to quadtree, binary and 163 ternary trees are also supported, which contributes significant 164 improvement in coding efficiency. Moreover, the maximum size of 165 coding tree unit (CTU) is increased from 64x64 to 128x128. To 166 improve the coding efficiency of chroma signal, luma chroma separated 167 trees at CTU level may be employed for intra-slices. The square 168 transforms in HEVC are extended to non-square transforms for 169 rectangular blocks resulting from binary and ternary tree splits. 170 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 171 DST-7, and DCT-8 as well as the non-separable secondary transform. 172 The transforms used in VVC can have different sizes with support for 173 larger transform sizes. For DCT-2, the transform sizes range from 174 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 175 4x4 to 32x32. In addition, VVC also support sub-block transform for 176 both intra and inter coded blocks. For intra coded blocks, intra 177 sub-partitioning (ISP) may be used to allow sub-block based intra 178 prediction and transform. For inter blocks, sub-block transform may 179 be used assuming that only a part of an inter-block has non-zero 180 transform coefficients. 182 Entropy coding 184 Similar to HEVC, VVC uses a single entropy-coding engine, which is 185 based on context adaptive binary arithmetic coding [CABAC], but with 186 the support of multi-window sizes. The window sizes can be 187 initialized differently for different context models. Due to such a 188 design, it has more efficient adaptation speed and better coding 189 efficiency. A joint chroma residual coding scheme is applied to 190 further exploit the correlation between the residuals of two color 191 components. In VVC, different residual coding schemes are applied 192 for regular transform coefficients and residual samples generated 193 using transform-skip mode. 195 In-loop filtering 197 VVC has more feature support in loop filters than HEVC. The 198 deblocking filter in VVC is similar to HEVC but operates at a smaller 199 grid. After deblocking and sample adaptive offset (SAO), an adaptive 200 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 201 distortion of decoded pictures. Besides, VVC introduces a new module 202 called luma mapping with chroma scaling to fully utilize the dynamic 203 range of signal so that rate-distortion performance of both Standard 204 Dynamic Range (SDR) and High Dynamic Range (HDR) content is improved. 206 Motion prediction and coding 208 Compared to HEVC, VVC introduces several improvements in this area. 209 First, there is the adaptive motion vector resolution (AMVR), which 210 can save bit cost for motion vectors by adaptively signalling motion 211 vector resolution. Then the affine motion compensation is included 212 to capture complicated motion like zooming and rotation. Meanwhile, 213 prediction refinement with the optical flow with affine mode (PROF) 214 is further deployed to mimic affine motion at the pixel level. 215 Thirdly the decoder side motion vector refinement (DMVR) is a method 216 to derive MV vector at decoder side based on block matching so that 217 fewer bits may be spent on motion vectors. Bi-directional optical 218 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 219 offset at 4x4 sub-block level that is derived with equations based on 220 gradients of the prediction samples and a motion difference relative 221 to CU motion vectors. Furthermore, merge with motion vector 222 difference (MMVD) is a special mode, which further signals a limited 223 set of motion vector differences on top of merge mode. In addition 224 to MMVD, there are another three types of special merge modes, i.e., 225 sub-block merge, triangle, and combined intra-/inter-prediction 226 (CIIP). Sub-block merge list includes one candidate of sub-block 227 temporal motion vector prediction (SbTMVP) and up to four candidates 228 of affine motion vectors. Triangle is based on triangular block 229 motion compensation. CIIP combines intra- and inter- predictions 230 with weighting. Adaptive weighting may be employed with a block- 231 level tool called bi-prediction with CU based weighting (BCW) which 232 provides more flexibility than in HEVC. 234 Intra prediction and intra-coding 236 To capture the diversified local image texture directions with finer 237 granularity, VVC supports 65 angular directions instead of 33 238 directions in HEVC. The intra mode coding is based on a 6-most- 239 probable-mode scheme, and the 6 most probable modes are derived using 240 the neighboring intra prediction directions. In addition, to deal 241 with the different distributions of intra prediction angles for 242 different block aspect ratios, a wide-angle intra prediction (WAIP) 243 scheme is applied in VVC by including intra prediction angles beyond 244 those present in HEVC. Unlike HEVC which only allows using the most 245 adjacent line of reference samples for intra prediction, VVC also 246 allows using two further reference lines, as known as multi- 247 reference-line (MRL) intra prediction. The additional reference 248 lines can be only used for the 6 most probable intra prediction 249 modes. To capture the strong correlation between different colour 250 components, in VVC, a cross-component linear mode (CCLM) is utilized 251 which assumes a linear relationship between the luma sample values 252 and their associated chroma samples. For intra prediction, VVC also 253 applies a position-dependent prediction combination (PDPC) for 254 refining the prediction samples closer to the intra prediction block 255 boundary. Matrix-based intra prediction (MIP) modes are also used in 256 VVC which generates an up to 8x8 intra prediction block using a 257 weighted sum of downsampled neighboring reference samples, and the 258 weights are hardcoded constants. 260 Other coding-tool feature 262 VVC introduces dependent quantization (DQ) to reduce quantization 263 error by state-based switching between two quantizers. 265 1.1.2. Systems and Transport Interfaces (informative) 267 VVC inherits the basic systems and transport interfaces designs from 268 HEVC and AVC. These include the NAL-unit-based syntax structure, the 269 hierarchical syntax and data unit structure, the supplemental 270 enhancement information (SEI) message mechanism, and the video 271 buffering model based on the hypothetical reference decoder (HRD). 272 The scalability features of VVC are conceptually similar to the 273 scalable variant of HEVC known as SHVC. The hierarchical syntax and 274 data unit structure consists of parameter sets at various levels 275 (decoder, sequence (pertaining to all), sequence (pertaining to a 276 single), picture), picture-level header parameters, slice-level 277 header parameters, and lower-level parameters. 279 A number of key components that influenced the network abstraction 280 layer design of VVC as well as this memo are described below 282 Decoding capability information 284 The decoding capability information includes parameters that stay 285 constant for the lifetime of a VVC bitstream, which in IETF terms can 286 translate to a session. Such information includes profile, level, 287 and sub-profile information to determine a maximum capability interop 288 point that is guaranteed to be never exceeded, even if splicing of 289 video sequences occurs within a session. It further includes 290 constraint fields (most of which are flags), which can optionally be 291 set to indicate that the video bitstream will be constrained in the 292 use of certain features as indicated by the values of those fields. 293 With this, a bitstream can be labelled as not using certain tools, 294 which allows among other things for resource allocation in a decoder 295 implementation. 297 Video parameter set 299 The video parameter set (VPS) pertains to one or more coded video 300 sequences (CVSs) of multiple layers covering the same range of access 301 units, and includes, among other information, decoding dependency 302 expressed as information for reference picture list construction of 303 enhancement layers. The VPS provides a "big picture" of a scalable 304 sequence, including what types of operation points are provided, the 305 profile, tier, and level of the operation points, and some other 306 high-level properties of the bitstream that can be used as the basis 307 for session negotiation and content selection, etc. One VPS may be 308 referenced by one or more sequence parameter sets. 310 Sequence parameter set 312 The sequence parameter set (SPS) contains syntax elements pertaining 313 to a coded layer video sequence (CLVS), which is a group of pictures 314 belonging to the same layer, starting with a random access point, and 315 followed by pictures that may depend on each other, until the next 316 random access point picture. In MPGEG-2, the equivalent of a CVS was 317 a group of pictures (GOP), which normally started with an I frame and 318 was followed by P and B frames. While more complex in its options of 319 random access points, VVC retains this basic concept. One remarkable 320 difference of VVC is that a CLVS may start with a Gradual Decoding 321 Refresh (GDR) picture, without requiring presence of traditional 322 random access points in the bitstream, such as instantaneous decoding 323 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 324 applications, a CVS contains a few hundred milliseconds to a few 325 seconds of video. In video conferencing (without switching MCUs 326 involved), a CVS can be as long in duration as the whole session. 328 Picture and adaptation parameter set 329 The picture parameter set and the adaptation parameter set (PPS and 330 APS, respectively) carry information pertaining to zero or more 331 pictures and zero or more slices, respectively. The PPS contains 332 information that is likely to stay constant from picture to picture, 333 at least for pictures for a certain type-whereas the APS contains 334 information, such as adaptive loop filter coefficients, that are 335 likely to change from picture to picture or even within a picture. A 336 single APS is referenced by all slices of the same picture if that 337 APS contains information about luma mapping with chroma scaling 338 (LMCS) or scaling list. Different APSs containing ALF parameters can 339 be referenced by slices of the same picture. 341 Picture header 343 A Picture Header contains information that is common to all slices 344 that belong to the same picture. Being able to send that information 345 as a separate NAL unit when pictures are split into several slices 346 allows for saving bitrate, compared to repeating the same information 347 in all slices. However, there might be scenarios where low-bitrate 348 video is transmitted using a single slice per picture. Having a 349 separate NAL unit to convey that information incurs in an overhead 350 for such scenarios. For such scenarios, the picture header syntax 351 structure is directly included in the slice header, instead of its 352 own NAL unit. The mode of the picture header syntax structure being 353 included in its own NAL unit or not can only be switched on/off for 354 an entire CLVS, and can only be switched off when in the entire CLVS 355 each picture contains only one slice. 357 Profile, tier, and level 359 The profile, tier and level syntax structures in DCI, VPS and SPS 360 contain profile, tier, level information for all layers that refer to 361 the DCI, for layers associated with one or more output layer sets 362 specified by the VPS, and for any layer that refers to the SPS, 363 respectively. 365 Sub-profiles 367 Within the VVC specification, a sub-profile is a 32-bit number, coded 368 according to ITU-T Rec. T.35, that does not carry a semantics. It is 369 carried in the profile_tier_level structure and hence (potentially) 370 present in the DCI, VPS, and SPS. External registration bodies can 371 register a T.35 codepoint with ITU-T registration authorities and 372 associate with their registration a description of bitstream 373 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 374 would allow encoder manufacturers to label the bitstreams generated 375 by their encoder as complying with such sub-profile. It is expected 376 that upstream standardization organizations (such as: DVB and ATSC), 377 as well as walled-garden video services will take advantage of this 378 labelling system. In contrast to "normal" profiles, it is expected 379 that sub-profiles may indicate encoder choices traditionally left 380 open in the (decoder-centric) video coding specs, such as GOP 381 structures, minimum/maximum QP values, and the mandatory use of 382 certain tools or SEI messages. 384 General constraint fields 386 The profile_tier_level structure carries a considerable number of 387 constraint fields (most of which are flags), which an encoder can use 388 to indicate to a decoder that it will not use a certain tool or 389 technology. They were included in reaction to a perceived market 390 need for labelling a bitstream as not exercising a certain tool that 391 has become commercially unviable. 393 Temporal scalability support 395 VVC includes support of temporal scalability, by inclusion of the 396 signalling of TemporalId in the NAL unit header, the restriction that 397 pictures of a particular temporal sublayer cannot be used for inter 398 prediction reference by pictures of a lower temporal sublayer, the 399 sub-bitstream extraction process, and the requirement that each sub- 400 bitstream extraction output be a conforming bitstream. Media-Aware 401 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 402 header for stream adaptation purposes based on temporal scalability. 404 Reference picture resampling (RPR) 406 In AVC and HEVC, the spatial resolution of pictures cannot change 407 unless a new sequence using a new SPS starts, with an Intra random 408 access point (IRAP) picture. VVC enables picture resolution change 409 within a sequence at a position without encoding an IRAP picture, 410 which is always intra-coded. This feature is sometimes referred to 411 as reference picture resampling (RPR), as the feature needs 412 resampling of a reference picture used for inter prediction when that 413 reference picture has a different resolution than the current picture 414 being decoded. RPR allows resolution change without the need of 415 coding an IRAP picture and hence avoids a momentary bit rate spike 416 caused by an IRAP picture in streaming or video conferencing 417 scenarios, e.g., to cope with network condition changes. RPR can 418 also be used in application scenarios wherein zooming of the entire 419 video region or some region of interest is needed. 421 Spatial, SNR, and multiview scalability 422 VVC includes support for spatial, SNR, and multiview scalability. 423 Scalable video coding is widely considered to have technical benefits 424 and enrich services for various video applications. Until recently, 425 however, the functionality has not been included in the first version 426 of specifications of the video codecs. In VVC, however, all those 427 forms of scalability are supported in the first version of VVC 428 natively through the signalling of the nuh_layer_id in the NAL unit 429 header, the VPS which associates layers with given nuh_layer_id to 430 each other, reference picture selection, reference picture resampling 431 for spatial scalability, and a number of other mechanisms not 432 relevant for this memo. 434 Spatial scalability 436 With the existence of Reference Picture Resampling (RPR), the 437 additional burden for scalability support is just a 438 modification of the high-level syntax (HLS). The inter-layer 439 prediction is employed in a scalable system to improve the 440 coding efficiency of the enhancement layers. In addition to 441 the spatial and temporal motion-compensated predictions that 442 are available in a single-layer codec, the inter-layer 443 prediction in VVC uses the possibly resampled video data of the 444 reconstructed reference picture from a reference layer to 445 predict the current enhancement layer. The resampling process 446 for inter-layer prediction, when used, is performed at the 447 block-level, reusing the existing interpolation process for 448 motion compensation in single-layer coding. It means that no 449 additional resampling process is needed to support spatial 450 scalability. 452 SNR scalability 454 SNR scalability is similar to spatial scalability except that 455 the resampling factors are 1:1. In other words, there is no 456 change in resolution, but there is inter-layer prediction. 458 Multiview scalability 460 The first version of VVC also supports multiview scalability, 461 wherein a multi-layer bitstream carries layers representing 462 multiple views, and one or more of the represented views can be 463 output at the same time. 465 SEI messages 467 Supplementary enhancement information (SEI) messages are information 468 in the bitstream that do not influence the decoding process as 469 specified in the VVC spec, but address issues of representation/ 470 rendering of the decoded bitstream, label the bitstream for certain 471 applications, among other, similar tasks. The overall concept of SEI 472 messages and many of the messages themselves has been inherited from 473 the AVC and HEVC specs. Except for the SEI messages that affect the 474 specification of the hypothetical reference decoder (HRD), other SEI 475 messages for use in the VVC environment, which are generally useful 476 also in other video coding technologies, are not included in the main 477 VVC specification but in a companion specification [VSEI]. 479 1.1.3. High-Level Picture Partitioning (informative) 481 VVC inherited the concept of tiles and wavefront parallel processing 482 (WPP) from HEVC, with some minor to moderate differences. The basic 483 concept of slices was kept in VVC but designed in an essentially 484 different form. VVC is the first video coding standard that includes 485 subpictures as a feature, which provides the same functionality as 486 HEVC motion-constrained tile sets (MCTSs) but designed differently to 487 have better coding efficiency and to be friendlier for usage in 488 application systems. More details of these differences are described 489 below. 491 Tiles and WPP 493 Same as in HEVC, a picture can be split into tile rows and tile 494 columns in VVC, in-picture prediction across tile boundaries is 495 disallowed, etc. However, the syntax for signalling of tile 496 partitioning has been simplified, by using a unified syntax design 497 for both the uniform and the non-uniform mode. In addition, 498 signalling of entry point offsets for tiles in the slice header is 499 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 500 has two differences compared to HEVC: i) The CTU row delay is reduced 501 from two CTUs to one CTU; ii) signalling of entry point offsets for 502 WPP in the slice header is optional in VVC while it is mandatory in 503 HEVC. 505 Slices 507 In VVC, the conventional slices based on CTUs (as in HEVC) or 508 macroblocks (as in AVC) have been removed. The main reasoning behind 509 this architectural change is as follows. The advances in video 510 coding since 2003 (the publication year of AVC v1) have been such 511 that slice-based error concealment has become practically impossible, 512 due to the ever-increasing number and efficiency of in-picture and 513 inter-picture prediction mechanisms. An error-concealed picture is 514 the decoding result of a transmitted coded picture for which there is 515 some data loss (e.g., loss of some slices) of the coded picture or a 516 reference picture for at least some part of the coded picture is not 517 error-free (e.g., that reference picture was an error-concealed 518 picture). For example, when one of the multiple slices of a picture 519 is lost, it may be error-concealed using an interpolation of the 520 neighboring slices. While advanced video coding prediction 521 mechanisms provide significantly higher coding efficiency, they also 522 make it harder for machines to estimate the quality of an error- 523 concealed picture, which was already a hard problem with the use of 524 simpler prediction mechanisms. Advanced in-picture prediction 525 mechanisms also cause the coding efficiency loss due to splitting a 526 picture into multiple slices to be more significant. Furthermore, 527 network conditions become significantly better while at the same time 528 techniques for dealing with packet losses have become significantly 529 improved. As a result, very few implementations have recently used 530 slices for maximum transmission unit size matching. Instead, 531 substantially all applications where low-delay error resilience is 532 required (e.g., video telephony and video conferencing) rely on 533 system/transport-level error resilience (e.g., retransmission, 534 forward error correction) and/or picture-based error resilience tools 535 (feedback-based error resilience, insertion of IRAPs, scalability 536 with higher protection level of the base layer, and so on). 537 Considering all the above, nowadays it is very rare that a picture 538 that cannot be correctly decoded is passed to the decoder, and when 539 such a rare case occurs, the system can afford to wait for an error- 540 free picture to be decoded and available for display without 541 resulting in frequent and long periods of picture freezing seen by 542 end users. 544 Slices in VVC have two modes: rectangular slices and raster-scan 545 slices. The rectangular slice, as indicated by its name, covers a 546 rectangular region of the picture. Typically, a rectangular slice 547 consists of several complete tiles. However, it is also possible 548 that a rectangular slice is a subset of a tile and consists of one or 549 more consecutive, complete CTU rows within a tile. A raster-scan 550 slice consists of one or more complete tiles in a tile raster scan 551 order, hence the region covered by a raster-scan slices need not but 552 could have a non-rectangular shape, but it may also happen to have 553 the shape of a rectangle. The concept of slices in VVC is therefore 554 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 555 macroblocks (as in AVC). 557 Subpictures 559 VVC is the first video coding standard that includes the support of 560 subpictures as a feature. Each subpicture consists of one or more 561 complete rectangular slices that collectively cover a rectangular 562 region of the picture. A subpicture may be either specified to be 563 extractable (i.e., coded independently of other subpictures of the 564 same picture and of earlier pictures in decoding order) or not 565 extractable. Regardless of whether a subpicture is extractable or 566 not, the encoder can control whether in-loop filtering (including 567 deblocking, SAO, and ALF) is applied across the subpicture boundaries 568 individually for each subpicture. 570 Functionally, subpictures are similar to the motion-constrained tile 571 sets (MCTSs) in HEVC. They both allow independent coding and 572 extraction of a rectangular subset of a sequence of coded pictures, 573 for use cases like viewport-dependent 360o video streaming 574 optimization and region of interest (ROI) applications. 576 There are several important design differences between subpictures 577 and MCTSs. First, the subpictures feature in VVC allows motion 578 vectors of a coding block pointing outside of the subpicture even 579 when the subpicture is extractable by applying sample padding at 580 subpicture boundaries in this case, similarly as at picture 581 boundaries. Second, additional changes were introduced for the 582 selection and derivation of motion vectors in the merge mode and in 583 the decoder side motion vector refinement process of VVC. This 584 allows higher coding efficiency compared to the non-normative motion 585 constraints applied at the encoder-side for MCTSs. Third, rewriting 586 of SHs (and PH NAL units, when present) is not needed when extracting 587 one or more extractable subpictures from a sequence of pictures to 588 create a sub-bitstream that is a conforming bitstream. In sub- 589 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 590 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 591 extraction, rewriting of SPSs and PPSs is needed. However, typically 592 there are only a few parameter sets in a bitstream, while each 593 picture has at least one slice, therefore rewriting of SHs can be a 594 significant burden for application systems. Fourth, slices of 595 different subpictures within a picture are allowed to have different 596 NAL unit types. Fifth, VVC specifies HRD and level definitions for 597 subpicture sequences, thus the conformance of the sub-bitstream of 598 each extractable subpicture sequence can be ensured by encoders. 600 1.1.4. NAL Unit Header 602 VVC maintains the NAL unit concept of HEVC with modifications. VVC 603 uses a two-byte NAL unit header, as shown in Figure 1. The payload 604 of a NAL unit refers to the NAL unit excluding the NAL unit header. 606 +---------------+---------------+ 607 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 608 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 609 |F|Z| LayerID | Type | TID | 610 +---------------+---------------+ 612 The Structure of the VVC NAL Unit Header. 614 Figure 1 616 The semantics of the fields in the NAL unit header are as specified 617 in VVC and described briefly below for convenience. In addition to 618 the name and size of each field, the corresponding syntax element 619 name in VVC is also provided. 621 F: 1 bit 623 forbidden_zero_bit. Required to be zero in VVC. Note that the 624 inclusion of this bit in the NAL unit header was to enable 625 transport of VVC video over MPEG-2 transport systems (avoidance of 626 start code emulations) [MPEG2S]. In the context of this memo the 627 value 1 may be used to indicate a syntax violation, e.g., for a 628 NAL unit resulted from aggregating a number of fragmented units of 629 a NAL unit but missing the last fragment, as described in the last 630 sentence of section 4.3.3. 632 Z: 1 bit 634 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 635 for future extensions by ITU-T and ISO/IEC. 637 This memo does not overload the "Z" bit for local extensions, as 638 a) overloading the "F" bit is sufficient and b) to preserve the 639 usefulness of this memo to possible future versions of [VVC]. 641 LayerId: 6 bits 643 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 644 a layer may be, e.g., a spatial scalable layer, a quality scalable 645 layer, a layer containing a different view, etc. 647 Type: 5 bits 649 nal_unit_type. This field specifies the NAL unit type as defined 650 in Table 5 of [VVC]. For a reference of all currently defined NAL 651 unit types and their semantics, please refer to Section 7.4.2.2 in 652 [VVC]. 654 TID: 3 bits 656 nuh_temporal_id_plus1. This field specifies the temporal 657 identifier of the NAL unit plus 1. The value of TemporalId is 658 equal to TID minus 1. A TID value of 0 is illegal to ensure that 659 there is at least one bit in the NAL unit header equal to 1, so to 660 enable the consideration of start code emulations in the NAL unit 661 payload data independent of the NAL unit header. 663 1.2. Overview of the Payload Format 665 This payload format defines the following processes required for 666 transport of VVC coded data over RTP [RFC3550]: 668 * Usage of RTP header with this payload format 670 * Packetization of VVC coded NAL units into RTP packets using three 671 types of payload structures: a single NAL unit packet, aggregation 672 packet, and fragment unit 674 * Transmission of VVC NAL units of the same bitstream within a 675 single RTP stream 677 * Media type parameters to be used with the Session Description 678 Protocol (SDP) [RFC4566] 680 * Usage of RTCP feedback messages 682 2. Conventions 684 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 685 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 686 "OPTIONAL" in this document are to be interpreted as described in BCP 687 14 [RFC2119] [RFC8174] when, and only when, they appear in all 688 capitals, as shown above. 690 3. Definitions and Abbreviations 692 3.1. Definitions 694 This document uses the terms and definitions of VVC. Section 3.1.1 695 lists relevant definitions from [VVC] for convenience. Section 3.1.2 696 provides definitions specific to this memo. All the used terms and 697 definitions in this memo are verbatim copies of [VVC] specification. 699 3.1.1. Definitions from the VVC Specification 701 Access unit (AU): A set of PUs that belong to different layers and 702 contain coded pictures associated with the same time for output from 703 the DPB. 705 Adaptation parameter set (APS): A syntax structure containing syntax 706 elements that apply to zero or more slices as determined by zero or 707 more syntax elements found in slice headers. 709 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 710 byte stream, that forms the representation of a sequence of AUs 711 forming one or more coded video sequences (CVSs). 713 Coded picture: A coded representation of a picture comprising VCL NAL 714 units with a particular value of nuh_layer_id within an AU and 715 containing all CTUs of the picture. 717 Clean random access (CRA) PU: A PU in which the coded picture is a 718 CRA picture. 720 Clean random access (CRA) picture: An IRAP picture for which each VCL 721 NAL unit has nal_unit_type equal to CRA_NUT. 723 Coded video sequence (CVS): A sequence of AUs that consists, in 724 decoding order, of a CVSS AU, followed by zero or more AUs that are 725 not CVSS AUs, including all subsequent AUs up to but not including 726 any subsequent AU that is a CVSS AU. 728 Coded video sequence start (CVSS) AU: An AU in which there is a PU 729 for each layer in the CVS and the coded picture in each PU is a CLVSS 730 picture. 732 Coded layer video sequence (CLVS): A sequence of PUs with the same 733 value of nuh_layer_id that consists, in decoding order, of a CLVSS 734 PU, followed by zero or more PUs that are not CLVSS PUs, including 735 all subsequent PUs up to but not including any subsequent PU that is 736 a CLVSS PU. 738 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 739 picture is a CLVSS picture. 741 Coded layer video sequence start (CLVSS) picture: A coded picture 742 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 743 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 745 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 746 of chroma samples of a picture that has three sample arrays, or a CTB 747 of samples of a monochrome picture or a picture that is coded using 748 three separate colour planes and syntax structures used to code the 749 samples. 751 Decoding Capability Information (DCI): A syntax structure containing 752 syntax elements that apply to the entire bitstream. 754 Decoded picture buffer (DPB): A buffer holding decoded pictures for 755 reference, output reordering, or output delay specified for the 756 hypothetical reference decoder. 758 Gradual decoding refresh (GDR) picture: A picture for which each VCL 759 NAL unit has nal_unit_type equal to GDR_NUT. 761 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 762 picture is an IDR picture. 764 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 765 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 766 IDR_N_LP. 768 Intra random access point (IRAP) AU: An AU in which there is a PU for 769 each layer in the CVS and the coded picture in each PU is an IRAP 770 picture. 772 Intra random access point (IRAP) PU: A PU in which the coded picture 773 is an IRAP picture. 775 Intra random access point (IRAP) picture: A coded picture for which 776 all VCL NAL units have the same value of nal_unit_type in the range 777 of IDR_W_RADL to CRA_NUT, inclusive. 779 Layer: A set of VCL NAL units that all have a particular value of 780 nuh_layer_id and the associated non-VCL NAL units. 782 Network abstraction layer (NAL) unit: A syntax structure containing 783 an indication of the type of data to follow and bytes containing that 784 data in the form of an RBSP interspersed as necessary with emulation 785 prevention bytes. 787 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 789 Output Layer Set (OLS): A set of layers for which one or more layers 790 are specified as the output layers. 792 Operation point (OP): A temporal subset of an OLS, identified by an 793 OLS index and a highest value of TemporalId. 795 Picture parameter set (PPS): A syntax structure containing syntax 796 elements that apply to zero or more entire coded pictures as 797 determined by a syntax element found in each slice header. 799 Picture unit (PU): A set of NAL units that are associated with each 800 other according to a specified classification rule, are consecutive 801 in decoding order, and contain exactly one coded picture. 803 Random access: The act of starting the decoding process for a 804 bitstream at a point other than the beginning of the stream. 806 Sequence parameter set (SPS): A syntax structure containing syntax 807 elements that apply to zero or more entire CLVSs as determined by the 808 content of a syntax element found in the PPS referred to by a syntax 809 element found in each picture header. 811 Slice: An integer number of complete tiles or an integer number of 812 consecutive complete CTU rows within a tile of a picture that are 813 exclusively contained in a single NAL unit. 815 Slice header (SH): A part of a coded slice containing the data 816 elements pertaining to all tiles or CTU rows within a tile 817 represented in the slice. 819 Sublayer: A temporal scalable layer of a temporal scalable bitstream 820 consisting of VCL NAL units with a particular value of the TemporalId 821 variable, and the associated non-VCL NAL units. 823 Subpicture: An rectangular region of one or more slices within a 824 picture. 826 Sublayer representation: A subset of the bitstream consisting of NAL 827 units of a particular sublayer and the lower sublayers. 829 Tile: A rectangular region of CTUs within a particular tile column 830 and a particular tile row in a picture. 832 Tile column: A rectangular region of CTUs having a height equal to 833 the height of the picture and a width specified by syntax elements in 834 the picture parameter set. 836 Tile row: A rectangular region of CTUs having a height specified by 837 syntax elements in the picture parameter set and a width equal to the 838 width of the picture. 840 Video coding layer (VCL) NAL unit: A collective term for coded slice 841 NAL units and the subset of NAL units that have reserved values of 842 nal_unit_type that are classified as VCL NAL units in this 843 Specification. 845 3.1.2. Definitions Specific to This Memo 847 Media-Aware Network Element (MANE): A network element, such as a 848 middlebox, selective forwarding unit, or application-layer gateway 849 that is capable of parsing certain aspects of the RTP payload headers 850 or the RTP payload and reacting to their contents. 852 Informative note: The concept of a MANE goes beyond normal routers 853 or gateways in that a MANE has to be aware of the signalling 854 (e.g., to learn about the payload type mappings of the media 855 streams), and in that it has to be trusted when working with 856 Secure RTP (SRTP). The advantage of using MANEs is that they 857 allow packets to be dropped according to the needs of the media 858 coding. For example, if a MANE has to drop packets due to 859 congestion on a certain link, it can identify and remove those 860 packets whose elimination produces the least adverse effect on the 861 user experience. After dropping packets, MANEs must rewrite RTCP 862 packets to match the changes to the RTP stream, as specified in 863 Section 7 of [RFC3550]. 865 NAL unit decoding order: A NAL unit order that conforms to the 866 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 867 follow the Order of NAL units in the bitstream. 869 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 870 stream is utilized to transport a VVC bitstream, which may contain 871 one or more layers, and each layer may contain one or more temporal 872 sublayers. 874 Transmission order: The order of packets in ascending RTP sequence 875 number order (in modulo arithmetic). Within an aggregation packet, 876 the NAL unit transmission order is the same as the order of 877 appearance of NAL units in the packet. 879 3.2. Abbreviations 881 AU Access Unit 883 AP Aggregation Packet 885 APS Adaptation Parameter Set 887 CTU Coding Tree Unit 889 CVS Coded Video Sequence 891 DPB Decoded Picture Buffer 893 DCI Decoding Capability Information 895 DON Decoding Order Number 897 FIR Full Intra Request 899 FU Fragmentation Unit 900 GDR Gradual Decoding Refresh 902 HRD Hypothetical Reference Decoder 904 IDR Instantaneous Decoding Refresh 906 IRAP Intra Random Access Point 908 MANE Media-Aware Network Element 910 MTU Maximum Transfer Unit 912 NAL Network Abstraction Layer 914 NALU Network Abstraction Layer Unit 916 OLS Output Layer Set 918 PLI Picture Loss Indication 920 PPS Picture Parameter Set 922 RPS Reference Picture Set 924 RPSI Reference Picture Selection Indication 926 SEI Supplemental Enhancement Information 928 SLI Slice Loss Indication 930 SPS Sequence Parameter Set 932 VCL Video Coding Layer 934 VPS Video Parameter Set 936 4. RTP Payload Format 938 4.1. RTP Header Usage 940 The format of the RTP header is specified in [RFC3550] (reprinted as 941 Figure 2 for convenience). This payload format uses the fields of 942 the header in a manner consistent with that specification. 944 The RTP payload (and the settings for some RTP header bits) for 945 aggregation packets and fragmentation units are specified in 946 Section 4.3.2 and Section 4.3.3, respectively. 948 0 1 2 3 949 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 950 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 951 |V=2|P|X| CC |M| PT | sequence number | 952 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 953 | timestamp | 954 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 955 | synchronization source (SSRC) identifier | 956 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 957 | contributing source (CSRC) identifiers | 958 | .... | 959 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 961 RTP Header According to {{RFC3550}} 963 Figure 2 965 The RTP header information to be set according to this RTP payload 966 format is set as follows: 968 Marker bit (M): 1 bit 970 Set for the last packet, in transmission order, among each set of 971 packets that contain NAL units of one access unit. This is in 972 line with the normal use of the M bit in video formats to allow an 973 efficient playout buffer handling. 975 Payload Type (PT): 7 bits 977 The assignment of an RTP payload type for this new packet format 978 is outside the scope of this document and will not be specified 979 here. The assignment of a payload type has to be performed either 980 through the profile used or in a dynamic way. 982 Sequence Number (SN): 16 bits 984 Set and used in accordance with [RFC3550]. 986 Timestamp: 32 bits 987 The RTP timestamp is set to the sampling timestamp of the content. 988 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 989 properties of its own (e.g., parameter set and SEI NAL units), the 990 RTP timestamp MUST be set to the RTP timestamp of the coded 991 pictures of the access unit in which the NAL unit (according to 992 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 993 timestamp for the display process, even when the bitstream 994 contains picture timing SEI messages or decoding unit information 995 SEI messages as specified in [VVC]. 997 Informative note: When picture timing SEI messages are present, 998 the RTP sender is responsible to ensure that the RTP timestamps 999 are consistent with the timing information carried in the 1000 picture timing SEI messages. 1002 Synchronization source (SSRC): 32 bits 1004 Used to identify the source of the RTP packets. A single SSRC is 1005 used for all parts of a single bitstream. 1007 4.2. Payload Header Usage 1009 The first two bytes of the payload of an RTP packet are referred to 1010 as the payload header. The payload header consists of the same 1011 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 1012 in Section 1.1.4, irrespective of the type of the payload structure. 1014 The TID value indicates (among other things) the relative importance 1015 of an RTP packet, for example, because NAL units belonging to higher 1016 temporal sublayers are not used for the decoding of lower temporal 1017 sublayers. A lower value of TID indicates a higher importance. 1018 More-important NAL units MAY be better protected against transmission 1019 losses than less-important NAL units. 1021 4.3. Payload Structures 1023 Three different types of RTP packet payload structures are specified. 1024 A receiver can identify the type of an RTP packet payload through the 1025 Type field in the payload header. 1027 The three different payload structures are as follows: 1029 * Single NAL unit packet: Contains a single NAL unit in the payload, 1030 and the NAL unit header of the NAL unit also serves as the payload 1031 header. This payload structure is specified in Section 4.4.1. 1033 * Aggregation Packet (AP): Contains more than one NAL unit within 1034 one access unit. This payload structure is specified in 1035 Section 4.3.2. 1037 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1038 This payload structure is specified in Section 4.3.3. 1040 4.3.1. Single NAL Unit Packets 1042 A single NAL unit packet contains exactly one NAL unit, and consists 1043 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1044 DONL field (in network byte order), and the NAL unit payload data 1045 (the NAL unit excluding its NAL unit header) of the contained NAL 1046 unit, as shown in Figure 3. 1048 0 1 2 3 1049 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1050 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1051 | PayloadHdr | DONL (conditional) | 1052 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1053 | | 1054 | NAL unit payload data | 1055 | | 1056 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1057 | :...OPTIONAL RTP padding | 1058 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1060 The Structure of a Single NAL Unit Packet 1062 Figure 3 1064 The DONL field, when present, specifies the value of the 16 least 1065 significant bits of the decoding order number of the contained NAL 1066 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1067 be present, and the variable DON for the contained NAL unit is 1068 derived as equal to the value of the DONL field. Otherwise (sprop- 1069 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1071 4.3.2. Aggregation Packets (APs) 1073 Aggregation Packets (APs) can reduce packetization overhead for small 1074 NAL units, such as most of the non-VCL NAL units, which are often 1075 only a few octets in size. 1077 An AP aggregates NAL units of one access unit and it MUST NOT contain 1078 NAL units from more than one AU. Each NAL unit to be carried in an 1079 AP is encapsulated in an aggregation unit. NAL units aggregated in 1080 one AP are included in NAL unit decoding order. 1082 An AP consists of a payload header (denoted as PayloadHdr) followed 1083 by two or more aggregation units, as shown in Figure 4. 1085 0 1 2 3 1086 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1087 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1088 | PayloadHdr (Type=28) | | 1089 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1090 | | 1091 | two or more aggregation units | 1092 | | 1093 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1094 | :...OPTIONAL RTP padding | 1095 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1097 The Structure of an Aggregation Packet 1099 Figure 4 1101 The fields in the payload header of an AP are set as follows. The F 1102 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1103 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1104 be equal to 28. 1106 The value of LayerId MUST be equal to the lowest value of LayerId of 1107 all the aggregated NAL units. The value of TID MUST be the lowest 1108 value of TID of all the aggregated NAL units. 1110 Informative note: All VCL NAL units in an AP have the same TID 1111 value since they belong to the same access unit. However, an AP 1112 may contain non-VCL NAL units for which the TID value in the NAL 1113 unit header may be different than the TID value of the VCL NAL 1114 units in the same AP. 1116 Informative Note: If a system envisions sub-picture level or 1117 picture level modifications, for example by removing sub-pictures 1118 or pictures of a particular layer, a good design choice on the 1119 sender's side would be to aggregate NAL units belonging to only 1120 the same sub-picture or picture of a particular layer. 1122 An AP MUST carry at least two aggregation units and can carry as many 1123 aggregation units as necessary; however, the total amount of data in 1124 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1125 chosen so that the resulting IP packet is smaller than the MTU size 1126 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1127 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1128 not contain another AP. 1130 The first aggregation unit in an AP consists of a conditional 16-bit 1131 DONL field (in network byte order) followed by a 16-bit unsigned size 1132 information (in network byte order) that indicates the size of the 1133 NAL unit in bytes (excluding these two octets, but including the NAL 1134 unit header), followed by the NAL unit itself, including its NAL unit 1135 header, as shown in Figure 5. 1137 0 1 2 3 1138 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1139 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 | : DONL (conditional) | NALU size | 1141 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1142 | NALU size | | 1143 +-+-+-+-+-+-+-+-+ NAL unit | 1144 | | 1145 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1146 | : 1147 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1149 The Structure of the First Aggregation Unit in an AP 1151 Figure 5 1153 The DONL field, when present, specifies the value of the 16 least 1154 significant bits of the decoding order number of the aggregated NAL 1155 unit. 1157 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1158 present in an aggregation unit that is the first aggregation unit in 1159 an AP, and the variable DON for the aggregated NAL unit is derived as 1160 equal to the value of the DONL field, and the variable DON for an 1161 aggregation unit that is not the first aggregation unit in an AP 1162 aggregated NAL unit is derived as equal to the DON of the preceding 1163 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1164 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1165 present in an aggregation unit that is the first aggregation unit in 1166 an AP. 1168 An aggregation unit that is not the first aggregation unit in an AP 1169 will be followed immediately by a 16-bit unsigned size information 1170 (in network byte order) that indicates the size of the NAL unit in 1171 bytes (excluding these two octets, but including the NAL unit 1172 header), followed by the NAL unit itself, including its NAL unit 1173 header, as shown in Figure 6. 1175 0 1 2 3 1176 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1177 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1178 | : NALU size | NAL unit | 1179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1180 | | 1181 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1182 | : 1183 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1185 The Structure of an Aggregation Unit That Is Not the First 1186 Aggregation Unit in an AP 1188 Figure 6 1190 Figure 7 presents an example of an AP that contains two aggregation 1191 units, labeled as 1 and 2 in the figure, without the DONL field being 1192 present. 1194 0 1 2 3 1195 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1196 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1197 | RTP Header | 1198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1199 | PayloadHdr (Type=28) | NALU 1 Size | 1200 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1201 | NALU 1 HDR | | 1202 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1203 | . . . | 1204 | | 1205 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1206 | . . . | NALU 2 Size | NALU 2 HDR | 1207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1208 | NALU 2 HDR | | 1209 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1210 | . . . | 1211 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1212 | :...OPTIONAL RTP padding | 1213 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1215 An Example of an AP Packet Containing 1216 Two Aggregation Units without the DONL Field 1218 Figure 7 1220 Figure 8 presents an example of an AP that contains two aggregation 1221 units, labeled as 1 and 2 in the figure, with the DONL field being 1222 present. 1224 0 1 2 3 1225 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1226 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1227 | RTP Header | 1228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1229 | PayloadHdr (Type=28) | NALU 1 DONL | 1230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1231 | NALU 1 Size | NALU 1 HDR | 1232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1233 | | 1234 | NALU 1 Data . . . | 1235 | | 1236 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1237 | : NALU 2 Size | 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1239 | NALU 2 HDR | | 1240 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1241 | | 1242 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1243 | :...OPTIONAL RTP padding | 1244 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1246 An Example of an AP Containing 1247 Two Aggregation Units with the DONL Field 1249 Figure 8 1251 4.3.3. Fragmentation Units 1253 Fragmentation Units (FUs) are introduced to enable fragmenting a 1254 single NAL unit into multiple RTP packets, possibly without 1255 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1256 unit consists of an integer number of consecutive octets of that NAL 1257 unit. Fragments of the same NAL unit MUST be sent in consecutive 1258 order with ascending RTP sequence numbers (with no other RTP packets 1259 within the same RTP stream being sent between the first and last 1260 fragment). 1262 When a NAL unit is fragmented and conveyed within FUs, it is referred 1263 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1264 NOT be nested; i.e., an FU can not contain a subset of another FU. 1266 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1267 time of the fragmented NAL unit. 1269 An FU consists of a payload header (denoted as PayloadHdr), an FU 1270 header of one octet, a conditional 16-bit DONL field (in network byte 1271 order), and an FU payload, as shown in Figure 9. 1273 0 1 2 3 1274 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1276 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1277 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1278 | DONL (cond) | | 1279 |-+-+-+-+-+-+-+-+ | 1280 | FU payload | 1281 | | 1282 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1283 | :...OPTIONAL RTP padding | 1284 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1286 The Structure of an FU 1288 Figure 9 1290 The fields in the payload header are set as follows. The Type field 1291 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1292 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1293 unit. 1295 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1296 FuType field, as shown in Figure 10. 1298 +---------------+ 1299 |0|1|2|3|4|5|6|7| 1300 +-+-+-+-+-+-+-+-+ 1301 |S|E|P| FuType | 1302 +---------------+ 1304 The Structure of FU Header 1306 Figure 10 1308 The semantics of the FU header fields are as follows: 1310 S: 1 bit 1312 When set to 1, the S bit indicates the start of a fragmented NAL 1313 unit, i.e., the first byte of the FU payload is also the first 1314 byte of the payload of the fragmented NAL unit. When the FU 1315 payload is not the start of the fragmented NAL unit payload, the S 1316 bit MUST be set to 0. 1318 E: 1 bit 1319 When set to 1, the E bit indicates the end of a fragmented NAL 1320 unit, i.e., the last byte of the payload is also the last byte of 1321 the fragmented NAL unit. When the FU payload is not the last 1322 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1324 P: 1 bit 1326 When set to 1, the P bit indicates the last FU of the last VCL NAL 1327 unit of a coded picture, i.e., the last byte of the FU payload is 1328 also the last byte of the last VCL NAL unit of the coded picture. 1329 When the FU payload is not the last fragment of the last VCL NAL 1330 unit of a coded picture, the P bit MUST be set to 0. 1332 FuType: 5 bits 1334 The field FuType MUST be equal to the field Type of the fragmented 1335 NAL unit. 1337 The DONL field, when present, specifies the value of the 16 least 1338 significant bits of the decoding order number of the fragmented NAL 1339 unit. 1341 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1342 the DONL field MUST be present in the FU, and the variable DON for 1343 the fragmented NAL unit is derived as equal to the value of the DONL 1344 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1345 equal to 0), the DONL field MUST NOT be present in the FU. 1347 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1348 the Start bit and End bit must not both be set to 1 in the same FU 1349 header. 1351 The FU payload consists of fragments of the payload of the fragmented 1352 NAL unit so that if the FU payloads of consecutive FUs, starting with 1353 an FU with the S bit equal to 1 and ending with an FU with the E bit 1354 equal to 1, are sequentially concatenated, the payload of the 1355 fragmented NAL unit can be reconstructed. The NAL unit header of the 1356 fragmented NAL unit is not included as such in the FU payload, but 1357 rather the information of the NAL unit header of the fragmented NAL 1358 unit is conveyed in F, LayerId, and TID fields of the FU payload 1359 headers of the FUs and the FuType field of the FU header of the FUs. 1360 An FU payload MUST NOT be empty. 1362 If an FU is lost, the receiver SHOULD discard all following 1363 fragmentation units in transmission order corresponding to the same 1364 fragmented NAL unit, unless the decoder in the receiver is known to 1365 be prepared to gracefully handle incomplete NAL units. 1367 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1368 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1369 n of that NAL unit is not received. In this case, the 1370 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1371 syntax violation. 1373 4.4. Decoding Order Number 1375 For each NAL unit, the variable AbsDon is derived, representing the 1376 decoding order number that is indicative of the NAL unit decoding 1377 order. 1379 Let NAL unit n be the n-th NAL unit in transmission order within an 1380 RTP stream. 1382 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1383 for NAL unit n, is derived as equal to n. 1385 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1386 derived as follows, where DON[n] is the value of the variable DON for 1387 NAL unit n: 1389 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1390 transmission order), AbsDon[0] is set equal to DON[0]. 1392 * Otherwise (n is greater than 0), the following applies for 1393 derivation of AbsDon[n]: 1395 If DON[n] == DON[n-1], 1396 AbsDon[n] = AbsDon[n-1] 1398 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1399 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1401 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1402 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1404 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1405 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - DON[n]) 1407 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1408 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1410 For any two NAL units m and n, the following applies: 1412 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1413 NAL unit m in NAL unit decoding order. 1415 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1416 of the two NAL units can be in either order. 1418 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1419 NAL unit m in decoding order. 1421 Informative note: When two consecutive NAL units in the NAL 1422 unit decoding order have different values of AbsDon, the 1423 absolute difference between the two AbsDon values may be 1424 greater than or equal to 1. 1426 Informative note: There are multiple reasons to allow for the 1427 absolute difference of the values of AbsDon for two consecutive 1428 NAL units in the NAL unit decoding order to be greater than 1429 one. An increment by one is not required, as at the time of 1430 associating values of AbsDon to NAL units, it may not be known 1431 whether all NAL units are to be delivered to the receiver. For 1432 example, a gateway might not forward VCL NAL units of higher 1433 sublayers or some SEI NAL units when there is congestion in the 1434 network. In another example, the first intra-coded picture of 1435 a pre-encoded clip is transmitted in advance to ensure that it 1436 is readily available in the receiver, and when transmitting the 1437 first intra-coded picture, the originator does not exactly know 1438 how many NAL units will be encoded before the first intra-coded 1439 picture of the pre-encoded clip follows in decoding order. 1440 Thus, the values of AbsDon for the NAL units of the first 1441 intra-coded picture of the pre-encoded clip have to be 1442 estimated when they are transmitted, and gaps in values of 1443 AbsDon may occur. 1445 5. Packetization Rules 1447 The following packetization rules apply: 1449 * If sprop-max-don-diff is greater than 0, the transmission order of 1450 NAL units carried in the RTP stream MAY be different than the NAL 1451 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1452 0), the transmission order of NAL units carried in the RTP stream 1453 MUST be the same as the NAL unit decoding order. 1455 * A NAL unit of a small size SHOULD be encapsulated in an 1456 aggregation packet together one or more other NAL units in order 1457 to avoid the unnecessary packetization overhead for small NAL 1458 units. For example, non-VCL NAL units such as access unit 1459 delimiters, parameter sets, or SEI NAL units are typically small 1460 and can often be aggregated with VCL NAL units without violating 1461 MTU size constraints. 1463 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1464 viewpoint, be encapsulated in an aggregation packet together with 1465 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1466 be meaningless without the associated VCL NAL unit being 1467 available. 1469 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1470 unit packet MUST be used. 1472 6. De-packetization Process 1474 The general concept behind de-packetization is to get the NAL units 1475 out of the RTP packets in an RTP stream and pass them to the decoder 1476 in the NAL unit decoding order. 1478 The de-packetization process is implementation dependent. Therefore, 1479 the following description should be seen as an example of a suitable 1480 implementation. Other schemes may be used as well, as long as the 1481 output for the same input is the same as the process described below. 1482 The output is the same when the set of output NAL units and their 1483 order are both identical. Optimizations relative to the described 1484 algorithms are possible. 1486 All normal RTP mechanisms related to buffer management apply. In 1487 particular, duplicated or outdated RTP packets (as indicated by the 1488 RTP sequences number and the RTP timestamp) are removed. To 1489 determine the exact time for decoding, factors such as a possible 1490 intentional delay to allow for proper inter-stream synchronization 1491 MUST be factored in. 1493 NAL units with NAL unit type values in the range of 0 to 27, 1494 inclusive, may be passed to the decoder. NAL-unit-like structures 1495 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1496 NOT be passed to the decoder. 1498 The receiver includes a receiver buffer, which is used to compensate 1499 for transmission delay jitter within individual RTP stream, and to 1500 reorder NAL units from transmission order to the NAL unit decoding 1501 order. In this section, the receiver operation is described under 1502 the assumption that there is no transmission delay jitter within an 1503 RTP stream. To make a difference from a practical receiver buffer 1504 that is also used for compensation of transmission delay jitter, the 1505 receiver buffer is hereafter called the de-packetization buffer in 1506 this section. Receivers should also prepare for transmission delay 1507 jitter; that is, either reserve separate buffers for transmission 1508 delay jitter buffering and de-packetization buffering or use a 1509 receiver buffer for both transmission delay jitter and de- 1510 packetization. Moreover, receivers should take transmission delay 1511 jitter into account in the buffering operation, e.g., by additional 1512 initial buffering before starting of decoding and playback. 1514 The de-packetization process extracts the NAL units from the RTP 1515 packets in an RTP stream as follows. When an RTP packet carries a 1516 single NAL unit packet, the payload of the RTP packet is extracted as 1517 a single NAL unit, excluding the DONL field, i.e., third and fourth 1518 bytes, when sprop-max-don-diff is greater than 0. When an RTP packet 1519 carries an Aggregation Packet, several NAL units are extracted from 1520 the payload of the RTP packet. In this case, each NAL unit 1521 corresponds to the part of the payload of each aggregation unit that 1522 follows the NALU size field as described in Section 4.3.2. When an 1523 RTP packet carries a Fragmentation Unit (FU), all RTP packets from 1524 the first FU (with the S field equal to 1) of the fragmented NAL unit 1525 up to the last FU (with the E field equal to 1) of the fragmented NAL 1526 unit are collected. The NAL unit is extracted from these RTP packets 1527 by concatenating all FU payloads in the same order as the 1528 corresponding RTP packets and appending the NAL unit header with the 1529 fields F, LayerId, and TID, set to equal to the values of the fields 1530 F, LayerId, and TID in the payload header of the FUs respectively, 1531 and with the NAL unit type set equal to the value of the field FuType 1532 in the FU header of the FUs, as described in Section 4.3.3. 1534 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1535 size is zero bytes, and the NAL units carried in the single RTP 1536 stream are directly passed to the decoder in their transmission 1537 order, which is identical to their decoding order. 1539 When sprop-max-don-diff is greater than 0, the process described in 1540 the remainder of this section applies. 1542 There are two buffering states in the receiver: initial buffering and 1543 buffering while playing. Initial buffering starts when the reception 1544 is initialized. After initial buffering, decoding and playback are 1545 started, and the buffering-while-playing mode is used. 1547 Regardless of the buffering state, the receiver stores incoming NAL 1548 units in reception order into the de-packetization buffer. NAL units 1549 carried in RTP packets are stored in the de-packetization buffer 1550 individually, and the value of AbsDon is calculated and stored for 1551 each NAL unit. 1553 Initial buffering lasts until the difference between the greatest and 1554 smallest AbsDon values of the NAL units in the de-packetization 1555 buffer is greater than or equal to the value of sprop-max-don-diff. 1557 After initial buffering, whenever the difference between the greatest 1558 and smallest AbsDon values of the NAL units in the de-packetization 1559 buffer is greater than or equal to the value of sprop-max-don-diff, 1560 the following operation is repeatedly applied until this difference 1561 is smaller than sprop-max-don-diff: 1563 * The NAL unit in the de-packetization buffer with the smallest 1564 value of AbsDon is removed from the de-packetization buffer and 1565 passed to the decoder. 1567 When no more NAL units are flowing into the de-packetization buffer, 1568 all NAL units remaining in the de-packetization buffer are removed 1569 from the buffer and passed to the decoder in the order of increasing 1570 AbsDon values. 1572 7. Payload Format Parameters 1574 This section specifies the optional parameters. A mapping of the 1575 parameters with Session Description Protocol (SDP) [RFC4556] is also 1576 provided for applications that use SDP. 1578 7.1. Media Type Registration 1580 The receiver MUST ignore any parameter unspecified in this memo. 1582 Type name: video 1584 Subtype name: H266 1586 Required parameters: none 1588 Optional parameters: 1590 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1591 level-id: 1593 These parameters indicate the profile, tier, default level, 1594 sub-profile, and some constraints of the bitstream carried by 1595 the RTP stream, or a specific set of the profile, tier, default 1596 level, sub-profile and some constraints the receiver supports. 1598 The subset of coding tools that may have been used to generate 1599 the bitstream or that the receiver supports, as well as some 1600 additional constraints are indicated collectively by profile- 1601 id, sub-profile-id, and interop-constraints. 1603 Informative note: There are 128 values of profile-id. The 1604 subset of coding tools identified by the profile-id can be 1605 further constrained with up to 255 instances of sub-profile- 1606 id. In addition, 68 bits included in interop-constraints, 1607 which can be extended up to 324 bits provide means to 1608 further restrict tools from existing profiles. To be able 1609 to support this fine-granular signalling of coding tool 1610 subsets with profile-id, sub-profile-id and interop- 1611 constraints, it would be safe to require symmetric use of 1612 these parameters in SDP offer/answer unless recv-ols-id is 1613 included in the SDP answer for choosing one of the layers 1614 offered. 1616 The tier is indicated by tier-flag. The default level is 1617 indicated by level-id. The tier and the default level specify 1618 the limits on values of syntax elements or arithmetic 1619 combinations of values of syntax elements that are followed 1620 when generating the bitstream or that the receiver supports. 1622 In SDP offer/answer, when the SDP answer does not include the 1623 recv-ols-id parameter that is less than the sprop-ols-id 1624 parameter in the SDP offer, the following applies: 1626 o The tier-flag, profile-id, sub-profile-id, and interop- 1627 constraints parameters MUST be used symmetrically, i.e., the 1628 value of each of these parameters in the offer MUST be the 1629 same as that in the answer, either explicitly signalled or 1630 implicitly inferred. 1632 o The level-id parameter is changeable as long as the highest 1633 level indicated by the answer is either equal to or lower 1634 than that in the offer. Note that a highest level higher 1635 than level-id in the offer for receiving can be included as 1636 max-recv-level-id. 1638 In SDP offer/answer, when the SDP answer does include the recv- 1639 ols-id parameter that is less than the sprop-ols-id parameter 1640 in the SDP offer, the set of tier- flag, profile-id, sub- 1641 profile-id, interop-constraints, and level-id parameters 1642 included in the answer MUST be consistent with that for the 1643 chosen output layer set as indicated in the SDP offer, with the 1644 exception that the level-id parameter in the SDP answer is 1645 changeable as long as the highest level indicated by the answer 1646 is either lower than or equal to that in the offer. 1648 More specifications of these parameters, including how they 1649 relate to syntax elements specified in [VVC] are provided 1650 below. 1652 profile-id: 1654 When profile-id is not present, a value of 1 (i.e., the Main 10 1655 profile) MUST be inferred. 1657 When used to indicate properties of a bitstream, profile-id is 1658 derived from the general_profile_idc syntax element that 1659 applies to the bitstream in an instance of the 1660 profile_tier_level( ) syntax structure. 1662 VVC bitstreams transported over RTP using the technologies of 1663 this memo SHOULD contain only a single profile_tier_level( ) 1664 structure in the DCI, unless the sender can assure that a 1665 receiver can correctly decode the VVC bitstream regardless of 1666 which profile_tier_level( ) structure contained in the DCI was 1667 used for deriving profile-id and other parameters for the SDP 1668 O/A exchange. 1670 As specified in [VVC], a profile_tier_level( ) syntax structure 1671 may be contained in an SPS NAL unit, and one or more 1672 profile_tier_level( ) syntax structures may be contained in a 1673 VPS NAL unit and in a DCI NAL unit. One of the following three 1674 cases applies to the container NAL unit of the 1675 profile_tier_level( ) syntax structure containing syntax 1676 elements used to derive the values of profile-id, tier-flag, 1677 level-id, sub-profile-id, or interop-constraints: 1) The 1678 container NAL unit is an SPS, the bitstream is a single-layer 1679 bitstream, and the profile_tier_level( ) syntax structures in 1680 all SPSs referenced by the CVSs in the bitstream has the same 1681 values respectively for those profile_tier_level( ) syntax 1682 elements; 2) The container NAL unit is a VPS, the 1683 profile_tier_level( ) syntax structure is the one in the VPS 1684 that applies to the OLS corresponding to the bitstream, and the 1685 profile_tier_level( ) syntax structures applicable to the OLS 1686 corresponding to the bitstream in all VPSs referenced by the 1687 CVSs in the bitstream have the same values respectively for 1688 those profile_tier_level( ) syntax elements; 3) The container 1689 NAL unit is a DCI NAL unit and the profile_tier_level( ) syntax 1690 structures in all DCI NAL units in the bitstream has the same 1691 values respectively for those profile_tier_level( ) syntax 1692 elements. 1694 [VVC] allows for multiple profile_tier_level( ) structures in a 1695 DCI NAL unit, which may contain different values for the syntax 1696 elements used to derive the values of profile-id, tier-flag, 1697 level-id, sub-profile-id, or interop-constraints in the 1698 different entries. However, herein defined is only a single 1699 profile-id, tier-flag, level-id, sub-profile-id, or interop- 1700 constraints. When signalling these parameters and a DCI NAL 1701 unit is present with multiple profile_tier_level( ) structures, 1702 these values SHOULD be the same as the first profile_tier_level 1703 structure in the DCI, unless the sender has ensured that the 1704 receiver can decode the bitstream when a different value is 1705 chosen. 1707 tier-flag, level-id: 1709 The value of tier-flag MUST be in the range of 0 to 1, 1710 inclusive. The value of level-id MUST be in the range of 0 to 1711 255, inclusive. 1713 If the tier-flag and level-id parameters are used to indicate 1714 properties of a bitstream, they indicate the tier and the 1715 highest level the bitstream complies with. 1717 If the tier-flag and level-id parameters are used for 1718 capability exchange, the following applies. If max-recv-level- 1719 id is not present, the default level defined by level-id 1720 indicates the highest level the codec wishes to support. 1721 Otherwise, max-recv-level-id indicates the highest level the 1722 codec supports for receiving. For either receiving or sending, 1723 all levels that are lower than the highest level supported MUST 1724 also be supported. 1726 If no tier-flag is present, a value of 0 MUST be inferred; if 1727 no level-id is present, a value of 51 (i.e., level 3.1) MUST be 1728 inferred. 1730 Informative note: The level values currently defined in the 1731 VVC specification are in the form of "majorNum.minorNum", 1732 and the value of the level-id for each of the levels is 1733 equal to majorNum * 16 + minorNum * 3. It is expected that 1734 if any levels are defined in the future, the same convention 1735 will be used, but this cannot be guaranteed. 1737 When used to indicate properties of a bitstream, the tier-flag 1738 and level-id parameters are derived respectively from the 1739 syntax element general_tier_flag, and the syntax element 1740 general_level_idc or sub_layer_level_idc[j], that apply to the 1741 bitstream, in an instance of the profile_tier_level( ) syntax 1742 structure. 1744 If the tier-flag and level-id are derived from the 1745 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1746 following applies: 1748 o tier-flag = general_tier_flag 1750 o level-id = general_level_idc 1752 Otherwise, if the tier-flag and level-id are derived from the 1753 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1754 unit, and the bitstream contains the highest sublayer 1755 representation in the OLS corresponding to the bitstream, the 1756 following applies: 1758 o tier-flag = general_tier_flag 1760 o level-id = general_level_idc 1762 Otherwise, if the tier-flag and level-id are derived from the 1763 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1764 unit, and the bitstream does not contain the highest sublayer 1765 representation in the OLS corresponding to the bitstream, the 1766 following applies, with j being the value of the sprop- 1767 sublayer-id parameter: 1769 o tier-flag = general_tier_flag 1771 o level-id = sub_layer_level_idc[j] 1773 sub-profile-id: 1775 The value of the parameter is a comma-separated (',') list of 1776 data using base64 [RFC4648] representation. 1778 When used to indicate properties of a bitstream, sub-profile-id 1779 is derived from each of the ptl_num_sub_profiles 1780 general_sub_profile_idc[i] syntax elements that apply to the 1781 bitstream in a profile_tier_level( ) syntax structure. 1783 interop-constraints: 1785 A base64 [RFC4648] representation of the data that includes the 1786 syntax elements ptl_frame_only_constraint_flag and 1787 ptl_multilayer_enabled_flag and the general_constraints_info( ) 1788 syntax structure that apply to the bitstream in an instance of 1789 the profile_tier_level( ) syntax structure. 1791 If the interop-constraints parameter is not present, the 1792 following MUST be inferred: 1794 o ptl_frame_only_constraint_flag = 1 1796 o ptl_multilayer_enabled_flag = 0 1798 o gci_present_flag in the general_constraints_info( ) syntax 1799 structure = 0 1801 Using interop-constraints for capability exchange results in a 1802 requirement on any bitstream to be compliant with the interop- 1803 constraints. 1805 sprop-sublayer-id: 1807 This parameter MAY be used to indicate the highest allowed 1808 value of TID in the bitstream. When not present, the value of 1809 sprop-sublayer-id is inferred to be equal to 6. 1811 The value of sprop-sublayer-id MUST be in the range of 0 to 6, 1812 inclusive. 1814 sprop-ols-id: 1816 This parameter MAY be used to indicate the OLS that the 1817 bitstream applies to. When not present, the value of sprop- 1818 ols-id is inferred to be equal to TargetOlsIdx as specified in 1819 8.1.1 in [VVC]. If this optional parameter is present, sprop- 1820 vps MUST also be present or its content MUST be known a priori 1821 at the receiver. 1823 The value of sprop-ols-id MUST be in the range of 0 to 256, 1824 inclusive. 1826 Informative note: VVC allows having up to 257 output layer 1827 sets indicated in the VPS as the number of output layer sets 1828 minus 2 is indicated with a field of 8 bits. 1830 recv-sublayer-id: 1832 This parameter MAY be used to signal a receiver's choice of the 1833 offered or declared sublayer representations in the sprop-vps 1834 and sprop-sps. The value of recv-sublayer-id indicates the TID 1835 of the highest sublayer that a receiver supports. When not 1836 present, the value of recv-sublayer-id is inferred to be equal 1837 to the value of the sprop-sublayer-id parameter in the SDP 1838 offer. 1840 The value of recv-sublayer-id MUST be in the range of 0 to 6, 1841 inclusive. 1843 recv-ols-id: 1845 This parameter MAY be used to signal a receiver's choice of the 1846 offered or declared output layer sets in the sprop-vps. The 1847 value of recv-ols-id indicates the OLS index of the bitstream 1848 that a receiver supports. When not present, the value of recv- 1849 ols-id is inferred to be equal to value of the sprop-ols-id 1850 parameter inferred from or indicated in the SDP offer. When 1851 present, the value of recv-ols-id must be included only when 1852 sprop-ols-id was received and must refer to an output layer set 1853 in the VPS that includes no layers other than all or a subset 1854 of the layers of the OLS referred to by sprop-ols-id. If this 1855 optional parameter is present, sprop-vps must have been 1856 received or its content must be known a priori at the receiver. 1858 The value of recv-ols-id MUST be in the range of 0 to 256, 1859 inclusive. 1861 max-recv-level-id: 1863 This parameter MAY be used to indicate the highest level a 1864 receiver supports. 1866 The value of max-recv-level-id MUST be in the range of 0 to 1867 255, inclusive. 1869 When max-recv-level-id is not present, the value is inferred to 1870 be equal to level-id. 1872 max-recv-level-id MUST NOT be present when the highest level 1873 the receiver supports is not higher than the default level. 1875 sprop-dci: 1877 This parameter MAY be used to convey a decoding capability 1878 information NAL unit of the bitstream for out-of-band 1879 transmission. The parameter MAY also be used for capability 1880 exchange. The value of the parameter a base64 [RFC4648] 1881 representations of the decoding capability information NAL unit 1882 as specified in Section 7.3.2.1 of [VVC]. 1884 sprop-vps: 1886 This parameter MAY be used to convey any video parameter set 1887 NAL unit of the bitstream for out-of-band transmission of video 1888 parameter sets. The parameter MAY also be used for capability 1889 exchange and to indicate sub-stream characteristics (i.e., 1890 properties of output layer sets and sublayer representations as 1891 defined in [VVC]). The value of the parameter is a comma- 1892 separated (',') list of base64 [RFC4648] representations of the 1893 video parameter set NAL units as specified in Section 7.3.2.3 1894 of [VVC]. 1896 The sprop-vps parameter MAY contain one or more than one video 1897 parameter set NAL units. However, all other video parameter 1898 sets contained in the sprop-vps parameter MUST be consistent 1899 with the first video parameter set in the sprop-vps parameter. 1900 A video parameter set vpsB is said to be consistent with 1901 another video parameter set vpsA if the number of OLSs in vpsA 1902 and vpsB is the same and any decoder that conforms to the 1903 profile, tier, level, and constraints indicated by the data 1904 starting from the syntax element general_profile_idc to the 1905 syntax structure general_constraints_info(), inclusive, in the 1906 profile_tier_level( ) syntax structure corresponding to any OLS 1907 with index olsIdx in vpsA can decode any CVS(s) referencing 1908 vpsB when TargetOlsIdx is equal to olsIdx that conforms to the 1909 profile, tier, level, and constraints indicated by the data 1910 starting from the syntax element general_profile_idc to the 1911 syntax structure general_constraints_info(), inclusive, in the 1912 profile_tier_level( ) syntax structure corresponding to the OLS 1913 with index TargetOlsIdx in vpsB. 1915 sprop-sps: 1917 This parameter MAY be used to convey sequence parameter set NAL 1918 units of the bitstream for out-of-band transmission of sequence 1919 parameter sets. The value of the parameter is a comma- 1920 separated (',') list of base64 [RFC4648] representations of the 1921 sequence parameter set NAL units as specified in 1922 Section 7.3.2.4 of [VVC]. 1924 A sequence parameter set spsB is said to be consistent with 1925 another sequence parameter set spsA if any decoder that 1926 conforms to the profile, tier, level, and constraints indicated 1927 by the data starting from the syntax element 1928 general_profile_idc to the syntax structure 1929 general_constraints_info(), inclusive, in the 1930 profile_tier_level( ) syntax structure in spsA can decode any 1931 CLVS(s) referencing spsB that conforms to the profile, tier, 1932 level, and constraints indicated by the data starting from the 1933 syntax element general_profile_idc to the syntax structure 1934 general_constraints_info(), inclusive, in the 1935 profile_tier_level( ) syntax structure in spsB. 1937 sprop-pps: 1939 This parameter MAY be used to convey picture parameter set NAL 1940 units of the bitstream for out-of-band transmission of picture 1941 parameter sets. The value of the parameter is a comma- 1942 separated (',') list of base64 [RFC4648] representations of the 1943 picture parameter set NAL units as specified in Section 7.3.2.5 1944 of [VVC]. 1946 sprop-sei: 1948 This parameter MAY be used to convey one or more SEI messages 1949 that describe bitstream characteristics. When present, a 1950 decoder can rely on the bitstream characteristics that are 1951 described in the SEI messages for the entire duration of the 1952 session, independently from the persistence scopes of the SEI 1953 messages as specified in [VSEI]. 1955 The value of the parameter is a comma-separated (',') list of 1956 base64 [RFC4648] representations of SEI NAL units as specified 1957 in [VSEI]. 1959 Informative note: Intentionally, no list of applicable or 1960 inapplicable SEI messages is specified here. Conveying 1961 certain SEI messages in sprop-sei may be sensible in some 1962 application scenarios and meaningless in others. However, a 1963 few examples are described below: 1965 1) In an environment where the bitstream was created from 1966 film-based source material, and no splicing is going to 1967 occur during the lifetime of the session, the film grain 1968 characteristics SEI message is likely meaningful, and 1969 sending it in sprop-sei rather than in the bitstream at each 1970 entry point may help with saving bits and allows one to 1971 configure the renderer only once, avoiding unwanted 1972 artifacts. 1974 2) Examples for SEI messages that would be meaningless to be 1975 conveyed in sprop-sei include the decoded picture hash SEI 1976 message (it is close to impossible that all decoded pictures 1977 have the same hashtag) or the filler payload SEI message (as 1978 there is no point in just having more bits in SDP). 1980 max-lsr: 1982 The max-lsr MAY be used to signal the capabilities of a 1983 receiver implementation and MUST NOT be used for any other 1984 purpose. The value of max-lsr is an integer indicating the 1985 maximum processing rate in units of luma samples per second. 1986 The max-lsr parameter signals that the receiver is capable of 1987 decoding video at a higher rate than is required by the highest 1988 level. 1990 Informative note: When the OPTIONAL media type parameters 1991 are used to signal the properties of a bitstream, and max- 1992 lsr is not present, the values of tier-flag, profile-id, 1993 sub-profile-id interop-constraints, and level-id must always 1994 be such that the bitstream complies fully with the specified 1995 profile, tier, and level. 1997 When max-lsr is signalled, the receiver MUST be able to decode 1998 bitstreams that conform to the highest level, with the 1999 exception that the MaxLumaSr value in Table 136 of [VVC] for 2000 the highest level is replaced with the value of max-lsr. 2001 Senders MAY use this knowledge to send pictures of a given size 2002 at a higher picture rate than is indicated in the highest 2003 level. 2005 When not present, the value of max-lsr is inferred to be equal 2006 to the value of MaxLumaSr given in Table 136 of [VVC] for the 2007 highest level. 2009 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 2010 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 2011 [VVC] for the highest level. 2013 max-fps: 2015 The value of max-fps is an integer indicating the maximum 2016 picture rate in units of pictures per 100 seconds that can be 2017 effectively processed by the receiver. The max-fps parameter 2018 MAY be used to signal that the receiver has a constraint in 2019 that it is not capable of processing video effectively at the 2020 full picture rate that is implied by the highest level and, 2021 when present, max-lsr. 2023 The value of max-fps is not necessarily the picture rate at 2024 which the maximum picture size can be sent, it constitutes a 2025 constraint on maximum picture rate for all resolutions. 2027 Informative note: The max-fps parameter is semantically 2028 different from max-lsr in that max-fps is used to signal a 2029 constraint, lowering the maximum picture rate from what is 2030 implied by other parameters. 2032 The encoder MUST use a picture rate equal to or less than this 2033 value. In cases where the max-fps parameter is absent, the 2034 encoder is free to choose any picture rate according to the 2035 highest level and any signalled optional parameters. 2037 The value of max-fps MUST be smaller than or equal to the full 2038 picture rate that is implied by the highest level and, when 2039 present, max-lsr. 2041 sprop-max-don-diff: 2043 If there is no NAL unit naluA that is followed in transmission 2044 order by any NAL unit preceding naluA in decoding order (i.e., 2045 the transmission order of the NAL units is the same as the 2046 decoding order), the value of this parameter MUST be equal to 2047 0. 2049 Otherwise, this parameter specifies the maximum absolute 2050 difference between the decoding order number (i.e., AbsDon) 2051 values of any two NAL units naluA and naluB, where naluA 2052 follows naluB in decoding order and precedes naluB in 2053 transmission order. 2055 The value of sprop-max-don-diff MUST be an integer in the range 2056 of 0 to 32767, inclusive. 2058 When not present, the value of sprop-max-don-diff is inferred 2059 to be equal to 0. 2061 sprop-depack-buf-bytes: 2063 This parameter signals the required size of the de- 2064 packetization buffer in units of bytes. The value of the 2065 parameter MUST be greater than or equal to the maximum buffer 2066 occupancy (in units of bytes) of the de-packetization buffer as 2067 specified in Section 6. 2069 The value of sprop-depack-buf-bytes MUST be an integer in the 2070 range of 0 to 4294967295, inclusive. 2072 When sprop-max-don-diff is present and greater than 0, this 2073 parameter MUST be present and the value MUST be greater than 0. 2074 When not present, the value of sprop-depack-buf-bytes is 2075 inferred to be equal to 0. 2077 Informative note: The value of sprop-depack-buf-bytes 2078 indicates the required size of the de-packetization buffer 2079 only. When network jitter can occur, an appropriately sized 2080 jitter buffer has to be available as well. 2082 depack-buf-cap: 2084 This parameter signals the capabilities of a receiver 2085 implementation and indicates the amount of de-packetization 2086 buffer space in units of bytes that the receiver has available 2087 for reconstructing the NAL unit decoding order from NAL units 2088 carried in the RTP stream. A receiver is able to handle any 2089 RTP stream for which the value of the sprop-depack-buf-bytes 2090 parameter is smaller than or equal to this parameter. 2092 When not present, the value of depack-buf-cap is inferred to be 2093 equal to 4294967295. The value of depack-buf-cap MUST be an 2094 integer in the range of 1 to 4294967295, inclusive. 2096 Informative note: depack-buf-cap indicates the maximum 2097 possible size of the de-packetization buffer of the receiver 2098 only, without allowing for network jitter. 2100 7.2. SDP Parameters 2102 The receiver MUST ignore any parameter unspecified in this memo. 2104 7.2.1. Mapping of Payload Type Parameters to SDP 2106 The media type video/H266 string is mapped to fields in the Session 2107 Description Protocol (SDP) [RFC4566] as follows: 2109 * The media name in the "m=" line of SDP MUST be video. 2111 * The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2112 media subtype). 2114 * The clock rate in the "a=rtpmap" line MUST be 90000. 2116 * The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2117 interop-constraints, level-id, sprop-sublayer-id, sprop-ols-id, 2118 recv-sublayer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2119 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2120 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2121 This parameter is expressed as a media type string, in the form of 2122 a semicolon-separated list of parameter=value pairs. 2124 * The OPTIONAL parameter sprop-vps, sprop-sps, sprop-pps, sprop-sei, 2125 and sprop-dci, when present, MUST be included in the "a=fmtp" line 2126 of SDP or conveyed using the "fmtp" source attribute as specified 2127 in Section 6.3 of [RFC5576]. For a particular media format (i.e., 2128 RTP payload type), sprop-vps, sprop-sps, sprop-pps, sprop-sei, or 2129 sprop-dci MUST NOT be both included in the "a=fmtp" line of SDP 2130 and conveyed using the "fmtp" source attribute. When included in 2131 the "a=fmtp" line of SDP, those parameters are expressed as a 2132 media type string, in the form of a semicolon-separated list of 2133 parameter=value pairs. When conveyed in the "a=fmtp" line of SDP 2134 for a particular payload type, the parameters sprop-vps, sprop- 2135 sps, sprop-pps, sprop-sei, and sprop-dci MUST be applied to each 2136 SSRC with the payload type. When conveyed using the "fmtp" source 2137 attribute, these parameters are only associated with the given 2138 source and payload type as parts of the "fmtp" source attribute. 2140 An example of media representation in SDP is as follows: 2142 m=video 49170 RTP/AVP 98 2143 a=rtpmap:98 H266/90000 2144 a=fmtp:98 profile-id=1; 2145 sprop-vps=