idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-10.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (July 09, 2021) is 1015 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1381 ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' -- Obsolete informational reference (is this intentional?): RFC 2326 (Obsoleted by RFC 7826) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: January 10, 2022 Y. Sanchez 6 Fraunhofer HHI 7 Y. Wang 8 Bytedance Inc. 9 July 09, 2021 11 RTP Payload Format for Versatile Video Coding (VVC) 12 draft-ietf-avtcore-rtp-vvc-10 14 Abstract 16 This memo describes an RTP payload format for the video coding 17 standard ITU-T Recommendation H.266 and ISO/IEC International 18 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 19 developed by the Joint Video Experts Team (JVET). The RTP payload 20 format allows for packetization of one or more Network Abstraction 21 Layer (NAL) units in each RTP packet payload as well as fragmentation 22 of a NAL unit into multiple RTP packets. The payload format has wide 23 applicability in videoconferencing, Internet video streaming, and 24 high-bitrate entertainment-quality video, among other applications. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on January 10, 2022. 43 Copyright Notice 45 Copyright (c) 2021 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents 50 (https://trustee.ietf.org/license-info) in effect on the date of 51 publication of this document. Please review these documents 52 carefully, as they describe your rights and restrictions with respect 53 to this document. Code Components extracted from this document must 54 include Simplified BSD License text as described in Section 4.e of 55 the Trust Legal Provisions and are provided without warranty as 56 described in the Simplified BSD License. 58 Table of Contents 60 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 61 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 62 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 4 63 1.1.2. Systems and Transport Interfaces (informative) . . . 6 64 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 65 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 66 1.2. Overview of the Payload Format . . . . . . . . . . . . . 15 67 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 68 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 69 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 70 3.1.1. Definitions from the VVC Specification . . . . . . . 16 71 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 19 72 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 73 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 74 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 75 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 22 76 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 77 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 23 78 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 79 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 80 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 81 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 32 82 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 83 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 84 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 35 85 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 48 86 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 48 87 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 49 88 7.2.3. Usage in Declarative Session Descriptions . . . . . . 59 89 7.2.4. Considerations for Parameter Sets . . . . . . . . . . 60 90 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 60 91 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 61 92 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 61 93 9. Security Considerations . . . . . . . . . . . . . . . . . . . 61 94 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 63 95 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 64 96 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 64 97 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 64 98 13.1. Normative References . . . . . . . . . . . . . . . . . . 64 99 13.2. Informative References . . . . . . . . . . . . . . . . . 66 100 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 67 101 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 68 103 1. Introduction 105 The Versatile Video Coding [VVC] specification, formally published as 106 both ITU-T Recommendation H.266 and ISO/IEC International Standard 107 23090-3, is currently in the ITU-T publication process and the ISO/ 108 IEC approval process. VVC is reported to provide significant coding 109 efficiency gains over HEVC [HEVC] as known as H.265, and other 110 earlier video codecs. 112 This memo specifies an RTP payload format for VVC. It shares its 113 basic design with the NAL (Network Abstraction Layer) unit-based RTP 114 payload formats of H.264 Video Coding [RFC6184], Scalable Video 115 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 116 and their respective predecessors. With respect to design 117 philosophy, security, congestion control, and overall implementation 118 complexity, it has similar properties to those earlier payload format 119 specifications. This is a conscious choice, as at least RFC 6184 is 120 widely deployed and generally known in the relevant implementer 121 communities. Certain scalability-related mechanisms known from 122 [RFC6190] were incorporated into this document, as VVC version 1 123 supports temporal, spatial, and signal-to-noise ratio (SNR) 124 scalability. 126 1.1. Overview of the VVC Codec 128 VVC and HEVC share a similar hybrid video codec design. In this 129 memo, we provide a very brief overview of those features of VVC that 130 are, in some form, addressed by the payload format specified herein. 131 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 132 specifications pertaining to VVC to arrive at interoperable, well- 133 performing implementations. 135 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 136 which is often used to refer to the coding-tool features, and a NAL, 137 which is often used to refer to the systems and transport interface 138 aspects of the codecs. 140 1.1.1. Coding-Tool Features (informative) 142 Coding tool features are described below with occasional reference to 143 the coding tool set of HEVC, which is well known in the community. 145 Similar to earlier hybrid-video-coding-based standards, including 146 HEVC, the following basic video coding design is employed by VVC. A 147 prediction signal is first formed by either intra- or motion- 148 compensated prediction, and the residual (the difference between the 149 original and the prediction) is then coded. The gains in coding 150 efficiency are achieved by redesigning and improving almost all parts 151 of the codec over earlier designs. In addition, VVC includes several 152 tools to make the implementation on parallel architectures easier. 154 Finally, VVC includes temporal, spatial, and SNR scalability as well 155 as multiview coding support. 157 Coding blocks and transform structure 159 Among major coding-tool differences between HEVC and VVC, one of the 160 important improvements is the more flexible coding tree structure in 161 VVC, i.e., multi-type tree. In addition to quadtree, binary and 162 ternary trees are also supported, which contributes significant 163 improvement in coding efficiency. Moreover, the maximum size of 164 coding tree unit (CTU) is increased from 64x64 to 128x128. To 165 improve the coding efficiency of chroma signal, luma chroma separated 166 trees at CTU level may be employed for intra-slices. The square 167 transforms in HEVC are extended to non-square transforms for 168 rectangular blocks resulting from binary and ternary tree splits. 169 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 170 DST-7, and DCT-8 as well as the non-separable secondary transform. 171 The transforms used in VVC can have different sizes with support for 172 larger transform sizes. For DCT-2, the transform sizes range from 173 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 174 4x4 to 32x32. In addition, VVC also support sub-block transform for 175 both intra and inter coded blocks. For intra coded blocks, intra 176 sub-partitioning (ISP) may be used to allow sub-block based intra 177 prediction and transform. For inter blocks, sub-block transform may 178 be used assuming that only a part of an inter-block has non-zero 179 transform coefficients. 181 Entropy coding 183 Similar to HEVC, VVC uses a single entropy-coding engine, which is 184 based on context adaptive binary arithmetic coding [CABAC], but with 185 the support of multi-window sizes. The window sizes can be 186 initialized differently for different context models. Due to such a 187 design, it has more efficient adaptation speed and better coding 188 efficiency. A joint chroma residual coding scheme is applied to 189 further exploit the correlation between the residuals of two color 190 components. In VVC, different residual coding schemes are applied 191 for regular transform coefficients and residual samples generated 192 using transform-skip mode. 194 In-loop filtering 196 VVC has more feature support in loop filters than HEVC. The 197 deblocking filter in VVC is similar to HEVC but operates at a smaller 198 grid. After deblocking and sample adaptive offset (SAO), an adaptive 199 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 200 distortion of decoded pictures. Besides, VVC introduces a new module 201 before deblocking called luma mapping with chroma scaling to fully 202 utilize the dynamic range of signal so that rate-distortion 203 performance of both SDR and HDR content is improved. 205 Motion prediction and coding 207 Compared to HEVC, VVC introduces several improvements in this area. 208 First, there is the adaptive motion vector resolution (AMVR), which 209 can save bit cost for motion vectors by adaptively signaling motion 210 vector resolution. Then the affine motion compensation is included 211 to capture complicated motion like zooming and rotation. Meanwhile, 212 prediction refinement with the optical flow with affine mode (PROF) 213 is further deployed to mimic affine motion at the pixel level. 214 Thirdly the decoder side motion vector refinement (DMVR) is a method 215 to derive MV vector at decoder side based on block matching so that 216 fewer bits may be spent on motion vectors. Bi-directional optical 217 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 218 offset at 4x4 sub-block level that is derived with equations based on 219 gradients of the prediction samples and a motion difference relative 220 to CU motion vectors. Furthermore, merge with motion vector 221 difference (MMVD) is a special mode, which further signals a limited 222 set of motion vector differences on top of merge mode. In addition 223 to MMVD, there are another three types of special merge modes, i.e., 224 sub-block merge, triangle, and combined intra-/inter-prediction 225 (CIIP). Sub-block merge list includes one candidate of sub-block 226 temporal motion vector prediction (SbTMVP) and up to four candidates 227 of affine motion vectors. Triangle is based on triangular block 228 motion compensation. CIIP combines intra- and inter- predictions 229 with weighting. Adaptive weighting may be employed with a block- 230 level tool called bi-prediction with CU based weighting (BCW) which 231 provides more flexibility than in HEVC. 233 Intra prediction and intra-coding 234 To capture the diversified local image texture directions with finer 235 granularity, VVC supports 65 angular directions instead of 33 236 directions in HEVC. The intra mode coding is based on a 6-most- 237 probable-mode scheme, and the 6 most probable modes are derived using 238 the neighboring intra prediction directions. In addition, to deal 239 with the different distributions of intra prediction angles for 240 different block aspect ratios, a wide-angle intra prediction (WAIP) 241 scheme is applied in VVC by including intra prediction angles beyond 242 those present in HEVC. Unlike HEVC which only allows using the most 243 adjacent line of reference samples for intra prediction, VVC also 244 allows using two further reference lines, as known as multi- 245 reference-line (MRL) intra prediction. The additional reference 246 lines can be only used for the 6 most probable intra prediction 247 modes. To capture the strong correlation between different colour 248 components, in VVC, a cross-component linear mode (CCLM) is utilized 249 which assumes a linear relationship between the luma sample values 250 and their associated chroma samples. For intra prediction, VVC also 251 applies a position-dependent prediction combination (PDPC) for 252 refining the prediction samples closer to the intra prediction block 253 boundary. Matrix-based intra prediction (MIP) modes are also used in 254 VVC which generates an up to 8x8 intra prediction block using a 255 weighted sum of downsampled neighboring reference samples, and the 256 weights are hardcoded constants. 258 Other coding-tool feature 260 VVC introduces dependent quantization (DQ) to reduce quantization 261 error by state-based switching between two quantizers. 263 1.1.2. Systems and Transport Interfaces (informative) 265 VVC inherits the basic systems and transport interfaces designs from 266 HEVC and H.264. These include the NAL-unit-based syntax structure, 267 the hierarchical syntax and data unit structure, the supplemental 268 enhancement information (SEI) message mechanism, and the video 269 buffering model based on the hypothetical reference decoder (HRD). 270 The scalability features of VVC are conceptually similar to the 271 scalable variant of HEVC known as SHVC. The hierarchical syntax and 272 data unit structure consists of parameter sets at various levels 273 (decoder, sequence (pertaining to all), sequence (pertaining to a 274 single), picture), picture-level header parameters, slice-level 275 header parameters, and lower-level parameters. 277 A number of key components that influenced the network abstraction 278 layer design of VVC as well as this memo are described below 280 Decoding capability information 281 The decoding capability information includes parameters that stay 282 constant for the lifetime of a Video Bitstream, which in IETF terms 283 can translate to the lifetime of a session. Such information 284 includes profile, level, and sub-profile information to determine a 285 maximum capability interop point that is guaranteed to be never 286 exceeded, even if splicing of video sequences occurs within a 287 session. It further includes constraint fields (most of which are 288 flags), which can optionally be set to indicate that the video 289 bitstream will be constraint in the use of certain features as 290 indicated by the values of those fields. With this, a bitstream can 291 be labelled as not using certain tools, which allows among other 292 things for resource allocation in a decoder implementation. 294 Video parameter set 296 The ideo parameter set (VPS) pertains to a coded video sequences 297 (CVS) of multiple layers covering the same range of access units, and 298 includes, among other information decoding dependency expressed as 299 information for reference picture list construction of enhancement 300 layers. The VPS provides a "big picture" of a scalable sequence, 301 including what types of operation points are provided, the profile, 302 tier, and level of the operation points, and some other high-level 303 properties of the bitstream that can be used as the basis for session 304 negotiation and content selection, etc. One VPS may be referenced by 305 one or more sequence parameter sets. 307 Sequence parameter set 309 The sequence parameter set (SPS) contains syntax elements pertaining 310 to a coded layer video sequence (CLVS), which is a group of pictures 311 belonging to the same layer, starting with a random access point, and 312 followed by pictures that may depend on each other, until the next 313 random access point picture. In MPGEG-2, the equivalent of a CVS was 314 a group of pictures (GOP), which normally started with an I frame and 315 was followed by P and B frames. While more complex in its options of 316 random access points, VVC retains this basic concept. One remarkable 317 difference of VVC is that a CLVS may start with a Gradual Decoding 318 Refresh (GDR) picture, without requiring presence of traditional 319 random access points in the bitstream, such as instantaneous decoding 320 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 321 applications, a CVS contains a few hundred milliseconds to a few 322 seconds of video. In video conferencing (without switching MCUs 323 involved), a CVS can be as long in duration as the whole session. 325 Picture and adaptation parameter set 327 The picture parameter set and the adaptation parameter set (PPS and 328 APS, respectively) carry information pertaining to zero or more 329 pictures and zero or more slices, respectively. The PPS contains 330 information that is likely to stay constant from picture to picture- 331 at least for pictures for a certain type-whereas the APS contains 332 information, such as adaptive loop filter coefficients, that are 333 likely to change from picture to picture or even within a picture. A 334 single APS is referenced by all slices of the same picture if that 335 APS contains information about luma mapping with chroma scaling 336 (LMCS) or scaling list. Different APSs containing ALF parameters can 337 be referenced by slices of the same picture. 339 Picture header 341 A Picture Header contains information that is common to all slices 342 that belong to the same picture. Being able to send that information 343 as a separate NAL unit when pictures are split into several slices 344 allows for saving bitrate, compared to repeating the same information 345 in all slices. However, there might be scenarios where low-bitrate 346 video is transmitted using a single slice per picture. Having a 347 separate NAL unit to convey that information incurs in an overhead 348 for such scenarios. For such scenarios, the picture header syntax 349 structure is directly included in the slice header, instead of in its 350 own NAL unit. The mode of the picture header syntax structure being 351 included in its own NAL unit or not can only be switched on/off for 352 an entire CLVS, and can only be switched off when in the entire CLVS 353 each picture contains only one slice. 355 Profile, tier, and level 357 The profile, tier and level syntax structures in DCI, VPS and SPS 358 contain profile, tier, level information for all layers that refer to 359 the DCI, for layers associated with one or more output layer sets 360 specified by the VPS, and for any layer that refers to the SPS, 361 respectively. 363 Sub-profiles 365 Within the VVC specification, a sub-profile is a 32-bit number, coded 366 according to ITU-T Rec. T.35, that does not carry a semantics. It is 367 carried in the profile_tier_level structure and hence (potentially) 368 present in the DCI, VPS, and SPS. External registration bodies can 369 register a T.35 codepoint with ITU-T registration authorities and 370 associate with their registration a description of bitstream 371 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 372 would allow encoder manufacturers to label the bitstreams generated 373 by their encoder as complying with such sub-profile. It is expected 374 that upstream standardization organizations (such as: DVB and ATSC), 375 as well as walled-garden video services will take advantage of this 376 labelling system. In contrast to "normal" profiles, it is expected 377 that sub-profiles may indicate encoder choices traditionally left 378 open in the (decoder- centric) video coding specs, such as GOP 379 structures, minimum/maximum QP values, and the mandatory use of 380 certain tools or SEI messages. 382 General constraint fields 384 The profile_tier_level structure carries a considerable number of 385 constraint fields (most of which are flags), which an encoder can use 386 to indicate to a decoder that it will not use a certain tool or 387 technology. They were included in reaction to a perceived market 388 need for labelling a bitstream as not exercising a certain tool that 389 has become commercially unviable. 391 Temporal scalability support 393 VVC includes support of temporal scalability, by inclusion of the 394 signaling of TemporalId in the NAL unit header, the restriction that 395 pictures of a particular temporal sublayer cannot be used for inter 396 prediction reference by pictures of a lower temporal sublayer, the 397 sub-bitstream extraction process, and the requirement that each sub- 398 bitstream extraction output be a conforming bitstream. Media-Aware 399 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 400 header for stream adaptation purposes based on temporal scalability. 402 Reference picture resampling (RPR) 404 In AVC and HEVC, the spatial resolution of pictures cannot change 405 unless a new sequence using a new SPS starts, with an IRAP picture. 406 VVC enables picture resolution change within a sequence at a position 407 without encoding an IRAP picture, which is always intra-coded. This 408 feature is sometimes referred to as reference picture resampling 409 (RPR), as the feature needs resampling of a reference picture used 410 for inter prediction when that reference picture has a different 411 resolution than the current picture being decoded. RPR allows 412 resolution change without the need of coding an IRAP picture, which 413 causes a momentary bit rate spike in streaming or video conferencing 414 scenarios, e.g., to cope with network condition changes. RPR can 415 also be used in application scenarios wherein zooming of the entire 416 video region or some region of interest is needed. 418 Spatial, SNR, and multiview scalability 420 VVC includes support for spatial, SNR, and multiview scalability. 421 Scalable video coding is widely considered to have technical benefits 422 and enrich services for various video applications. Until recently, 423 however, the functionality has not been included in the first version 424 of specifications of the video codecs. In VVC, however, all those 425 forms of scalability are supported in the first version of VVC 426 natively through the signaling of the layer_id in the NAL unit 427 header, the VPS which associates layers with given layer_ids to each 428 other, reference picture selection, reference picture resampling for 429 spatial scalability, and a number of other mechanisms not relevant 430 for this memo. 432 Spatial scalability 434 With the existence of Reference Picture Resampling (RPR), the 435 additional burden for scalability support is just a 436 modification of the high-level syntax (HLS). The inter-layer 437 prediction is employed in a scalable system to improve the 438 coding efficiency of the enhancement layers. In addition to 439 the spatial and temporal motion-compensated predictions that 440 are available in a single-layer codec, the inter-layer 441 prediction in VVC uses the possibly resampled video data of the 442 reconstructed reference picture from a reference layer to 443 predict the current enhancement layer. The resampling process 444 for inter-layer prediction, when used, is performed at the 445 block-level, reusing the existing interpolation process for 446 motion compensation in single-layer coding. It means that no 447 additional resampling process is needed to support spatial 448 scalability. 450 SNR scalability 452 SNR scalability is similar to spatial scalability except that 453 the resampling factors are 1:1. In other words, there is no 454 change in resolution, but there is inter-layer prediction. 456 Multiview scalability 458 The first version of VVC also supports multiview scalability, 459 wherein a multi-layer bitstream carries layers representing 460 multiple views, and one or more of the represented views can be 461 output at the same time. 463 SEI messages 465 Supplementary enhancement information (SEI) messages are information 466 in the bitstream that do not influence the decoding process as 467 specified in the VVC spec, but address issues of representation/ 468 rendering of the decoded bitstream, label the bitstream for certain 469 applications, among other, similar tasks. The overall concept of SEI 470 messages and many of the messages themselves has been inherited from 471 the H.264 and HEVC specs. Except for the SEI messages that affect 472 the specification of the hypothetical reference decoder (HRD), other 473 SEI messages for use in the VVC environment, which are generally 474 useful also in other video coding technologies, are not included in 475 the main VVC specification but in a companion specification [VSEI]. 477 1.1.3. High-Level Picture Partitioning (informative) 479 VVC inherited the concept of tiles and wavefront parallel processing 480 (WPP) from HEVC, with some minor to moderate differences. The basic 481 concept of slices was kept in VVC but designed in an essentially 482 different form. VVC is the first video coding standard that includes 483 subpictures as a feature, which provides the same functionality as 484 HEVC motion-constrained tile sets (MCTSs) but designed differently to 485 have better coding efficiency and to be friendlier for usage in 486 application systems. More details of these differences are described 487 below. 489 Tiles and WPP 491 Same as in HEVC, a picture can be split into tile rows and tile 492 columns in VVC, in-picture prediction across tile boundaries is 493 disallowed, etc. However, the syntax for signaling of tile 494 partitioning has been simplified, by using a unified syntax design 495 for both the uniform and the non-uniform mode. In addition, 496 signaling of entry point offsets for tiles in the slice header is 497 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 498 has two differences compared to HEVC: i) The CTU row delay is reduced 499 from two CTUs to one CTU; ii) Signaling of entry point offsets for 500 WPP in the slice header is optional in VVC while it is mandatory in 501 HEVC. 503 Slices 505 In VVC, the conventional slices based on CTUs (as in HEVC) or 506 macroblocks (as in AVC) have been removed. The main reasoning behind 507 this architectural change is as follows. The advances in video 508 coding since 2003 (the publication year of AVC v1) have been such 509 that slice-based error concealment has become practically impossible, 510 due to the ever-increasing number and efficiency of in-picture and 511 inter-picture prediction mechanisms. An error-concealed picture is 512 the decoding result of a transmitted coded picture for which there is 513 some data loss (e.g., loss of some slices) of the coded picture or a 514 reference picture for at least some part of the coded picture is not 515 error-free (e.g., that reference picture was an error-concealed 516 picture). For example, when one of the multiple slices of a picture 517 is lost, it may be error-concealed using an interpolation of the 518 neighboring slices. While advanced video coding prediction 519 mechanisms provide significantly higher coding efficiency, they also 520 make it harder for machines to estimate the quality of an error- 521 concealed picture, which was already a hard problem with the use of 522 simpler prediction mechanisms. Advanced in-picture prediction 523 mechanisms also cause the coding efficiency loss due to splitting a 524 picture into multiple slices to be more significant. Furthermore, 525 network conditions become significantly better while at the same time 526 techniques for dealing with packet losses have become significantly 527 improved. As a result, very few implementations have recently used 528 slices for maximum transmission unit size matching. Instead, 529 substantially all applications where low-delay error resilience is 530 required (e.g., video telephony and video conferencing) rely on 531 system/transport-level error resilience (e.g., retransmission, 532 forward error correction) and/or picture-based error resilience tools 533 (feedback-based error resilience, insertion of IRAPs, scalability 534 with higher protection level of the base layer, and so on). 535 Considering all the above, nowadays it is very rare that a picture 536 that cannot be correctly decoded is passed to the decoder, and when 537 such a rare case occurs, the system can afford to wait for an error- 538 free picture to be decoded and available for display without 539 resulting in frequent and long periods of picture freezing seen by 540 end users. 542 Slices in VVC have two modes: rectangular slices and raster-scan 543 slices. The rectangular slice, as indicated by its name, covers a 544 rectangular region of the picture. Typically, a rectangular slice 545 consists of several complete tiles. However, it is also possible 546 that a rectangular slice is a subset of a tile and consists of one or 547 more consecutive, complete CTU rows within a tile. A raster-scan 548 slice consists of one or more complete tiles in a tile raster scan 549 order, hence the region covered by a raster-scan slices need not but 550 could have a non-rectangular shape, but it may also happen to have 551 the shape of a rectangle. The concept of slices in VVC is therefore 552 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 553 macroblocks (as in AVC). 555 Subpictures 557 VVC is the first video coding standard that includes the support of 558 subpictures as a feature. Each subpicture consists of one or more 559 complete rectangular slices that collectively cover a rectangular 560 region of the picture. A subpicture may be either specified to be 561 extractable (i.e., coded independently of other subpictures of the 562 same picture and of earlier pictures in decoding order) or not 563 extractable. Regardless of whether a subpicture is extractable or 564 not, the encoder can control whether in-loop filtering (including 565 deblocking, SAO, and ALF) is applied across the subpicture boundaries 566 individually for each subpicture. 568 Functionally, subpictures are similar to the motion-constrained tile 569 sets (MCTSs) in HEVC. They both allow independent coding and 570 extraction of a rectangular subset of a sequence of coded pictures, 571 for use cases like viewport-dependent 360o video streaming 572 optimization and region of interest (ROI) applications. 574 There are several important design differences between subpictures 575 and MCTSs. First, the subpictures feature in VVC allows motion 576 vectors of a coding block pointing outside of the subpicture even 577 when the subpicture is extractable by applying sample padding at 578 subpicture boundaries in this case, similarly as at picture 579 boundaries. Second, additional changes were introduced for the 580 selection and derivation of motion vectors in the merge mode and in 581 the decoder side motion vector refinement process of VVC. This 582 allows higher coding efficiency compared to the non-normative motion 583 constraints applied at the encoder-side for MCTSs. Third, rewriting 584 of SHs (and PH NAL units, when present) is not needed when extracting 585 one or more extractable subpictures from a sequence of pictures to 586 create a sub-bitstream that is a conforming bitstream. In sub- 587 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 588 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 589 extraction, rewriting of SPSs and PPSs is needed. However, typically 590 there are only a few parameter sets in a bitstream, while each 591 picture has at least one slice, therefore rewriting of SHs can be a 592 significant burden for application systems. Fourth, slices of 593 different subpictures within a picture are allowed to have different 594 NAL unit types. Fifth, VVC specifies HRD and level definitions for 595 subpicture sequences, thus the conformance of the sub-bitstream of 596 each extractable subpicture sequence can be ensured by encoders. 598 1.1.4. NAL Unit Header 600 VVC maintains the NAL unit concept of HEVC with modifications. VVC 601 uses a two-byte NAL unit header, as shown in Figure 1. The payload 602 of a NAL unit refers to the NAL unit excluding the NAL unit header. 604 +---------------+---------------+ 605 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 606 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 607 |F|Z| LayerID | Type | TID | 608 +---------------+---------------+ 610 The Structure of the VVC NAL Unit Header. 612 Figure 1 614 The semantics of the fields in the NAL unit header are as specified 615 in VVC and described briefly below for convenience. In addition to 616 the name and size of each field, the corresponding syntax element 617 name in VVC is also provided. 619 F: 1 bit 621 forbidden_zero_bit. Required to be zero in VVC. Note that the 622 inclusion of this bit in the NAL unit header was to enable 623 transport of VVC video over MPEG-2 transport systems (avoidance of 624 start code emulations) [MPEG2S]. In the context of this memo the 625 value 1 may be used to indicate a syntax violation, e.g., for a 626 NAL unit resulted from aggregating a number of fragmented units of 627 a NAL unit but missing the last fragment, as described in the last 628 sentence of section 4.3.3. 630 Z: 1 bit 632 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 633 for future extensions by ITU-T and ISO/IEC. 634 This memo does not overload the "Z" bit for local extensions, as 635 a) overloading the "F" bit is sufficient and b) to preserve the 636 usefulness of this memo to possible future versions of [VVC]. 638 LayerId: 6 bits 640 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 641 a layer may be, e.g., a spatial scalable layer, a quality scalable 642 layer, a layer containing a different view, etc. 644 Type: 5 bits 646 nal_unit_type. This field specifies the NAL unit type as defined 647 in Table 5 of [VVC]. For a reference of all currently defined NAL 648 unit types and their semantics, please refer to Section 7.4.2.2 in 649 [VVC]. 651 TID: 3 bits 653 nuh_temporal_id_plus1. This field specifies the temporal 654 identifier of the NAL unit plus 1. The value of TemporalId is 655 equal to TID minus 1. A TID value of 0 is illegal to ensure that 656 there is at least one bit in the NAL unit header equal to 1, so to 657 enable the consideration of start code emulations in the NAL unit 658 payload data independent of the NAL unit header. 660 1.2. Overview of the Payload Format 662 This payload format defines the following processes required for 663 transport of VVC coded data over RTP [RFC3550]: 665 o Usage of RTP header with this payload format 667 o Packetization of VVC coded NAL units into RTP packets using three 668 types of payload structures: a single NAL unit packet, aggregation 669 packet, and fragment unit 671 o Transmission of VVC NAL units of the same bitstream within a 672 single RTP stream 674 o Media type parameters to be used with the Session Description 675 Protocol (SDP) [RFC4566] 677 o Usage of RTCP feedback messages 679 2. Conventions 681 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 682 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 683 "OPTIONAL" in this document are to be interpreted as described in BCP 684 14 [RFC2119] [RFC8174] when, and only when, they appear in all 685 capitals, as shown above. 687 3. Definitions and Abbreviations 689 3.1. Definitions 691 This document uses the terms and definitions of VVC. Section 3.1.1 692 lists relevant definitions from [VVC] for convenience. Section 3.1.2 693 provides definitions specific to this memo. All the used terms and 694 definitions in this memo are verbatim copies of [VVC] specification. 696 3.1.1. Definitions from the VVC Specification 698 Access unit (AU): A set of PUs that belong to different layers and 699 contain coded pictures associated with the same time for output from 700 the DPB. 702 Adaptation parameter set (APS): A syntax structure containing syntax 703 elements that apply to zero or more slices as determined by zero or 704 more syntax elements found in slice headers. 706 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 707 byte stream, that forms the representation of a sequence of AUs 708 forming one or more coded video sequences (CVSs). 710 Coded picture: A coded representation of a picture comprising VCL NAL 711 units with a particular value of nuh_layer_id within an AU and 712 containing all CTUs of the picture. 714 Clean random access (CRA) PU: A PU in which the coded picture is a 715 CRA picture. 717 Clean random access (CRA) picture: An IRAP picture for which each VCL 718 NAL unit has nal_unit_type equal to CRA_NUT. 720 Coded video sequence (CVS): A sequence of AUs that consists, in 721 decoding order, of a CVSS AU, followed by zero or more AUs that are 722 not CVSS AUs, including all subsequent AUs up to but not including 723 any subsequent AU that is a CVSS AU. 725 Coded video sequence start (CVSS) AU: An AU in which there is a PU 726 for each layer in the CVS and the coded picture in each PU is a CLVSS 727 picture. 729 Coded layer video sequence (CLVS): A sequence of PUs with the same 730 value of nuh_layer_id that consists, in decoding order, of a CLVSS 731 PU, followed by zero or more PUs that are not CLVSS PUs, including 732 all subsequent PUs up to but not including any subsequent PU that is 733 a CLVSS PU. 735 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 736 picture is a CLVSS picture. 738 Coded layer video sequence start (CLVSS) picture: A coded picture 739 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 740 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 742 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 743 of chroma samples of a picture that has three sample arrays, or a CTB 744 of samples of a monochrome picture or a picture that is coded using 745 three separate colour planes and syntax structures used to code the 746 samples. 748 Decoding Capability Information (DCI): A syntax structure containing 749 syntax elements that apply to the entire bitstream. 751 Decoded picture buffer (DPB): A buffer holding decoded pictures for 752 reference, output reordering, or output delay specified for the 753 hypothetical reference decoder. 755 Gradual decoding refresh (GDR) picture: A picture for which each VCL 756 NAL unit has nal_unit_type equal to GDR_NUT. 758 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 759 picture is an IDR picture. 761 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 762 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 763 IDR_N_LP. 765 Intra random access point (IRAP) AU: An AU in which there is a PU for 766 each layer in the CVS and the coded picture in each PU is an IRAP 767 picture. 769 Intra random access point (IRAP) PU: A PU in which the coded picture 770 is an IRAP picture. 772 Intra random access point (IRAP) picture: A coded picture for which 773 all VCL NAL units have the same value of nal_unit_type in the range 774 of IDR_W_RADL to CRA_NUT, inclusive. 776 Layer: A set of VCL NAL units that all have a particular value of 777 nuh_layer_id and the associated non-VCL NAL units. 779 Network abstraction layer (NAL) unit: A syntax structure containing 780 an indication of the type of data to follow and bytes containing that 781 data in the form of an RBSP interspersed as necessary with emulation 782 prevention bytes. 784 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 786 Operation point (OP): A temporal subset of an OLS, identified by an 787 OLS index and a highest value of TemporalId. 789 Picture parameter set (PPS): A syntax structure containing syntax 790 elements that apply to zero or more entire coded pictures as 791 determined by a syntax element found in each slice header. 793 Picture unit (PU): A set of NAL units that are associated with each 794 other according to a specified classification rule, are consecutive 795 in decoding order, and contain exactly one coded picture. 797 Random access: The act of starting the decoding process for a 798 bitstream at a point other than the beginning of the stream. 800 Sequence parameter set (SPS): A syntax structure containing syntax 801 elements that apply to zero or more entire CLVSs as determined by the 802 content of a syntax element found in the PPS referred to by a syntax 803 element found in each picture header. 805 Slice: An integer number of complete tiles or an integer number of 806 consecutive complete CTU rows within a tile of a picture that are 807 exclusively contained in a single NAL unit. 809 Slice header (SH): A part of a coded slice containing the data 810 elements pertaining to all tiles or CTU rows within a tile 811 represented in the slice. 813 Sublayer: A temporal scalable layer of a temporal scalable bitstream 814 consisting of VCL NAL units with a particular value of the TemporalId 815 variable, and the associated non-VCL NAL units. 817 Subpicture: An rectangular region of one or more slices within a 818 picture. 820 Sublayer representation: A subset of the bitstream consisting of NAL 821 units of a particular sublayer and the lower sublayers. 823 Tile: A rectangular region of CTUs within a particular tile column 824 and a particular tile row in a picture. 826 Tile column: A rectangular region of CTUs having a height equal to 827 the height of the picture and a width specified by syntax elements in 828 the picture parameter set. 830 Tile row: A rectangular region of CTUs having a height specified by 831 syntax elements in the picture parameter set and a width equal to the 832 width of the picture. 834 Video coding layer (VCL) NAL unit: A collective term for coded slice 835 NAL units and the subset of NAL units that have reserved values of 836 nal_unit_type that are classified as VCL NAL units in this 837 Specification. 839 3.1.2. Definitions Specific to This Memo 841 Media-Aware Network Element (MANE): A network element, such as a 842 middlebox, selective forwarding unit, or application-layer gateway 843 that is capable of parsing certain aspects of the RTP payload headers 844 or the RTP payload and reacting to their contents. 846 Informative note: The concept of a MANE goes beyond normal routers 847 or gateways in that a MANE has to be aware of the signaling (e.g., 848 to learn about the payload type mappings of the media streams), 849 and in that it has to be trusted when working with Secure RTP 850 (SRTP). The advantage of using MANEs is that they allow packets 851 to be dropped according to the needs of the media coding. For 852 example, if a MANE has to drop packets due to congestion on a 853 certain link, it can identify and remove those packets whose 854 elimination produces the least adverse effect on the user 855 experience. After dropping packets, MANEs must rewrite RTCP 856 packets to match the changes to the RTP stream, as specified in 857 Section 7 of [RFC3550]. 859 NAL unit decoding order: A NAL unit order that conforms to the 860 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 861 follow the Order of NAL units in the bitstream. 863 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 864 stream is utilized to transport a VVC bitstream, which may contain 865 one or more layers, and each layer may contain one or more temporal 866 sublayers. 868 Transmission order: The order of packets in ascending RTP sequence 869 number order (in modulo arithmetic). Within an aggregation packet, 870 the NAL unit transmission order is the same as the order of 871 appearance of NAL units in the packet. 873 3.2. Abbreviations 875 AU Access Unit 877 AP Aggregation Packet 879 APS Adaptation Parameter Set 881 CTU Coding Tree Unit 883 CVS Coded Video Sequence 885 DPB Decoded Picture Buffer 886 DCI Decoding Capability Information 888 DON Decoding Order Number 890 FIR Full Intra Request 892 FU Fragmentation Unit 894 GDR Gradual Decoding Refresh 896 HRD Hypothetical Reference Decoder 898 IDR Instantaneous Decoding Refresh 900 MANE Media-Aware Network Element 902 MTU Maximum Transfer Unit 904 NAL Network Abstraction Layer 906 NALU Network Abstraction Layer Unit 908 PLI Picture Loss Indication 910 PPS Picture Parameter Set 912 RPS Reference Picture Set 914 RPSI Reference Picture Selection Indication 916 SEI Supplemental Enhancement Information 918 SLI Slice Loss Indication 920 SPS Sequence Parameter Set 922 VCL Video Coding Layer 924 VPS Video Parameter Set 926 4. RTP Payload Format 928 4.1. RTP Header Usage 930 The format of the RTP header is specified in [RFC3550] (reprinted as 931 Figure 2 for convenience). This payload format uses the fields of 932 the header in a manner consistent with that specification. 934 The RTP payload (and the settings for some RTP header bits) for 935 aggregation packets and fragmentation units are specified in 936 Section 4.3.2 and Section 4.3.3, respectively. 938 0 1 2 3 939 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 940 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 941 |V=2|P|X| CC |M| PT | sequence number | 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 943 | timestamp | 944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 | synchronization source (SSRC) identifier | 946 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 947 | contributing source (CSRC) identifiers | 948 | .... | 949 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 951 RTP Header According to {{RFC3550}} 953 Figure 2 955 The RTP header information to be set according to this RTP payload 956 format is set as follows: 958 Marker bit (M): 1 bit 960 Set for the last packet, in transmission order, among each set of 961 packets that contain NAL units of one access unit. This is in 962 line with the normal use of the M bit in video formats to allow an 963 efficient playout buffer handling. 965 Payload Type (PT): 7 bits 967 The assignment of an RTP payload type for this new packet format 968 is outside the scope of this document and will not be specified 969 here. The assignment of a payload type has to be performed either 970 through the profile used or in a dynamic way. 972 Sequence Number (SN): 16 bits 974 Set and used in accordance with [RFC3550]. 976 Timestamp: 32 bits 978 The RTP timestamp is set to the sampling timestamp of the content. 979 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 980 properties of its own (e.g., parameter set and SEI NAL units), the 981 RTP timestamp MUST be set to the RTP timestamp of the coded 982 pictures of the access unit in which the NAL unit (according to 983 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 984 timestamp for the display process, even when the bitstream 985 contains picture timing SEI messages or decoding unit information 986 SEI messages as specified in [VVC]. 988 Informative note: When picture timing SEI messages are present, 989 the RTP sender is responsible to ensure that the RTP timestamps 990 are consistent with the timing information carried in the 991 picture timing SEI messages. 993 Synchronization source (SSRC): 32 bits 995 Used to identify the source of the RTP packets. A single SSRC is 996 used for all parts of a single bitstream. 998 4.2. Payload Header Usage 1000 The first two bytes of the payload of an RTP packet are referred to 1001 as the payload header. The payload header consists of the same 1002 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 1003 in Section 1.1.4, irrespective of the type of the payload structure. 1005 The TID value indicates (among other things) the relative importance 1006 of an RTP packet, for example, because NAL units belonging to higher 1007 temporal sublayers are not used for the decoding of lower temporal 1008 sublayers. A lower value of TID indicates a higher importance. 1009 More-important NAL units MAY be better protected against transmission 1010 losses than less-important NAL units. 1012 For Discussion: quite possibly something similar can be said for 1013 the Layer_id in layered coding, but perhaps not in multiview 1014 coding. (The relevant part of the spec is relatively new, 1015 therefore the soft language). However, for serious layer pruning, 1016 interpretation of the VPS is required. We can add language about 1017 the need for stateful interpretation of LayerID vis-a-vis 1018 stateless interpretation of TID later. 1020 4.3. Payload Structures 1022 Three different types of RTP packet payload structures are specified. 1023 A receiver can identify the type of an RTP packet payload through the 1024 Type field in the payload header. 1026 The three different payload structures are as follows: 1028 o Single NAL unit packet: Contains a single NAL unit in the payload, 1029 and the NAL unit header of the NAL unit also serves as the payload 1030 header. This payload structure is specified in Section 4.4.1. 1032 o Aggregation Packet (AP): Contains more than one NAL unit within 1033 one access unit. This payload structure is specified in 1034 Section 4.3.2. 1036 o Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1037 This payload structure is specified in Section 4.3.3. 1039 4.3.1. Single NAL Unit Packets 1041 A single NAL unit packet contains exactly one NAL unit, and consists 1042 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1043 DONL field (in network byte order), and the NAL unit payload data 1044 (the NAL unit excluding its NAL unit header) of the contained NAL 1045 unit, as shown in Figure 3. 1047 0 1 2 3 1048 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1049 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1050 | PayloadHdr | DONL (conditional) | 1051 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1052 | | 1053 | NAL unit payload data | 1054 | | 1055 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1056 | :...OPTIONAL RTP padding | 1057 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1059 The Structure of a Single NAL Unit Packet 1061 Figure 3 1063 The DONL field, when present, specifies the value of the 16 least 1064 significant bits of the decoding order number of the contained NAL 1065 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1066 be present, and the variable DON for the contained NAL unit is 1067 derived as equal to the value of the DONL field. Otherwise (sprop- 1068 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1070 4.3.2. Aggregation Packets (APs) 1072 Aggregation Packets (APs) can reduce packetization overhead for small 1073 NAL units, such as most of the non-VCL NAL units, which are often 1074 only a few octets in size. 1076 An AP aggregates NAL units of one access unit. Each NAL unit to be 1077 carried in an AP is encapsulated in an aggregation unit. NAL units 1078 aggregated in one AP are included in NAL unit decoding order. 1080 An AP consists of a payload header (denoted as PayloadHdr) followed 1081 by two or more aggregation units, as shown in Figure 4. 1083 0 1 2 3 1084 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1085 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1086 | PayloadHdr (Type=28) | | 1087 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1088 | | 1089 | two or more aggregation units | 1090 | | 1091 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1092 | :...OPTIONAL RTP padding | 1093 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1095 The Structure of an Aggregation Packet 1097 Figure 4 1099 The fields in the payload header of an AP are set as follows. The F 1100 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1101 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1102 be equal to 28. 1104 The value of LayerId MUST be equal to the lowest value of LayerId of 1105 all the aggregated NAL units. The value of TID MUST be the lowest 1106 value of TID of all the aggregated NAL units. 1108 Informative note: All VCL NAL units in an AP have the same TID 1109 value since they belong to the same access unit. However, an AP 1110 may contain non-VCL NAL units for which the TID value in the NAL 1111 unit header may be different than the TID value of the VCL NAL 1112 units in the same AP. 1114 An AP MUST carry at least two aggregation units and can carry as many 1115 aggregation units as necessary; however, the total amount of data in 1116 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1117 chosen so that the resulting IP packet is smaller than the MTU size 1118 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1119 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1120 not contain another AP. 1122 The first aggregation unit in an AP consists of a conditional 16-bit 1123 DONL field (in network byte order) followed by a 16-bit unsigned size 1124 information (in network byte order) that indicates the size of the 1125 NAL unit in bytes (excluding these two octets, but including the NAL 1126 unit header), followed by the NAL unit itself, including its NAL unit 1127 header, as shown in Figure 5. 1129 0 1 2 3 1130 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1131 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1132 | : DONL (conditional) | NALU size | 1133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1134 | NALU size | | 1135 +-+-+-+-+-+-+-+-+ NAL unit | 1136 | | 1137 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1138 | : 1139 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1141 The Structure of the First Aggregation Unit in an AP 1143 Figure 5 1145 The DONL field, when present, specifies the value of the 16 least 1146 significant bits of the decoding order number of the aggregated NAL 1147 unit. 1149 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1150 present in an aggregation unit that is the first aggregation unit in 1151 an AP, and the variable DON for the aggregated NAL unit is derived as 1152 equal to the value of the DONL field, and the variable DON for an 1153 aggregation unit that is not the first aggregation unit in an AP 1154 aggregated NAL unit is derived as equal to the DON of the preceding 1155 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1156 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1157 present in an aggregation unit that is the first aggregation unit in 1158 an AP. 1160 An aggregation unit that is not the first aggregation unit in an AP 1161 will be followed immediately by a 16-bit unsigned size information 1162 (in network byte order) that indicates the size of the NAL unit in 1163 bytes (excluding these two octets, but including the NAL unit 1164 header), followed by the NAL unit itself, including its NAL unit 1165 header, as shown in Figure 6. 1167 0 1 2 3 1168 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1169 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1170 | : NALU size | NAL unit | 1171 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1172 | | 1173 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1174 | : 1175 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1177 The Structure of an Aggregation Unit That Is Not the First 1178 Aggregation Unit in an AP 1180 Figure 6 1182 Figure 7 presents an example of an AP that contains two aggregation 1183 units, labeled as 1 and 2 in the figure, without the DONL field being 1184 present. 1186 0 1 2 3 1187 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1188 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1189 | RTP Header | 1190 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1191 | PayloadHdr (Type=28) | NALU 1 Size | 1192 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1193 | NALU 1 HDR | | 1194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1195 | . . . | 1196 | | 1197 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1198 | . . . | NALU 2 Size | NALU 2 HDR | 1199 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1200 | NALU 2 HDR | | 1201 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1202 | . . . | 1203 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1204 | :...OPTIONAL RTP padding | 1205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1207 An Example of an AP Packet Containing 1208 Two Aggregation Units without the DONL Field 1210 Figure 7 1212 Figure 8 presents an example of an AP that contains two aggregation 1213 units, labeled as 1 and 2 in the figure, with the DONL field being 1214 present. 1216 0 1 2 3 1217 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1218 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1219 | RTP Header | 1220 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1221 | PayloadHdr (Type=28) | NALU 1 DONL | 1222 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1223 | NALU 1 Size | NALU 1 HDR | 1224 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1225 | | 1226 | NALU 1 Data . . . | 1227 | | 1228 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1229 | : NALU 2 Size | 1230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1231 | NALU 2 HDR | | 1232 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1233 | | 1234 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1235 | :...OPTIONAL RTP padding | 1236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1238 An Example of an AP Containing 1239 Two Aggregation Units with the DONL Field 1241 Figure 8 1243 4.3.3. Fragmentation Units 1245 Fragmentation Units (FUs) are introduced to enable fragmenting a 1246 single NAL unit into multiple RTP packets, possibly without 1247 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1248 unit consists of an integer number of consecutive octets of that NAL 1249 unit. Fragments of the same NAL unit MUST be sent in consecutive 1250 order with ascending RTP sequence numbers (with no other RTP packets 1251 within the same RTP stream being sent between the first and last 1252 fragment). 1254 When a NAL unit is fragmented and conveyed within FUs, it is referred 1255 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1256 NOT be nested; i.e., an FU can not contain a subset of another FU. 1258 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1259 time of the fragmented NAL unit. 1261 An FU consists of a payload header (denoted as PayloadHdr), an FU 1262 header of one octet, a conditional 16-bit DONL field (in network byte 1263 order), and an FU payload, as shown in Figure 9. 1265 0 1 2 3 1266 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1267 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1268 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1269 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1270 | DONL (cond) | | 1271 |-+-+-+-+-+-+-+-+ | 1272 | FU payload | 1273 | | 1274 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1275 | :...OPTIONAL RTP padding | 1276 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1278 The Structure of an FU 1280 Figure 9 1282 The fields in the payload header are set as follows. The Type field 1283 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1284 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1285 unit. 1287 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1288 FuType field, as shown in Figure 10. 1290 +---------------+ 1291 |0|1|2|3|4|5|6|7| 1292 +-+-+-+-+-+-+-+-+ 1293 |S|E|P| FuType | 1294 +---------------+ 1296 The Structure of FU Header 1298 Figure 10 1300 The semantics of the FU header fields are as follows: 1302 S: 1 bit 1303 When set to 1, the S bit indicates the start of a fragmented NAL 1304 unit, i.e., the first byte of the FU payload is also the first 1305 byte of the payload of the fragmented NAL unit. When the FU 1306 payload is not the start of the fragmented NAL unit payload, the S 1307 bit MUST be set to 0. 1309 E: 1 bit 1311 When set to 1, the E bit indicates the end of a fragmented NAL 1312 unit, i.e., the last byte of the payload is also the last byte of 1313 the fragmented NAL unit. When the FU payload is not the last 1314 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1316 P: 1 bit 1318 When set to 1, the P bit indicates the last NAL unit of a coded 1319 picture, i.e., the last byte of the FU payload is also the last 1320 byte of the coded picture. When the FU payload is not the last 1321 fragment of a coded picture, the P bit MUST be set to 0. 1323 FuType: 5 bits 1325 The field FuType MUST be equal to the field Type of the fragmented 1326 NAL unit. 1328 The DONL field, when present, specifies the value of the 16 least 1329 significant bits of the decoding order number of the fragmented NAL 1330 unit. 1332 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1333 the DONL field MUST be present in the FU, and the variable DON for 1334 the fragmented NAL unit is derived as equal to the value of the DONL 1335 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1336 equal to 0), the DONL field MUST NOT be present in the FU. 1338 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1339 the Start bit and End bit must not both be set to 1 in the same FU 1340 header. 1342 The FU payload consists of fragments of the payload of the fragmented 1343 NAL unit so that if the FU payloads of consecutive FUs, starting with 1344 an FU with the S bit equal to 1 and ending with an FU with the E bit 1345 equal to 1, are sequentially concatenated, the payload of the 1346 fragmented NAL unit can be reconstructed. The NAL unit header of the 1347 fragmented NAL unit is not included as such in the FU payload, but 1348 rather the information of the NAL unit header of the fragmented NAL 1349 unit is conveyed in F, LayerId, and TID fields of the FU payload 1350 headers of the FUs and the FuType field of the FU header of the FUs. 1351 An FU payload MUST NOT be empty. 1353 If an FU is lost, the receiver SHOULD discard all following 1354 fragmentation units in transmission order corresponding to the same 1355 fragmented NAL unit, unless the decoder in the receiver is known to 1356 be prepared to gracefully handle incomplete NAL units. 1358 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1359 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1360 n of that NAL unit is not received. In this case, the 1361 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1362 syntax violation. 1364 4.4. Decoding Order Number 1366 For each NAL unit, the variable AbsDon is derived, representing the 1367 decoding order number that is indicative of the NAL unit decoding 1368 order. 1370 Let NAL unit n be the n-th NAL unit in transmission order within an 1371 RTP stream. 1373 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1374 for NAL unit n, is derived as equal to n. 1376 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1377 derived as follows, where DON[n] is the value of the variable DON for 1378 NAL unit n: 1380 o If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1381 transmission order), AbsDon[0] is set equal to DON[0]. 1383 o Otherwise (n is greater than 0), the following applies for 1384 derivation of AbsDon[n]: 1386 If DON[n] == DON[n-1], 1387 AbsDon[n] = AbsDon[n-1] 1389 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1390 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1392 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1393 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1395 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1396 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1397 DON[n]) 1399 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1400 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1402 For any two NAL units m and n, the following applies: 1404 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1405 NAL unit m in NAL unit decoding order. 1407 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1408 of the two NAL units can be in either order. 1410 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1411 NAL unit m in decoding order. 1413 Informative note: When two consecutive NAL units in the NAL 1414 unit decoding order have different values of AbsDon, the 1415 absolute difference between the two AbsDon values may be 1416 greater than or equal to 1. 1418 Informative note: There are multiple reasons to allow for the 1419 absolute difference of the values of AbsDon for two consecutive 1420 NAL units in the NAL unit decoding order to be greater than 1421 one. An increment by one is not required, as at the time of 1422 associating values of AbsDon to NAL units, it may not be known 1423 whether all NAL units are to be delivered to the receiver. For 1424 example, a gateway might not forward VCL NAL units of higher 1425 sublayers or some SEI NAL units when there is congestion in the 1426 network. In another example, the first intra-coded picture of 1427 a pre-encoded clip is transmitted in advance to ensure that it 1428 is readily available in the receiver, and when transmitting the 1429 first intra-coded picture, the originator does not exactly know 1430 how many NAL units will be encoded before the first intra-coded 1431 picture of the pre-encoded clip follows in decoding order. 1432 Thus, the values of AbsDon for the NAL units of the first 1433 intra-coded picture of the pre-encoded clip have to be 1434 estimated when they are transmitted, and gaps in values of 1435 AbsDon may occur. 1437 5. Packetization Rules 1439 The following packetization rules apply: 1441 o If sprop-max-don-diff is greater than 0, the transmission order of 1442 NAL units carried in the RTP stream MAY be different than the NAL 1443 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1444 0), the transmission order of NAL units carried in the RTP stream 1445 MUST be the same as the NAL unit decoding order. 1447 o A NAL unit of a small size SHOULD be encapsulated in an 1448 aggregation packet together one or more other NAL units in order 1449 to avoid the unnecessary packetization overhead for small NAL 1450 units. For example, non-VCL NAL units such as access unit 1451 delimiters, parameter sets, or SEI NAL units are typically small 1452 and can often be aggregated with VCL NAL units without violating 1453 MTU size constraints. 1455 o Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1456 viewpoint, be encapsulated in an aggregation packet together with 1457 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1458 be meaningless without the associated VCL NAL unit being 1459 available. 1461 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1462 unit packet MUST be used. 1464 6. De-packetization Process 1466 The general concept behind de-packetization is to get the NAL units 1467 out of the RTP packets in an RTP stream and pass them to the decoder 1468 in the NAL unit decoding order. 1470 The de-packetization process is implementation dependent. Therefore, 1471 the following description should be seen as an example of a suitable 1472 implementation. Other schemes may be used as well, as long as the 1473 output for the same input is the same as the process described below. 1474 The output is the same when the set of output NAL units and their 1475 order are both identical. Optimizations relative to the described 1476 algorithms are possible. 1478 All normal RTP mechanisms related to buffer management apply. In 1479 particular, duplicated or outdated RTP packets (as indicated by the 1480 RTP sequences number and the RTP timestamp) are removed. To 1481 determine the exact time for decoding, factors such as a possible 1482 intentional delay to allow for proper inter-stream synchronization 1483 MUST be factored in. 1485 NAL units with NAL unit type values in the range of 0 to 27, 1486 inclusive, may be passed to the decoder. NAL-unit-like structures 1487 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1488 NOT be passed to the decoder. 1490 The receiver includes a receiver buffer, which is used to compensate 1491 for transmission delay jitter within individual RTP stream, and to 1492 reorder NAL units from transmission order to the NAL unit decoding 1493 order. In this section, the receiver operation is described under 1494 the assumption that there is no transmission delay jitter within an 1495 RTP stream. To make a difference from a practical receiver buffer 1496 that is also used for compensation of transmission delay jitter, the 1497 receiver buffer is hereafter called the de-packetization buffer in 1498 this section. Receivers should also prepare for transmission delay 1499 jitter; that is, either reserve separate buffers for transmission 1500 delay jitter buffering and de-packetization buffering or use a 1501 receiver buffer for both transmission delay jitter and de- 1502 packetization. Moreover, receivers should take transmission delay 1503 jitter into account in the buffering operation, e.g., by additional 1504 initial buffering before starting of decoding and playback. 1506 The de-packetization process extracts the NAL units from the RTP 1507 packets in an RTP stream as follows. When an RTP packet carries a 1508 single NAL unit packet, the payload of the RTP packet is extracted as 1509 a single NAL unit, excluding the DONL field, i.e., third and fourth 1510 bytes, when sprop-max-don-diff is greater than 0. When an RTP packet 1511 carries an Aggregation Packet, several NAL units are extracted from 1512 the payload of the RTP packet. In this case, each NAL unit 1513 corresponds to the part of the payload of each aggregation unit that 1514 follows the NALU size field as described in Section 4.3.2. When an 1515 RTP packet carries a Fragmentation Unit (FU), all RTP packets from 1516 the first FU (with the S field equal to 1) of the fragmented NAL unit 1517 up to the last FU (with the E field equal to 1) of the fragmented NAL 1518 unit are collected. The NAL unit is extracted from these RTP packets 1519 by concatenating all FU payloads in the same order as the 1520 corresponding RTP packets and appending the NAL unit header with the 1521 fields F, LayerId, and TID, set to equal to the values of the fields 1522 F, LayerId, and TID in the payload header of the FUs respectively, 1523 and with the NAL unit type set equal to the value of the field FuType 1524 in the FU header of the FUs, as described in Section 4.3.3. 1526 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1527 size is zero bytes, and the NAL units carried in the single RTP 1528 stream are directly passed to the decoder in their transmission 1529 order, which is identical to their decoding order. 1531 When sprop-max-don-diff is greater than 0, the process described in 1532 the remainder of this section applies. 1534 There are two buffering states in the receiver: initial buffering and 1535 buffering while playing. Initial buffering starts when the reception 1536 is initialized. After initial buffering, decoding and playback are 1537 started, and the buffering-while-playing mode is used. 1539 Regardless of the buffering state, the receiver stores incoming NAL 1540 units in reception order into the de-packetization buffer. NAL units 1541 carried in RTP packets are stored in the de-packetization buffer 1542 individually, and the value of AbsDon is calculated and stored for 1543 each NAL unit. 1545 Initial buffering lasts until the difference between the greatest and 1546 smallest AbsDon values of the NAL units in the de-packetization 1547 buffer is greater than or equal to the value of sprop-max-don-diff. 1549 After initial buffering, whenever the difference between the greatest 1550 and smallest AbsDon values of the NAL units in the de-packetization 1551 buffer is greater than or equal to the value of sprop-max-don-diff, 1552 the following operation is repeatedly applied until this difference 1553 is smaller than sprop-max-don-diff: 1555 o The NAL unit in the de-packetization buffer with the smallest 1556 value of AbsDon is removed from the de-packetization buffer and 1557 passed to the decoder. 1559 When no more NAL units are flowing into the de-packetization buffer, 1560 all NAL units remaining in the de-packetization buffer are removed 1561 from the buffer and passed to the decoder in the order of increasing 1562 AbsDon values. 1564 7. Payload Format Parameters 1566 This section specifies the optional parameters. A mapping of the 1567 parameters with Session Description Protocol (SDP) [RFC4556] is also 1568 provided for applications that use SDP. 1570 7.1. Media Type Registration 1572 The receiver MUST ignore any parameter unspecified in this memo. 1574 Type name: video 1576 Subtype name: H266 1578 Required parameters: none 1580 Optional parameters: 1582 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1583 level-id: 1585 These parameters indicate the profile, tier, default level, 1586 sub-profile, and some constraints of the bitstream carried by 1587 the RTP stream, or a specific set of the profile, tier, default 1588 level, sub-profile and some constraints the receiver supports. 1590 The subset of coding tools that may have been used to generate 1591 the bitstream or that the receiver supports, as well as some 1592 additional constraints are indicated collectively by profile- 1593 id, sub-profile-id, and interop-constraints. 1595 Informative note: There are 128 values of profile-id. The 1596 subset of coding tools identified by the profile-id can be 1597 further constrained with up to 255 instances of sub-profile- 1598 id. In addition, 68 bits included in interop-constraints, 1599 which can be extended up to 324 bits provide means to 1600 further restrict tools from existing profiles. To be able 1601 to support this fine-granular signalling of coding tool 1602 subsets with profile-id, sub-profile-id and interop- 1603 constraints, it would be safe to require symmetric use of 1604 these parameters in SDP offer/answer unless recv-ols-id is 1605 included in the SDP answer for choosing one of the layers 1606 offered. 1608 The tier is indicated by tier-flag. The default level is 1609 indicated by level-id. The tier and the default level specify 1610 the limits on values of syntax elements or arithmetic 1611 combinations of values of syntax elements that are followed 1612 when generating the bitstream or that the receiver supports. 1614 In SDP offer/answer, when the SDP answer does not include the 1615 recv-ols-id parameter that is less than the sprop-ols-id 1616 parameter in the SDP offer, the following applies: 1618 + The tier-flag, profile-id, sub-profile-id, and interop- 1619 constraints parameters MUST be used symmetrically, i.e., the 1620 value of each of these parameters in the offer MUST be the 1621 same as that in the answer, either explicitly signaled or 1622 implicitly inferred. 1624 + The level-id parameter is changeable as long as the highest 1625 level indicated by the answer is either equal to or lower 1626 than that in the offer. Note that a highest level higher 1627 than level-id in the offer for receiving can be included as 1628 max-recv-level-id. 1630 In SDP offer/answer, when the SDP answer does include the recv- 1631 ols-id parameter that is less than the sprop-ols-id parameter 1632 in the SDP offer, the set of tier- flag, profile-id, sub- 1633 profile-id, interop-constraints, and level-id parameters 1634 included in the answer MUST be consistent with that for the 1635 chosen output layer set as indicated in the SDP offer, with the 1636 exception that the level-id parameter in the SDP answer is 1637 changeable as long as the highest level indicated by the answer 1638 is either lower than or equal to that in the offer. 1640 More specifications of these parameters, including how they 1641 relate to syntax elements specified in [VVC] are provided 1642 below. 1644 profile-id: 1646 When profile-id is not present, a value of 1 (i.e., the Main 10 1647 profile) MUST be inferred. 1649 When used to indicate properties of a bitstream, profile-id is 1650 derived from the general_profile_idc syntax element that 1651 applies to the bitstream in an instance of the 1652 profile_tier_level( ) syntax structure. 1654 A profile_tier_level( ) syntax structure may be contained in an 1655 SPS, VPS, or DCI NAL units as specified in [VVC]. One of the 1656 following three cases applies to the container NAL unit of the 1657 profile_tier_level( ) syntax structure containing those PTL 1658 syntax elements used to derive the values of profile-id, tier- 1659 flag, level-id, sub-profile-id, or interop-constraints: 1) The 1660 container NAL unit is an SPS, the bitstream is a single-layer 1661 bitstream, and the profile_tier_level( ) syntax structures in 1662 all SPSs referenced by the CVSs in the bitstream has the same 1663 values respectively for those PTL syntax elements; 2) The 1664 container NAL unit is a VPS, the profile_tier_level( ) syntax 1665 structure is the one in the VPS that applies to the OLS 1666 corresponding to the bitstream, and the profile_tier_level( ) 1667 syntax structures applicable to the OLS corresponding to the 1668 bitstream in all VPSs referenced by the CVSs in the bitstream 1669 have the same values respectively for those PTL syntax 1670 elements; 3) The container NAL unit is a DCI NAL unit and the 1671 profile_tier_level( ) syntax structures in all DCI NAL units in 1672 the bitstream has the same values respectively for those PTL 1673 syntax elements. 1675 tier-flag, level-id: 1677 The value of tier-flag MUST be in the range of 0 to 1, 1678 inclusive. The value of level-id MUST be in the range of 0 to 1679 255, inclusive. 1681 If the tier-flag and level-id parameters are used to indicate 1682 properties of a bitstream, they indicate the tier and the 1683 highest level the bitstream complies with. 1685 If the tier-flag and level-id parameters are used for 1686 capability exchange, the following applies. If max-recv-level- 1687 id is not present, the default level defined by level-id 1688 indicates the highest level the codec wishes to support. 1689 Otherwise, max-recv-level-id indicates the highest level the 1690 codec supports for receiving. For either receiving or sending, 1691 all levels that are lower than the highest level supported MUST 1692 also be supported. 1694 If no tier-flag is present, a value of 0 MUST be inferred; if 1695 no level-id is present, a value of 51 (i.e., level 3.1) MUST be 1696 inferred. 1698 Informative note: The level values currently defined in the 1699 VVC specification are in the form of "majorNum.minorNum", 1700 and the value of the level-id for each of the levels is 1701 equal to majorNum * 16 + minorNum * 3. It is expected that 1702 if any level are defined in the future, the same convention 1703 will be used, but this cannot be guaranteed. 1705 When used to indicate properties of a bitstream, the tier-flag 1706 and level-id parameters are derived respectively from the 1707 syntax element general_tier_flag, and the syntax element 1708 general_level_idc or sub_layer_level_idc[j], that apply to the 1709 bitstream, in an instance of the profile_tier_level( ) syntax 1710 structure. 1712 If the tier-flag and level-id are derived from the 1713 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1714 following applies: 1716 + tier-flag = general_tier_flag 1718 + level-id = general_level_idc 1720 Otherwise, if the tier-flag and level-id are derived from the 1721 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1722 unit, and the bitstream contains the highest sub-layer 1723 representation in the OLS corresponding to the bitstream, the 1724 following applies: 1726 + tier-flag = general_tier_flag 1728 + level-id = general_level_idc 1730 Otherwise, if the tier-flag and level-id are derived from the 1731 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1732 unit, and the bitstream does not contains the highest sub-layer 1733 representation in the OLS corresponding to the bitstream, the 1734 following applies, with j being the value of the sprop-sub- 1735 layer-id parameter: 1737 + tier-flag = general_tier_flag 1739 + level-id = sub_layer_level_idc[j] 1741 sub-profile-id: 1743 The value of the parameter is a comma-separated (',') list of 1744 data using base64 [RFC4648] representation. 1746 When used to indicate properties of a bitstream, sub-profile-id 1747 is derived from each of the ptl_num_sub_profiles 1748 general_sub_profile_idc[i] syntax elements that apply to the 1749 bitstream in an profile_tier_level( ) syntax structure. 1751 interop-constraints: 1753 A base64 [RFC4648] representation of the data that includes the 1754 syntax elements ptl_frame_only_constraint_flag and 1755 ptl_multilayer_enabled_flag and the general_constraints_info( ) 1756 syntax structure that apply to the bitstream in an instance of 1757 the profile_tier_level( ) syntax structure. 1759 If the interop-constraints parameter is not present, the 1760 following MUST be inferred: 1762 + ptl_frame_only_constraint_flag = 1 1764 + ptl_multilayer_enabled_flag = 0 1766 + gci_present_flag in the general_constraints_info( ) syntax 1767 structure = 0 1769 Using interop-constraints for capability exchange results in a 1770 requirement on any bitstream to be compliant with the interop- 1771 constraints. 1773 sprop-sub-layer-id: 1775 This parameter MAY be used to indicate the highest allowed 1776 value of TID in the highest layer present in the bitstream. 1777 When not present, the value of sprop-sub-layer-id is inferred 1778 to be equal to 6. 1780 The value of sprop-sub-layer-id MUST be in the range of 0 to 6, 1781 inclusive. 1783 sprop-ols-id: 1785 This parameter MAY be used to indicate the OLS that the 1786 bitstream applies to. When not present, the value of sprop- 1787 ols-id is inferred to be equal to TargetOlsIdx as specified in 1788 8.1.1 in [VVC]. If this optional parameter is present, sprop- 1789 vps MUST also be present or its content MUST be known a priori 1790 at the receiver. 1792 The value of sprop-ols-id MUST be in the range of 0 to 257, 1793 inclusive. 1795 Informative note: VVC allows having up to 258 output layer 1796 sets indicated in the VPS as the number of output layer sets 1797 minus 2 is indicated with a field of 8 bits. 1799 recv-sub-layer-id: 1801 This parameter MAY be used to signal a receiver's choice of the 1802 offered or declared sub-layer representations in the sprop-vps 1803 and sprop-sps. The value of recv-sub-layer-id indicates the 1804 TID of the highest sub-layer in the highest layer of the 1805 bitstream that a receiver supports. When not present, the 1806 value of recv-sub-layer-id is inferred to be equal to the value 1807 of the sprop-sub-layer-id parameter in the SDP offer. 1809 The value of recv-sub-layer-id MUST be in the range of 0 to 6, 1810 inclusive. 1812 recv-ols-id: 1814 This parameter MAY be used to signal a receiver's choice of the 1815 offered or declared output layer sets in the sprop-vps. The 1816 value of recv-ols-id indicates the OLS index of the bitstream 1817 that a receiver supports. When not present, the value of recv- 1818 ols-id is inferred to be equal to the value of the sprop-ols-id 1819 parameter in the SDP offer. When present, the value of recv- 1820 ols-id must be included only when sprop-ols-id was received and 1821 must refer to an output layer set in the VPS that is in the 1822 same dependency tree as the OLS referred to by sprop-ols-id. 1823 If this optional parameter is present, sprop-vps must have been 1824 received or its content must be known a priori at the receiver. 1826 The value of recv-ols-id MUST be in the range of 0 to 257, 1827 inclusive. 1829 max-recv-level-id: 1831 This parameter MAY be used to indicate the highest level a 1832 receiver supports. 1834 The value of max-recv-level-id MUST be in the range of 0 to 1835 255, inclusive. 1837 When max-recv-level-id is not present, the value is inferred to 1838 be equal to level-id. 1840 max-recv-level-id MUST NOT be present when the highest level 1841 the receiver supports is not higher than the default level. 1843 sprop-dci: 1845 This parameter MAY be used to convey a decoding capability 1846 information NAL unit of the bitstream for out-of-band 1847 transmission. The parameter MAY also be used for capability 1848 exchange. The value of the parameter a base64 [RFC4648] 1849 representations of the decoding capability information NAL unit 1850 as specified in Section 7.3.2.1 of [VVC]. 1852 sprop-vps: 1854 This parameter MAY be used to convey any video parameter set 1855 NAL unit of the bitstream for out-of-band transmission of video 1856 parameter sets. The parameter MAY also be used for capability 1857 exchange and to indicate sub-stream characteristics (i.e., 1858 properties of output layer sets and sublayer representations as 1859 defined in [VVC]). The value of the parameter is a comma- 1860 separated (',') list of base64 [RFC4648] representations of the 1861 video parameter set NAL units as specified in Section 7.3.2.3 1862 of [VVC]. 1864 The sprop-vps parameter MAY contain one or more than one video 1865 parameter set NAL unit. However, all other video parameter 1866 sets contained in the sprop-vps parameter MUST be consistent 1867 with the first video parameter set in the sprop-vps parameter. 1868 A video parameter set vpsB is said to be consistent with 1869 another video parameter set vpsA if any decoder that conforms 1870 to the profile, tier, level, and constraints indicated by the 1871 data starting from the syntax element general_profile_space to 1872 the syntax element general_level_idc, inclusive, in the first 1873 profile_tier_level( ) syntax structure in vpsA can decode any 1874 bitstream that conforms to the profile, tier, level, and 1875 constraints indicated by the data starting from the syntax 1876 element general_profile_space to the syntax element 1877 general_level_idc, inclusive, in the first profile_tier_level( 1878 ) syntax structure in vpsB. 1880 sprop-sps: 1882 This parameter MAY be used to convey sequence parameter set NAL 1883 units of the bitstream for out-of-band transmission of sequence 1884 parameter sets. The value of the parameter is a comma- 1885 separated (',') list of base64 [RFC4648] representations of the 1886 sequence parameter set NAL units as specified in 1887 Section 7.3.2.4 of [VVC]. 1889 sprop-pps: 1891 This parameter MAY be used to convey picture parameter set NAL 1892 units of the bitstream for out-of-band transmission of picture 1893 parameter sets. The value of the parameter is a comma- 1894 separated (',') list of base64 [RFC4648] representations of the 1895 picture parameter set NAL units as specified in Section 7.3.2.5 1896 of [VVC]. 1898 sprop-sei: 1900 This parameter MAY be used to convey one or more SEI messages 1901 that describe bitstream characteristics. When present, a 1902 decoder can rely on the bitstream characteristics that are 1903 described in the SEI messages for the entire duration of the 1904 session, independently from the persistence scopes of the SEI 1905 messages as specified in [VSEI]. 1907 The value of the parameter is a comma-separated (',') list of 1908 base64 [RFC4648] representations of SEI NAL units as specified 1909 in [VSEI]. 1911 Informative note: Intentionally, no list of applicable or 1912 inapplicable SEI messages is specified here. Conveying 1913 certain SEI messages in sprop-sei may be sensible in some 1914 application scenarios and meaningless in others. However, a 1915 few examples are described below: 1917 1) In an environment where the bitstream was created from 1918 film-based source material, and no splicing is going to 1919 occur during the lifetime of the session, the film grain 1920 characteristics SEI message is likely meaningful, and 1921 sending it in sprop-sei rather than in the bitstream at each 1922 entry point may help with saving bits and allows one to 1923 configure the renderer only once, avoiding unwanted 1924 artifacts. 1926 2) Examples for SEI messages that would be meaningless to be 1927 conveyed in sprop-sei include the decoded picture hash SEI 1928 message (it is close to impossible that all decoded pictures 1929 have the same hashtag), the display orientation SEI message 1930 when the device is a handheld device (as the display 1931 orientation may change when the handheld device is turned 1932 around), or the filler payload SEI message (as there is no 1933 point in just having more bits in SDP). 1935 max-lsr: 1937 The max-lsr MAY be used to signal the capabilities of a 1938 receiver implementation and MUST NOT be used for any other 1939 purpose. The value of max-lsr is an integer indicating the 1940 maximum processing rate in units of luma samples per second. 1941 The max-lsr parameter signals that the receiver is capable of 1942 decoding video at a higher rate than is required by the highest 1943 level. 1945 Informative note: When the OPTIONAL media type parameters 1946 are used to signal the properties of a bitstream, and max- 1947 lsr is not present, the values of tier-flag, profile-id, 1948 sub-profile-id interop-constraints, and level-id must always 1949 be such that the bitstream complies fully with the specified 1950 profile, tier, and level. 1952 When max-lsr is signaled, the receiver MUST be able to decode 1953 bitstreams that conform to the highest level, with the 1954 exception that the MaxLumaSr value in Table 136 of [VVC] for 1955 the highest level is replaced with the value of max-lsr. 1956 Senders MAY use this knowledge to send pictures of a given size 1957 at a higher picture rate than is indicated in the highest 1958 level. 1960 When not present, the value of max-lsr is inferred to be equal 1961 to the value of MaxLumaSr given in Table 136 of [VVC] for the 1962 highest level. 1964 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 1965 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 1966 [VVC] for the highest level. 1968 max-fps: 1970 The value of max-fps is an integer indicating the maximum 1971 picture rate in units of pictures per 100 seconds that can be 1972 effectively processed by the receiver. The max-fps parameter 1973 MAY be used to signal that the receiver has a constraint in 1974 that it is not capable of processing video effectively at the 1975 full picture rate that is implied by the highest level and, 1976 when present, max-lsr. 1978 The value of max-fps is not necessarily the picture rate at 1979 which the maximum picture size can be sent, it constitutes a 1980 constraint on maximum picture rate for all resolutions. 1982 Informative note: The max-fps parameter is semantically 1983 different from max-lsr in that max-fps is used to signal a 1984 constraint, lowering the maximum picture rate from what is 1985 implied by other parameters. 1987 The encoder MUST use a picture rate equal to or less than this 1988 value. In cases where the max-fps parameter is absent, the 1989 encoder is free to choose any picture rate according to the 1990 highest level and any signaled optional parameters. 1992 The value of max-fps MUST be smaller than or equal to the full 1993 picture rate that is implied by the highest level and, when 1994 present, max-lsr. 1996 sprop-max-don-diff: 1998 If there is no NAL unit naluA that is followed in transmission 1999 order by any NAL unit preceding naluA in decoding order (i.e., 2000 the transmission order of the NAL units is the same as the 2001 decoding order), the value of this parameter MUST be equal to 2002 0. 2004 Otherwise, this parameter specifies the maximum absolute 2005 difference between the decoding order number (i.e., AbsDon) 2006 values of any two NAL units naluA and naluB, where naluA 2007 follows naluB in decoding order and precedes naluB in 2008 transmission order. 2010 The value of sprop-max-don-diff MUST be an integer in the range 2011 of 0 to 32767, inclusive. 2013 When not present, the value of sprop-max-don-diff is inferred 2014 to be equal to 0. 2016 sprop-depack-buf-bytes: 2018 This parameter signals the required size of the de- 2019 packetization buffer in units of bytes. The value of the 2020 parameter MUST be greater than or equal to the maximum buffer 2021 occupancy (in units of bytes) of the de-packetization buffer as 2022 specified in Section 6. 2024 The value of sprop-depack-buf-bytes MUST be an integer in the 2025 range of 0 to 4294967295, inclusive. 2027 When sprop-max-don-diff is present and greater than 0, this 2028 parameter MUST be present and the value MUST be greater than 0. 2029 When not present, the value of sprop-depack-buf-bytes is 2030 inferred to be equal to 0. 2032 Informative note: The value of sprop-depack-buf-bytes 2033 indicates the required size of the de-packetization buffer 2034 only. When network jitter can occur, an appropriately sized 2035 jitter buffer has to be available as well. 2037 depack-buf-cap: 2039 This parameter signals the capabilities of a receiver 2040 implementation and indicates the amount of de-packetization 2041 buffer space in units of bytes that the receiver has available 2042 for reconstructing the NAL unit decoding order from NAL units 2043 carried in the RTP stream. A receiver is able to handle any 2044 RTP stream for which the value of the sprop-depack-buf-bytes 2045 parameter is smaller than or equal to this parameter. 2047 When not present, the value of depack-buf-cap is inferred to be 2048 equal to 4294967295. The value of depack-buf-cap MUST be an 2049 integer in the range of 1 to 4294967295, inclusive. 2051 Informative note: depack-buf-cap indicates the maximum 2052 possible size of the de-packetization buffer of the receiver 2053 only, without allowing for network jitter. 2055 7.2. SDP Parameters 2057 The receiver MUST ignore any parameter unspecified in this memo. 2059 7.2.1. Mapping of Payload Type Parameters to SDP 2061 The media type video/H266 string is mapped to fields in the Session 2062 Description Protocol (SDP) [RFC4566] as follows: 2064 o The media name in the "m=" line of SDP MUST be video. 2066 o The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2067 media subtype). 2069 o The clock rate in the "a=rtpmap" line MUST be 90000. 2071 o The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2072 interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id, 2073 recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2074 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2075 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2076 This parameter is expressed as a media type string, in the form of 2077 a semicolon-separated list of parameter=value pairs. 2079 o The OPTIONAL parameter sprop-vps, sprop-sps, sprop-pps, sprop-sei, 2080 and sprop-dci, when present, MUST be included in the "a=fmtp" line 2081 of SDP or conveyed using the "fmtp" source attribute as specified 2082 in Section 6.3 of [RFC5576]. For a particular media format (i.e., 2083 RTP payload type), sprop-vps, sprop-sps, sprop-pps, sprop-sei, or 2084 sprop-dci MUST NOT be both included in the "a=fmtp" line of SDP 2085 and conveyed using the "fmtp" source attribute. When included in 2086 the "a=fmtp" line of SDP, those parameters are expressed as a 2087 media type string, in the form of a semicolon-separated list of 2088 parameter=value pairs. When conveyed in the "a=fmtp" line of SDP 2089 for a particular payload type, the parameters sprop-vps, sprop- 2090 sps, sprop-pps, sprop-sei, and sprop-dci MUST be applied to each 2091 SSRC with the payload type. When conveyed using the "fmtp" source 2092 attribute, these parameters are only associated with the given 2093 source and payload type as parts of the "fmtp" source attribute. 2095 An example of media representation in SDP is as follows: 2097 m=video 49170 RTP/AVP 98 2098 a=rtpmap:98 H266/90000 2099 a=fmtp:98 profile-id=1; 2100 sprop-vps=