idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-12.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (25 October 2021) is 913 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1381 ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' -- Obsolete informational reference (is this intentional?): RFC 2326 (Obsoleted by RFC 7826) Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 28 April 2022 Y. Sanchez 6 Fraunhofer HHI 7 Y.-K. Wang 8 Bytedance Inc. 9 25 October 2021 11 RTP Payload Format for Versatile Video Coding (VVC) 12 draft-ietf-avtcore-rtp-vvc-12 14 Abstract 16 This memo describes an RTP payload format for the video coding 17 standard ITU-T Recommendation H.266 and ISO/IEC International 18 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 19 developed by the Joint Video Experts Team (JVET). The RTP payload 20 format allows for packetization of one or more Network Abstraction 21 Layer (NAL) units in each RTP packet payload as well as fragmentation 22 of a NAL unit into multiple RTP packets. The payload format has wide 23 applicability in videoconferencing, Internet video streaming, and 24 high-bitrate entertainment-quality video, among other applications. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on 28 April 2022. 43 Copyright Notice 45 Copyright (c) 2021 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 50 license-info) in effect on the date of publication of this document. 51 Please review these documents carefully, as they describe your rights 52 and restrictions with respect to this document. Code Components 53 extracted from this document must include Simplified BSD License text 54 as described in Section 4.e of the Trust Legal Provisions and are 55 provided without warranty as described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 62 1.1.2. Systems and Transport Interfaces (informative) . . . 6 63 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 15 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 69 3.1.1. Definitions from the VVC Specification . . . . . . . 15 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 18 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 22 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 23 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 31 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 83 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 34 84 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 46 85 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 46 86 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 47 87 7.2.3. Usage in Declarative Session Descriptions . . . . . . 56 88 7.2.4. Considerations for Parameter Sets . . . . . . . . . . 57 89 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 57 90 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 58 91 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 58 92 9. Security Considerations . . . . . . . . . . . . . . . . . . . 58 93 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 60 94 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 61 95 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 61 96 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 61 97 13.1. Normative References . . . . . . . . . . . . . . . . . . 61 98 13.2. Informative References . . . . . . . . . . . . . . . . . 63 99 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 64 100 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 64 102 1. Introduction 104 The Versatile Video Coding [VVC] specification, formally published as 105 both ITU-T Recommendation H.266 and ISO/IEC International Standard 106 23090-3, is currently in the ITU-T publication process and the ISO/ 107 IEC approval process. VVC is reported to provide significant coding 108 efficiency gains over HEVC [HEVC] as known as H.265, and other 109 earlier video codecs. 111 This memo specifies an RTP payload format for VVC. It shares its 112 basic design with the NAL (Network Abstraction Layer) unit-based RTP 113 payload formats of H.264 Video Coding [RFC6184], Scalable Video 114 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 115 and their respective predecessors. With respect to design 116 philosophy, security, congestion control, and overall implementation 117 complexity, it has similar properties to those earlier payload format 118 specifications. This is a conscious choice, as at least RFC 6184 is 119 widely deployed and generally known in the relevant implementer 120 communities. Certain scalability-related mechanisms known from 121 [RFC6190] were incorporated into this document, as VVC version 1 122 supports temporal, spatial, and signal-to-noise ratio (SNR) 123 scalability. 125 1.1. Overview of the VVC Codec 127 VVC and HEVC share a similar hybrid video codec design. In this 128 memo, we provide a very brief overview of those features of VVC that 129 are, in some form, addressed by the payload format specified herein. 130 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 131 specifications pertaining to VVC to arrive at interoperable, well- 132 performing implementations. 134 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 135 which is often used to refer to the coding-tool features, and a NAL, 136 which is often used to refer to the systems and transport interface 137 aspects of the codecs. 139 1.1.1. Coding-Tool Features (informative) 141 Coding tool features are described below with occasional reference to 142 the coding tool set of HEVC, which is well known in the community. 144 Similar to earlier hybrid-video-coding-based standards, including 145 HEVC, the following basic video coding design is employed by VVC. A 146 prediction signal is first formed by either intra- or motion- 147 compensated prediction, and the residual (the difference between the 148 original and the prediction) is then coded. The gains in coding 149 efficiency are achieved by redesigning and improving almost all parts 150 of the codec over earlier designs. In addition, VVC includes several 151 tools to make the implementation on parallel architectures easier. 153 Finally, VVC includes temporal, spatial, and SNR scalability as well 154 as multiview coding support. 156 Coding blocks and transform structure 158 Among major coding-tool differences between HEVC and VVC, one of the 159 important improvements is the more flexible coding tree structure in 160 VVC, i.e., multi-type tree. In addition to quadtree, binary and 161 ternary trees are also supported, which contributes significant 162 improvement in coding efficiency. Moreover, the maximum size of 163 coding tree unit (CTU) is increased from 64x64 to 128x128. To 164 improve the coding efficiency of chroma signal, luma chroma separated 165 trees at CTU level may be employed for intra-slices. The square 166 transforms in HEVC are extended to non-square transforms for 167 rectangular blocks resulting from binary and ternary tree splits. 168 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 169 DST-7, and DCT-8 as well as the non-separable secondary transform. 170 The transforms used in VVC can have different sizes with support for 171 larger transform sizes. For DCT-2, the transform sizes range from 172 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 173 4x4 to 32x32. In addition, VVC also support sub-block transform for 174 both intra and inter coded blocks. For intra coded blocks, intra 175 sub-partitioning (ISP) may be used to allow sub-block based intra 176 prediction and transform. For inter blocks, sub-block transform may 177 be used assuming that only a part of an inter-block has non-zero 178 transform coefficients. 180 Entropy coding 182 Similar to HEVC, VVC uses a single entropy-coding engine, which is 183 based on context adaptive binary arithmetic coding [CABAC], but with 184 the support of multi-window sizes. The window sizes can be 185 initialized differently for different context models. Due to such a 186 design, it has more efficient adaptation speed and better coding 187 efficiency. A joint chroma residual coding scheme is applied to 188 further exploit the correlation between the residuals of two color 189 components. In VVC, different residual coding schemes are applied 190 for regular transform coefficients and residual samples generated 191 using transform-skip mode. 193 In-loop filtering 195 VVC has more feature support in loop filters than HEVC. The 196 deblocking filter in VVC is similar to HEVC but operates at a smaller 197 grid. After deblocking and sample adaptive offset (SAO), an adaptive 198 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 199 distortion of decoded pictures. Besides, VVC introduces a new module 200 before deblocking called luma mapping with chroma scaling to fully 201 utilize the dynamic range of signal so that rate-distortion 202 performance of both SDR and HDR content is improved. 204 Motion prediction and coding 206 Compared to HEVC, VVC introduces several improvements in this area. 207 First, there is the adaptive motion vector resolution (AMVR), which 208 can save bit cost for motion vectors by adaptively signaling motion 209 vector resolution. Then the affine motion compensation is included 210 to capture complicated motion like zooming and rotation. Meanwhile, 211 prediction refinement with the optical flow with affine mode (PROF) 212 is further deployed to mimic affine motion at the pixel level. 213 Thirdly the decoder side motion vector refinement (DMVR) is a method 214 to derive MV vector at decoder side based on block matching so that 215 fewer bits may be spent on motion vectors. Bi-directional optical 216 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 217 offset at 4x4 sub-block level that is derived with equations based on 218 gradients of the prediction samples and a motion difference relative 219 to CU motion vectors. Furthermore, merge with motion vector 220 difference (MMVD) is a special mode, which further signals a limited 221 set of motion vector differences on top of merge mode. In addition 222 to MMVD, there are another three types of special merge modes, i.e., 223 sub-block merge, triangle, and combined intra-/inter-prediction 224 (CIIP). Sub-block merge list includes one candidate of sub-block 225 temporal motion vector prediction (SbTMVP) and up to four candidates 226 of affine motion vectors. Triangle is based on triangular block 227 motion compensation. CIIP combines intra- and inter- predictions 228 with weighting. Adaptive weighting may be employed with a block- 229 level tool called bi-prediction with CU based weighting (BCW) which 230 provides more flexibility than in HEVC. 232 Intra prediction and intra-coding 234 To capture the diversified local image texture directions with finer 235 granularity, VVC supports 65 angular directions instead of 33 236 directions in HEVC. The intra mode coding is based on a 6-most- 237 probable-mode scheme, and the 6 most probable modes are derived using 238 the neighboring intra prediction directions. In addition, to deal 239 with the different distributions of intra prediction angles for 240 different block aspect ratios, a wide-angle intra prediction (WAIP) 241 scheme is applied in VVC by including intra prediction angles beyond 242 those present in HEVC. Unlike HEVC which only allows using the most 243 adjacent line of reference samples for intra prediction, VVC also 244 allows using two further reference lines, as known as multi- 245 reference-line (MRL) intra prediction. The additional reference 246 lines can be only used for the 6 most probable intra prediction 247 modes. To capture the strong correlation between different colour 248 components, in VVC, a cross-component linear mode (CCLM) is utilized 249 which assumes a linear relationship between the luma sample values 250 and their associated chroma samples. For intra prediction, VVC also 251 applies a position-dependent prediction combination (PDPC) for 252 refining the prediction samples closer to the intra prediction block 253 boundary. Matrix-based intra prediction (MIP) modes are also used in 254 VVC which generates an up to 8x8 intra prediction block using a 255 weighted sum of downsampled neighboring reference samples, and the 256 weights are hardcoded constants. 258 Other coding-tool feature 260 VVC introduces dependent quantization (DQ) to reduce quantization 261 error by state-based switching between two quantizers. 263 1.1.2. Systems and Transport Interfaces (informative) 265 VVC inherits the basic systems and transport interfaces designs from 266 HEVC and H.264. These include the NAL-unit-based syntax structure, 267 the hierarchical syntax and data unit structure, the supplemental 268 enhancement information (SEI) message mechanism, and the video 269 buffering model based on the hypothetical reference decoder (HRD). 270 The scalability features of VVC are conceptually similar to the 271 scalable variant of HEVC known as SHVC. The hierarchical syntax and 272 data unit structure consists of parameter sets at various levels 273 (decoder, sequence (pertaining to all), sequence (pertaining to a 274 single), picture), picture-level header parameters, slice-level 275 header parameters, and lower-level parameters. 277 A number of key components that influenced the network abstraction 278 layer design of VVC as well as this memo are described below 280 Decoding capability information 282 The decoding capability information includes parameters that stay 283 constant for the lifetime of a Video Bitstream, which in IETF terms 284 can translate to the lifetime of a session. Such information 285 includes profile, level, and sub-profile information to determine a 286 maximum capability interop point that is guaranteed to be never 287 exceeded, even if splicing of video sequences occurs within a 288 session. It further includes constraint fields (most of which are 289 flags), which can optionally be set to indicate that the video 290 bitstream will be constraint in the use of certain features as 291 indicated by the values of those fields. With this, a bitstream can 292 be labelled as not using certain tools, which allows among other 293 things for resource allocation in a decoder implementation. 295 Video parameter set 297 The video parameter set (VPS) pertains to one or more coded video 298 sequences (CVSs) of multiple layers covering the same range of access 299 units, and includes, among other information, decoding dependency 300 expressed as information for reference picture list construction of 301 enhancement layers. The VPS provides a "big picture" of a scalable 302 sequence, including what types of operation points are provided, the 303 profile, tier, and level of the operation points, and some other 304 high-level properties of the bitstream that can be used as the basis 305 for session negotiation and content selection, etc. One VPS may be 306 referenced by one or more sequence parameter sets. 308 Sequence parameter set 310 The sequence parameter set (SPS) contains syntax elements pertaining 311 to a coded layer video sequence (CLVS), which is a group of pictures 312 belonging to the same layer, starting with a random access point, and 313 followed by pictures that may depend on each other, until the next 314 random access point picture. In MPGEG-2, the equivalent of a CVS was 315 a group of pictures (GOP), which normally started with an I frame and 316 was followed by P and B frames. While more complex in its options of 317 random access points, VVC retains this basic concept. One remarkable 318 difference of VVC is that a CLVS may start with a Gradual Decoding 319 Refresh (GDR) picture, without requiring presence of traditional 320 random access points in the bitstream, such as instantaneous decoding 321 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 322 applications, a CVS contains a few hundred milliseconds to a few 323 seconds of video. In video conferencing (without switching MCUs 324 involved), a CVS can be as long in duration as the whole session. 326 Picture and adaptation parameter set 327 The picture parameter set and the adaptation parameter set (PPS and 328 APS, respectively) carry information pertaining to zero or more 329 pictures and zero or more slices, respectively. The PPS contains 330 information that is likely to stay constant from picture to picture- 331 at least for pictures for a certain type-whereas the APS contains 332 information, such as adaptive loop filter coefficients, that are 333 likely to change from picture to picture or even within a picture. A 334 single APS is referenced by all slices of the same picture if that 335 APS contains information about luma mapping with chroma scaling 336 (LMCS) or scaling list. Different APSs containing ALF parameters can 337 be referenced by slices of the same picture. 339 Picture header 341 A Picture Header contains information that is common to all slices 342 that belong to the same picture. Being able to send that information 343 as a separate NAL unit when pictures are split into several slices 344 allows for saving bitrate, compared to repeating the same information 345 in all slices. However, there might be scenarios where low-bitrate 346 video is transmitted using a single slice per picture. Having a 347 separate NAL unit to convey that information incurs in an overhead 348 for such scenarios. For such scenarios, the picture header syntax 349 structure is directly included in the slice header, instead of in its 350 own NAL unit. The mode of the picture header syntax structure being 351 included in its own NAL unit or not can only be switched on/off for 352 an entire CLVS, and can only be switched off when in the entire CLVS 353 each picture contains only one slice. 355 Profile, tier, and level 357 The profile, tier and level syntax structures in DCI, VPS and SPS 358 contain profile, tier, level information for all layers that refer to 359 the DCI, for layers associated with one or more output layer sets 360 specified by the VPS, and for any layer that refers to the SPS, 361 respectively. 363 Sub-profiles 365 Within the VVC specification, a sub-profile is a 32-bit number, coded 366 according to ITU-T Rec. T.35, that does not carry a semantics. It is 367 carried in the profile_tier_level structure and hence (potentially) 368 present in the DCI, VPS, and SPS. External registration bodies can 369 register a T.35 codepoint with ITU-T registration authorities and 370 associate with their registration a description of bitstream 371 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 372 would allow encoder manufacturers to label the bitstreams generated 373 by their encoder as complying with such sub-profile. It is expected 374 that upstream standardization organizations (such as: DVB and ATSC), 375 as well as walled-garden video services will take advantage of this 376 labelling system. In contrast to "normal" profiles, it is expected 377 that sub-profiles may indicate encoder choices traditionally left 378 open in the (decoder- centric) video coding specs, such as GOP 379 structures, minimum/maximum QP values, and the mandatory use of 380 certain tools or SEI messages. 382 General constraint fields 384 The profile_tier_level structure carries a considerable number of 385 constraint fields (most of which are flags), which an encoder can use 386 to indicate to a decoder that it will not use a certain tool or 387 technology. They were included in reaction to a perceived market 388 need for labelling a bitstream as not exercising a certain tool that 389 has become commercially unviable. 391 Temporal scalability support 393 VVC includes support of temporal scalability, by inclusion of the 394 signaling of TemporalId in the NAL unit header, the restriction that 395 pictures of a particular temporal sublayer cannot be used for inter 396 prediction reference by pictures of a lower temporal sublayer, the 397 sub-bitstream extraction process, and the requirement that each sub- 398 bitstream extraction output be a conforming bitstream. Media-Aware 399 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 400 header for stream adaptation purposes based on temporal scalability. 402 Reference picture resampling (RPR) 404 In AVC and HEVC, the spatial resolution of pictures cannot change 405 unless a new sequence using a new SPS starts, with an IRAP picture. 406 VVC enables picture resolution change within a sequence at a position 407 without encoding an IRAP picture, which is always intra-coded. This 408 feature is sometimes referred to as reference picture resampling 409 (RPR), as the feature needs resampling of a reference picture used 410 for inter prediction when that reference picture has a different 411 resolution than the current picture being decoded. RPR allows 412 resolution change without the need of coding an IRAP picture, which 413 causes a momentary bit rate spike in streaming or video conferencing 414 scenarios, e.g., to cope with network condition changes. RPR can 415 also be used in application scenarios wherein zooming of the entire 416 video region or some region of interest is needed. 418 Spatial, SNR, and multiview scalability 420 VVC includes support for spatial, SNR, and multiview scalability. 421 Scalable video coding is widely considered to have technical benefits 422 and enrich services for various video applications. Until recently, 423 however, the functionality has not been included in the first version 424 of specifications of the video codecs. In VVC, however, all those 425 forms of scalability are supported in the first version of VVC 426 natively through the signaling of the layer_id in the NAL unit 427 header, the VPS which associates layers with given layer_ids to each 428 other, reference picture selection, reference picture resampling for 429 spatial scalability, and a number of other mechanisms not relevant 430 for this memo. 432 Spatial scalability 434 With the existence of Reference Picture Resampling (RPR), the 435 additional burden for scalability support is just a 436 modification of the high-level syntax (HLS). The inter-layer 437 prediction is employed in a scalable system to improve the 438 coding efficiency of the enhancement layers. In addition to 439 the spatial and temporal motion-compensated predictions that 440 are available in a single-layer codec, the inter-layer 441 prediction in VVC uses the possibly resampled video data of the 442 reconstructed reference picture from a reference layer to 443 predict the current enhancement layer. The resampling process 444 for inter-layer prediction, when used, is performed at the 445 block-level, reusing the existing interpolation process for 446 motion compensation in single-layer coding. It means that no 447 additional resampling process is needed to support spatial 448 scalability. 450 SNR scalability 452 SNR scalability is similar to spatial scalability except that 453 the resampling factors are 1:1. In other words, there is no 454 change in resolution, but there is inter-layer prediction. 456 Multiview scalability 458 The first version of VVC also supports multiview scalability, 459 wherein a multi-layer bitstream carries layers representing 460 multiple views, and one or more of the represented views can be 461 output at the same time. 463 SEI messages 465 Supplementary enhancement information (SEI) messages are information 466 in the bitstream that do not influence the decoding process as 467 specified in the VVC spec, but address issues of representation/ 468 rendering of the decoded bitstream, label the bitstream for certain 469 applications, among other, similar tasks. The overall concept of SEI 470 messages and many of the messages themselves has been inherited from 471 the H.264 and HEVC specs. Except for the SEI messages that affect 472 the specification of the hypothetical reference decoder (HRD), other 473 SEI messages for use in the VVC environment, which are generally 474 useful also in other video coding technologies, are not included in 475 the main VVC specification but in a companion specification [VSEI]. 477 1.1.3. High-Level Picture Partitioning (informative) 479 VVC inherited the concept of tiles and wavefront parallel processing 480 (WPP) from HEVC, with some minor to moderate differences. The basic 481 concept of slices was kept in VVC but designed in an essentially 482 different form. VVC is the first video coding standard that includes 483 subpictures as a feature, which provides the same functionality as 484 HEVC motion-constrained tile sets (MCTSs) but designed differently to 485 have better coding efficiency and to be friendlier for usage in 486 application systems. More details of these differences are described 487 below. 489 Tiles and WPP 491 Same as in HEVC, a picture can be split into tile rows and tile 492 columns in VVC, in-picture prediction across tile boundaries is 493 disallowed, etc. However, the syntax for signaling of tile 494 partitioning has been simplified, by using a unified syntax design 495 for both the uniform and the non-uniform mode. In addition, 496 signaling of entry point offsets for tiles in the slice header is 497 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 498 has two differences compared to HEVC: i) The CTU row delay is reduced 499 from two CTUs to one CTU; ii) Signaling of entry point offsets for 500 WPP in the slice header is optional in VVC while it is mandatory in 501 HEVC. 503 Slices 505 In VVC, the conventional slices based on CTUs (as in HEVC) or 506 macroblocks (as in AVC) have been removed. The main reasoning behind 507 this architectural change is as follows. The advances in video 508 coding since 2003 (the publication year of AVC v1) have been such 509 that slice-based error concealment has become practically impossible, 510 due to the ever-increasing number and efficiency of in-picture and 511 inter-picture prediction mechanisms. An error-concealed picture is 512 the decoding result of a transmitted coded picture for which there is 513 some data loss (e.g., loss of some slices) of the coded picture or a 514 reference picture for at least some part of the coded picture is not 515 error-free (e.g., that reference picture was an error-concealed 516 picture). For example, when one of the multiple slices of a picture 517 is lost, it may be error-concealed using an interpolation of the 518 neighboring slices. While advanced video coding prediction 519 mechanisms provide significantly higher coding efficiency, they also 520 make it harder for machines to estimate the quality of an error- 521 concealed picture, which was already a hard problem with the use of 522 simpler prediction mechanisms. Advanced in-picture prediction 523 mechanisms also cause the coding efficiency loss due to splitting a 524 picture into multiple slices to be more significant. Furthermore, 525 network conditions become significantly better while at the same time 526 techniques for dealing with packet losses have become significantly 527 improved. As a result, very few implementations have recently used 528 slices for maximum transmission unit size matching. Instead, 529 substantially all applications where low-delay error resilience is 530 required (e.g., video telephony and video conferencing) rely on 531 system/transport-level error resilience (e.g., retransmission, 532 forward error correction) and/or picture-based error resilience tools 533 (feedback-based error resilience, insertion of IRAPs, scalability 534 with higher protection level of the base layer, and so on). 535 Considering all the above, nowadays it is very rare that a picture 536 that cannot be correctly decoded is passed to the decoder, and when 537 such a rare case occurs, the system can afford to wait for an error- 538 free picture to be decoded and available for display without 539 resulting in frequent and long periods of picture freezing seen by 540 end users. 542 Slices in VVC have two modes: rectangular slices and raster-scan 543 slices. The rectangular slice, as indicated by its name, covers a 544 rectangular region of the picture. Typically, a rectangular slice 545 consists of several complete tiles. However, it is also possible 546 that a rectangular slice is a subset of a tile and consists of one or 547 more consecutive, complete CTU rows within a tile. A raster-scan 548 slice consists of one or more complete tiles in a tile raster scan 549 order, hence the region covered by a raster-scan slices need not but 550 could have a non-rectangular shape, but it may also happen to have 551 the shape of a rectangle. The concept of slices in VVC is therefore 552 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 553 macroblocks (as in AVC). 555 Subpictures 557 VVC is the first video coding standard that includes the support of 558 subpictures as a feature. Each subpicture consists of one or more 559 complete rectangular slices that collectively cover a rectangular 560 region of the picture. A subpicture may be either specified to be 561 extractable (i.e., coded independently of other subpictures of the 562 same picture and of earlier pictures in decoding order) or not 563 extractable. Regardless of whether a subpicture is extractable or 564 not, the encoder can control whether in-loop filtering (including 565 deblocking, SAO, and ALF) is applied across the subpicture boundaries 566 individually for each subpicture. 568 Functionally, subpictures are similar to the motion-constrained tile 569 sets (MCTSs) in HEVC. They both allow independent coding and 570 extraction of a rectangular subset of a sequence of coded pictures, 571 for use cases like viewport-dependent 360o video streaming 572 optimization and region of interest (ROI) applications. 574 There are several important design differences between subpictures 575 and MCTSs. First, the subpictures feature in VVC allows motion 576 vectors of a coding block pointing outside of the subpicture even 577 when the subpicture is extractable by applying sample padding at 578 subpicture boundaries in this case, similarly as at picture 579 boundaries. Second, additional changes were introduced for the 580 selection and derivation of motion vectors in the merge mode and in 581 the decoder side motion vector refinement process of VVC. This 582 allows higher coding efficiency compared to the non-normative motion 583 constraints applied at the encoder-side for MCTSs. Third, rewriting 584 of SHs (and PH NAL units, when present) is not needed when extracting 585 one or more extractable subpictures from a sequence of pictures to 586 create a sub-bitstream that is a conforming bitstream. In sub- 587 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 588 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 589 extraction, rewriting of SPSs and PPSs is needed. However, typically 590 there are only a few parameter sets in a bitstream, while each 591 picture has at least one slice, therefore rewriting of SHs can be a 592 significant burden for application systems. Fourth, slices of 593 different subpictures within a picture are allowed to have different 594 NAL unit types. Fifth, VVC specifies HRD and level definitions for 595 subpicture sequences, thus the conformance of the sub-bitstream of 596 each extractable subpicture sequence can be ensured by encoders. 598 1.1.4. NAL Unit Header 600 VVC maintains the NAL unit concept of HEVC with modifications. VVC 601 uses a two-byte NAL unit header, as shown in Figure 1. The payload 602 of a NAL unit refers to the NAL unit excluding the NAL unit header. 604 +---------------+---------------+ 605 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 606 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 607 |F|Z| LayerID | Type | TID | 608 +---------------+---------------+ 610 The Structure of the VVC NAL Unit Header. 612 Figure 1 614 The semantics of the fields in the NAL unit header are as specified 615 in VVC and described briefly below for convenience. In addition to 616 the name and size of each field, the corresponding syntax element 617 name in VVC is also provided. 619 F: 1 bit 621 forbidden_zero_bit. Required to be zero in VVC. Note that the 622 inclusion of this bit in the NAL unit header was to enable 623 transport of VVC video over MPEG-2 transport systems (avoidance of 624 start code emulations) [MPEG2S]. In the context of this memo the 625 value 1 may be used to indicate a syntax violation, e.g., for a 626 NAL unit resulted from aggregating a number of fragmented units of 627 a NAL unit but missing the last fragment, as described in the last 628 sentence of section 4.3.3. 630 Z: 1 bit 632 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 633 for future extensions by ITU-T and ISO/IEC. 635 This memo does not overload the "Z" bit for local extensions, as 636 a) overloading the "F" bit is sufficient and b) to preserve the 637 usefulness of this memo to possible future versions of [VVC]. 639 LayerId: 6 bits 641 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 642 a layer may be, e.g., a spatial scalable layer, a quality scalable 643 layer, a layer containing a different view, etc. 645 Type: 5 bits 647 nal_unit_type. This field specifies the NAL unit type as defined 648 in Table 5 of [VVC]. For a reference of all currently defined NAL 649 unit types and their semantics, please refer to Section 7.4.2.2 in 650 [VVC]. 652 TID: 3 bits 654 nuh_temporal_id_plus1. This field specifies the temporal 655 identifier of the NAL unit plus 1. The value of TemporalId is 656 equal to TID minus 1. A TID value of 0 is illegal to ensure that 657 there is at least one bit in the NAL unit header equal to 1, so to 658 enable the consideration of start code emulations in the NAL unit 659 payload data independent of the NAL unit header. 661 1.2. Overview of the Payload Format 663 This payload format defines the following processes required for 664 transport of VVC coded data over RTP [RFC3550]: 666 * Usage of RTP header with this payload format 668 * Packetization of VVC coded NAL units into RTP packets using three 669 types of payload structures: a single NAL unit packet, aggregation 670 packet, and fragment unit 672 * Transmission of VVC NAL units of the same bitstream within a 673 single RTP stream 675 * Media type parameters to be used with the Session Description 676 Protocol (SDP) [RFC4566] 678 * Usage of RTCP feedback messages 680 2. Conventions 682 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 683 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 684 "OPTIONAL" in this document are to be interpreted as described in BCP 685 14 [RFC2119] [RFC8174] when, and only when, they appear in all 686 capitals, as shown above. 688 3. Definitions and Abbreviations 690 3.1. Definitions 692 This document uses the terms and definitions of VVC. Section 3.1.1 693 lists relevant definitions from [VVC] for convenience. Section 3.1.2 694 provides definitions specific to this memo. All the used terms and 695 definitions in this memo are verbatim copies of [VVC] specification. 697 3.1.1. Definitions from the VVC Specification 699 Access unit (AU): A set of PUs that belong to different layers and 700 contain coded pictures associated with the same time for output from 701 the DPB. 703 Adaptation parameter set (APS): A syntax structure containing syntax 704 elements that apply to zero or more slices as determined by zero or 705 more syntax elements found in slice headers. 707 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 708 byte stream, that forms the representation of a sequence of AUs 709 forming one or more coded video sequences (CVSs). 711 Coded picture: A coded representation of a picture comprising VCL NAL 712 units with a particular value of nuh_layer_id within an AU and 713 containing all CTUs of the picture. 715 Clean random access (CRA) PU: A PU in which the coded picture is a 716 CRA picture. 718 Clean random access (CRA) picture: An IRAP picture for which each VCL 719 NAL unit has nal_unit_type equal to CRA_NUT. 721 Coded video sequence (CVS): A sequence of AUs that consists, in 722 decoding order, of a CVSS AU, followed by zero or more AUs that are 723 not CVSS AUs, including all subsequent AUs up to but not including 724 any subsequent AU that is a CVSS AU. 726 Coded video sequence start (CVSS) AU: An AU in which there is a PU 727 for each layer in the CVS and the coded picture in each PU is a CLVSS 728 picture. 730 Coded layer video sequence (CLVS): A sequence of PUs with the same 731 value of nuh_layer_id that consists, in decoding order, of a CLVSS 732 PU, followed by zero or more PUs that are not CLVSS PUs, including 733 all subsequent PUs up to but not including any subsequent PU that is 734 a CLVSS PU. 736 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 737 picture is a CLVSS picture. 739 Coded layer video sequence start (CLVSS) picture: A coded picture 740 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 741 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 743 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 744 of chroma samples of a picture that has three sample arrays, or a CTB 745 of samples of a monochrome picture or a picture that is coded using 746 three separate colour planes and syntax structures used to code the 747 samples. 749 Decoding Capability Information (DCI): A syntax structure containing 750 syntax elements that apply to the entire bitstream. 752 Decoded picture buffer (DPB): A buffer holding decoded pictures for 753 reference, output reordering, or output delay specified for the 754 hypothetical reference decoder. 756 Gradual decoding refresh (GDR) picture: A picture for which each VCL 757 NAL unit has nal_unit_type equal to GDR_NUT. 759 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 760 picture is an IDR picture. 762 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 763 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 764 IDR_N_LP. 766 Intra random access point (IRAP) AU: An AU in which there is a PU for 767 each layer in the CVS and the coded picture in each PU is an IRAP 768 picture. 770 Intra random access point (IRAP) PU: A PU in which the coded picture 771 is an IRAP picture. 773 Intra random access point (IRAP) picture: A coded picture for which 774 all VCL NAL units have the same value of nal_unit_type in the range 775 of IDR_W_RADL to CRA_NUT, inclusive. 777 Layer: A set of VCL NAL units that all have a particular value of 778 nuh_layer_id and the associated non-VCL NAL units. 780 Network abstraction layer (NAL) unit: A syntax structure containing 781 an indication of the type of data to follow and bytes containing that 782 data in the form of an RBSP interspersed as necessary with emulation 783 prevention bytes. 785 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 787 Operation point (OP): A temporal subset of an OLS, identified by an 788 OLS index and a highest value of TemporalId. 790 Picture parameter set (PPS): A syntax structure containing syntax 791 elements that apply to zero or more entire coded pictures as 792 determined by a syntax element found in each slice header. 794 Picture unit (PU): A set of NAL units that are associated with each 795 other according to a specified classification rule, are consecutive 796 in decoding order, and contain exactly one coded picture. 798 Random access: The act of starting the decoding process for a 799 bitstream at a point other than the beginning of the stream. 801 Sequence parameter set (SPS): A syntax structure containing syntax 802 elements that apply to zero or more entire CLVSs as determined by the 803 content of a syntax element found in the PPS referred to by a syntax 804 element found in each picture header. 806 Slice: An integer number of complete tiles or an integer number of 807 consecutive complete CTU rows within a tile of a picture that are 808 exclusively contained in a single NAL unit. 810 Slice header (SH): A part of a coded slice containing the data 811 elements pertaining to all tiles or CTU rows within a tile 812 represented in the slice. 814 Sublayer: A temporal scalable layer of a temporal scalable bitstream 815 consisting of VCL NAL units with a particular value of the TemporalId 816 variable, and the associated non-VCL NAL units. 818 Subpicture: An rectangular region of one or more slices within a 819 picture. 821 Sublayer representation: A subset of the bitstream consisting of NAL 822 units of a particular sublayer and the lower sublayers. 824 Tile: A rectangular region of CTUs within a particular tile column 825 and a particular tile row in a picture. 827 Tile column: A rectangular region of CTUs having a height equal to 828 the height of the picture and a width specified by syntax elements in 829 the picture parameter set. 831 Tile row: A rectangular region of CTUs having a height specified by 832 syntax elements in the picture parameter set and a width equal to the 833 width of the picture. 835 Video coding layer (VCL) NAL unit: A collective term for coded slice 836 NAL units and the subset of NAL units that have reserved values of 837 nal_unit_type that are classified as VCL NAL units in this 838 Specification. 840 3.1.2. Definitions Specific to This Memo 842 Media-Aware Network Element (MANE): A network element, such as a 843 middlebox, selective forwarding unit, or application-layer gateway 844 that is capable of parsing certain aspects of the RTP payload headers 845 or the RTP payload and reacting to their contents. 847 Informative note: The concept of a MANE goes beyond normal routers 848 or gateways in that a MANE has to be aware of the signaling (e.g., 849 to learn about the payload type mappings of the media streams), 850 and in that it has to be trusted when working with Secure RTP 851 (SRTP). The advantage of using MANEs is that they allow packets 852 to be dropped according to the needs of the media coding. For 853 example, if a MANE has to drop packets due to congestion on a 854 certain link, it can identify and remove those packets whose 855 elimination produces the least adverse effect on the user 856 experience. After dropping packets, MANEs must rewrite RTCP 857 packets to match the changes to the RTP stream, as specified in 858 Section 7 of [RFC3550]. 860 NAL unit decoding order: A NAL unit order that conforms to the 861 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 862 follow the Order of NAL units in the bitstream. 864 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 865 stream is utilized to transport a VVC bitstream, which may contain 866 one or more layers, and each layer may contain one or more temporal 867 sublayers. 869 Transmission order: The order of packets in ascending RTP sequence 870 number order (in modulo arithmetic). Within an aggregation packet, 871 the NAL unit transmission order is the same as the order of 872 appearance of NAL units in the packet. 874 3.2. Abbreviations 876 AU Access Unit 878 AP Aggregation Packet 880 APS Adaptation Parameter Set 882 CTU Coding Tree Unit 884 CVS Coded Video Sequence 886 DPB Decoded Picture Buffer 888 DCI Decoding Capability Information 890 DON Decoding Order Number 892 FIR Full Intra Request 894 FU Fragmentation Unit 895 GDR Gradual Decoding Refresh 897 HRD Hypothetical Reference Decoder 899 IDR Instantaneous Decoding Refresh 901 MANE Media-Aware Network Element 903 MTU Maximum Transfer Unit 905 NAL Network Abstraction Layer 907 NALU Network Abstraction Layer Unit 909 PLI Picture Loss Indication 911 PPS Picture Parameter Set 913 RPS Reference Picture Set 915 RPSI Reference Picture Selection Indication 917 SEI Supplemental Enhancement Information 919 SLI Slice Loss Indication 921 SPS Sequence Parameter Set 923 VCL Video Coding Layer 925 VPS Video Parameter Set 927 4. RTP Payload Format 929 4.1. RTP Header Usage 931 The format of the RTP header is specified in [RFC3550] (reprinted as 932 Figure 2 for convenience). This payload format uses the fields of 933 the header in a manner consistent with that specification. 935 The RTP payload (and the settings for some RTP header bits) for 936 aggregation packets and fragmentation units are specified in 937 Section 4.3.2 and Section 4.3.3, respectively. 939 0 1 2 3 940 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 941 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 942 |V=2|P|X| CC |M| PT | sequence number | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 944 | timestamp | 945 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 946 | synchronization source (SSRC) identifier | 947 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 948 | contributing source (CSRC) identifiers | 949 | .... | 950 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 952 RTP Header According to {{RFC3550}} 954 Figure 2 956 The RTP header information to be set according to this RTP payload 957 format is set as follows: 959 Marker bit (M): 1 bit 961 Set for the last packet, in transmission order, among each set of 962 packets that contain NAL units of one access unit. This is in 963 line with the normal use of the M bit in video formats to allow an 964 efficient playout buffer handling. 966 Payload Type (PT): 7 bits 968 The assignment of an RTP payload type for this new packet format 969 is outside the scope of this document and will not be specified 970 here. The assignment of a payload type has to be performed either 971 through the profile used or in a dynamic way. 973 Sequence Number (SN): 16 bits 975 Set and used in accordance with [RFC3550]. 977 Timestamp: 32 bits 978 The RTP timestamp is set to the sampling timestamp of the content. 979 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 980 properties of its own (e.g., parameter set and SEI NAL units), the 981 RTP timestamp MUST be set to the RTP timestamp of the coded 982 pictures of the access unit in which the NAL unit (according to 983 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 984 timestamp for the display process, even when the bitstream 985 contains picture timing SEI messages or decoding unit information 986 SEI messages as specified in [VVC]. 988 Informative note: When picture timing SEI messages are present, 989 the RTP sender is responsible to ensure that the RTP timestamps 990 are consistent with the timing information carried in the 991 picture timing SEI messages. 993 Synchronization source (SSRC): 32 bits 995 Used to identify the source of the RTP packets. A single SSRC is 996 used for all parts of a single bitstream. 998 4.2. Payload Header Usage 1000 The first two bytes of the payload of an RTP packet are referred to 1001 as the payload header. The payload header consists of the same 1002 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 1003 in Section 1.1.4, irrespective of the type of the payload structure. 1005 The TID value indicates (among other things) the relative importance 1006 of an RTP packet, for example, because NAL units belonging to higher 1007 temporal sublayers are not used for the decoding of lower temporal 1008 sublayers. A lower value of TID indicates a higher importance. 1009 More-important NAL units MAY be better protected against transmission 1010 losses than less-important NAL units. 1012 4.3. Payload Structures 1014 Three different types of RTP packet payload structures are specified. 1015 A receiver can identify the type of an RTP packet payload through the 1016 Type field in the payload header. 1018 The three different payload structures are as follows: 1020 * Single NAL unit packet: Contains a single NAL unit in the payload, 1021 and the NAL unit header of the NAL unit also serves as the payload 1022 header. This payload structure is specified in Section 4.4.1. 1024 * Aggregation Packet (AP): Contains more than one NAL unit within 1025 one access unit. This payload structure is specified in 1026 Section 4.3.2. 1028 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1029 This payload structure is specified in Section 4.3.3. 1031 4.3.1. Single NAL Unit Packets 1033 A single NAL unit packet contains exactly one NAL unit, and consists 1034 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1035 DONL field (in network byte order), and the NAL unit payload data 1036 (the NAL unit excluding its NAL unit header) of the contained NAL 1037 unit, as shown in Figure 3. 1039 0 1 2 3 1040 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1041 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1042 | PayloadHdr | DONL (conditional) | 1043 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1044 | | 1045 | NAL unit payload data | 1046 | | 1047 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1048 | :...OPTIONAL RTP padding | 1049 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1051 The Structure of a Single NAL Unit Packet 1053 Figure 3 1055 The DONL field, when present, specifies the value of the 16 least 1056 significant bits of the decoding order number of the contained NAL 1057 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1058 be present, and the variable DON for the contained NAL unit is 1059 derived as equal to the value of the DONL field. Otherwise (sprop- 1060 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1062 4.3.2. Aggregation Packets (APs) 1064 Aggregation Packets (APs) can reduce packetization overhead for small 1065 NAL units, such as most of the non-VCL NAL units, which are often 1066 only a few octets in size. 1068 An AP aggregates NAL units of one access unit and it can only contain 1069 NAL units from one AU. Each NAL unit to be carried in an AP is 1070 encapsulated in an aggregation unit. NAL units aggregated in one AP 1071 are included in NAL unit decoding order. 1073 An AP consists of a payload header (denoted as PayloadHdr) followed 1074 by two or more aggregation units, as shown in Figure 4. 1076 0 1 2 3 1077 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1078 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1079 | PayloadHdr (Type=28) | | 1080 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1081 | | 1082 | two or more aggregation units | 1083 | | 1084 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1085 | :...OPTIONAL RTP padding | 1086 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1088 The Structure of an Aggregation Packet 1090 Figure 4 1092 The fields in the payload header of an AP are set as follows. The F 1093 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1094 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1095 be equal to 28. 1097 The value of LayerId MUST be equal to the lowest value of LayerId of 1098 all the aggregated NAL units. The value of TID MUST be the lowest 1099 value of TID of all the aggregated NAL units. 1101 Informative note: All VCL NAL units in an AP have the same TID 1102 value since they belong to the same access unit. However, an AP 1103 may contain non-VCL NAL units for which the TID value in the NAL 1104 unit header may be different than the TID value of the VCL NAL 1105 units in the same AP. 1107 Informative Note: If a system envisions sub-picture level or 1108 picture level modifications, for example by removing sub-pictures 1109 or pictures of a particular layer, a good design choice on the 1110 sender's side would be to aggregate NAL units belonging to only 1111 the same sub-picture or picture of a particular layer. 1113 An AP MUST carry at least two aggregation units and can carry as many 1114 aggregation units as necessary; however, the total amount of data in 1115 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1116 chosen so that the resulting IP packet is smaller than the MTU size 1117 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1118 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1119 not contain another AP. 1121 The first aggregation unit in an AP consists of a conditional 16-bit 1122 DONL field (in network byte order) followed by a 16-bit unsigned size 1123 information (in network byte order) that indicates the size of the 1124 NAL unit in bytes (excluding these two octets, but including the NAL 1125 unit header), followed by the NAL unit itself, including its NAL unit 1126 header, as shown in Figure 5. 1128 0 1 2 3 1129 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1131 | : DONL (conditional) | NALU size | 1132 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1133 | NALU size | | 1134 +-+-+-+-+-+-+-+-+ NAL unit | 1135 | | 1136 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1137 | : 1138 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 The Structure of the First Aggregation Unit in an AP 1142 Figure 5 1144 The DONL field, when present, specifies the value of the 16 least 1145 significant bits of the decoding order number of the aggregated NAL 1146 unit. 1148 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1149 present in an aggregation unit that is the first aggregation unit in 1150 an AP, and the variable DON for the aggregated NAL unit is derived as 1151 equal to the value of the DONL field, and the variable DON for an 1152 aggregation unit that is not the first aggregation unit in an AP 1153 aggregated NAL unit is derived as equal to the DON of the preceding 1154 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1155 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1156 present in an aggregation unit that is the first aggregation unit in 1157 an AP. 1159 An aggregation unit that is not the first aggregation unit in an AP 1160 will be followed immediately by a 16-bit unsigned size information 1161 (in network byte order) that indicates the size of the NAL unit in 1162 bytes (excluding these two octets, but including the NAL unit 1163 header), followed by the NAL unit itself, including its NAL unit 1164 header, as shown in Figure 6. 1166 0 1 2 3 1167 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1168 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1169 | : NALU size | NAL unit | 1170 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1171 | | 1172 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1173 | : 1174 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1176 The Structure of an Aggregation Unit That Is Not the First 1177 Aggregation Unit in an AP 1179 Figure 6 1181 Figure 7 presents an example of an AP that contains two aggregation 1182 units, labeled as 1 and 2 in the figure, without the DONL field being 1183 present. 1185 0 1 2 3 1186 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1187 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1188 | RTP Header | 1189 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1190 | PayloadHdr (Type=28) | NALU 1 Size | 1191 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1192 | NALU 1 HDR | | 1193 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1194 | . . . | 1195 | | 1196 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1197 | . . . | NALU 2 Size | NALU 2 HDR | 1198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1199 | NALU 2 HDR | | 1200 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1201 | . . . | 1202 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1203 | :...OPTIONAL RTP padding | 1204 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1206 An Example of an AP Packet Containing 1207 Two Aggregation Units without the DONL Field 1209 Figure 7 1211 Figure 8 presents an example of an AP that contains two aggregation 1212 units, labeled as 1 and 2 in the figure, with the DONL field being 1213 present. 1215 0 1 2 3 1216 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1217 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1218 | RTP Header | 1219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1220 | PayloadHdr (Type=28) | NALU 1 DONL | 1221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1222 | NALU 1 Size | NALU 1 HDR | 1223 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1224 | | 1225 | NALU 1 Data . . . | 1226 | | 1227 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1228 | : NALU 2 Size | 1229 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1230 | NALU 2 HDR | | 1231 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1232 | | 1233 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1234 | :...OPTIONAL RTP padding | 1235 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1237 An Example of an AP Containing 1238 Two Aggregation Units with the DONL Field 1240 Figure 8 1242 4.3.3. Fragmentation Units 1244 Fragmentation Units (FUs) are introduced to enable fragmenting a 1245 single NAL unit into multiple RTP packets, possibly without 1246 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1247 unit consists of an integer number of consecutive octets of that NAL 1248 unit. Fragments of the same NAL unit MUST be sent in consecutive 1249 order with ascending RTP sequence numbers (with no other RTP packets 1250 within the same RTP stream being sent between the first and last 1251 fragment). 1253 When a NAL unit is fragmented and conveyed within FUs, it is referred 1254 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1255 NOT be nested; i.e., an FU can not contain a subset of another FU. 1257 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1258 time of the fragmented NAL unit. 1260 An FU consists of a payload header (denoted as PayloadHdr), an FU 1261 header of one octet, a conditional 16-bit DONL field (in network byte 1262 order), and an FU payload, as shown in Figure 9. 1264 0 1 2 3 1265 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1266 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1267 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1268 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1269 | DONL (cond) | | 1270 |-+-+-+-+-+-+-+-+ | 1271 | FU payload | 1272 | | 1273 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1274 | :...OPTIONAL RTP padding | 1275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1277 The Structure of an FU 1279 Figure 9 1281 The fields in the payload header are set as follows. The Type field 1282 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1283 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1284 unit. 1286 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1287 FuType field, as shown in Figure 10. 1289 +---------------+ 1290 |0|1|2|3|4|5|6|7| 1291 +-+-+-+-+-+-+-+-+ 1292 |S|E|P| FuType | 1293 +---------------+ 1295 The Structure of FU Header 1297 Figure 10 1299 The semantics of the FU header fields are as follows: 1301 S: 1 bit 1303 When set to 1, the S bit indicates the start of a fragmented NAL 1304 unit, i.e., the first byte of the FU payload is also the first 1305 byte of the payload of the fragmented NAL unit. When the FU 1306 payload is not the start of the fragmented NAL unit payload, the S 1307 bit MUST be set to 0. 1309 E: 1 bit 1310 When set to 1, the E bit indicates the end of a fragmented NAL 1311 unit, i.e., the last byte of the payload is also the last byte of 1312 the fragmented NAL unit. When the FU payload is not the last 1313 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1315 P: 1 bit 1317 When set to 1, the P bit indicates the last FU of the last VCL NAL 1318 unit of a coded picture, i.e., the last byte of the FU payload is 1319 also the last byte of the last VCL NAL unit of the coded picture. 1320 When the FU payload is not the last fragment of the last VCL NAL 1321 unit of a coded picture, the P bit MUST be set to 0. 1323 FuType: 5 bits 1325 The field FuType MUST be equal to the field Type of the fragmented 1326 NAL unit. 1328 The DONL field, when present, specifies the value of the 16 least 1329 significant bits of the decoding order number of the fragmented NAL 1330 unit. 1332 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1333 the DONL field MUST be present in the FU, and the variable DON for 1334 the fragmented NAL unit is derived as equal to the value of the DONL 1335 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1336 equal to 0), the DONL field MUST NOT be present in the FU. 1338 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1339 the Start bit and End bit must not both be set to 1 in the same FU 1340 header. 1342 The FU payload consists of fragments of the payload of the fragmented 1343 NAL unit so that if the FU payloads of consecutive FUs, starting with 1344 an FU with the S bit equal to 1 and ending with an FU with the E bit 1345 equal to 1, are sequentially concatenated, the payload of the 1346 fragmented NAL unit can be reconstructed. The NAL unit header of the 1347 fragmented NAL unit is not included as such in the FU payload, but 1348 rather the information of the NAL unit header of the fragmented NAL 1349 unit is conveyed in F, LayerId, and TID fields of the FU payload 1350 headers of the FUs and the FuType field of the FU header of the FUs. 1351 An FU payload MUST NOT be empty. 1353 If an FU is lost, the receiver SHOULD discard all following 1354 fragmentation units in transmission order corresponding to the same 1355 fragmented NAL unit, unless the decoder in the receiver is known to 1356 be prepared to gracefully handle incomplete NAL units. 1358 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1359 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1360 n of that NAL unit is not received. In this case, the 1361 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1362 syntax violation. 1364 4.4. Decoding Order Number 1366 For each NAL unit, the variable AbsDon is derived, representing the 1367 decoding order number that is indicative of the NAL unit decoding 1368 order. 1370 Let NAL unit n be the n-th NAL unit in transmission order within an 1371 RTP stream. 1373 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1374 for NAL unit n, is derived as equal to n. 1376 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1377 derived as follows, where DON[n] is the value of the variable DON for 1378 NAL unit n: 1380 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1381 transmission order), AbsDon[0] is set equal to DON[0]. 1383 * Otherwise (n is greater than 0), the following applies for 1384 derivation of AbsDon[n]: 1386 If DON[n] == DON[n-1], 1387 AbsDon[n] = AbsDon[n-1] 1389 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1390 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1392 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1393 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1395 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1396 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1397 DON[n]) 1399 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1400 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1402 For any two NAL units m and n, the following applies: 1404 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1405 NAL unit m in NAL unit decoding order. 1407 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1408 of the two NAL units can be in either order. 1410 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1411 NAL unit m in decoding order. 1413 Informative note: When two consecutive NAL units in the NAL 1414 unit decoding order have different values of AbsDon, the 1415 absolute difference between the two AbsDon values may be 1416 greater than or equal to 1. 1418 Informative note: There are multiple reasons to allow for the 1419 absolute difference of the values of AbsDon for two consecutive 1420 NAL units in the NAL unit decoding order to be greater than 1421 one. An increment by one is not required, as at the time of 1422 associating values of AbsDon to NAL units, it may not be known 1423 whether all NAL units are to be delivered to the receiver. For 1424 example, a gateway might not forward VCL NAL units of higher 1425 sublayers or some SEI NAL units when there is congestion in the 1426 network. In another example, the first intra-coded picture of 1427 a pre-encoded clip is transmitted in advance to ensure that it 1428 is readily available in the receiver, and when transmitting the 1429 first intra-coded picture, the originator does not exactly know 1430 how many NAL units will be encoded before the first intra-coded 1431 picture of the pre-encoded clip follows in decoding order. 1432 Thus, the values of AbsDon for the NAL units of the first 1433 intra-coded picture of the pre-encoded clip have to be 1434 estimated when they are transmitted, and gaps in values of 1435 AbsDon may occur. 1437 5. Packetization Rules 1439 The following packetization rules apply: 1441 * If sprop-max-don-diff is greater than 0, the transmission order of 1442 NAL units carried in the RTP stream MAY be different than the NAL 1443 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1444 0), the transmission order of NAL units carried in the RTP stream 1445 MUST be the same as the NAL unit decoding order. 1447 * A NAL unit of a small size SHOULD be encapsulated in an 1448 aggregation packet together one or more other NAL units in order 1449 to avoid the unnecessary packetization overhead for small NAL 1450 units. For example, non-VCL NAL units such as access unit 1451 delimiters, parameter sets, or SEI NAL units are typically small 1452 and can often be aggregated with VCL NAL units without violating 1453 MTU size constraints. 1455 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1456 viewpoint, be encapsulated in an aggregation packet together with 1457 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1458 be meaningless without the associated VCL NAL unit being 1459 available. 1461 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1462 unit packet MUST be used. 1464 6. De-packetization Process 1466 The general concept behind de-packetization is to get the NAL units 1467 out of the RTP packets in an RTP stream and pass them to the decoder 1468 in the NAL unit decoding order. 1470 The de-packetization process is implementation dependent. Therefore, 1471 the following description should be seen as an example of a suitable 1472 implementation. Other schemes may be used as well, as long as the 1473 output for the same input is the same as the process described below. 1474 The output is the same when the set of output NAL units and their 1475 order are both identical. Optimizations relative to the described 1476 algorithms are possible. 1478 All normal RTP mechanisms related to buffer management apply. In 1479 particular, duplicated or outdated RTP packets (as indicated by the 1480 RTP sequences number and the RTP timestamp) are removed. To 1481 determine the exact time for decoding, factors such as a possible 1482 intentional delay to allow for proper inter-stream synchronization 1483 MUST be factored in. 1485 NAL units with NAL unit type values in the range of 0 to 27, 1486 inclusive, may be passed to the decoder. NAL-unit-like structures 1487 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1488 NOT be passed to the decoder. 1490 The receiver includes a receiver buffer, which is used to compensate 1491 for transmission delay jitter within individual RTP stream, and to 1492 reorder NAL units from transmission order to the NAL unit decoding 1493 order. In this section, the receiver operation is described under 1494 the assumption that there is no transmission delay jitter within an 1495 RTP stream. To make a difference from a practical receiver buffer 1496 that is also used for compensation of transmission delay jitter, the 1497 receiver buffer is hereafter called the de-packetization buffer in 1498 this section. Receivers should also prepare for transmission delay 1499 jitter; that is, either reserve separate buffers for transmission 1500 delay jitter buffering and de-packetization buffering or use a 1501 receiver buffer for both transmission delay jitter and de- 1502 packetization. Moreover, receivers should take transmission delay 1503 jitter into account in the buffering operation, e.g., by additional 1504 initial buffering before starting of decoding and playback. 1506 The de-packetization process extracts the NAL units from the RTP 1507 packets in an RTP stream as follows. When an RTP packet carries a 1508 single NAL unit packet, the payload of the RTP packet is extracted as 1509 a single NAL unit, excluding the DONL field, i.e., third and fourth 1510 bytes, when sprop-max-don-diff is greater than 0. When an RTP packet 1511 carries an Aggregation Packet, several NAL units are extracted from 1512 the payload of the RTP packet. In this case, each NAL unit 1513 corresponds to the part of the payload of each aggregation unit that 1514 follows the NALU size field as described in Section 4.3.2. When an 1515 RTP packet carries a Fragmentation Unit (FU), all RTP packets from 1516 the first FU (with the S field equal to 1) of the fragmented NAL unit 1517 up to the last FU (with the E field equal to 1) of the fragmented NAL 1518 unit are collected. The NAL unit is extracted from these RTP packets 1519 by concatenating all FU payloads in the same order as the 1520 corresponding RTP packets and appending the NAL unit header with the 1521 fields F, LayerId, and TID, set to equal to the values of the fields 1522 F, LayerId, and TID in the payload header of the FUs respectively, 1523 and with the NAL unit type set equal to the value of the field FuType 1524 in the FU header of the FUs, as described in Section 4.3.3. 1526 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1527 size is zero bytes, and the NAL units carried in the single RTP 1528 stream are directly passed to the decoder in their transmission 1529 order, which is identical to their decoding order. 1531 When sprop-max-don-diff is greater than 0, the process described in 1532 the remainder of this section applies. 1534 There are two buffering states in the receiver: initial buffering and 1535 buffering while playing. Initial buffering starts when the reception 1536 is initialized. After initial buffering, decoding and playback are 1537 started, and the buffering-while-playing mode is used. 1539 Regardless of the buffering state, the receiver stores incoming NAL 1540 units in reception order into the de-packetization buffer. NAL units 1541 carried in RTP packets are stored in the de-packetization buffer 1542 individually, and the value of AbsDon is calculated and stored for 1543 each NAL unit. 1545 Initial buffering lasts until the difference between the greatest and 1546 smallest AbsDon values of the NAL units in the de-packetization 1547 buffer is greater than or equal to the value of sprop-max-don-diff. 1549 After initial buffering, whenever the difference between the greatest 1550 and smallest AbsDon values of the NAL units in the de-packetization 1551 buffer is greater than or equal to the value of sprop-max-don-diff, 1552 the following operation is repeatedly applied until this difference 1553 is smaller than sprop-max-don-diff: 1555 * The NAL unit in the de-packetization buffer with the smallest 1556 value of AbsDon is removed from the de-packetization buffer and 1557 passed to the decoder. 1559 When no more NAL units are flowing into the de-packetization buffer, 1560 all NAL units remaining in the de-packetization buffer are removed 1561 from the buffer and passed to the decoder in the order of increasing 1562 AbsDon values. 1564 7. Payload Format Parameters 1566 This section specifies the optional parameters. A mapping of the 1567 parameters with Session Description Protocol (SDP) [RFC4556] is also 1568 provided for applications that use SDP. 1570 7.1. Media Type Registration 1572 The receiver MUST ignore any parameter unspecified in this memo. 1574 Type name: video 1576 Subtype name: H266 1578 Required parameters: none 1580 Optional parameters: 1582 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1583 level-id: 1585 These parameters indicate the profile, tier, default level, 1586 sub-profile, and some constraints of the bitstream carried by 1587 the RTP stream, or a specific set of the profile, tier, default 1588 level, sub-profile and some constraints the receiver supports. 1590 The subset of coding tools that may have been used to generate 1591 the bitstream or that the receiver supports, as well as some 1592 additional constraints are indicated collectively by profile- 1593 id, sub-profile-id, and interop-constraints. 1595 Informative note: There are 128 values of profile-id. The 1596 subset of coding tools identified by the profile-id can be 1597 further constrained with up to 255 instances of sub-profile- 1598 id. In addition, 68 bits included in interop-constraints, 1599 which can be extended up to 324 bits provide means to 1600 further restrict tools from existing profiles. To be able 1601 to support this fine-granular signalling of coding tool 1602 subsets with profile-id, sub-profile-id and interop- 1603 constraints, it would be safe to require symmetric use of 1604 these parameters in SDP offer/answer unless recv-ols-id is 1605 included in the SDP answer for choosing one of the layers 1606 offered. 1608 The tier is indicated by tier-flag. The default level is 1609 indicated by level-id. The tier and the default level specify 1610 the limits on values of syntax elements or arithmetic 1611 combinations of values of syntax elements that are followed 1612 when generating the bitstream or that the receiver supports. 1614 In SDP offer/answer, when the SDP answer does not include the 1615 recv-ols-id parameter that is less than the sprop-ols-id 1616 parameter in the SDP offer, the following applies: 1618 o The tier-flag, profile-id, sub-profile-id, and interop- 1619 constraints parameters MUST be used symmetrically, i.e., the 1620 value of each of these parameters in the offer MUST be the 1621 same as that in the answer, either explicitly signaled or 1622 implicitly inferred. 1624 o The level-id parameter is changeable as long as the highest 1625 level indicated by the answer is either equal to or lower 1626 than that in the offer. Note that a highest level higher 1627 than level-id in the offer for receiving can be included as 1628 max-recv-level-id. 1630 In SDP offer/answer, when the SDP answer does include the recv- 1631 ols-id parameter that is less than the sprop-ols-id parameter 1632 in the SDP offer, the set of tier- flag, profile-id, sub- 1633 profile-id, interop-constraints, and level-id parameters 1634 included in the answer MUST be consistent with that for the 1635 chosen output layer set as indicated in the SDP offer, with the 1636 exception that the level-id parameter in the SDP answer is 1637 changeable as long as the highest level indicated by the answer 1638 is either lower than or equal to that in the offer. 1640 More specifications of these parameters, including how they 1641 relate to syntax elements specified in [VVC] are provided 1642 below. 1644 profile-id: 1646 When profile-id is not present, a value of 1 (i.e., the Main 10 1647 profile) MUST be inferred. 1649 When used to indicate properties of a bitstream, profile-id is 1650 derived from the general_profile_idc syntax element that 1651 applies to the bitstream in an instance of the 1652 profile_tier_level( ) syntax structure. 1654 VVC bitstreams transported over RTP using the technologies of 1655 this memo SHOULD contain only a single PTL structure in the 1656 DCI, unless the sender can assure that a receiver can correctly 1657 decode the the VVC bitstream regardless of what PTL structure 1658 was used in the SDP O/A exchange. 1660 As specified in [VVC], a profile_tier_level( ) syntax structure 1661 may be contained in an SPS NAL unit, and one or more 1662 profile_tier_level( ) syntax structures may be contained in a 1663 VPS NAL unit and in a DCI NAL unit. One of the following three 1664 cases applies to the container NAL unit of the 1665 profile_tier_level( ) syntax structure containing those PTL 1666 syntax elements used to derive the values of profile-id, tier- 1667 flag, level-id, sub-profile-id, or interop-constraints: 1) The 1668 container NAL unit is an SPS, the bitstream is a single-layer 1669 bitstream, and the profile_tier_level( ) syntax structures in 1670 all SPSs referenced by the CVSs in the bitstream has the same 1671 values respectively for those PTL syntax elements; 2) The 1672 container NAL unit is a VPS, the profile_tier_level( ) syntax 1673 structure is the one in the VPS that applies to the OLS 1674 corresponding to the bitstream, and the profile_tier_level( ) 1675 syntax structures applicable to the OLS corresponding to the 1676 bitstream in all VPSs referenced by the CVSs in the bitstream 1677 have the same values respectively for those PTL syntax 1678 elements; 3) The container NAL unit is a DCI NAL unit and the 1679 profile_tier_level( ) syntax structures in all DCI NAL units in 1680 the bitstream has the same values respectively for those PTL 1681 syntax elements. 1683 [VVC] allows for multiple profile_tier_level( ) structures in a 1684 DCI NAL unit, which may contain different values for the syntax 1685 elements used to derive the values of profile-id, tier-flag, 1686 level-id, sub-profile-id, or interop-constraints in the 1687 different entries. However, herein defined is only a single 1688 profile-id, tier-flag, level-id, sub-profile-id, or interop- 1689 constraints. When signaling these parameters, when a DCI NAL 1690 unit is present with multiple profile_tier_level( ) structures, 1691 these values SHOULD be the same as the first profile_tier_level 1692 structure in the DCI, unless the sender has ensured that the 1693 receiver can decode the bitstream when a different value is 1694 chosen. 1696 tier-flag, level-id: 1698 The value of tier-flag MUST be in the range of 0 to 1, 1699 inclusive. The value of level-id MUST be in the range of 0 to 1700 255, inclusive. 1702 If the tier-flag and level-id parameters are used to indicate 1703 properties of a bitstream, they indicate the tier and the 1704 highest level the bitstream complies with. 1706 If the tier-flag and level-id parameters are used for 1707 capability exchange, the following applies. If max-recv-level- 1708 id is not present, the default level defined by level-id 1709 indicates the highest level the codec wishes to support. 1710 Otherwise, max-recv-level-id indicates the highest level the 1711 codec supports for receiving. For either receiving or sending, 1712 all levels that are lower than the highest level supported MUST 1713 also be supported. 1715 If no tier-flag is present, a value of 0 MUST be inferred; if 1716 no level-id is present, a value of 51 (i.e., level 3.1) MUST be 1717 inferred. 1719 Informative note: The level values currently defined in the 1720 VVC specification are in the form of "majorNum.minorNum", 1721 and the value of the level-id for each of the levels is 1722 equal to majorNum * 16 + minorNum * 3. It is expected that 1723 if any level are defined in the future, the same convention 1724 will be used, but this cannot be guaranteed. 1726 When used to indicate properties of a bitstream, the tier-flag 1727 and level-id parameters are derived respectively from the 1728 syntax element general_tier_flag, and the syntax element 1729 general_level_idc or sub_layer_level_idc[j], that apply to the 1730 bitstream, in an instance of the profile_tier_level( ) syntax 1731 structure. 1733 If the tier-flag and level-id are derived from the 1734 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1735 following applies: 1737 o tier-flag = general_tier_flag 1739 o level-id = general_level_idc 1741 Otherwise, if the tier-flag and level-id are derived from the 1742 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1743 unit, and the bitstream contains the highest sublayer 1744 representation in the OLS corresponding to the bitstream, the 1745 following applies: 1747 o tier-flag = general_tier_flag 1749 o level-id = general_level_idc 1751 Otherwise, if the tier-flag and level-id are derived from the 1752 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1753 unit, and the bitstream does not contains the highest sublayer 1754 representation in the OLS corresponding to the bitstream, the 1755 following applies, with j being the value of the sprop- 1756 sublayer-id parameter: 1758 o tier-flag = general_tier_flag 1760 o level-id = sub_layer_level_idc[j] 1762 sub-profile-id: 1764 The value of the parameter is a comma-separated (',') list of 1765 data using base64 [RFC4648] representation. 1767 When used to indicate properties of a bitstream, sub-profile-id 1768 is derived from each of the ptl_num_sub_profiles 1769 general_sub_profile_idc[i] syntax elements that apply to the 1770 bitstream in an profile_tier_level( ) syntax structure. 1772 interop-constraints: 1774 A base64 [RFC4648] representation of the data that includes the 1775 syntax elements ptl_frame_only_constraint_flag and 1776 ptl_multilayer_enabled_flag and the general_constraints_info( ) 1777 syntax structure that apply to the bitstream in an instance of 1778 the profile_tier_level( ) syntax structure. 1780 If the interop-constraints parameter is not present, the 1781 following MUST be inferred: 1783 o ptl_frame_only_constraint_flag = 1 1785 o ptl_multilayer_enabled_flag = 0 1787 o gci_present_flag in the general_constraints_info( ) syntax 1788 structure = 0 1790 Using interop-constraints for capability exchange results in a 1791 requirement on any bitstream to be compliant with the interop- 1792 constraints. 1794 sprop-sublayer-id: 1796 This parameter MAY be used to indicate the highest allowed 1797 value of TID in the highest layer present in the bitstream. 1798 When not present, the value of sprop-sublayer-id is inferred to 1799 be equal to 6. 1801 The value of sprop-sublayer-id MUST be in the range of 0 to 6, 1802 inclusive. 1804 sprop-ols-id: 1806 This parameter MAY be used to indicate the OLS that the 1807 bitstream applies to. When not present, the value of sprop- 1808 ols-id is inferred to be equal to TargetOlsIdx as specified in 1809 8.1.1 in [VVC]. If this optional parameter is present, sprop- 1810 vps MUST also be present or its content MUST be known a priori 1811 at the receiver. 1813 The value of sprop-ols-id MUST be in the range of 0 to 256, 1814 inclusive. 1816 Informative note: VVC allows having up to 258 output layer 1817 sets indicated in the VPS as the number of output layer sets 1818 minus 2 is indicated with a field of 8 bits. 1820 recv-sublayer-id: 1822 This parameter MAY be used to signal a receiver's choice of the 1823 offered or declared sublayer representations in the sprop-vps 1824 and sprop-sps. The value of recv-sublayer-id indicates the TID 1825 of the highest sublayer in the highest layer of the bitstream 1826 that a receiver supports. When not present, the value of recv- 1827 sublayer-id is inferred to be equal to the value of the sprop- 1828 sublayer-id parameter in the SDP offer. 1830 The value of recv-sublayer-id MUST be in the range of 0 to 6, 1831 inclusive. 1833 recv-ols-id: 1835 This parameter MAY be used to signal a receiver's choice of the 1836 offered or declared output layer sets in the sprop-vps. The 1837 value of recv-ols-id indicates the OLS index of the bitstream 1838 that a receiver supports. When not present, the value of recv- 1839 ols-id is inferred to be equal to value of the sprop-ols-id 1840 parameter inferred from or indicated in the SDP offer. When 1841 present, the value of recv-ols-id must be included only when 1842 sprop-ols-id was received and must refer to an output layer set 1843 in the VPS that includes no layers other than all or a subset 1844 of the layers of the OLS referred to by sprop-ols-id. If this 1845 optional parameter is present, sprop-vps must have been 1846 received or its content must be known a priori at the receiver. 1848 The value of recv-ols-id MUST be in the range of 0 to 257, 1849 inclusive. 1851 max-recv-level-id: 1853 This parameter MAY be used to indicate the highest level a 1854 receiver supports. 1856 The value of max-recv-level-id MUST be in the range of 0 to 1857 255, inclusive. 1859 When max-recv-level-id is not present, the value is inferred to 1860 be equal to level-id. 1862 max-recv-level-id MUST NOT be present when the highest level 1863 the receiver supports is not higher than the default level. 1865 sprop-dci: 1867 This parameter MAY be used to convey a decoding capability 1868 information NAL unit of the bitstream for out-of-band 1869 transmission. The parameter MAY also be used for capability 1870 exchange. The value of the parameter a base64 [RFC4648] 1871 representations of the decoding capability information NAL unit 1872 as specified in Section 7.3.2.1 of [VVC]. 1874 sprop-vps: 1876 This parameter MAY be used to convey any video parameter set 1877 NAL unit of the bitstream for out-of-band transmission of video 1878 parameter sets. The parameter MAY also be used for capability 1879 exchange and to indicate sub-stream characteristics (i.e., 1880 properties of output layer sets and sublayer representations as 1881 defined in [VVC]). The value of the parameter is a comma- 1882 separated (',') list of base64 [RFC4648] representations of the 1883 video parameter set NAL units as specified in Section 7.3.2.3 1884 of [VVC]. 1886 The sprop-vps parameter MAY contain one or more than one video 1887 parameter set NAL units. However, all other video parameter 1888 sets contained in the sprop-vps parameter MUST be consistent 1889 with the first video parameter set in the sprop-vps parameter. 1890 A video parameter set vpsB is said to be consistent with 1891 another video parameter set vpsA if the number of OLSs in vpsA 1892 and vpsB is the same and any decoder that conforms to the 1893 profile, tier, level, and constraints indicated by the data 1894 starting from the syntax element general_profile_idc to the 1895 syntax structure general_constraints_info(), inclusive, in the 1896 profile_tier_level( ) syntax structure corresponding to any OLS 1897 with index olsIdx in vpsA can decode any CVS(s) referencing 1898 vpsB when TargetOlsIdx is equal to olsIdx that conforms to the 1899 profile, tier, level, and constraints indicated by the data 1900 starting from the syntax element general_profile_idc to the 1901 syntax structure general_constraints_info(), inclusive, in the 1902 profile_tier_level( ) syntax structure corresponding to the OLS 1903 with index TargetOlsIdx in vpsB. 1905 sprop-sps: 1907 This parameter MAY be used to convey sequence parameter set NAL 1908 units of the bitstream for out-of-band transmission of sequence 1909 parameter sets. The value of the parameter is a comma- 1910 separated (',') list of base64 [RFC4648] representations of the 1911 sequence parameter set NAL units as specified in 1912 Section 7.3.2.4 of [VVC]. 1914 A sequence parameter set spsB is said to be consistent with 1915 another sequence parameter set spsA if any decoder that 1916 conforms to the profile, tier, level, and constraints indicated 1917 by the data starting from the syntax element 1918 general_profile_idc to the syntax structure 1919 general_constraints_info(), inclusive, in the 1920 profile_tier_level( ) syntax structure in spsA can decode any 1921 CLVS(s) referencing spsB that conforms to the profile, tier, 1922 level, and constraints indicated by the data starting from the 1923 syntax element general_profile_idc to the syntax structure 1924 general_constraints_info(), inclusive, in the 1925 profile_tier_level( ) syntax structure in spsB. 1927 sprop-pps: 1929 This parameter MAY be used to convey picture parameter set NAL 1930 units of the bitstream for out-of-band transmission of picture 1931 parameter sets. The value of the parameter is a comma- 1932 separated (',') list of base64 [RFC4648] representations of the 1933 picture parameter set NAL units as specified in Section 7.3.2.5 1934 of [VVC]. 1936 sprop-sei: 1938 This parameter MAY be used to convey one or more SEI messages 1939 that describe bitstream characteristics. When present, a 1940 decoder can rely on the bitstream characteristics that are 1941 described in the SEI messages for the entire duration of the 1942 session, independently from the persistence scopes of the SEI 1943 messages as specified in [VSEI]. 1945 The value of the parameter is a comma-separated (',') list of 1946 base64 [RFC4648] representations of SEI NAL units as specified 1947 in [VSEI]. 1949 Informative note: Intentionally, no list of applicable or 1950 inapplicable SEI messages is specified here. Conveying 1951 certain SEI messages in sprop-sei may be sensible in some 1952 application scenarios and meaningless in others. However, a 1953 few examples are described below: 1955 1) In an environment where the bitstream was created from 1956 film-based source material, and no splicing is going to 1957 occur during the lifetime of the session, the film grain 1958 characteristics SEI message is likely meaningful, and 1959 sending it in sprop-sei rather than in the bitstream at each 1960 entry point may help with saving bits and allows one to 1961 configure the renderer only once, avoiding unwanted 1962 artifacts. 1964 2) Examples for SEI messages that would be meaningless to be 1965 conveyed in sprop-sei include the decoded picture hash SEI 1966 message (it is close to impossible that all decoded pictures 1967 have the same hashtag) or the filler payload SEI message (as 1968 there is no point in just having more bits in SDP). 1970 max-lsr: 1972 The max-lsr MAY be used to signal the capabilities of a 1973 receiver implementation and MUST NOT be used for any other 1974 purpose. The value of max-lsr is an integer indicating the 1975 maximum processing rate in units of luma samples per second. 1976 The max-lsr parameter signals that the receiver is capable of 1977 decoding video at a higher rate than is required by the highest 1978 level. 1980 Informative note: When the OPTIONAL media type parameters 1981 are used to signal the properties of a bitstream, and max- 1982 lsr is not present, the values of tier-flag, profile-id, 1983 sub-profile-id interop-constraints, and level-id must always 1984 be such that the bitstream complies fully with the specified 1985 profile, tier, and level. 1987 When max-lsr is signaled, the receiver MUST be able to decode 1988 bitstreams that conform to the highest level, with the 1989 exception that the MaxLumaSr value in Table 136 of [VVC] for 1990 the highest level is replaced with the value of max-lsr. 1991 Senders MAY use this knowledge to send pictures of a given size 1992 at a higher picture rate than is indicated in the highest 1993 level. 1995 When not present, the value of max-lsr is inferred to be equal 1996 to the value of MaxLumaSr given in Table 136 of [VVC] for the 1997 highest level. 1999 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 2000 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 2001 [VVC] for the highest level. 2003 max-fps: 2005 The value of max-fps is an integer indicating the maximum 2006 picture rate in units of pictures per 100 seconds that can be 2007 effectively processed by the receiver. The max-fps parameter 2008 MAY be used to signal that the receiver has a constraint in 2009 that it is not capable of processing video effectively at the 2010 full picture rate that is implied by the highest level and, 2011 when present, max-lsr. 2013 The value of max-fps is not necessarily the picture rate at 2014 which the maximum picture size can be sent, it constitutes a 2015 constraint on maximum picture rate for all resolutions. 2017 Informative note: The max-fps parameter is semantically 2018 different from max-lsr in that max-fps is used to signal a 2019 constraint, lowering the maximum picture rate from what is 2020 implied by other parameters. 2022 The encoder MUST use a picture rate equal to or less than this 2023 value. In cases where the max-fps parameter is absent, the 2024 encoder is free to choose any picture rate according to the 2025 highest level and any signaled optional parameters. 2027 The value of max-fps MUST be smaller than or equal to the full 2028 picture rate that is implied by the highest level and, when 2029 present, max-lsr. 2031 sprop-max-don-diff: 2033 If there is no NAL unit naluA that is followed in transmission 2034 order by any NAL unit preceding naluA in decoding order (i.e., 2035 the transmission order of the NAL units is the same as the 2036 decoding order), the value of this parameter MUST be equal to 2037 0. 2039 Otherwise, this parameter specifies the maximum absolute 2040 difference between the decoding order number (i.e., AbsDon) 2041 values of any two NAL units naluA and naluB, where naluA 2042 follows naluB in decoding order and precedes naluB in 2043 transmission order. 2045 The value of sprop-max-don-diff MUST be an integer in the range 2046 of 0 to 32767, inclusive. 2048 When not present, the value of sprop-max-don-diff is inferred 2049 to be equal to 0. 2051 sprop-depack-buf-bytes: 2053 This parameter signals the required size of the de- 2054 packetization buffer in units of bytes. The value of the 2055 parameter MUST be greater than or equal to the maximum buffer 2056 occupancy (in units of bytes) of the de-packetization buffer as 2057 specified in Section 6. 2059 The value of sprop-depack-buf-bytes MUST be an integer in the 2060 range of 0 to 4294967295, inclusive. 2062 When sprop-max-don-diff is present and greater than 0, this 2063 parameter MUST be present and the value MUST be greater than 0. 2064 When not present, the value of sprop-depack-buf-bytes is 2065 inferred to be equal to 0. 2067 Informative note: The value of sprop-depack-buf-bytes 2068 indicates the required size of the de-packetization buffer 2069 only. When network jitter can occur, an appropriately sized 2070 jitter buffer has to be available as well. 2072 depack-buf-cap: 2074 This parameter signals the capabilities of a receiver 2075 implementation and indicates the amount of de-packetization 2076 buffer space in units of bytes that the receiver has available 2077 for reconstructing the NAL unit decoding order from NAL units 2078 carried in the RTP stream. A receiver is able to handle any 2079 RTP stream for which the value of the sprop-depack-buf-bytes 2080 parameter is smaller than or equal to this parameter. 2082 When not present, the value of depack-buf-cap is inferred to be 2083 equal to 4294967295. The value of depack-buf-cap MUST be an 2084 integer in the range of 1 to 4294967295, inclusive. 2086 Informative note: depack-buf-cap indicates the maximum 2087 possible size of the de-packetization buffer of the receiver 2088 only, without allowing for network jitter. 2090 7.2. SDP Parameters 2092 The receiver MUST ignore any parameter unspecified in this memo. 2094 7.2.1. Mapping of Payload Type Parameters to SDP 2096 The media type video/H266 string is mapped to fields in the Session 2097 Description Protocol (SDP) [RFC4566] as follows: 2099 * The media name in the "m=" line of SDP MUST be video. 2101 * The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2102 media subtype). 2104 * The clock rate in the "a=rtpmap" line MUST be 90000. 2106 * The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2107 interop-constraints, level-id, sprop-sublayer-id, sprop-ols-id, 2108 recv-sublayer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2109 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2110 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2111 This parameter is expressed as a media type string, in the form of 2112 a semicolon-separated list of parameter=value pairs. 2114 * The OPTIONAL parameter sprop-vps, sprop-sps, sprop-pps, sprop-sei, 2115 and sprop-dci, when present, MUST be included in the "a=fmtp" line 2116 of SDP or conveyed using the "fmtp" source attribute as specified 2117 in Section 6.3 of [RFC5576]. For a particular media format (i.e., 2118 RTP payload type), sprop-vps, sprop-sps, sprop-pps, sprop-sei, or 2119 sprop-dci MUST NOT be both included in the "a=fmtp" line of SDP 2120 and conveyed using the "fmtp" source attribute. When included in 2121 the "a=fmtp" line of SDP, those parameters are expressed as a 2122 media type string, in the form of a semicolon-separated list of 2123 parameter=value pairs. When conveyed in the "a=fmtp" line of SDP 2124 for a particular payload type, the parameters sprop-vps, sprop- 2125 sps, sprop-pps, sprop-sei, and sprop-dci MUST be applied to each 2126 SSRC with the payload type. When conveyed using the "fmtp" source 2127 attribute, these parameters are only associated with the given 2128 source and payload type as parts of the "fmtp" source attribute. 2130 An example of media representation in SDP is as follows: 2132 m=video 49170 RTP/AVP 98 2133 a=rtpmap:98 H266/90000 2134 a=fmtp:98 profile-id=1; 2135 sprop-vps=