idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-08.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (7 March 2021) is 1145 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1372 ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 8 September 2021 Y. Sanchez 6 Fraunhofer HHI 7 Y.-K. Wang 8 Bytedance Inc. 9 7 March 2021 11 RTP Payload Format for Versatile Video Coding (VVC) 12 draft-ietf-avtcore-rtp-vvc-08 14 Abstract 16 This memo describes an RTP payload format for the video coding 17 standard ITU-T Recommendation H.266 and ISO/IEC International 18 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 19 developed by the Joint Video Experts Team (JVET). The RTP payload 20 format allows for packetization of one or more Network Abstraction 21 Layer (NAL) units in each RTP packet payload as well as fragmentation 22 of a NAL unit into multiple RTP packets. The payload format has wide 23 applicability in videoconferencing, Internet video streaming, and 24 high-bitrate entertainment-quality video, among other applications. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on 8 September 2021. 43 Copyright Notice 45 Copyright (c) 2021 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 50 license-info) in effect on the date of publication of this document. 51 Please review these documents carefully, as they describe your rights 52 and restrictions with respect to this document. Code Components 53 extracted from this document must include Simplified BSD License text 54 as described in Section 4.e of the Trust Legal Provisions and are 55 provided without warranty as described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 62 1.1.2. Systems and Transport Interfaces (informative) . . . 6 63 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 14 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 69 3.1.1. Definitions from the VVC Specification . . . . . . . 15 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 18 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 21 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 22 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 31 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 83 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 34 84 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 44 85 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 44 86 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 44 87 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 45 88 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 45 89 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 45 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 46 91 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 47 92 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 48 93 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 48 94 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 48 95 13.1. Normative References . . . . . . . . . . . . . . . . . . 48 96 13.2. Informative References . . . . . . . . . . . . . . . . . 50 97 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 51 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 51 100 1. Introduction 102 The Versatile Video Coding [VVC] specification, formally published as 103 both ITU-T Recommendation H.266 and ISO/IEC International Standard 104 23090-3, is currently in the ITU-T publication process and the ISO/ 105 IEC approval process. VVC is reported to provide significant coding 106 efficiency gains over HEVC [HEVC] as known as H.265, and other 107 earlier video codecs. 109 This memo specifies an RTP payload format for VVC. It shares its 110 basic design with the NAL (Network Abstraction Layer) unit-based RTP 111 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 112 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 113 and their respective predecessors. With respect to design 114 philosophy, security, congestion control, and overall implementation 115 complexity, it has similar properties to those earlier payload format 116 specifications. This is a conscious choice, as at least RFC 6184 is 117 widely deployed and generally known in the relevant implementer 118 communities. Certain mechanisms known from [RFC6190] were 119 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 120 signal-to-noise ratio (SNR) scalability. 122 1.1. Overview of the VVC Codec 124 VVC and HEVC share a similar hybrid video codec design. In this 125 memo, we provide a very brief overview of those features of VVC that 126 are, in some form, addressed by the payload format specified herein. 127 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 128 specifications pertaining to VVC to arrive at interoperable, well- 129 performing implementations. 131 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 132 which is often used to refer to the coding-tool features, and a NAL, 133 which is often used to refer to the systems and transport interface 134 aspects of the codecs. 136 1.1.1. Coding-Tool Features (informative) 138 Coding tool features are described below with occasional reference to 139 the coding tool set of HEVC, which is well known in the community. 141 Similar to earlier hybrid-video-coding-based standards, including 142 HEVC, the following basic video coding design is employed by VVC. A 143 prediction signal is first formed by either intra- or motion- 144 compensated prediction, and the residual (the difference between the 145 original and the prediction) is then coded. The gains in coding 146 efficiency are achieved by redesigning and improving almost all parts 147 of the codec over earlier designs. In addition, VVC includes several 148 tools to make the implementation on parallel architectures easier. 150 Finally, VVC includes temporal, spatial, and SNR scalability as well 151 as multiview coding support. 153 Coding blocks and transform structure 155 Among major coding-tool differences between HEVC and VVC, one of the 156 important improvements is the more flexible coding tree structure in 157 VVC, i.e., multi-type tree. In addition to quadtree, binary and 158 ternary trees are also supported, which contributes significant 159 improvement in coding efficiency. Moreover, the maximum size of 160 coding tree unit (CTU) is increased from 64x64 to 128x128. To 161 improve the coding efficiency of chroma signal, luma chroma separated 162 trees at CTU level may be employed for intra-slices. The square 163 transforms in HEVC are extended to non-square transforms for 164 rectangular blocks resulting from binary and ternary tree splits. 165 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 166 DST-7, and DCT-8 as well as the non-separable secondary transform. 167 The transforms used in VVC can have different sizes with support for 168 larger transform sizes. For DCT-2, the transform sizes range from 169 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 170 4x4 to 32x32. In addition, VVC also support sub-block transform for 171 both intra and inter coded blocks. For intra coded blocks, intra 172 sub-partitioning (ISP) may be used to allow sub-block based intra 173 prediction and transform. For inter blocks, sub-block transform may 174 be used assuming that only a part of an inter-block has non-zero 175 transform coefficients. 177 Entropy coding 179 Similar to HEVC, VVC uses a single entropy-coding engine, which is 180 based on context adaptive binary arithmetic coding [CABAC], but with 181 the support of multi-window sizes. The window sizes can be 182 initialized differently for different context models. Due to such a 183 design, it has more efficient adaptation speed and better coding 184 efficiency. A joint chroma residual coding scheme is applied to 185 further exploit the correlation between the residuals of two color 186 components. In VVC, different residual coding schemes are applied 187 for regular transform coefficients and residual samples generated 188 using transform-skip mode. 190 In-loop filtering 191 VVC has more feature support in loop filters than HEVC. The 192 deblocking filter in VVC is similar to HEVC but operates at a smaller 193 grid. After deblocking and sample adaptive offset (SAO), an adaptive 194 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 195 distortion of decoded pictures. Besides, VVC introduces a new module 196 before deblocking called luma mapping with chroma scaling to fully 197 utilize the dynamic range of signal so that rate-distortion 198 performance of both SDR and HDR content is improved. 200 Motion prediction and coding 202 Compared to HEVC, VVC introduces several improvements in this area. 203 First, there is the adaptive motion vector resolution (AMVR), which 204 can save bit cost for motion vectors by adaptively signaling motion 205 vector resolution. Then the affine motion compensation is included 206 to capture complicated motion like zooming and rotation. Meanwhile, 207 prediction refinement with the optical flow with affine mode (PROF) 208 is further deployed to mimic affine motion at the pixel level. 209 Thirdly the decoder side motion vector refinement (DMVR) is a method 210 to derive MV vector at decoder side based on block matching so that 211 fewer bits may be spent on motion vectors. Bi-directional optical 212 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 213 offset at 4x4 sub-block level that is derived with equations based on 214 gradients of the prediction samples and a motion difference relative 215 to CU motion vectors. Furthermore, merge with motion vector 216 difference (MMVD) is a special mode, which further signals a limited 217 set of motion vector differences on top of merge mode. In addition 218 to MMVD, there are another three types of special merge modes, i.e., 219 sub-block merge, triangle, and combined intra-/inter-prediction 220 (CIIP). Sub-block merge list includes one candidate of sub-block 221 temporal motion vector prediction (SbTMVP) and up to four candidates 222 of affine motion vectors. Triangle is based on triangular block 223 motion compensation. CIIP combines intra- and inter- predictions 224 with weighting. Adaptive weighting may be employed with a block- 225 level tool called bi-prediction with CU based weighting (BCW) which 226 provides more flexibility than in HEVC. 228 Intra prediction and intra-coding 230 To capture the diversified local image texture directions with finer 231 granularity, VVC supports 65 angular directions instead of 33 232 directions in HEVC. The intra mode coding is based on a 6-most- 233 probable-mode scheme, and the 6 most probable modes are derived using 234 the neighboring intra prediction directions. In addition, to deal 235 with the different distributions of intra prediction angles for 236 different block aspect ratios, a wide-angle intra prediction (WAIP) 237 scheme is applied in VVC by including intra prediction angles beyond 238 those present in HEVC. Unlike HEVC which only allows using the most 239 adjacent line of reference samples for intra prediction, VVC also 240 allows using two further reference lines, as known as multi- 241 reference-line (MRL) intra prediction. The additional reference 242 lines can be only used for the 6 most probable intra prediction 243 modes. To capture the strong correlation between different colour 244 components, in VVC, a cross-component linear mode (CCLM) is utilized 245 which assumes a linear relationship between the luma sample values 246 and their associated chroma samples. For intra prediction, VVC also 247 applies a position-dependent prediction combination (PDPC) for 248 refining the prediction samples closer to the intra prediction block 249 boundary. Matrix-based intra prediction (MIP) modes are also used in 250 VVC which generates an up to 8x8 intra prediction block using a 251 weighted sum of downsampled neighboring reference samples, and the 252 weights are hardcoded constants. 254 Other coding-tool feature 256 VVC introduces dependent quantization (DQ) to reduce quantization 257 error by state-based switching between two quantizers. 259 1.1.2. Systems and Transport Interfaces (informative) 261 VVC inherits the basic systems and transport interfaces designs from 262 HEVC and H.264. These include the NAL-unit-based syntax structure, 263 the hierarchical syntax and data unit structure, the supplemental 264 enhancement information (SEI) message mechanism, and the video 265 buffering model based on the hypothetical reference decoder (HRD). 266 The scalability features of VVC are conceptually similar to the 267 scalable variant of HEVC known as SHVC. The hierarchical syntax and 268 data unit structure consists of parameter sets at various levels 269 (decoder, sequence (pertaining to all), sequence (pertaining to a 270 single), picture), picture-level header parameters, slice-level 271 header parameters, and lower-level parameters. 273 A number of key components that influenced the network abstraction 274 layer design of VVC as well as this memo are described below 276 Decoding capability information 278 The decoding capability information includes parameters that stay 279 constant for the lifetime of a Video Bitstream, which in IETF terms 280 can translate to the lifetime of a session. Such information 281 includes profile, level, and sub-profile information to determine a 282 maximum capability interop point that is guaranteed to be never 283 exceeded, even if splicing of video sequences occurs within a 284 session. It further includes constraint fields (most of which are 285 flags), which can optionally be set to indicate that the video 286 bitstream will be constraint in the use of certain features as 287 indicated by the values of those fields. With this, a bitstream can 288 be labelled as not using certain tools, which allows among other 289 things for resource allocation in a decoder implementation. 291 Video parameter set 293 The ideo parameter set (VPS) pertains to a coded video sequences 294 (CVS) of multiple layers covering the same range of access units, and 295 includes, among other information decoding dependency expressed as 296 information for reference picture list construction of enhancement 297 layers. The VPS provides a "big picture" of a scalable sequence, 298 including what types of operation points are provided, the profile, 299 tier, and level of the operation points, and some other high-level 300 properties of the bitstream that can be used as the basis for session 301 negotiation and content selection, etc. One VPS may be referenced by 302 one or more sequence parameter sets. 304 Sequence parameter set 306 The sequence parameter set (SPS) contains syntax elements pertaining 307 to a coded layer video sequence (CLVS), which is a group of pictures 308 belonging to the same layer, starting with a random access point, and 309 followed by pictures that may depend on each other, until the next 310 random access point picture. In MPGEG-2, the equivalent of a CVS was 311 a group of pictures (GOP), which normally started with an I frame and 312 was followed by P and B frames. While more complex in its options of 313 random access points, VVC retains this basic concept. One remarkable 314 difference of VVC is that a CLVS may start with a Gradual Decoding 315 Refresh (GDR) picture, without requiring presence of traditional 316 random access points in the bitstream, such as instantaneous decoding 317 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 318 applications, a CVS contains a few hundred milliseconds to a few 319 seconds of video. In video conferencing (without switching MCUs 320 involved), a CVS can be as long in duration as the whole session. 322 Picture and adaptation parameter set 324 The picture parameter set and the adaptation parameter set (PPS and 325 APS, respectively) carry information pertaining to zero or more 326 pictures and zero or more slices, respectively. The PPS contains 327 information that is likely to stay constant from picture to picture- 328 at least for pictures for a certain type-whereas the APS contains 329 information, such as adaptive loop filter coefficients, that are 330 likely to change from picture to picture or even within a picture. A 331 single APS is referenced by all slices of the same picture if that 332 APS contains information about luma mapping with chroma scaling 333 (LMCS) or scaling list. Different APSs containing ALF parameters can 334 be referenced by slices of the same picture. 336 Picture header 338 A Picture Header contains information that is common to all slices 339 that belong to the same picture. Being able to send that information 340 as a separate NAL unit when pictures are split into several slices 341 allows for saving bitrate, compared to repeating the same information 342 in all slices. However, there might be scenarios where low-bitrate 343 video is transmitted using a single slice per picture. Having a 344 separate NAL unit to convey that information incurs in an overhead 345 for such scenarios. For such scenarios, the picture header syntax 346 structure is directly included in the slice header, instead of in its 347 own NAL unit. The mode of the picture header syntax structure being 348 included in its own NAL unit or not can only be switched on/off for 349 an entire CLVS, and can only be switched off when in the entire CLVS 350 each picture contains only one slice. 352 Profile, tier, and level 354 The profile, tier and level syntax structures in DCI, VPS and SPS 355 contain profile, tier, level information for all layers that refer to 356 the DCI, for layers associated with one or more output layer sets 357 specified by the VPS, and for any layer that refers to the SPS, 358 respectively. 360 Sub-profiles 362 Within the VVC specification, a sub-profile is a 32-bit number, coded 363 according to ITU-T Rec. T.35, that does not carry a semantics. It is 364 carried in the profile_tier_level structure and hence (potentially) 365 present in the DCI, VPS, and SPS. External registration bodies can 366 register a T.35 codepoint with ITU-T registration authorities and 367 associate with their registration a description of bitstream 368 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 369 would allow encoder manufacturers to label the bitstreams generated 370 by their encoder as complying with such sub-profile. It is expected 371 that upstream standardization organizations (such as: DVB and ATSC), 372 as well as walled-garden video services will take advantage of this 373 labelling system. In contrast to "normal" profiles, it is expected 374 that sub-profiles may indicate encoder choices traditionally left 375 open in the (decoder- centric) video coding specs, such as GOP 376 structures, minimum/maximum QP values, and the mandatory use of 377 certain tools or SEI messages. 379 General constraint fields 381 The profile_tier_level structure carries a considerable number of 382 constraint fields (most of which are flags), which an encoder can use 383 to indicate to a decoder that it will not use a certain tool or 384 technology. They were included in reaction to a perceived market 385 need for labelling a bitstream as not exercising a certain tool that 386 has become commercially unviable. 388 Temporal scalability support 390 VVC includes support of temporal scalability, by inclusion of the 391 signaling of TemporalId in the NAL unit header, the restriction that 392 pictures of a particular temporal sublayer cannot be used for inter 393 prediction reference by pictures of a lower temporal sublayer, the 394 sub-bitstream extraction process, and the requirement that each sub- 395 bitstream extraction output be a conforming bitstream. Media-Aware 396 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 397 header for stream adaptation purposes based on temporal scalability. 399 Reference picture resampling (RPR) 401 In AVC and HEVC, the spatial resolution of pictures cannot change 402 unless a new sequence using a new SPS starts, with an IRAP picture. 403 VVC enables picture resolution change within a sequence at a position 404 without encoding an IRAP picture, which is always intra-coded. This 405 feature is sometimes referred to as reference picture resampling 406 (RPR), as the feature needs resampling of a reference picture used 407 for inter prediction when that reference picture has a different 408 resolution than the current picture being decoded. RPR allows 409 resolution change without the need of coding an IRAP picture, which 410 causes a momentary bit rate spike in streaming or video conferencing 411 scenarios, e.g., to cope with network condition changes. RPR can 412 also be used in application scenarios wherein zooming of the entire 413 video region or some region of interest is needed. 415 Spatial, SNR, and multiview scalability 417 VVC includes support for spatial, SNR, and multiview scalability. 418 Scalable video coding is widely considered to have technical benefits 419 and enrich services for various video applications. Until recently, 420 however, the functionality has not been included in the first version 421 of specifications of the video codecs. In VVC, however, all those 422 forms of scalability are supported in the first version of VVC 423 natively through the signaling of the layer_id in the NAL unit 424 header, the VPS which associates layers with given layer_ids to each 425 other, reference picture selection, reference picture resampling for 426 spatial scalability, and a number of other mechanisms not relevant 427 for this memo. 429 Spatial scalability 430 With the existence of Reference Picture Resampling (RPR), the 431 additional burden for scalability support is just a 432 modification of the high-level syntax (HLS). The inter-layer 433 prediction is employed in a scalable system to improve the 434 coding efficiency of the enhancement layers. In addition to 435 the spatial and temporal motion-compensated predictions that 436 are available in a single-layer codec, the inter-layer 437 prediction in VVC uses the possibly resampled video data of the 438 reconstructed reference picture from a reference layer to 439 predict the current enhancement layer. The resampling process 440 for inter-layer prediction, when used, is performed at the 441 block-level, reusing the existing interpolation process for 442 motion compensation in single-layer coding. It means that no 443 additional resampling process is needed to support spatial 444 scalability. 446 SNR scalability 448 SNR scalability is similar to spatial scalability except that 449 the resampling factors are 1:1. In other words, there is no 450 change in resolution, but there is inter-layer prediction. 452 Multiview scalability 454 The first version of VVC also supports multiview scalability, 455 wherein a multi-layer bitstream carries layers representing 456 multiple views, and one or more of the represented views can be 457 output at the same time. 459 SEI messages 461 Supplementary enhancement information (SEI) messages are information 462 in the bitstream that do not influence the decoding process as 463 specified in the VVC spec, but address issues of representation/ 464 rendering of the decoded bitstream, label the bitstream for certain 465 applications, among other, similar tasks. The overall concept of SEI 466 messages and many of the messages themselves has been inherited from 467 the H.264 and HEVC specs. Except for the SEI messages that affect 468 the specification of the hypothetical reference decoder (HRD), other 469 SEI messages for use in the VVC environment, which are generally 470 useful also in other video coding technologies, are not included in 471 the main VVC specification but in a companion specification [VSEI]. 473 1.1.3. High-Level Picture Partitioning (informative) 475 VVC inherited the concept of tiles and wavefront parallel processing 476 (WPP) from HEVC, with some minor to moderate differences. The basic 477 concept of slices was kept in VVC but designed in an essentially 478 different form. VVC is the first video coding standard that includes 479 subpictures as a feature, which provides the same functionality as 480 HEVC motion-constrained tile sets (MCTSs) but designed differently to 481 have better coding efficiency and to be friendlier for usage in 482 application systems. More details of these differences are described 483 below. 485 Tiles and WPP 487 Same as in HEVC, a picture can be split into tile rows and tile 488 columns in VVC, in-picture prediction across tile boundaries is 489 disallowed, etc. However, the syntax for signaling of tile 490 partitioning has been simplified, by using a unified syntax design 491 for both the uniform and the non-uniform mode. In addition, 492 signaling of entry point offsets for tiles in the slice header is 493 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 494 has two differences compared to HEVC: i) The CTU row delay is reduced 495 from two CTUs to one CTU; ii) Signaling of entry point offsets for 496 WPP in the slice header is optional in VVC while it is mandatory in 497 HEVC. 499 Slices 501 In VVC, the conventional slices based on CTUs (as in HEVC) or 502 macroblocks (as in AVC) have been removed. The main reasoning behind 503 this architectural change is as follows. The advances in video 504 coding since 2003 (the publication year of AVC v1) have been such 505 that slice-based error concealment has become practically impossible, 506 due to the ever-increasing number and efficiency of in-picture and 507 inter-picture prediction mechanisms. An error-concealed picture is 508 the decoding result of a transmitted coded picture for which there is 509 some data loss (e.g., loss of some slices) of the coded picture or a 510 reference picture for at least some part of the coded picture is not 511 error-free (e.g., that reference picture was an error-concealed 512 picture). For example, when one of the multiple slices of a picture 513 is lost, it may be error-concealed using an interpolation of the 514 neighboring slices. While advanced video coding prediction 515 mechanisms provide significantly higher coding efficiency, they also 516 make it harder for machines to estimate the quality of an error- 517 concealed picture, which was already a hard problem with the use of 518 simpler prediction mechanisms. Advanced in-picture prediction 519 mechanisms also cause the coding efficiency loss due to splitting a 520 picture into multiple slices to be more significant. Furthermore, 521 network conditions become significantly better while at the same time 522 techniques for dealing with packet losses have become significantly 523 improved. As a result, very few implementations have recently used 524 slices for maximum transmission unit size matching. Instead, 525 substantially all applications where low-delay error resilience is 526 required (e.g., video telephony and video conferencing) rely on 527 system/transport-level error resilience (e.g., retransmission, 528 forward error correction) and/or picture-based error resilience tools 529 (feedback-based error resilience, insertion of IRAPs, scalability 530 with higher protection level of the base layer, and so on). 531 Considering all the above, nowadays it is very rare that a picture 532 that cannot be correctly decoded is passed to the decoder, and when 533 such a rare case occurs, the system can afford to wait for an error- 534 free picture to be decoded and available for display without 535 resulting in frequent and long periods of picture freezing seen by 536 end users. 538 Slices in VVC have two modes: rectangular slices and raster-scan 539 slices. The rectangular slice, as indicated by its name, covers a 540 rectangular region of the picture. Typically, a rectangular slice 541 consists of several complete tiles. However, it is also possible 542 that a rectangular slice is a subset of a tile and consists of one or 543 more consecutive, complete CTU rows within a tile. A raster-scan 544 slice consists of one or more complete tiles in a tile raster scan 545 order, hence the region covered by a raster-scan slices need not but 546 could have a non-rectangular shape, but it may also happen to have 547 the shape of a rectangle. The concept of slices in VVC is therefore 548 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 549 macroblocks (as in AVC). 551 Subpictures 553 VVC is the first video coding standard that includes the support of 554 subpictures as a feature. Each subpicture consists of one or more 555 complete rectangular slices that collectively cover a rectangular 556 region of the picture. A subpicture may be either specified to be 557 extractable (i.e., coded independently of other subpictures of the 558 same picture and of earlier pictures in decoding order) or not 559 extractable. Regardless of whether a subpicture is extractable or 560 not, the encoder can control whether in-loop filtering (including 561 deblocking, SAO, and ALF) is applied across the subpicture boundaries 562 individually for each subpicture. 564 Functionally, subpictures are similar to the motion-constrained tile 565 sets (MCTSs) in HEVC. They both allow independent coding and 566 extraction of a rectangular subset of a sequence of coded pictures, 567 for use cases like viewport-dependent 360o video streaming 568 optimization and region of interest (ROI) applications. 570 There are several important design differences between subpictures 571 and MCTSs. First, the subpictures feature in VVC allows motion 572 vectors of a coding block pointing outside of the subpicture even 573 when the subpicture is extractable by applying sample padding at 574 subpicture boundaries in this case, similarly as at picture 575 boundaries. Second, additional changes were introduced for the 576 selection and derivation of motion vectors in the merge mode and in 577 the decoder side motion vector refinement process of VVC. This 578 allows higher coding efficiency compared to the non-normative motion 579 constraints applied at the encoder-side for MCTSs. Third, rewriting 580 of SHs (and PH NAL units, when present) is not needed when extracting 581 one or more extractable subpictures from a sequence of pictures to 582 create a sub-bitstream that is a conforming bitstream. In sub- 583 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 584 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 585 extraction, rewriting of SPSs and PPSs is needed. However, typically 586 there are only a few parameter sets in a bitstream, while each 587 picture has at least one slice, therefore rewriting of SHs can be a 588 significant burden for application systems. Fourth, slices of 589 different subpictures within a picture are allowed to have different 590 NAL unit types. Fifth, VVC specifies HRD and level definitions for 591 subpicture sequences, thus the conformance of the sub-bitstream of 592 each extractable subpicture sequence can be ensured by encoders. 594 1.1.4. NAL Unit Header 596 VVC maintains the NAL unit concept of HEVC with modifications. VVC 597 uses a two-byte NAL unit header, as shown in Figure 1. The payload 598 of a NAL unit refers to the NAL unit excluding the NAL unit header. 600 +---------------+---------------+ 601 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 602 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 603 |F|Z| LayerID | Type | TID | 604 +---------------+---------------+ 606 The Structure of the VVC NAL Unit Header. 608 Figure 1 610 The semantics of the fields in the NAL unit header are as specified 611 in VVC and described briefly below for convenience. In addition to 612 the name and size of each field, the corresponding syntax element 613 name in VVC is also provided. 615 F: 1 bit 616 forbidden_zero_bit. Required to be zero in VVC. Note that the 617 inclusion of this bit in the NAL unit header was to enable 618 transport of VVC video over MPEG-2 transport systems (avoidance of 619 start code emulations) [MPEG2S]. In the context of this memo the 620 value 1 may be used to indicate a syntax violation, e.g., for a 621 NAL unit resulted from aggregating a number of fragmented units of 622 a NAL unit but missing the last fragment, as described in 623 Section TBD. 625 Z: 1 bit 627 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 628 for future extensions by ITU-T and ISO/IEC. 630 This memo does not overload the "Z" bit for local extensions, as 631 a) overloading the "F" bit is sufficient and b) to preserve the 632 usefulness of this memo to possible future versions of [VVC]. 634 LayerId: 6 bits 636 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 637 a layer may be, e.g., a spatial scalable layer, a quality scalable 638 layer, a layer containing a different view, etc. 640 Type: 5 bits 642 nal_unit_type. This field specifies the NAL unit type as defined 643 in Table 5 of [VVC]. For a reference of all currently defined NAL 644 unit types and their semantics, please refer to Section 7.4.2.2 in 645 [VVC]. 647 TID: 3 bits 649 nuh_temporal_id_plus1. This field specifies the temporal 650 identifier of the NAL unit plus 1. The value of TemporalId is 651 equal to TID minus 1. A TID value of 0 is illegal to ensure that 652 there is at least one bit in the NAL unit header equal to 1, so to 653 enable independent considerations of start code emulations in the 654 NAL unit header and in the NAL unit payload data. 656 1.2. Overview of the Payload Format 658 This payload format defines the following processes required for 659 transport of VVC coded data over RTP [RFC3550]: 661 * Usage of RTP header with this payload format 662 * Packetization of VVC coded NAL units into RTP packets using three 663 types of payload structures: a single NAL unit packet, aggregation 664 packet, and fragment unit 666 * Transmission of VVC NAL units of the same bitstream within a 667 single RTP stream 669 * Media type parameters to be used with the Session Description 670 Protocol (SDP) [RFC4566] 672 * Usage of RTCP feedback messages 674 2. Conventions 676 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 677 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 678 "OPTIONAL" in this document are to be interpreted as described in BCP 679 14 [RFC2119] [RFC8174] when, and only when, they appear in all 680 capitals, as shown above. 682 3. Definitions and Abbreviations 684 3.1. Definitions 686 This document uses the terms and definitions of VVC. Section 3.1.1 687 lists relevant definitions from [VVC] for convenience. Section 3.1.2 688 provides definitions specific to this memo. 690 3.1.1. Definitions from the VVC Specification 692 Access unit (AU): A set of PUs that belong to different layers and 693 contain coded pictures associated with the same time for output from 694 the DPB. 696 Adaptation parameter set (APS): A syntax structure containing syntax 697 elements that apply to zero or more slices as determined by zero or 698 more syntax elements found in slice headers. 700 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 701 byte stream, that forms the representation of a sequence of AUs 702 forming one or more coded video sequences (CVSs). 704 Coded picture: A coded representation of a picture comprising VCL NAL 705 units with a particular value of nuh_layer_id within an AU and 706 containing all CTUs of the picture. 708 Clean random access (CRA) PU: A PU in which the coded picture is a 709 CRA picture. 711 Clean random access (CRA) picture: An IRAP picture for which each VCL 712 NAL unit has nal_unit_type equal to CRA_NUT. 714 Coded video sequence (CVS): A sequence of AUs that consists, in 715 decoding order, of a CVSS AU, followed by zero or more AUs that are 716 not CVSS AUs, including all subsequent AUs up to but not including 717 any subsequent AU that is a CVSS AU. 719 Coded video sequence start (CVSS) AU: An AU in which there is a PU 720 for each layer in the CVS and the coded picture in each PU is a CLVSS 721 picture. 723 Coded layer video sequence (CLVS): A sequence of PUs with the same 724 value of nuh_layer_id that consists, in decoding order, of a CLVSS 725 PU, followed by zero or more PUs that are not CLVSS PUs, including 726 all subsequent PUs up to but not including any subsequent PU that is 727 a CLVSS PU. 729 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 730 picture is a CLVSS picture. 732 Coded layer video sequence start (CLVSS) picture: A coded picture 733 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 734 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 736 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 737 of chroma samples of a picture that has three sample arrays, or a CTB 738 of samples of a monochrome picture or a picture that is coded using 739 three separate colour planes and syntax structures used to code the 740 samples. 742 Decoding Capability Information (DCI): A syntax structure containing 743 syntax elements that apply to the entire bitstream. 745 Decoded picture buffer (DPB): A buffer holding decoded pictures for 746 reference, output reordering, or output delay specified for the 747 hypothetical reference decoder. 749 Gradual decoding refresh (GDR) picture: A picture for which each VCL 750 NAL unit has nal_unit_type equal to GDR_NUT. 752 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 753 picture is an IDR picture. 755 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 756 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 757 IDR_N_LP. 759 Intra random access point (IRAP) AU: An AU in which there is a PU for 760 each layer in the CVS and the coded picture in each PU is an IRAP 761 picture. 763 Intra random access point (IRAP) PU: A PU in which the coded picture 764 is an IRAP picture. 766 Intra random access point (IRAP) picture: A coded picture for which 767 all VCL NAL units have the same value of nal_unit_type in the range 768 of IDR_W_RADL to CRA_NUT, inclusive. 770 Layer: A set of VCL NAL units that all have a particular value of 771 nuh_layer_id and the associated non-VCL NAL units. 773 Network abstraction layer (NAL) unit: A syntax structure containing 774 an indication of the type of data to follow and bytes containing that 775 data in the form of an RBSP interspersed as necessary with emulation 776 prevention bytes. 778 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 780 Operation point (OP): A temporal subset of an OLS, identified by an 781 OLS index and a highest value of TemporalId. 783 Picture parameter set (PPS): A syntax structure containing syntax 784 elements that apply to zero or more entire coded pictures as 785 determined by a syntax element found in each slice header. 787 Picture unit (PU): A set of NAL units that are associated with each 788 other according to a specified classification rule, are consecutive 789 in decoding order, and contain exactly one coded picture. 791 Random access: The act of starting the decoding process for a 792 bitstream at a point other than the beginning of the stream. 794 Sequence parameter set (SPS): A syntax structure containing syntax 795 elements that apply to zero or more entire CLVSs as determined by the 796 content of a syntax element found in the PPS referred to by a syntax 797 element found in each picture header. 799 Slice: An integer number of complete tiles or an integer number of 800 consecutive complete CTU rows within a tile of a picture that are 801 exclusively contained in a single NAL unit. 803 Slice header (SH): A part of a coded slice containing the data 804 elements pertaining to all tiles or CTU rows within a tile 805 represented in the slice. 807 Sublayer: A temporal scalable layer of a temporal scalable bitstream 808 consisting of VCL NAL units with a particular value of the TemporalId 809 variable, and the associated non-VCL NAL units. 811 Subpicture: An rectangular region of one or more slices within a 812 picture. 814 Sublayer representation: A subset of the bitstream consisting of NAL 815 units of a particular sublayer and the lower sublayers. 817 Tile: A rectangular region of CTUs within a particular tile column 818 and a particular tile row in a picture. 820 Tile column: A rectangular region of CTUs having a height equal to 821 the height of the picture and a width specified by syntax elements in 822 the picture parameter set. 824 Tile row: A rectangular region of CTUs having a height specified by 825 syntax elements in the picture parameter set and a width equal to the 826 width of the picture. 828 Video coding layer (VCL) NAL unit: A collective term for coded slice 829 NAL units and the subset of NAL units that have reserved values of 830 nal_unit_type that are classified as VCL NAL units in this 831 Specification. 833 3.1.2. Definitions Specific to This Memo 835 Media-Aware Network Element (MANE): A network element, such as a 836 middlebox, selective forwarding unit, or application-layer gateway 837 that is capable of parsing certain aspects of the RTP payload headers 838 or the RTP payload and reacting to their contents. 840 Informative note: The concept of a MANE goes beyond normal routers 841 or gateways in that a MANE has to be aware of the signaling (e.g., 842 to learn about the payload type mappings of the media streams), 843 and in that it has to be trusted when working with Secure RTP 844 (SRTP). The advantage of using MANEs is that they allow packets 845 to be dropped according to the needs of the media coding. For 846 example, if a MANE has to drop packets due to congestion on a 847 certain link, it can identify and remove those packets whose 848 elimination produces the least adverse effect on the user 849 experience. After dropping packets, MANEs must rewrite RTCP 850 packets to match the changes to the RTP stream, as specified in 851 Section 7 of [RFC3550]. 853 NAL unit decoding order: A NAL unit order that conforms to the 854 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 855 follow the Order of NAL units in the bitstream. 857 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 858 stream is utilized to transport a VVC bitstream, which may contain 859 one or more layers, and each layer may contain one or more temporal 860 sublayers. 862 Transmission order: The order of packets in ascending RTP sequence 863 number order (in modulo arithmetic). Within an aggregation packet, 864 the NAL unit transmission order is the same as the order of 865 appearance of NAL units in the packet. 867 3.2. Abbreviations 869 AU Access Unit 871 AP Aggregation Packet 873 APS Adaptation Parameter Set 875 CTU Coding Tree Unit 877 CVS Coded Video Sequence 879 DPB Decoded Picture Buffer 881 DCI Decoding Capability Information 883 DON Decoding Order Number 885 FIR Full Intra Request 887 FU Fragmentation Unit 889 GDR Gradual Decoding Refresh 891 HRD Hypothetical Reference Decoder 893 IDR Instantaneous Decoding Refresh 895 MANE Media-Aware Network Element 897 MTU Maximum Transfer Unit 899 NAL Network Abstraction Layer 900 NALU Network Abstraction Layer Unit 902 PLI Picture Loss Indication 904 PPS Picture Parameter Set 906 RPS Reference Picture Set 908 RPSI Reference Picture Selection Indication 910 SEI Supplemental Enhancement Information 912 SLI Slice Loss Indication 914 SPS Sequence Parameter Set 916 VCL Video Coding Layer 918 VPS Video Parameter Set 920 4. RTP Payload Format 922 4.1. RTP Header Usage 924 The format of the RTP header is specified in [RFC3550] (reprinted as 925 Figure 2 for convenience). This payload format uses the fields of 926 the header in a manner consistent with that specification. 928 The RTP payload (and the settings for some RTP header bits) for 929 aggregation packets and fragmentation units are specified in 930 Section 4.3.2 and Section 4.3.3, respectively. 932 0 1 2 3 933 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 934 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 935 |V=2|P|X| CC |M| PT | sequence number | 936 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 937 | timestamp | 938 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 939 | synchronization source (SSRC) identifier | 940 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 941 | contributing source (CSRC) identifiers | 942 | .... | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 RTP Header According to {{RFC3550}} 947 Figure 2 949 The RTP header information to be set according to this RTP payload 950 format is set as follows: 952 Marker bit (M): 1 bit 954 Set for the last packet, in transmission order, among each set of 955 packets that contain NAL units of one access unit. This is in 956 line with the normal use of the M bit in video formats to allow an 957 efficient playout buffer handling. 959 Payload Type (PT): 7 bits 961 The assignment of an RTP payload type for this new packet format 962 is outside the scope of this document and will not be specified 963 here. The assignment of a payload type has to be performed either 964 through the profile used or in a dynamic way. 966 Sequence Number (SN): 16 bits 968 Set and used in accordance with [RFC3550]. 970 Timestamp: 32 bits 972 The RTP timestamp is set to the sampling timestamp of the content. 973 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 974 properties of its own (e.g., parameter set and SEI NAL units), the 975 RTP timestamp MUST be set to the RTP timestamp of the coded 976 pictures of the access unit in which the NAL unit (according to 977 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 978 timestamp for the display process, even when the bitstream 979 contains picture timing SEI messages or decoding unit information 980 SEI messages as specified in [VVC]. 982 Synchronization source (SSRC): 32 bits 984 Used to identify the source of the RTP packets. A single SSRC is 985 used for all parts of a single bitstream. 987 4.2. Payload Header Usage 989 The first two bytes of the payload of an RTP packet are referred to 990 as the payload header. The payload header consists of the same 991 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 992 in Section 1.1.4, irrespective of the type of the payload structure. 994 The TID value indicates (among other things) the relative importance 995 of an RTP packet, for example, because NAL units belonging to higher 996 temporal sublayers are not used for the decoding of lower temporal 997 sublayers. A lower value of TID indicates a higher importance. 998 More-important NAL units MAY be better protected against transmission 999 losses than less-important NAL units. 1001 For Discussion: quite possibly something similar can be said for 1002 the Layer_id in layered coding, but perhaps not in multiview 1003 coding. (The relevant part of the spec is relatively new, 1004 therefore the soft language). However, for serious layer pruning, 1005 interpretation of the VPS is required. We can add language about 1006 the need for stateful interpretation of LayerID vis-a-vis 1007 stateless interpretation of TID later. 1009 4.3. Payload Structures 1011 Three different types of RTP packet payload structures are specified. 1012 A receiver can identify the type of an RTP packet payload through the 1013 Type field in the payload header. 1015 The three different payload structures are as follows: 1017 * Single NAL unit packet: Contains a single NAL unit in the payload, 1018 and the NAL unit header of the NAL unit also serves as the payload 1019 header. This payload structure is specified in Section 4.4.1. 1021 * Aggregation Packet (AP): Contains more than one NAL unit within 1022 one access unit. This payload structure is specified in 1023 Section 4.3.2. 1025 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1026 This payload structure is specified in Section 4.3.3. 1028 4.3.1. Single NAL Unit Packets 1030 A single NAL unit packet contains exactly one NAL unit, and consists 1031 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1032 DONL field (in network byte order), and the NAL unit payload data 1033 (the NAL unit excluding its NAL unit header) of the contained NAL 1034 unit, as shown in Figure 3. 1036 0 1 2 3 1037 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1038 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1039 | PayloadHdr | DONL (conditional) | 1040 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1041 | | 1042 | NAL unit payload data | 1043 | | 1044 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1045 | :...OPTIONAL RTP padding | 1046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1048 The Structure of a Single NAL Unit Packet 1050 Figure 3 1052 The DONL field, when present, specifies the value of the 16 least 1053 significant bits of the decoding order number of the contained NAL 1054 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1055 be present, and the variable DON for the contained NAL unit is 1056 derived as equal to the value of the DONL field. Otherwise (sprop- 1057 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1059 4.3.2. Aggregation Packets (APs) 1061 Aggregation Packets (APs) can reduce packetization overhead for small 1062 NAL units, such as most of the non- VCL NAL units, which are often 1063 only a few octets in size. 1065 An AP aggregates NAL units of one access unit. Each NAL unit to be 1066 carried in an AP is encapsulated in an aggregation unit. NAL units 1067 aggregated in one AP are included in NAL unit decoding order. 1069 An AP consists of a payload header (denoted as PayloadHdr) followed 1070 by two or more aggregation units, as shown in Figure 4. 1072 0 1 2 3 1073 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1075 | PayloadHdr (Type=28) | | 1076 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1077 | | 1078 | two or more aggregation units | 1079 | | 1080 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1081 | :...OPTIONAL RTP padding | 1082 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1084 The Structure of an Aggregation Packet 1086 Figure 4 1088 The fields in the payload header of an AP are set as follows. The F 1089 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1090 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1091 be equal to 28. 1093 The value of LayerId MUST be equal to the lowest value of LayerId of 1094 all the aggregated NAL units. The value of TID MUST be the lowest 1095 value of TID of all the aggregated NAL units. 1097 Informative note: All VCL NAL units in an AP have the same TID 1098 value since they belong to the same access unit. However, an AP 1099 may contain non-VCL NAL units for which the TID value in the NAL 1100 unit header may be different than the TID value of the VCL NAL 1101 units in the same AP. 1103 An AP MUST carry at least two aggregation units and can carry as many 1104 aggregation units as necessary; however, the total amount of data in 1105 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1106 chosen so that the resulting IP packet is smaller than the MTU size 1107 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1108 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1109 not contain another AP. 1111 The first aggregation unit in an AP consists of a conditional 16-bit 1112 DONL field (in network byte order) followed by a 16-bit unsigned size 1113 information (in network byte order) that indicates the size of the 1114 NAL unit in bytes (excluding these two octets, but including the NAL 1115 unit header), followed by the NAL unit itself, including its NAL unit 1116 header, as shown in Figure 5. 1118 0 1 2 3 1119 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1121 | : DONL (conditional) | NALU size | 1122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1123 | NALU size | | 1124 +-+-+-+-+-+-+-+-+ NAL unit | 1125 | | 1126 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1127 | : 1128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1130 The Structure of the First Aggregation Unit in an AP 1132 Figure 5 1134 The DONL field, when present, specifies the value of the 16 least 1135 significant bits of the decoding order number of the aggregated NAL 1136 unit. 1138 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1139 present in an aggregation unit that is the first aggregation unit in 1140 an AP, and the variable DON for the aggregated NAL unit is derived as 1141 equal to the value of the DONL field, and the variable DON for an 1142 aggregation unit that is not the first aggregation unit in an AP 1143 aggregated NAL unit is derived as equal to the DON of the preceding 1144 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1145 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1146 present in an aggregation unit that is the first aggregation unit in 1147 an AP. 1149 An aggregation unit that is not the first aggregation unit in an AP 1150 will be followed immediately by a 16-bit unsigned size information 1151 (in network byte order) that indicates the size of the NAL unit in 1152 bytes (excluding these two octets, but including the NAL unit 1153 header), followed by the NAL unit itself, including its NAL unit 1154 header, as shown in Figure 6. 1156 0 1 2 3 1157 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1158 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1159 | : NALU size | NAL unit | 1160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1161 | | 1162 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1163 | : 1164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1166 The Structure of an Aggregation Unit That Is Not the First 1167 Aggregation Unit in an AP 1169 Figure 6 1171 Figure 7 presents an example of an AP that contains two aggregation 1172 units, labeled as 1 and 2 in the figure, without the DONL field being 1173 present. 1175 0 1 2 3 1176 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1177 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1178 | RTP Header | 1179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1180 | PayloadHdr (Type=28) | NALU 1 Size | 1181 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1182 | NALU 1 HDR | | 1183 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1184 | . . . | 1185 | | 1186 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1187 | . . . | NALU 2 Size | NALU 2 HDR | 1188 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1189 | NALU 2 HDR | | 1190 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1191 | . . . | 1192 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1193 | :...OPTIONAL RTP padding | 1194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1196 An Example of an AP Packet Containing 1197 Two Aggregation Units without the DONL Field 1199 Figure 7 1201 Figure 8 presents an example of an AP that contains two aggregation 1202 units, labeled as 1 and 2 in the figure, with the DONL field being 1203 present. 1205 0 1 2 3 1206 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1208 | RTP Header | 1209 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1210 | PayloadHdr (Type=28) | NALU 1 DONL | 1211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1212 | NALU 1 Size | NALU 1 HDR | 1213 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1214 | | 1215 | NALU 1 Data . . . | 1216 | | 1217 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1218 | : NALU 2 Size | 1219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1220 | NALU 2 HDR | | 1221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1222 | | 1223 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1224 | :...OPTIONAL RTP padding | 1225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1227 An Example of an AP Containing 1228 Two Aggregation Units with the DONL Field 1230 Figure 8 1232 4.3.3. Fragmentation Units 1234 Fragmentation Units (FUs) are introduced to enable fragmenting a 1235 single NAL unit into multiple RTP packets, possibly without 1236 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1237 unit consists of an integer number of consecutive octets of that NAL 1238 unit. Fragments of the same NAL unit MUST be sent in consecutive 1239 order with ascending RTP sequence numbers (with no other RTP packets 1240 within the same RTP stream being sent between the first and last 1241 fragment). 1243 When a NAL unit is fragmented and conveyed within FUs, it is referred 1244 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1245 NOT be nested; i.e., an FU can not contain a subset of another FU. 1247 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1248 time of the fragmented NAL unit. 1250 An FU consists of a payload header (denoted as PayloadHdr), an FU 1251 header of one octet, a conditional 16-bit DONL field (in network byte 1252 order), and an FU payload, as shown in Figure 9. 1254 0 1 2 3 1255 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1257 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1259 | DONL (cond) | | 1260 |-+-+-+-+-+-+-+-+ | 1261 | FU payload | 1262 | | 1263 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1264 | :...OPTIONAL RTP padding | 1265 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1267 The Structure of an FU 1269 Figure 9 1271 The fields in the payload header are set as follows. The Type field 1272 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1273 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1274 unit. 1276 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1277 FuType field, as shown in Figure 10. 1279 +---------------+ 1280 |0|1|2|3|4|5|6|7| 1281 +-+-+-+-+-+-+-+-+ 1282 |S|E|R| FuType | 1283 +---------------+ 1285 The Structure of FU Header 1287 Figure 10 1289 The semantics of the FU header fields are as follows: 1291 S: 1 bit 1293 When set to 1, the S bit indicates the start of a fragmented NAL 1294 unit, i.e., the first byte of the FU payload is also the first 1295 byte of the payload of the fragmented NAL unit. When the FU 1296 payload is not the start of the fragmented NAL unit payload, the S 1297 bit MUST be set to 0. 1299 E: 1 bit 1300 When set to 1, the E bit indicates the end of a fragmented NAL 1301 unit, i.e., the last byte of the payload is also the last byte of 1302 the fragmented NAL unit. When the FU payload is not the last 1303 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1305 Reserved: 1 bit 1307 editor-note 24: to be removed upon wg consensus 1309 When set to 1, the R bit indicates the last NAL unit of a coded 1310 picture, i.e., the last byte of the FU payload is also the last 1311 byte of the coded picture. When the FU payload is not the last 1312 fragment of a coded picture, the R bit MUST be set to 0. 1314 FuType: 5 bits 1316 The field FuType MUST be equal to the field Type of the fragmented 1317 NAL unit. 1319 The DONL field, when present, specifies the value of the 16 least 1320 significant bits of the decoding order number of the fragmented NAL 1321 unit. 1323 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1324 the DONL field MUST be present in the FU, and the variable DON for 1325 the fragmented NAL unit is derived as equal to the value of the DONL 1326 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1327 equal to 0), the DONL field MUST NOT be present in the FU. 1329 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1330 the Start bit and End bit must not both be set to 1 in the same FU 1331 header. 1333 The FU payload consists of fragments of the payload of the fragmented 1334 NAL unit so that if the FU payloads of consecutive FUs, starting with 1335 an FU with the S bit equal to 1 and ending with an FU with the E bit 1336 equal to 1, are sequentially concatenated, the payload of the 1337 fragmented NAL unit can be reconstructed. The NAL unit header of the 1338 fragmented NAL unit is not included as such in the FU payload, but 1339 rather the information of the NAL unit header of the fragmented NAL 1340 unit is conveyed in F, LayerId, and TID fields of the FU payload 1341 headers of the FUs and the FuType field of the FU header of the FUs. 1342 An FU payload MUST NOT be empty. 1344 If an FU is lost, the receiver SHOULD discard all following 1345 fragmentation units in transmission order corresponding to the same 1346 fragmented NAL unit, unless the decoder in the receiver is known to 1347 be prepared to gracefully handle incomplete NAL units. 1349 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1350 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1351 n of that NAL unit is not received. In this case, the 1352 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1353 syntax violation. 1355 4.4. Decoding Order Number 1357 For each NAL unit, the variable AbsDon is derived, representing the 1358 decoding order number that is indicative of the NAL unit decoding 1359 order. 1361 Let NAL unit n be the n-th NAL unit in transmission order within an 1362 RTP stream. 1364 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1365 for NAL unit n, is derived as equal to n. 1367 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1368 derived as follows, where DON[n] is the value of the variable DON for 1369 NAL unit n: 1371 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1372 transmission order), AbsDon[0] is set equal to DON[0]. 1374 * Otherwise (n is greater than 0), the following applies for 1375 derivation of AbsDon[n]: 1377 If DON[n] == DON[n-1], 1378 AbsDon[n] = AbsDon[n-1] 1380 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1381 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1383 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1384 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1386 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1387 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1388 DON[n]) 1390 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1391 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1393 For any two NAL units m and n, the following applies: 1395 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1396 NAL unit m in NAL unit decoding order. 1398 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1399 of the two NAL units can be in either order. 1401 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1402 NAL unit m in decoding order. 1404 Informative note: When two consecutive NAL units in the NAL 1405 unit decoding order have different values of AbsDon, the 1406 absolute difference between the two AbsDon values may be 1407 greater than or equal to 1. 1409 Informative note: There are multiple reasons to allow for the 1410 absolute difference of the values of AbsDon for two consecutive 1411 NAL units in the NAL unit decoding order to be greater than 1412 one. An increment by one is not required, as at the time of 1413 associating values of AbsDon to NAL units, it may not be known 1414 whether all NAL units are to be delivered to the receiver. For 1415 example, a gateway might not forward VCL NAL units of higher 1416 sublayers or some SEI NAL units when there is congestion in the 1417 network. In another example, the first intra-coded picture of 1418 a pre-encoded clip is transmitted in advance to ensure that it 1419 is readily available in the receiver, and when transmitting the 1420 first intra-coded picture, the originator does not exactly know 1421 how many NAL units will be encoded before the first intra-coded 1422 picture of the pre-encoded clip follows in decoding order. 1423 Thus, the values of AbsDon for the NAL units of the first 1424 intra-coded picture of the pre-encoded clip have to be 1425 estimated when they are transmitted, and gaps in values of 1426 AbsDon may occur. 1428 5. Packetization Rules 1430 The following packetization rules apply: 1432 * If sprop-max-don-diff is greater than 0, the transmission order of 1433 NAL units carried in the RTP stream MAY be different than the NAL 1434 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1435 0), the transmission order of NAL units carried in the RTP stream 1436 MUST be the same as the NAL unit decoding order. 1438 * A NAL unit of a small size SHOULD be encapsulated in an 1439 aggregation packet together one or more other NAL units in order 1440 to avoid the unnecessary packetization overhead for small NAL 1441 units. For example, non-VCL NAL units such as access unit 1442 delimiters, parameter sets, or SEI NAL units are typically small 1443 and can often be aggregated with VCL NAL units without violating 1444 MTU size constraints. 1446 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1447 viewpoint, be encapsulated in an aggregation packet together with 1448 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1449 be meaningless without the associated VCL NAL unit being 1450 available. 1452 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1453 unit packet MUST be used. 1455 6. De-packetization Process 1457 The general concept behind de-packetization is to get the NAL units 1458 out of the RTP packets in an RTP stream and pass them to the decoder 1459 in the NAL unit decoding order. 1461 The de-packetization process is implementation dependent. Therefore, 1462 the following description should be seen as an example of a suitable 1463 implementation. Other schemes may be used as well, as long as the 1464 output for the same input is the same as the process described below. 1465 The output is the same when the set of output NAL units and their 1466 order are both identical. Optimizations relative to the described 1467 algorithms are possible. 1469 All normal RTP mechanisms related to buffer management apply. In 1470 particular, duplicated or outdated RTP packets (as indicated by the 1471 RTP sequences number and the RTP timestamp) are removed. To 1472 determine the exact time for decoding, factors such as a possible 1473 intentional delay to allow for proper inter-stream synchronization 1474 MUST be factored in. 1476 NAL units with NAL unit type values in the range of 0 to 27, 1477 inclusive, may be passed to the decoder. NAL-unit-like structures 1478 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1479 NOT be passed to the decoder. 1481 The receiver includes a receiver buffer, which is used to compensate 1482 for transmission delay jitter within individual RTP stream, to 1483 reorder NAL units from transmission order to the NAL unit decoding 1484 order. In this section, the receiver operation is described under 1485 the assumption that there is no transmission delay jitter within an 1486 RTP stream. To make a difference from a practical receiver buffer 1487 that is also used for compensation of transmission delay jitter, the 1488 receiver buffer is hereafter called the de-packetization buffer in 1489 this section. Receivers should also prepare for transmission delay 1490 jitter; that is, either reserve separate buffers for transmission 1491 delay jitter buffering and de-packetization buffering or use a 1492 receiver buffer for both transmission delay jitter and de- 1493 packetization. Moreover, receivers should take transmission delay 1494 jitter into account in the buffering operation, e.g., by additional 1495 initial buffering before starting of decoding and playback. 1497 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1498 size is zero bytes, and the process described in the remainder of 1499 this paragraph applies. The NAL units carried in the single RTP 1500 stream are directly passed to the decoder in their transmission 1501 order, which is identical to their decoding order. 1503 When sprop-max-don-diff is greater than 0, the process described in 1504 the remainder of this section applies. 1506 There are two buffering states in the receiver: initial buffering and 1507 buffering while playing. Initial buffering starts when the reception 1508 is initialized. After initial buffering, decoding and playback are 1509 started, and the buffering-while-playing mode is used. 1511 Regardless of the buffering state, the receiver stores incoming NAL 1512 units in reception order into the de-packetization buffer. NAL units 1513 carried in RTP packets are stored in the de-packetization buffer 1514 individually, and the value of AbsDon is calculated and stored for 1515 each NAL unit. 1517 Initial buffering lasts until condition A (the difference between the 1518 greatest and smallest AbsDon values of the NAL units in the de- 1519 packetization buffer is greater than or equal to the value of sprop- 1520 max-don-diff) or condition B (the number of NAL units in the de- 1521 packetization buffer is greater than the value of sprop-depack-buf- 1522 nalus) is true. 1524 After initial buffering, whenever condition A or condition B is true, 1525 the following operation is repeatedly applied until both condition A 1526 and condition B become false: 1528 * The NAL unit in the de-packetization buffer with the smallest 1529 value of AbsDon is removed from the de-packetization buffer and 1530 passed to the decoder. 1532 When no more NAL units are flowing into the de-packetization buffer, 1533 all NAL units remaining in the de-packetization buffer are removed 1534 from the buffer and passed to the decoder in the order of increasing 1535 AbsDon values. 1537 7. Payload Format Parameters 1539 This section specifies the optional parameters. A mapping of the 1540 parameters with Session Description Protocol (SDP) [RFC4556] is also 1541 provided for applications that use SDP. 1543 7.1. Media Type Registration 1545 The receiver MUST ignore any parameter unspecified in this memo. 1547 Type name: video 1549 Subtype name: H266 1551 Required parameters: none 1553 Optional parameters: 1555 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1556 level-id: 1558 These parameters indicate the profile, tier, default level, 1559 sub-profile, and some constraints of the bitstream carried by 1560 the RTP stream, or a specific set of the profile, tier, default 1561 level, sub-profile and some constraints the receiver supports. 1563 The subset of coding tools that may have been used to generate 1564 the bitstream or that the receiver supports, as well as some 1565 additional constraints are indicated collectively by profile- 1566 id, sub-profile-id, and interop-constraints. 1568 Informative note: There are 128 values of profile-id. The 1569 subset of coding tools identified by the profile-id can be 1570 further constrained with up to 255 instances of sub-profile- 1571 id. In addition, 68 bits included in interop-constraints, 1572 which can be extended up to 324 bits provide means to 1573 further restrict tools from existing profiles. To be able 1574 to support this fine-granular signalling of coding tool 1575 subsets with profile-id, sub-profile-id and interop- 1576 constraints, it would be safe to require symmetric use of 1577 these parameters in SDP offer/answer unless recv-ols-id is 1578 included in the SDP answer for choosing one of the layers 1579 offered. 1581 The tier is indicated by tier-flag. The default level is 1582 indicated by level-id. The tier and the default level specify 1583 the limits on values of syntax elements or arithmetic 1584 combinations of values of syntax elements that are followed 1585 when generating the bitstream or that the receiver supports. 1587 In SDP offer/answer, when the SDP answer does not include the 1588 recv-ols-id parameter that is less than the sprop-ols-id 1589 parameter in the SDP offer, the following applies: 1591 o The tier-flag, profile-id, sub-profile-id, and interop- 1592 constraints parameters MUST be used symmetrically, i.e., the 1593 value of each of these parameters in the offer MUST be the 1594 same as that in the answer, either explicitly signaled or 1595 implicitly inferred. 1597 o The level-id parameter is changeable as long as the highest 1598 level indicated by the answer is either equal to or lower 1599 than that in the offer. Note that a highest level higher 1600 than level-id in the offer for receiving can be included as 1601 max-recv-level-id. 1603 In SDP offer/answer, when the SDP answer does include the recv- 1604 ols-id parameter that is less than the sprop-ols-id parameter 1605 in the SDP offer, the set of tier- flag, profile-id, sub- 1606 profile-id, interop-constraints, and level-id parameters 1607 included in the answer MUST be consistent with that for the 1608 chosen output layer set as indicated in the SDP offer, with the 1609 exception that the level-id parameter in the SDP answer is 1610 changeable as long as the highest level indicated by the answer 1611 is either lower than or equal to that in the offer. 1613 More specifications of these parameters, including how they 1614 relate to syntax elements specified in [VVC] are provided 1615 below. 1617 profile-id: 1619 When profile-id is not present, a value of 1 (i.e., the Main 10 1620 profile) MUST be inferred. 1622 When used to indicate properties of a bitstream, profile-id is 1623 derived from the general_profile_idc syntax element that 1624 applies to the bitstream in an instance of the 1625 profile_tier_level( ) syntax structure. 1627 A profile_tier_level( ) syntax structure may be contained in an 1628 SPS, VPS, or DCI NAL units as specified in [VVC]. One of the 1629 following three cases applies to the container NAL unit of the 1630 profile_tier_level( ) syntax structure containing those PTL 1631 syntax elements used to derive the values of profile-id, tier- 1632 flag, level-id, sub-profile-id, or interop-constraints: 1) The 1633 container NAL unit is an SPS, the bitstream is a single-layer 1634 bitstream, and the profile_tier_level( ) syntax structures in 1635 all SPSs referenced by the CVSs in the bitstream has the same 1636 values respectively for those PTL syntax elements; 2) The 1637 container NAL unit is a VPS, the profile_tier_level( ) syntax 1638 structure is the one in the VPS that applies to the OLS 1639 corresponding to the bitstream, and the profile_tier_level( ) 1640 syntax structures applicable to the OLS corresponding to the 1641 bitstream in all VPSs referenced by the CVSs in the bitstream 1642 have the same values respectively for those PTL syntax 1643 elements; 3) The container NAL unit is a DCI NAL unit and the 1644 profile_tier_level( ) syntax structures in all DCI NAL units in 1645 the bitstream has the same values respectively for those PTL 1646 syntax elements. 1648 tier-flag, level-id: 1650 The value of tier-flag MUST be in the range of 0 to 1, 1651 inclusive. The value of level-id MUST be in the range of 0 to 1652 255, inclusive. 1654 If the tier-flag and level-id parameters are used to indicate 1655 properties of a bitstream, they indicate the tier and the 1656 highest level the bitstream complies with. 1658 If the tier-flag and level-id parameters are used for 1659 capability exchange, the following applies. If max-recv-level- 1660 id is not present, the default level defined by level-id 1661 indicates the highest level the codec wishes to support. 1662 Otherwise, max-recv-level-id indicates the highest level the 1663 codec supports for receiving. For either receiving or sending, 1664 all levels that are lower than the highest level supported MUST 1665 also be supported. 1667 If no tier-flag is present, a value of 0 MUST be inferred; if 1668 no level-id is present, a value of 51 (i.e., level 3.1) MUST be 1669 inferred. 1671 Informative note: The level values currently defined in the 1672 VVC specification are in the form of "majorNum.minorNum", 1673 and the value of the level-id for each of the levels is 1674 equal to majorNum * 16 + minorNum * 3. It is expected that 1675 if any level are defined in the future, the same convention 1676 will be used, but this cannot be guaranteed. 1678 When used to indicate properties of a bitstream, the tier-flag 1679 and level-id parameters are derived respectively from the 1680 syntax element general_tier_flag, and the syntax element 1681 general_level_idc or sub_layer_level_idc[j], that apply to the 1682 bitstream, in an instance of the profile_tier_level( ) syntax 1683 structure. 1685 If the tier-flag and level-id are derived from the 1686 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1687 following applies: 1689 o tier-flag = general_tier_flag 1691 o level-id = general_level_idc 1693 Otherwise, if the tier-flag and level-id are derived from the 1694 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1695 unit, and the bitstream contains the highest sub-layer 1696 representation in the OLS corresponding to the bitstream, the 1697 following applies: 1699 o tier-flag = general_tier_flag 1701 o level-id = general_level_idc 1703 Otherwise, if the tier-flag and level-id are derived from the 1704 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1705 unit, and the bitstream does not contains the highest sub-layer 1706 representation in the OLS corresponding to the bitstream, the 1707 following applies, with j being the value of the sprop-sub- 1708 layer-id parameter: 1710 o tier-flag = general_tier_flag 1712 o level-id = sub_layer_level_idc[j] 1714 sub-profile-id: 1716 The value of the parameter is a comma-separated (',') list of 1717 values. 1719 editor-note 11: What is the value? integer, base32? 1721 When used to indicate properties of a bitstream, sub-profile-id 1722 is derived from each of the ptl_num_sub_profiles 1723 general_sub_profile_idc[i] syntax elements that apply to the 1724 bitstream in an profile_tier_level( ) syntax structure. 1726 interop-constraints: 1728 A base16 [RFC4648] (hexadecimal) representation of the data 1729 that includes the syntax elements 1730 ptl_frame_only_constraint_flag and ptl_multilayer_enabled_flag 1731 and the general_constraints_info( ) syntax structure that apply 1732 to the bitstream in an instance of the profile_tier_level( ) 1733 syntax structure. 1735 If the interop-constraints parameter is not present, the 1736 following MUST be inferred: 1738 o ptl_frame_only_constraint_flag = 0 1740 o ptl_multilayer_enabled_flag = 1 1742 o gci_present_flag in the general_constraints_info( ) syntax 1743 structure = 1 1745 editor-note 14: Double check the default values. Currently, no 1746 constraints, but actually, with the Main 10 profile as default multi- 1747 layer not possible. 1749 Using interop-constraints for capability exchange results in a 1750 requirement on any bitstream to be compliant with the interop- 1751 constraints. 1753 sprop-sub-layer-id: 1755 This parameter MAY be used to indicate the highest allowed 1756 value of TID in the bitstream. When not present, the value of 1757 sprop-sub-layer-id is inferred to be equal to 6. 1759 The value of sprop-sub-layer-id MUST be in the range of 0 to 6, 1760 inclusive. 1762 sprop-ols-id: 1764 This parameter MAY be used to indicate the OLS that the 1765 bitstream applies to. When not present, the value of sprop- 1766 ols-id is inferred to be equal to TargetOlsIdx as specified in 1767 8.1.1 in [VVC]. If this optional parameter is present, sprop- 1768 vps MUST also be present or its content MUST be known a priori 1769 at the receiver. 1771 The value of sprop-ols-id MUST be in the range of 0 to 257, 1772 inclusive. 1774 recv-sub-layer-id: 1776 This parameter MAY be used to signal a receiver's choice of the 1777 offered or declared sub-layer representations in the sprop-vps 1778 and sprop-sps. The value of recv-sub-layer-id indicates the 1779 TID of the highest sub-layer of the bitstream that a receiver 1780 supports. When not present, the value of recv-sub-layer-id is 1781 inferred to be equal to the value of the sprop-sub-layer-id 1782 parameter in the SDP offer. 1784 The value of recv-sub-layer-id MUST be in the range of 0 to 6, 1785 inclusive. 1787 recv-ols-id: 1789 This parameter MAY be used to signal a receiver's choice of the 1790 offered or declared output layer sets in the sprop-vps. The 1791 value of recv-ols-id indicates the OLS index of the bitstream 1792 that a receiver supports. When not present, the value of recv- 1793 ols-id is inferred to be equal to the value of the sprop-ols-id 1794 parameter in the SDP offer. When present, the value of recv- 1795 ols-id must be included only when sprop-ols-id was received and 1796 must refer to an output layer set in the VPS that is in the 1797 same dependency tree as the OLS referred to by sprop-ols-id. 1798 If this optional parameter is present, sprop-vps must have been 1799 received or its content must be known a priori at the receiver. 1801 The value of recv-ols-id MUST be in the range of 0 to 257, 1802 inclusive. 1804 max-recv-level-id: 1806 This parameter MAY be used to indicate the highest level a 1807 receiver supports. 1809 The value of max-recv-level-id MUST be in the range of 0 to 1810 255, inclusive. 1812 When max-recv-level-id is not present, the value is inferred to 1813 be equal to level-id. 1815 max-recv-level-id MUST NOT be present when the highest level 1816 the receiver supports is not higher than the default level. 1818 sprop-dci: 1820 This parameter MAY be used to convey a decoding capability 1821 information NAL unit of the bitstream for out-of-band 1822 transmission. The parameter MAY also be used for capability 1823 exchange. The value of the parameter a base64 [RFC4648] 1824 representations of the decoding capability information NAL unit 1825 as specified in Section 7.3.2.1 of [VVC]. 1827 sprop-vps: 1829 This parameter MAY be used to convey any video parameter set 1830 NAL unit of the bitstream for out-of-band transmission of video 1831 parameter sets. The parameter MAY also be used for capability 1832 exchange and to indicate sub-stream characteristics (i.e., 1833 properties of output layer sets and sublayer representations as 1834 defined in [VVC]). The value of the parameter is a comma- 1835 separated (',') list of base64 [RFC4648] representations of the 1836 video parameter set NAL units as specified in Section 7.3.2.3 1837 of [VVC]. 1839 The sprop-vps parameter MAY contain one or more than one video 1840 parameter set NAL unit. However, all other video parameter 1841 sets contained in the sprop-vps parameter MUST be consistent 1842 with the first video parameter set in the sprop-vps parameter. 1843 A video parameter set vpsB is said to be consistent with 1844 another video parameter set vpsA if any decoder that conforms 1845 to the profile, tier, level, and constraints indicated by the 1846 12 bytes of data starting from the syntax element 1847 general_profile_space to the syntax element general_level_idc, 1848 inclusive, in the first profile_tier_level( ) syntax structure 1849 in vpsA can decode any bitstream that conforms to the profile, 1850 tier, level, and constraints indicated by the 12 bytes of data 1851 starting from the syntax element general_profile_space to the 1852 syntax element general_level_idc, inclusive, in the first 1853 profile_tier_level( ) syntax structure in vpsB. 1855 sprop-sei: 1857 This parameter MAY be used to convey one or more SEI messages 1858 that describe bitstream characteristics. When present, a 1859 decoder can rely on the bitstream characteristics that are 1860 described in the SEI messages for the entire duration of the 1861 session, independently from the persistence scopes of the SEI 1862 messages as specified in [VSEI]. 1864 The value of the parameter is a comma-separated (',') list of 1865 base64 [RFC4648] representations of SEI NAL units as specified 1866 in [VSEI]. 1868 Informative note: Intentionally, no list of applicable or 1869 inapplicable SEI messages is specified here. Conveying 1870 certain SEI messages in sprop-sei may be sensible in some 1871 application scenarios and meaningless in others. However, a 1872 few examples are described below: 1874 1) In an environment where the bitstream was created from 1875 film-based source material, and no splicing is going to 1876 occur during the lifetime of the session, the film grain 1877 characteristics SEI message is likely meaningful, and 1878 sending it in sprop-sei rather than in the bitstream at each 1879 entry point may help with saving bits and allows one to 1880 configure the renderer only once, avoiding unwanted 1881 artifacts. 1883 2) Examples for SEI messages that would be meaningless to be 1884 conveyed in sprop-sei include the decoded picture hash SEI 1885 message (it is close to impossible that all decoded pictures 1886 have the same hashtag), the display orientation SEI message 1887 when the device is a handheld device (as the display 1888 orientation may change when the handheld device is turned 1889 around), or the filler payload SEI message (as there is no 1890 point in just having more bits in SDP). 1892 max-lsr: 1894 The max-lsr MAY be used to signal the capabilities of a 1895 receiver implementation and MUST NOT be used for any other 1896 purpose. The value of max-lsr is an integer indicating the 1897 maximum processing rate in units of luma samples per second. 1898 The max-lsr parameter signals that the receiver is capable of 1899 decoding video at a higher rate than is required by the highest 1900 level. 1902 Informative note: When the OPTIONAL media type parameters 1903 are used to signal the properties of a bitstream, and max- 1904 lsr is not present, the values of tier-flag, profile-id, 1905 sub-profile-id interop-constraints, and level-id must always 1906 be such that the bitstream complies fully with the specified 1907 profile, tier, and level. 1909 When max-lsr is signaled, the receiver MUST be able to decode 1910 bitstreams that conform to the highest level, with the 1911 exception that the MaxLumaSr value in Table 136 of [VVC] for 1912 the highest level is replaced with the value of max-lsr. 1913 Senders MAY use this knowledge to send pictures of a given size 1914 at a higher picture rate than is indicated in the highest 1915 level. 1917 When not present, the value of max-lsr is inferred to be equal 1918 to the value of MaxLumaSr given in Table 136 of [VVC] for the 1919 highest level. 1921 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 1922 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 1923 [VVC] for the highest level. 1925 max-fps: 1927 The value of max-fps is an integer indicating the maximum 1928 picture rate in units of pictures per 100 seconds that can be 1929 effectively processed by the receiver. The max-fps parameter 1930 MAY be used to signal that the receiver has a constraint in 1931 that it is not capable of processing video effectively at the 1932 full picture rate that is implied by the highest level and, 1933 when present, max-lsr. 1935 The value of max-fps is not necessarily the picture rate at 1936 which the maximum picture size can be sent, it constitutes a 1937 constraint on maximum picture rate for all resolutions. 1939 Informative note: The max-fps parameter is semantically 1940 different from max-lsr in that max-fps is used to signal a 1941 constraint, lowering the maximum picture rate from what is 1942 implied by other parameters. 1944 The encoder MUST use a picture rate equal to or less than this 1945 value. In cases where the max-fps parameter is absent, the 1946 encoder is free to choose any picture rate according to the 1947 highest level and any signaled optional parameters. 1949 The value of max-fps MUST be smaller than or equal to the full 1950 picture rate that is implied by the highest level and, when 1951 present, max-lsr. 1953 sprop-max-don-diff: 1955 If there is no NAL unit naluA that is followed in transmission 1956 order by any NAL unit preceding naluA in decoding order (i.e., 1957 the transmission order of the NAL units is the same as the 1958 decoding order), the value of this parameter MUST be equal to 1959 0. 1961 Otherwise, this parameter specifies the maximum absolute 1962 difference between the decoding order number (i.e., AbsDon) 1963 values of any two NAL units naluA and naluB, where naluA 1964 follows naluB in decoding order and precedes naluB in 1965 transmission order. 1967 The value of sprop-max-don-diff MUST be an integer in the range 1968 of 0 to 32767, inclusive. 1970 When not present, the value of sprop-max-don-diff is inferred 1971 to be equal to 0. 1973 sprop-depack-buf-bytes: 1975 This parameter signals the required size of the de- 1976 packetization buffer in units of bytes. The value of the 1977 parameter MUST be greater than or equal to the maximum buffer 1978 occupancy (in units of bytes) of the de-packetization buffer as 1979 specified in Section 6. 1981 The value of sprop-depack-buf-bytes MUST be an integer in the 1982 range of 0 to 4294967295, inclusive. 1984 When sprop-max-don-diff is present and greater than 0, this 1985 parameter MUST be present and the value MUST be greater than 0. 1986 When not present, the value of sprop-depack-buf-bytes is 1987 inferred to be equal to 0. 1989 Informative note: The value of sprop-depack-buf-bytes 1990 indicates the required size of the de-packetization buffer 1991 only. When network jitter can occur, an appropriately sized 1992 jitter buffer has to be available as well. 1994 depack-buf-cap: 1996 This parameter signals the capabilities of a receiver 1997 implementation and indicates the amount of de-packetization 1998 buffer space in units of bytes that the receiver has available 1999 for reconstructing the NAL unit decoding order from NAL units 2000 carried in the RTP stream. A receiver is able to handle any 2001 RTP stream for which the value of the sprop-depack-buf-bytes 2002 parameter is smaller than or equal to this parameter. 2004 When not present, the value of depack-buf-cap is inferred to be 2005 equal to 4294967295. The value of depack-buf-cap MUST be an 2006 integer in the range of 1 to 4294967295, inclusive. 2008 Informative note: depack-buf-cap indicates the maximum 2009 possible size of the de-packetization buffer of the receiver 2010 only, without allowing for network jitter. 2012 editor-note 19: sprop-depack-buf-nalus not included but mentioned in 2013 section 6 for startup in de-packetization process. We should decide 2014 on whether it needs to be included or not. 2016 7.2. SDP Parameters 2018 The receiver MUST ignore any parameter unspecified in this memo. 2020 7.2.1. Mapping of Payload Type Parameters to SDP 2022 The media type video/H266 string is mapped to fields in the Session 2023 Description Protocol (SDP) [RFC4566] as follows: 2025 * The media name in the "m=" line of SDP MUST be video. 2027 * The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2028 media subtype). 2030 * The clock rate in the "a=rtpmap" line MUST be 90000. 2032 * The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2033 interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id, 2034 recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2035 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2036 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2037 This parameter is expressed as a media type string, in the form of 2038 a semicolon-separated list of parameter=value pairs. 2040 editor-note 20: To Be updated 2042 An example of media representation in SDP is as follows: 2044 m=video 49170 RTP/AVP 98 2045 a=rtpmap:98 H266/90000 2046 a=fmtp:98 profile-id=1; sprop-vps=