idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-09.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (2 June 2021) is 1052 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1370 ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 4 December 2021 Y. Sanchez 6 Fraunhofer HHI 7 Y.-K. Wang 8 Bytedance Inc. 9 2 June 2021 11 RTP Payload Format for Versatile Video Coding (VVC) 12 draft-ietf-avtcore-rtp-vvc-09 14 Abstract 16 This memo describes an RTP payload format for the video coding 17 standard ITU-T Recommendation H.266 and ISO/IEC International 18 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 19 developed by the Joint Video Experts Team (JVET). The RTP payload 20 format allows for packetization of one or more Network Abstraction 21 Layer (NAL) units in each RTP packet payload as well as fragmentation 22 of a NAL unit into multiple RTP packets. The payload format has wide 23 applicability in videoconferencing, Internet video streaming, and 24 high-bitrate entertainment-quality video, among other applications. 26 Status of This Memo 28 This Internet-Draft is submitted in full conformance with the 29 provisions of BCP 78 and BCP 79. 31 Internet-Drafts are working documents of the Internet Engineering 32 Task Force (IETF). Note that other groups may also distribute 33 working documents as Internet-Drafts. The list of current Internet- 34 Drafts is at https://datatracker.ietf.org/drafts/current/. 36 Internet-Drafts are draft documents valid for a maximum of six months 37 and may be updated, replaced, or obsoleted by other documents at any 38 time. It is inappropriate to use Internet-Drafts as reference 39 material or to cite them other than as "work in progress." 41 This Internet-Draft will expire on 4 December 2021. 43 Copyright Notice 45 Copyright (c) 2021 IETF Trust and the persons identified as the 46 document authors. All rights reserved. 48 This document is subject to BCP 78 and the IETF Trust's Legal 49 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 50 license-info) in effect on the date of publication of this document. 51 Please review these documents carefully, as they describe your rights 52 and restrictions with respect to this document. Code Components 53 extracted from this document must include Simplified BSD License text 54 as described in Section 4.e of the Trust Legal Provisions and are 55 provided without warranty as described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 62 1.1.2. Systems and Transport Interfaces (informative) . . . 6 63 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 14 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 69 3.1.1. Definitions from the VVC Specification . . . . . . . 15 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 18 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 21 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 22 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 31 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 83 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 34 84 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 44 85 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 44 86 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 45 87 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 45 88 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 46 89 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 46 90 9. Security Considerations . . . . . . . . . . . . . . . . . . . 46 91 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 48 92 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 49 93 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 49 94 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 49 95 13.1. Normative References . . . . . . . . . . . . . . . . . . 49 96 13.2. Informative References . . . . . . . . . . . . . . . . . 51 97 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 52 98 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 52 100 1. Introduction 102 The Versatile Video Coding [VVC] specification, formally published as 103 both ITU-T Recommendation H.266 and ISO/IEC International Standard 104 23090-3, is currently in the ITU-T publication process and the ISO/ 105 IEC approval process. VVC is reported to provide significant coding 106 efficiency gains over HEVC [HEVC] as known as H.265, and other 107 earlier video codecs. 109 This memo specifies an RTP payload format for VVC. It shares its 110 basic design with the NAL (Network Abstraction Layer) unit-based RTP 111 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 112 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 113 and their respective predecessors. With respect to design 114 philosophy, security, congestion control, and overall implementation 115 complexity, it has similar properties to those earlier payload format 116 specifications. This is a conscious choice, as at least RFC 6184 is 117 widely deployed and generally known in the relevant implementer 118 communities. Certain mechanisms known from [RFC6190] were 119 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 120 signal-to-noise ratio (SNR) scalability. 122 1.1. Overview of the VVC Codec 124 VVC and HEVC share a similar hybrid video codec design. In this 125 memo, we provide a very brief overview of those features of VVC that 126 are, in some form, addressed by the payload format specified herein. 127 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 128 specifications pertaining to VVC to arrive at interoperable, well- 129 performing implementations. 131 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 132 which is often used to refer to the coding-tool features, and a NAL, 133 which is often used to refer to the systems and transport interface 134 aspects of the codecs. 136 1.1.1. Coding-Tool Features (informative) 138 Coding tool features are described below with occasional reference to 139 the coding tool set of HEVC, which is well known in the community. 141 Similar to earlier hybrid-video-coding-based standards, including 142 HEVC, the following basic video coding design is employed by VVC. A 143 prediction signal is first formed by either intra- or motion- 144 compensated prediction, and the residual (the difference between the 145 original and the prediction) is then coded. The gains in coding 146 efficiency are achieved by redesigning and improving almost all parts 147 of the codec over earlier designs. In addition, VVC includes several 148 tools to make the implementation on parallel architectures easier. 150 Finally, VVC includes temporal, spatial, and SNR scalability as well 151 as multiview coding support. 153 Coding blocks and transform structure 155 Among major coding-tool differences between HEVC and VVC, one of the 156 important improvements is the more flexible coding tree structure in 157 VVC, i.e., multi-type tree. In addition to quadtree, binary and 158 ternary trees are also supported, which contributes significant 159 improvement in coding efficiency. Moreover, the maximum size of 160 coding tree unit (CTU) is increased from 64x64 to 128x128. To 161 improve the coding efficiency of chroma signal, luma chroma separated 162 trees at CTU level may be employed for intra-slices. The square 163 transforms in HEVC are extended to non-square transforms for 164 rectangular blocks resulting from binary and ternary tree splits. 165 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 166 DST-7, and DCT-8 as well as the non-separable secondary transform. 167 The transforms used in VVC can have different sizes with support for 168 larger transform sizes. For DCT-2, the transform sizes range from 169 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 170 4x4 to 32x32. In addition, VVC also support sub-block transform for 171 both intra and inter coded blocks. For intra coded blocks, intra 172 sub-partitioning (ISP) may be used to allow sub-block based intra 173 prediction and transform. For inter blocks, sub-block transform may 174 be used assuming that only a part of an inter-block has non-zero 175 transform coefficients. 177 Entropy coding 179 Similar to HEVC, VVC uses a single entropy-coding engine, which is 180 based on context adaptive binary arithmetic coding [CABAC], but with 181 the support of multi-window sizes. The window sizes can be 182 initialized differently for different context models. Due to such a 183 design, it has more efficient adaptation speed and better coding 184 efficiency. A joint chroma residual coding scheme is applied to 185 further exploit the correlation between the residuals of two color 186 components. In VVC, different residual coding schemes are applied 187 for regular transform coefficients and residual samples generated 188 using transform-skip mode. 190 In-loop filtering 191 VVC has more feature support in loop filters than HEVC. The 192 deblocking filter in VVC is similar to HEVC but operates at a smaller 193 grid. After deblocking and sample adaptive offset (SAO), an adaptive 194 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 195 distortion of decoded pictures. Besides, VVC introduces a new module 196 before deblocking called luma mapping with chroma scaling to fully 197 utilize the dynamic range of signal so that rate-distortion 198 performance of both SDR and HDR content is improved. 200 Motion prediction and coding 202 Compared to HEVC, VVC introduces several improvements in this area. 203 First, there is the adaptive motion vector resolution (AMVR), which 204 can save bit cost for motion vectors by adaptively signaling motion 205 vector resolution. Then the affine motion compensation is included 206 to capture complicated motion like zooming and rotation. Meanwhile, 207 prediction refinement with the optical flow with affine mode (PROF) 208 is further deployed to mimic affine motion at the pixel level. 209 Thirdly the decoder side motion vector refinement (DMVR) is a method 210 to derive MV vector at decoder side based on block matching so that 211 fewer bits may be spent on motion vectors. Bi-directional optical 212 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 213 offset at 4x4 sub-block level that is derived with equations based on 214 gradients of the prediction samples and a motion difference relative 215 to CU motion vectors. Furthermore, merge with motion vector 216 difference (MMVD) is a special mode, which further signals a limited 217 set of motion vector differences on top of merge mode. In addition 218 to MMVD, there are another three types of special merge modes, i.e., 219 sub-block merge, triangle, and combined intra-/inter-prediction 220 (CIIP). Sub-block merge list includes one candidate of sub-block 221 temporal motion vector prediction (SbTMVP) and up to four candidates 222 of affine motion vectors. Triangle is based on triangular block 223 motion compensation. CIIP combines intra- and inter- predictions 224 with weighting. Adaptive weighting may be employed with a block- 225 level tool called bi-prediction with CU based weighting (BCW) which 226 provides more flexibility than in HEVC. 228 Intra prediction and intra-coding 230 To capture the diversified local image texture directions with finer 231 granularity, VVC supports 65 angular directions instead of 33 232 directions in HEVC. The intra mode coding is based on a 6-most- 233 probable-mode scheme, and the 6 most probable modes are derived using 234 the neighboring intra prediction directions. In addition, to deal 235 with the different distributions of intra prediction angles for 236 different block aspect ratios, a wide-angle intra prediction (WAIP) 237 scheme is applied in VVC by including intra prediction angles beyond 238 those present in HEVC. Unlike HEVC which only allows using the most 239 adjacent line of reference samples for intra prediction, VVC also 240 allows using two further reference lines, as known as multi- 241 reference-line (MRL) intra prediction. The additional reference 242 lines can be only used for the 6 most probable intra prediction 243 modes. To capture the strong correlation between different colour 244 components, in VVC, a cross-component linear mode (CCLM) is utilized 245 which assumes a linear relationship between the luma sample values 246 and their associated chroma samples. For intra prediction, VVC also 247 applies a position-dependent prediction combination (PDPC) for 248 refining the prediction samples closer to the intra prediction block 249 boundary. Matrix-based intra prediction (MIP) modes are also used in 250 VVC which generates an up to 8x8 intra prediction block using a 251 weighted sum of downsampled neighboring reference samples, and the 252 weights are hardcoded constants. 254 Other coding-tool feature 256 VVC introduces dependent quantization (DQ) to reduce quantization 257 error by state-based switching between two quantizers. 259 1.1.2. Systems and Transport Interfaces (informative) 261 VVC inherits the basic systems and transport interfaces designs from 262 HEVC and H.264. These include the NAL-unit-based syntax structure, 263 the hierarchical syntax and data unit structure, the supplemental 264 enhancement information (SEI) message mechanism, and the video 265 buffering model based on the hypothetical reference decoder (HRD). 266 The scalability features of VVC are conceptually similar to the 267 scalable variant of HEVC known as SHVC. The hierarchical syntax and 268 data unit structure consists of parameter sets at various levels 269 (decoder, sequence (pertaining to all), sequence (pertaining to a 270 single), picture), picture-level header parameters, slice-level 271 header parameters, and lower-level parameters. 273 A number of key components that influenced the network abstraction 274 layer design of VVC as well as this memo are described below 276 Decoding capability information 278 The decoding capability information includes parameters that stay 279 constant for the lifetime of a Video Bitstream, which in IETF terms 280 can translate to the lifetime of a session. Such information 281 includes profile, level, and sub-profile information to determine a 282 maximum capability interop point that is guaranteed to be never 283 exceeded, even if splicing of video sequences occurs within a 284 session. It further includes constraint fields (most of which are 285 flags), which can optionally be set to indicate that the video 286 bitstream will be constraint in the use of certain features as 287 indicated by the values of those fields. With this, a bitstream can 288 be labelled as not using certain tools, which allows among other 289 things for resource allocation in a decoder implementation. 291 Video parameter set 293 The ideo parameter set (VPS) pertains to a coded video sequences 294 (CVS) of multiple layers covering the same range of access units, and 295 includes, among other information decoding dependency expressed as 296 information for reference picture list construction of enhancement 297 layers. The VPS provides a "big picture" of a scalable sequence, 298 including what types of operation points are provided, the profile, 299 tier, and level of the operation points, and some other high-level 300 properties of the bitstream that can be used as the basis for session 301 negotiation and content selection, etc. One VPS may be referenced by 302 one or more sequence parameter sets. 304 Sequence parameter set 306 The sequence parameter set (SPS) contains syntax elements pertaining 307 to a coded layer video sequence (CLVS), which is a group of pictures 308 belonging to the same layer, starting with a random access point, and 309 followed by pictures that may depend on each other, until the next 310 random access point picture. In MPGEG-2, the equivalent of a CVS was 311 a group of pictures (GOP), which normally started with an I frame and 312 was followed by P and B frames. While more complex in its options of 313 random access points, VVC retains this basic concept. One remarkable 314 difference of VVC is that a CLVS may start with a Gradual Decoding 315 Refresh (GDR) picture, without requiring presence of traditional 316 random access points in the bitstream, such as instantaneous decoding 317 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 318 applications, a CVS contains a few hundred milliseconds to a few 319 seconds of video. In video conferencing (without switching MCUs 320 involved), a CVS can be as long in duration as the whole session. 322 Picture and adaptation parameter set 324 The picture parameter set and the adaptation parameter set (PPS and 325 APS, respectively) carry information pertaining to zero or more 326 pictures and zero or more slices, respectively. The PPS contains 327 information that is likely to stay constant from picture to picture- 328 at least for pictures for a certain type-whereas the APS contains 329 information, such as adaptive loop filter coefficients, that are 330 likely to change from picture to picture or even within a picture. A 331 single APS is referenced by all slices of the same picture if that 332 APS contains information about luma mapping with chroma scaling 333 (LMCS) or scaling list. Different APSs containing ALF parameters can 334 be referenced by slices of the same picture. 336 Picture header 338 A Picture Header contains information that is common to all slices 339 that belong to the same picture. Being able to send that information 340 as a separate NAL unit when pictures are split into several slices 341 allows for saving bitrate, compared to repeating the same information 342 in all slices. However, there might be scenarios where low-bitrate 343 video is transmitted using a single slice per picture. Having a 344 separate NAL unit to convey that information incurs in an overhead 345 for such scenarios. For such scenarios, the picture header syntax 346 structure is directly included in the slice header, instead of in its 347 own NAL unit. The mode of the picture header syntax structure being 348 included in its own NAL unit or not can only be switched on/off for 349 an entire CLVS, and can only be switched off when in the entire CLVS 350 each picture contains only one slice. 352 Profile, tier, and level 354 The profile, tier and level syntax structures in DCI, VPS and SPS 355 contain profile, tier, level information for all layers that refer to 356 the DCI, for layers associated with one or more output layer sets 357 specified by the VPS, and for any layer that refers to the SPS, 358 respectively. 360 Sub-profiles 362 Within the VVC specification, a sub-profile is a 32-bit number, coded 363 according to ITU-T Rec. T.35, that does not carry a semantics. It is 364 carried in the profile_tier_level structure and hence (potentially) 365 present in the DCI, VPS, and SPS. External registration bodies can 366 register a T.35 codepoint with ITU-T registration authorities and 367 associate with their registration a description of bitstream 368 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 369 would allow encoder manufacturers to label the bitstreams generated 370 by their encoder as complying with such sub-profile. It is expected 371 that upstream standardization organizations (such as: DVB and ATSC), 372 as well as walled-garden video services will take advantage of this 373 labelling system. In contrast to "normal" profiles, it is expected 374 that sub-profiles may indicate encoder choices traditionally left 375 open in the (decoder- centric) video coding specs, such as GOP 376 structures, minimum/maximum QP values, and the mandatory use of 377 certain tools or SEI messages. 379 General constraint fields 381 The profile_tier_level structure carries a considerable number of 382 constraint fields (most of which are flags), which an encoder can use 383 to indicate to a decoder that it will not use a certain tool or 384 technology. They were included in reaction to a perceived market 385 need for labelling a bitstream as not exercising a certain tool that 386 has become commercially unviable. 388 Temporal scalability support 390 VVC includes support of temporal scalability, by inclusion of the 391 signaling of TemporalId in the NAL unit header, the restriction that 392 pictures of a particular temporal sublayer cannot be used for inter 393 prediction reference by pictures of a lower temporal sublayer, the 394 sub-bitstream extraction process, and the requirement that each sub- 395 bitstream extraction output be a conforming bitstream. Media-Aware 396 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 397 header for stream adaptation purposes based on temporal scalability. 399 Reference picture resampling (RPR) 401 In AVC and HEVC, the spatial resolution of pictures cannot change 402 unless a new sequence using a new SPS starts, with an IRAP picture. 403 VVC enables picture resolution change within a sequence at a position 404 without encoding an IRAP picture, which is always intra-coded. This 405 feature is sometimes referred to as reference picture resampling 406 (RPR), as the feature needs resampling of a reference picture used 407 for inter prediction when that reference picture has a different 408 resolution than the current picture being decoded. RPR allows 409 resolution change without the need of coding an IRAP picture, which 410 causes a momentary bit rate spike in streaming or video conferencing 411 scenarios, e.g., to cope with network condition changes. RPR can 412 also be used in application scenarios wherein zooming of the entire 413 video region or some region of interest is needed. 415 Spatial, SNR, and multiview scalability 417 VVC includes support for spatial, SNR, and multiview scalability. 418 Scalable video coding is widely considered to have technical benefits 419 and enrich services for various video applications. Until recently, 420 however, the functionality has not been included in the first version 421 of specifications of the video codecs. In VVC, however, all those 422 forms of scalability are supported in the first version of VVC 423 natively through the signaling of the layer_id in the NAL unit 424 header, the VPS which associates layers with given layer_ids to each 425 other, reference picture selection, reference picture resampling for 426 spatial scalability, and a number of other mechanisms not relevant 427 for this memo. 429 Spatial scalability 430 With the existence of Reference Picture Resampling (RPR), the 431 additional burden for scalability support is just a 432 modification of the high-level syntax (HLS). The inter-layer 433 prediction is employed in a scalable system to improve the 434 coding efficiency of the enhancement layers. In addition to 435 the spatial and temporal motion-compensated predictions that 436 are available in a single-layer codec, the inter-layer 437 prediction in VVC uses the possibly resampled video data of the 438 reconstructed reference picture from a reference layer to 439 predict the current enhancement layer. The resampling process 440 for inter-layer prediction, when used, is performed at the 441 block-level, reusing the existing interpolation process for 442 motion compensation in single-layer coding. It means that no 443 additional resampling process is needed to support spatial 444 scalability. 446 SNR scalability 448 SNR scalability is similar to spatial scalability except that 449 the resampling factors are 1:1. In other words, there is no 450 change in resolution, but there is inter-layer prediction. 452 Multiview scalability 454 The first version of VVC also supports multiview scalability, 455 wherein a multi-layer bitstream carries layers representing 456 multiple views, and one or more of the represented views can be 457 output at the same time. 459 SEI messages 461 Supplementary enhancement information (SEI) messages are information 462 in the bitstream that do not influence the decoding process as 463 specified in the VVC spec, but address issues of representation/ 464 rendering of the decoded bitstream, label the bitstream for certain 465 applications, among other, similar tasks. The overall concept of SEI 466 messages and many of the messages themselves has been inherited from 467 the H.264 and HEVC specs. Except for the SEI messages that affect 468 the specification of the hypothetical reference decoder (HRD), other 469 SEI messages for use in the VVC environment, which are generally 470 useful also in other video coding technologies, are not included in 471 the main VVC specification but in a companion specification [VSEI]. 473 1.1.3. High-Level Picture Partitioning (informative) 475 VVC inherited the concept of tiles and wavefront parallel processing 476 (WPP) from HEVC, with some minor to moderate differences. The basic 477 concept of slices was kept in VVC but designed in an essentially 478 different form. VVC is the first video coding standard that includes 479 subpictures as a feature, which provides the same functionality as 480 HEVC motion-constrained tile sets (MCTSs) but designed differently to 481 have better coding efficiency and to be friendlier for usage in 482 application systems. More details of these differences are described 483 below. 485 Tiles and WPP 487 Same as in HEVC, a picture can be split into tile rows and tile 488 columns in VVC, in-picture prediction across tile boundaries is 489 disallowed, etc. However, the syntax for signaling of tile 490 partitioning has been simplified, by using a unified syntax design 491 for both the uniform and the non-uniform mode. In addition, 492 signaling of entry point offsets for tiles in the slice header is 493 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 494 has two differences compared to HEVC: i) The CTU row delay is reduced 495 from two CTUs to one CTU; ii) Signaling of entry point offsets for 496 WPP in the slice header is optional in VVC while it is mandatory in 497 HEVC. 499 Slices 501 In VVC, the conventional slices based on CTUs (as in HEVC) or 502 macroblocks (as in AVC) have been removed. The main reasoning behind 503 this architectural change is as follows. The advances in video 504 coding since 2003 (the publication year of AVC v1) have been such 505 that slice-based error concealment has become practically impossible, 506 due to the ever-increasing number and efficiency of in-picture and 507 inter-picture prediction mechanisms. An error-concealed picture is 508 the decoding result of a transmitted coded picture for which there is 509 some data loss (e.g., loss of some slices) of the coded picture or a 510 reference picture for at least some part of the coded picture is not 511 error-free (e.g., that reference picture was an error-concealed 512 picture). For example, when one of the multiple slices of a picture 513 is lost, it may be error-concealed using an interpolation of the 514 neighboring slices. While advanced video coding prediction 515 mechanisms provide significantly higher coding efficiency, they also 516 make it harder for machines to estimate the quality of an error- 517 concealed picture, which was already a hard problem with the use of 518 simpler prediction mechanisms. Advanced in-picture prediction 519 mechanisms also cause the coding efficiency loss due to splitting a 520 picture into multiple slices to be more significant. Furthermore, 521 network conditions become significantly better while at the same time 522 techniques for dealing with packet losses have become significantly 523 improved. As a result, very few implementations have recently used 524 slices for maximum transmission unit size matching. Instead, 525 substantially all applications where low-delay error resilience is 526 required (e.g., video telephony and video conferencing) rely on 527 system/transport-level error resilience (e.g., retransmission, 528 forward error correction) and/or picture-based error resilience tools 529 (feedback-based error resilience, insertion of IRAPs, scalability 530 with higher protection level of the base layer, and so on). 531 Considering all the above, nowadays it is very rare that a picture 532 that cannot be correctly decoded is passed to the decoder, and when 533 such a rare case occurs, the system can afford to wait for an error- 534 free picture to be decoded and available for display without 535 resulting in frequent and long periods of picture freezing seen by 536 end users. 538 Slices in VVC have two modes: rectangular slices and raster-scan 539 slices. The rectangular slice, as indicated by its name, covers a 540 rectangular region of the picture. Typically, a rectangular slice 541 consists of several complete tiles. However, it is also possible 542 that a rectangular slice is a subset of a tile and consists of one or 543 more consecutive, complete CTU rows within a tile. A raster-scan 544 slice consists of one or more complete tiles in a tile raster scan 545 order, hence the region covered by a raster-scan slices need not but 546 could have a non-rectangular shape, but it may also happen to have 547 the shape of a rectangle. The concept of slices in VVC is therefore 548 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 549 macroblocks (as in AVC). 551 Subpictures 553 VVC is the first video coding standard that includes the support of 554 subpictures as a feature. Each subpicture consists of one or more 555 complete rectangular slices that collectively cover a rectangular 556 region of the picture. A subpicture may be either specified to be 557 extractable (i.e., coded independently of other subpictures of the 558 same picture and of earlier pictures in decoding order) or not 559 extractable. Regardless of whether a subpicture is extractable or 560 not, the encoder can control whether in-loop filtering (including 561 deblocking, SAO, and ALF) is applied across the subpicture boundaries 562 individually for each subpicture. 564 Functionally, subpictures are similar to the motion-constrained tile 565 sets (MCTSs) in HEVC. They both allow independent coding and 566 extraction of a rectangular subset of a sequence of coded pictures, 567 for use cases like viewport-dependent 360o video streaming 568 optimization and region of interest (ROI) applications. 570 There are several important design differences between subpictures 571 and MCTSs. First, the subpictures feature in VVC allows motion 572 vectors of a coding block pointing outside of the subpicture even 573 when the subpicture is extractable by applying sample padding at 574 subpicture boundaries in this case, similarly as at picture 575 boundaries. Second, additional changes were introduced for the 576 selection and derivation of motion vectors in the merge mode and in 577 the decoder side motion vector refinement process of VVC. This 578 allows higher coding efficiency compared to the non-normative motion 579 constraints applied at the encoder-side for MCTSs. Third, rewriting 580 of SHs (and PH NAL units, when present) is not needed when extracting 581 one or more extractable subpictures from a sequence of pictures to 582 create a sub-bitstream that is a conforming bitstream. In sub- 583 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 584 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 585 extraction, rewriting of SPSs and PPSs is needed. However, typically 586 there are only a few parameter sets in a bitstream, while each 587 picture has at least one slice, therefore rewriting of SHs can be a 588 significant burden for application systems. Fourth, slices of 589 different subpictures within a picture are allowed to have different 590 NAL unit types. Fifth, VVC specifies HRD and level definitions for 591 subpicture sequences, thus the conformance of the sub-bitstream of 592 each extractable subpicture sequence can be ensured by encoders. 594 1.1.4. NAL Unit Header 596 VVC maintains the NAL unit concept of HEVC with modifications. VVC 597 uses a two-byte NAL unit header, as shown in Figure 1. The payload 598 of a NAL unit refers to the NAL unit excluding the NAL unit header. 600 +---------------+---------------+ 601 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 602 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 603 |F|Z| LayerID | Type | TID | 604 +---------------+---------------+ 606 The Structure of the VVC NAL Unit Header. 608 Figure 1 610 The semantics of the fields in the NAL unit header are as specified 611 in VVC and described briefly below for convenience. In addition to 612 the name and size of each field, the corresponding syntax element 613 name in VVC is also provided. 615 F: 1 bit 616 forbidden_zero_bit. Required to be zero in VVC. Note that the 617 inclusion of this bit in the NAL unit header was to enable 618 transport of VVC video over MPEG-2 transport systems (avoidance of 619 start code emulations) [MPEG2S]. In the context of this memo the 620 value 1 may be used to indicate a syntax violation, e.g., for a 621 NAL unit resulted from aggregating a number of fragmented units of 622 a NAL unit but missing the last fragment, as described in 623 Section TBD. 625 Z: 1 bit 627 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 628 for future extensions by ITU-T and ISO/IEC. 630 This memo does not overload the "Z" bit for local extensions, as 631 a) overloading the "F" bit is sufficient and b) to preserve the 632 usefulness of this memo to possible future versions of [VVC]. 634 LayerId: 6 bits 636 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 637 a layer may be, e.g., a spatial scalable layer, a quality scalable 638 layer, a layer containing a different view, etc. 640 Type: 5 bits 642 nal_unit_type. This field specifies the NAL unit type as defined 643 in Table 5 of [VVC]. For a reference of all currently defined NAL 644 unit types and their semantics, please refer to Section 7.4.2.2 in 645 [VVC]. 647 TID: 3 bits 649 nuh_temporal_id_plus1. This field specifies the temporal 650 identifier of the NAL unit plus 1. The value of TemporalId is 651 equal to TID minus 1. A TID value of 0 is illegal to ensure that 652 there is at least one bit in the NAL unit header equal to 1, so to 653 enable independent considerations of start code emulations in the 654 NAL unit header and in the NAL unit payload data. 656 1.2. Overview of the Payload Format 658 This payload format defines the following processes required for 659 transport of VVC coded data over RTP [RFC3550]: 661 * Usage of RTP header with this payload format 662 * Packetization of VVC coded NAL units into RTP packets using three 663 types of payload structures: a single NAL unit packet, aggregation 664 packet, and fragment unit 666 * Transmission of VVC NAL units of the same bitstream within a 667 single RTP stream 669 * Media type parameters to be used with the Session Description 670 Protocol (SDP) [RFC4566] 672 * Usage of RTCP feedback messages 674 2. Conventions 676 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 677 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 678 "OPTIONAL" in this document are to be interpreted as described in BCP 679 14 [RFC2119] [RFC8174] when, and only when, they appear in all 680 capitals, as shown above. 682 3. Definitions and Abbreviations 684 3.1. Definitions 686 This document uses the terms and definitions of VVC. Section 3.1.1 687 lists relevant definitions from [VVC] for convenience. Section 3.1.2 688 provides definitions specific to this memo. 690 3.1.1. Definitions from the VVC Specification 692 Access unit (AU): A set of PUs that belong to different layers and 693 contain coded pictures associated with the same time for output from 694 the DPB. 696 Adaptation parameter set (APS): A syntax structure containing syntax 697 elements that apply to zero or more slices as determined by zero or 698 more syntax elements found in slice headers. 700 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 701 byte stream, that forms the representation of a sequence of AUs 702 forming one or more coded video sequences (CVSs). 704 Coded picture: A coded representation of a picture comprising VCL NAL 705 units with a particular value of nuh_layer_id within an AU and 706 containing all CTUs of the picture. 708 Clean random access (CRA) PU: A PU in which the coded picture is a 709 CRA picture. 711 Clean random access (CRA) picture: An IRAP picture for which each VCL 712 NAL unit has nal_unit_type equal to CRA_NUT. 714 Coded video sequence (CVS): A sequence of AUs that consists, in 715 decoding order, of a CVSS AU, followed by zero or more AUs that are 716 not CVSS AUs, including all subsequent AUs up to but not including 717 any subsequent AU that is a CVSS AU. 719 Coded video sequence start (CVSS) AU: An AU in which there is a PU 720 for each layer in the CVS and the coded picture in each PU is a CLVSS 721 picture. 723 Coded layer video sequence (CLVS): A sequence of PUs with the same 724 value of nuh_layer_id that consists, in decoding order, of a CLVSS 725 PU, followed by zero or more PUs that are not CLVSS PUs, including 726 all subsequent PUs up to but not including any subsequent PU that is 727 a CLVSS PU. 729 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 730 picture is a CLVSS picture. 732 Coded layer video sequence start (CLVSS) picture: A coded picture 733 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 734 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 736 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 737 of chroma samples of a picture that has three sample arrays, or a CTB 738 of samples of a monochrome picture or a picture that is coded using 739 three separate colour planes and syntax structures used to code the 740 samples. 742 Decoding Capability Information (DCI): A syntax structure containing 743 syntax elements that apply to the entire bitstream. 745 Decoded picture buffer (DPB): A buffer holding decoded pictures for 746 reference, output reordering, or output delay specified for the 747 hypothetical reference decoder. 749 Gradual decoding refresh (GDR) picture: A picture for which each VCL 750 NAL unit has nal_unit_type equal to GDR_NUT. 752 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 753 picture is an IDR picture. 755 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 756 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 757 IDR_N_LP. 759 Intra random access point (IRAP) AU: An AU in which there is a PU for 760 each layer in the CVS and the coded picture in each PU is an IRAP 761 picture. 763 Intra random access point (IRAP) PU: A PU in which the coded picture 764 is an IRAP picture. 766 Intra random access point (IRAP) picture: A coded picture for which 767 all VCL NAL units have the same value of nal_unit_type in the range 768 of IDR_W_RADL to CRA_NUT, inclusive. 770 Layer: A set of VCL NAL units that all have a particular value of 771 nuh_layer_id and the associated non-VCL NAL units. 773 Network abstraction layer (NAL) unit: A syntax structure containing 774 an indication of the type of data to follow and bytes containing that 775 data in the form of an RBSP interspersed as necessary with emulation 776 prevention bytes. 778 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 780 Operation point (OP): A temporal subset of an OLS, identified by an 781 OLS index and a highest value of TemporalId. 783 Picture parameter set (PPS): A syntax structure containing syntax 784 elements that apply to zero or more entire coded pictures as 785 determined by a syntax element found in each slice header. 787 Picture unit (PU): A set of NAL units that are associated with each 788 other according to a specified classification rule, are consecutive 789 in decoding order, and contain exactly one coded picture. 791 Random access: The act of starting the decoding process for a 792 bitstream at a point other than the beginning of the stream. 794 Sequence parameter set (SPS): A syntax structure containing syntax 795 elements that apply to zero or more entire CLVSs as determined by the 796 content of a syntax element found in the PPS referred to by a syntax 797 element found in each picture header. 799 Slice: An integer number of complete tiles or an integer number of 800 consecutive complete CTU rows within a tile of a picture that are 801 exclusively contained in a single NAL unit. 803 Slice header (SH): A part of a coded slice containing the data 804 elements pertaining to all tiles or CTU rows within a tile 805 represented in the slice. 807 Sublayer: A temporal scalable layer of a temporal scalable bitstream 808 consisting of VCL NAL units with a particular value of the TemporalId 809 variable, and the associated non-VCL NAL units. 811 Subpicture: An rectangular region of one or more slices within a 812 picture. 814 Sublayer representation: A subset of the bitstream consisting of NAL 815 units of a particular sublayer and the lower sublayers. 817 Tile: A rectangular region of CTUs within a particular tile column 818 and a particular tile row in a picture. 820 Tile column: A rectangular region of CTUs having a height equal to 821 the height of the picture and a width specified by syntax elements in 822 the picture parameter set. 824 Tile row: A rectangular region of CTUs having a height specified by 825 syntax elements in the picture parameter set and a width equal to the 826 width of the picture. 828 Video coding layer (VCL) NAL unit: A collective term for coded slice 829 NAL units and the subset of NAL units that have reserved values of 830 nal_unit_type that are classified as VCL NAL units in this 831 Specification. 833 3.1.2. Definitions Specific to This Memo 835 Media-Aware Network Element (MANE): A network element, such as a 836 middlebox, selective forwarding unit, or application-layer gateway 837 that is capable of parsing certain aspects of the RTP payload headers 838 or the RTP payload and reacting to their contents. 840 Informative note: The concept of a MANE goes beyond normal routers 841 or gateways in that a MANE has to be aware of the signaling (e.g., 842 to learn about the payload type mappings of the media streams), 843 and in that it has to be trusted when working with Secure RTP 844 (SRTP). The advantage of using MANEs is that they allow packets 845 to be dropped according to the needs of the media coding. For 846 example, if a MANE has to drop packets due to congestion on a 847 certain link, it can identify and remove those packets whose 848 elimination produces the least adverse effect on the user 849 experience. After dropping packets, MANEs must rewrite RTCP 850 packets to match the changes to the RTP stream, as specified in 851 Section 7 of [RFC3550]. 853 NAL unit decoding order: A NAL unit order that conforms to the 854 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 855 follow the Order of NAL units in the bitstream. 857 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 858 stream is utilized to transport a VVC bitstream, which may contain 859 one or more layers, and each layer may contain one or more temporal 860 sublayers. 862 Transmission order: The order of packets in ascending RTP sequence 863 number order (in modulo arithmetic). Within an aggregation packet, 864 the NAL unit transmission order is the same as the order of 865 appearance of NAL units in the packet. 867 3.2. Abbreviations 869 AU Access Unit 871 AP Aggregation Packet 873 APS Adaptation Parameter Set 875 CTU Coding Tree Unit 877 CVS Coded Video Sequence 879 DPB Decoded Picture Buffer 881 DCI Decoding Capability Information 883 DON Decoding Order Number 885 FIR Full Intra Request 887 FU Fragmentation Unit 889 GDR Gradual Decoding Refresh 891 HRD Hypothetical Reference Decoder 893 IDR Instantaneous Decoding Refresh 895 MANE Media-Aware Network Element 897 MTU Maximum Transfer Unit 899 NAL Network Abstraction Layer 900 NALU Network Abstraction Layer Unit 902 PLI Picture Loss Indication 904 PPS Picture Parameter Set 906 RPS Reference Picture Set 908 RPSI Reference Picture Selection Indication 910 SEI Supplemental Enhancement Information 912 SLI Slice Loss Indication 914 SPS Sequence Parameter Set 916 VCL Video Coding Layer 918 VPS Video Parameter Set 920 4. RTP Payload Format 922 4.1. RTP Header Usage 924 The format of the RTP header is specified in [RFC3550] (reprinted as 925 Figure 2 for convenience). This payload format uses the fields of 926 the header in a manner consistent with that specification. 928 The RTP payload (and the settings for some RTP header bits) for 929 aggregation packets and fragmentation units are specified in 930 Section 4.3.2 and Section 4.3.3, respectively. 932 0 1 2 3 933 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 934 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 935 |V=2|P|X| CC |M| PT | sequence number | 936 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 937 | timestamp | 938 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 939 | synchronization source (SSRC) identifier | 940 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 941 | contributing source (CSRC) identifiers | 942 | .... | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 RTP Header According to {{RFC3550}} 947 Figure 2 949 The RTP header information to be set according to this RTP payload 950 format is set as follows: 952 Marker bit (M): 1 bit 954 Set for the last packet, in transmission order, among each set of 955 packets that contain NAL units of one access unit. This is in 956 line with the normal use of the M bit in video formats to allow an 957 efficient playout buffer handling. 959 Payload Type (PT): 7 bits 961 The assignment of an RTP payload type for this new packet format 962 is outside the scope of this document and will not be specified 963 here. The assignment of a payload type has to be performed either 964 through the profile used or in a dynamic way. 966 Sequence Number (SN): 16 bits 968 Set and used in accordance with [RFC3550]. 970 Timestamp: 32 bits 972 The RTP timestamp is set to the sampling timestamp of the content. 973 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 974 properties of its own (e.g., parameter set and SEI NAL units), the 975 RTP timestamp MUST be set to the RTP timestamp of the coded 976 pictures of the access unit in which the NAL unit (according to 977 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 978 timestamp for the display process, even when the bitstream 979 contains picture timing SEI messages or decoding unit information 980 SEI messages as specified in [VVC]. 982 Synchronization source (SSRC): 32 bits 984 Used to identify the source of the RTP packets. A single SSRC is 985 used for all parts of a single bitstream. 987 4.2. Payload Header Usage 989 The first two bytes of the payload of an RTP packet are referred to 990 as the payload header. The payload header consists of the same 991 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 992 in Section 1.1.4, irrespective of the type of the payload structure. 994 The TID value indicates (among other things) the relative importance 995 of an RTP packet, for example, because NAL units belonging to higher 996 temporal sublayers are not used for the decoding of lower temporal 997 sublayers. A lower value of TID indicates a higher importance. 998 More-important NAL units MAY be better protected against transmission 999 losses than less-important NAL units. 1001 For Discussion: quite possibly something similar can be said for 1002 the Layer_id in layered coding, but perhaps not in multiview 1003 coding. (The relevant part of the spec is relatively new, 1004 therefore the soft language). However, for serious layer pruning, 1005 interpretation of the VPS is required. We can add language about 1006 the need for stateful interpretation of LayerID vis-a-vis 1007 stateless interpretation of TID later. 1009 4.3. Payload Structures 1011 Three different types of RTP packet payload structures are specified. 1012 A receiver can identify the type of an RTP packet payload through the 1013 Type field in the payload header. 1015 The three different payload structures are as follows: 1017 * Single NAL unit packet: Contains a single NAL unit in the payload, 1018 and the NAL unit header of the NAL unit also serves as the payload 1019 header. This payload structure is specified in Section 4.4.1. 1021 * Aggregation Packet (AP): Contains more than one NAL unit within 1022 one access unit. This payload structure is specified in 1023 Section 4.3.2. 1025 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1026 This payload structure is specified in Section 4.3.3. 1028 4.3.1. Single NAL Unit Packets 1030 A single NAL unit packet contains exactly one NAL unit, and consists 1031 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1032 DONL field (in network byte order), and the NAL unit payload data 1033 (the NAL unit excluding its NAL unit header) of the contained NAL 1034 unit, as shown in Figure 3. 1036 0 1 2 3 1037 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1038 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1039 | PayloadHdr | DONL (conditional) | 1040 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1041 | | 1042 | NAL unit payload data | 1043 | | 1044 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1045 | :...OPTIONAL RTP padding | 1046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1048 The Structure of a Single NAL Unit Packet 1050 Figure 3 1052 The DONL field, when present, specifies the value of the 16 least 1053 significant bits of the decoding order number of the contained NAL 1054 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1055 be present, and the variable DON for the contained NAL unit is 1056 derived as equal to the value of the DONL field. Otherwise (sprop- 1057 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1059 4.3.2. Aggregation Packets (APs) 1061 Aggregation Packets (APs) can reduce packetization overhead for small 1062 NAL units, such as most of the non- VCL NAL units, which are often 1063 only a few octets in size. 1065 An AP aggregates NAL units of one access unit. Each NAL unit to be 1066 carried in an AP is encapsulated in an aggregation unit. NAL units 1067 aggregated in one AP are included in NAL unit decoding order. 1069 An AP consists of a payload header (denoted as PayloadHdr) followed 1070 by two or more aggregation units, as shown in Figure 4. 1072 0 1 2 3 1073 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1075 | PayloadHdr (Type=28) | | 1076 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1077 | | 1078 | two or more aggregation units | 1079 | | 1080 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1081 | :...OPTIONAL RTP padding | 1082 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1084 The Structure of an Aggregation Packet 1086 Figure 4 1088 The fields in the payload header of an AP are set as follows. The F 1089 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1090 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1091 be equal to 28. 1093 The value of LayerId MUST be equal to the lowest value of LayerId of 1094 all the aggregated NAL units. The value of TID MUST be the lowest 1095 value of TID of all the aggregated NAL units. 1097 Informative note: All VCL NAL units in an AP have the same TID 1098 value since they belong to the same access unit. However, an AP 1099 may contain non-VCL NAL units for which the TID value in the NAL 1100 unit header may be different than the TID value of the VCL NAL 1101 units in the same AP. 1103 An AP MUST carry at least two aggregation units and can carry as many 1104 aggregation units as necessary; however, the total amount of data in 1105 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1106 chosen so that the resulting IP packet is smaller than the MTU size 1107 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1108 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1109 not contain another AP. 1111 The first aggregation unit in an AP consists of a conditional 16-bit 1112 DONL field (in network byte order) followed by a 16-bit unsigned size 1113 information (in network byte order) that indicates the size of the 1114 NAL unit in bytes (excluding these two octets, but including the NAL 1115 unit header), followed by the NAL unit itself, including its NAL unit 1116 header, as shown in Figure 5. 1118 0 1 2 3 1119 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1121 | : DONL (conditional) | NALU size | 1122 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1123 | NALU size | | 1124 +-+-+-+-+-+-+-+-+ NAL unit | 1125 | | 1126 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1127 | : 1128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1130 The Structure of the First Aggregation Unit in an AP 1132 Figure 5 1134 The DONL field, when present, specifies the value of the 16 least 1135 significant bits of the decoding order number of the aggregated NAL 1136 unit. 1138 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1139 present in an aggregation unit that is the first aggregation unit in 1140 an AP, and the variable DON for the aggregated NAL unit is derived as 1141 equal to the value of the DONL field, and the variable DON for an 1142 aggregation unit that is not the first aggregation unit in an AP 1143 aggregated NAL unit is derived as equal to the DON of the preceding 1144 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1145 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1146 present in an aggregation unit that is the first aggregation unit in 1147 an AP. 1149 An aggregation unit that is not the first aggregation unit in an AP 1150 will be followed immediately by a 16-bit unsigned size information 1151 (in network byte order) that indicates the size of the NAL unit in 1152 bytes (excluding these two octets, but including the NAL unit 1153 header), followed by the NAL unit itself, including its NAL unit 1154 header, as shown in Figure 6. 1156 0 1 2 3 1157 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1158 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1159 | : NALU size | NAL unit | 1160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1161 | | 1162 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1163 | : 1164 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1166 The Structure of an Aggregation Unit That Is Not the First 1167 Aggregation Unit in an AP 1169 Figure 6 1171 Figure 7 presents an example of an AP that contains two aggregation 1172 units, labeled as 1 and 2 in the figure, without the DONL field being 1173 present. 1175 0 1 2 3 1176 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1177 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1178 | RTP Header | 1179 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1180 | PayloadHdr (Type=28) | NALU 1 Size | 1181 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1182 | NALU 1 HDR | | 1183 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1184 | . . . | 1185 | | 1186 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1187 | . . . | NALU 2 Size | NALU 2 HDR | 1188 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1189 | NALU 2 HDR | | 1190 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1191 | . . . | 1192 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1193 | :...OPTIONAL RTP padding | 1194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1196 An Example of an AP Packet Containing 1197 Two Aggregation Units without the DONL Field 1199 Figure 7 1201 Figure 8 presents an example of an AP that contains two aggregation 1202 units, labeled as 1 and 2 in the figure, with the DONL field being 1203 present. 1205 0 1 2 3 1206 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1207 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1208 | RTP Header | 1209 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1210 | PayloadHdr (Type=28) | NALU 1 DONL | 1211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1212 | NALU 1 Size | NALU 1 HDR | 1213 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1214 | | 1215 | NALU 1 Data . . . | 1216 | | 1217 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1218 | : NALU 2 Size | 1219 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1220 | NALU 2 HDR | | 1221 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1222 | | 1223 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1224 | :...OPTIONAL RTP padding | 1225 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1227 An Example of an AP Containing 1228 Two Aggregation Units with the DONL Field 1230 Figure 8 1232 4.3.3. Fragmentation Units 1234 Fragmentation Units (FUs) are introduced to enable fragmenting a 1235 single NAL unit into multiple RTP packets, possibly without 1236 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1237 unit consists of an integer number of consecutive octets of that NAL 1238 unit. Fragments of the same NAL unit MUST be sent in consecutive 1239 order with ascending RTP sequence numbers (with no other RTP packets 1240 within the same RTP stream being sent between the first and last 1241 fragment). 1243 When a NAL unit is fragmented and conveyed within FUs, it is referred 1244 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1245 NOT be nested; i.e., an FU can not contain a subset of another FU. 1247 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1248 time of the fragmented NAL unit. 1250 An FU consists of a payload header (denoted as PayloadHdr), an FU 1251 header of one octet, a conditional 16-bit DONL field (in network byte 1252 order), and an FU payload, as shown in Figure 9. 1254 0 1 2 3 1255 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1256 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1257 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1258 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1259 | DONL (cond) | | 1260 |-+-+-+-+-+-+-+-+ | 1261 | FU payload | 1262 | | 1263 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1264 | :...OPTIONAL RTP padding | 1265 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1267 The Structure of an FU 1269 Figure 9 1271 The fields in the payload header are set as follows. The Type field 1272 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1273 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1274 unit. 1276 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1277 FuType field, as shown in Figure 10. 1279 +---------------+ 1280 |0|1|2|3|4|5|6|7| 1281 +-+-+-+-+-+-+-+-+ 1282 |S|E|P| FuType | 1283 +---------------+ 1285 The Structure of FU Header 1287 Figure 10 1289 The semantics of the FU header fields are as follows: 1291 S: 1 bit 1293 When set to 1, the S bit indicates the start of a fragmented NAL 1294 unit, i.e., the first byte of the FU payload is also the first 1295 byte of the payload of the fragmented NAL unit. When the FU 1296 payload is not the start of the fragmented NAL unit payload, the S 1297 bit MUST be set to 0. 1299 E: 1 bit 1300 When set to 1, the E bit indicates the end of a fragmented NAL 1301 unit, i.e., the last byte of the payload is also the last byte of 1302 the fragmented NAL unit. When the FU payload is not the last 1303 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1305 P: 1 bit 1307 When set to 1, the P bit indicates the last NAL unit of a coded 1308 picture, i.e., the last byte of the FU payload is also the last 1309 byte of the coded picture. When the FU payload is not the last 1310 fragment of a coded picture, the P bit MUST be set to 0. 1312 FuType: 5 bits 1314 The field FuType MUST be equal to the field Type of the fragmented 1315 NAL unit. 1317 The DONL field, when present, specifies the value of the 16 least 1318 significant bits of the decoding order number of the fragmented NAL 1319 unit. 1321 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1322 the DONL field MUST be present in the FU, and the variable DON for 1323 the fragmented NAL unit is derived as equal to the value of the DONL 1324 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1325 equal to 0), the DONL field MUST NOT be present in the FU. 1327 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1328 the Start bit and End bit must not both be set to 1 in the same FU 1329 header. 1331 The FU payload consists of fragments of the payload of the fragmented 1332 NAL unit so that if the FU payloads of consecutive FUs, starting with 1333 an FU with the S bit equal to 1 and ending with an FU with the E bit 1334 equal to 1, are sequentially concatenated, the payload of the 1335 fragmented NAL unit can be reconstructed. The NAL unit header of the 1336 fragmented NAL unit is not included as such in the FU payload, but 1337 rather the information of the NAL unit header of the fragmented NAL 1338 unit is conveyed in F, LayerId, and TID fields of the FU payload 1339 headers of the FUs and the FuType field of the FU header of the FUs. 1340 An FU payload MUST NOT be empty. 1342 If an FU is lost, the receiver SHOULD discard all following 1343 fragmentation units in transmission order corresponding to the same 1344 fragmented NAL unit, unless the decoder in the receiver is known to 1345 be prepared to gracefully handle incomplete NAL units. 1347 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1348 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1349 n of that NAL unit is not received. In this case, the 1350 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1351 syntax violation. 1353 4.4. Decoding Order Number 1355 For each NAL unit, the variable AbsDon is derived, representing the 1356 decoding order number that is indicative of the NAL unit decoding 1357 order. 1359 Let NAL unit n be the n-th NAL unit in transmission order within an 1360 RTP stream. 1362 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1363 for NAL unit n, is derived as equal to n. 1365 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1366 derived as follows, where DON[n] is the value of the variable DON for 1367 NAL unit n: 1369 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1370 transmission order), AbsDon[0] is set equal to DON[0]. 1372 * Otherwise (n is greater than 0), the following applies for 1373 derivation of AbsDon[n]: 1375 If DON[n] == DON[n-1], 1376 AbsDon[n] = AbsDon[n-1] 1378 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1379 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1381 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1382 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1384 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1385 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1386 DON[n]) 1388 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1389 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1391 For any two NAL units m and n, the following applies: 1393 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1394 NAL unit m in NAL unit decoding order. 1396 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1397 of the two NAL units can be in either order. 1399 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1400 NAL unit m in decoding order. 1402 Informative note: When two consecutive NAL units in the NAL 1403 unit decoding order have different values of AbsDon, the 1404 absolute difference between the two AbsDon values may be 1405 greater than or equal to 1. 1407 Informative note: There are multiple reasons to allow for the 1408 absolute difference of the values of AbsDon for two consecutive 1409 NAL units in the NAL unit decoding order to be greater than 1410 one. An increment by one is not required, as at the time of 1411 associating values of AbsDon to NAL units, it may not be known 1412 whether all NAL units are to be delivered to the receiver. For 1413 example, a gateway might not forward VCL NAL units of higher 1414 sublayers or some SEI NAL units when there is congestion in the 1415 network. In another example, the first intra-coded picture of 1416 a pre-encoded clip is transmitted in advance to ensure that it 1417 is readily available in the receiver, and when transmitting the 1418 first intra-coded picture, the originator does not exactly know 1419 how many NAL units will be encoded before the first intra-coded 1420 picture of the pre-encoded clip follows in decoding order. 1421 Thus, the values of AbsDon for the NAL units of the first 1422 intra-coded picture of the pre-encoded clip have to be 1423 estimated when they are transmitted, and gaps in values of 1424 AbsDon may occur. 1426 5. Packetization Rules 1428 The following packetization rules apply: 1430 * If sprop-max-don-diff is greater than 0, the transmission order of 1431 NAL units carried in the RTP stream MAY be different than the NAL 1432 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1433 0), the transmission order of NAL units carried in the RTP stream 1434 MUST be the same as the NAL unit decoding order. 1436 * A NAL unit of a small size SHOULD be encapsulated in an 1437 aggregation packet together one or more other NAL units in order 1438 to avoid the unnecessary packetization overhead for small NAL 1439 units. For example, non-VCL NAL units such as access unit 1440 delimiters, parameter sets, or SEI NAL units are typically small 1441 and can often be aggregated with VCL NAL units without violating 1442 MTU size constraints. 1444 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1445 viewpoint, be encapsulated in an aggregation packet together with 1446 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1447 be meaningless without the associated VCL NAL unit being 1448 available. 1450 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1451 unit packet MUST be used. 1453 6. De-packetization Process 1455 The general concept behind de-packetization is to get the NAL units 1456 out of the RTP packets in an RTP stream and pass them to the decoder 1457 in the NAL unit decoding order. 1459 The de-packetization process is implementation dependent. Therefore, 1460 the following description should be seen as an example of a suitable 1461 implementation. Other schemes may be used as well, as long as the 1462 output for the same input is the same as the process described below. 1463 The output is the same when the set of output NAL units and their 1464 order are both identical. Optimizations relative to the described 1465 algorithms are possible. 1467 All normal RTP mechanisms related to buffer management apply. In 1468 particular, duplicated or outdated RTP packets (as indicated by the 1469 RTP sequences number and the RTP timestamp) are removed. To 1470 determine the exact time for decoding, factors such as a possible 1471 intentional delay to allow for proper inter-stream synchronization 1472 MUST be factored in. 1474 NAL units with NAL unit type values in the range of 0 to 27, 1475 inclusive, may be passed to the decoder. NAL-unit-like structures 1476 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1477 NOT be passed to the decoder. 1479 The receiver includes a receiver buffer, which is used to compensate 1480 for transmission delay jitter within individual RTP stream, to 1481 reorder NAL units from transmission order to the NAL unit decoding 1482 order. In this section, the receiver operation is described under 1483 the assumption that there is no transmission delay jitter within an 1484 RTP stream. To make a difference from a practical receiver buffer 1485 that is also used for compensation of transmission delay jitter, the 1486 receiver buffer is hereafter called the de-packetization buffer in 1487 this section. Receivers should also prepare for transmission delay 1488 jitter; that is, either reserve separate buffers for transmission 1489 delay jitter buffering and de-packetization buffering or use a 1490 receiver buffer for both transmission delay jitter and de- 1491 packetization. Moreover, receivers should take transmission delay 1492 jitter into account in the buffering operation, e.g., by additional 1493 initial buffering before starting of decoding and playback. 1495 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1496 size is zero bytes, and the process described in the remainder of 1497 this paragraph applies. The NAL units carried in the single RTP 1498 stream are directly passed to the decoder in their transmission 1499 order, which is identical to their decoding order. 1501 When sprop-max-don-diff is greater than 0, the process described in 1502 the remainder of this section applies. 1504 There are two buffering states in the receiver: initial buffering and 1505 buffering while playing. Initial buffering starts when the reception 1506 is initialized. After initial buffering, decoding and playback are 1507 started, and the buffering-while-playing mode is used. 1509 Regardless of the buffering state, the receiver stores incoming NAL 1510 units in reception order into the de-packetization buffer. NAL units 1511 carried in RTP packets are stored in the de-packetization buffer 1512 individually, and the value of AbsDon is calculated and stored for 1513 each NAL unit. 1515 Initial buffering lasts until the difference between the greatest and 1516 smallest AbsDon values of the NAL units in the de-packetization 1517 buffer is greater than or equal to the value of sprop-max-don-diff. 1519 After initial buffering, whenever condition A or condition B is true, 1520 the following operation is repeatedly applied until both condition A 1521 and condition B become false: 1523 * The NAL unit in the de-packetization buffer with the smallest 1524 value of AbsDon is removed from the de-packetization buffer and 1525 passed to the decoder. 1527 When no more NAL units are flowing into the de-packetization buffer, 1528 all NAL units remaining in the de-packetization buffer are removed 1529 from the buffer and passed to the decoder in the order of increasing 1530 AbsDon values. 1532 7. Payload Format Parameters 1534 This section specifies the optional parameters. A mapping of the 1535 parameters with Session Description Protocol (SDP) [RFC4556] is also 1536 provided for applications that use SDP. 1538 7.1. Media Type Registration 1540 The receiver MUST ignore any parameter unspecified in this memo. 1542 Type name: video 1544 Subtype name: H266 1546 Required parameters: none 1548 Optional parameters: 1550 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1551 level-id: 1553 These parameters indicate the profile, tier, default level, 1554 sub-profile, and some constraints of the bitstream carried by 1555 the RTP stream, or a specific set of the profile, tier, default 1556 level, sub-profile and some constraints the receiver supports. 1558 The subset of coding tools that may have been used to generate 1559 the bitstream or that the receiver supports, as well as some 1560 additional constraints are indicated collectively by profile- 1561 id, sub-profile-id, and interop-constraints. 1563 Informative note: There are 128 values of profile-id. The 1564 subset of coding tools identified by the profile-id can be 1565 further constrained with up to 255 instances of sub-profile- 1566 id. In addition, 68 bits included in interop-constraints, 1567 which can be extended up to 324 bits provide means to 1568 further restrict tools from existing profiles. To be able 1569 to support this fine-granular signalling of coding tool 1570 subsets with profile-id, sub-profile-id and interop- 1571 constraints, it would be safe to require symmetric use of 1572 these parameters in SDP offer/answer unless recv-ols-id is 1573 included in the SDP answer for choosing one of the layers 1574 offered. 1576 The tier is indicated by tier-flag. The default level is 1577 indicated by level-id. The tier and the default level specify 1578 the limits on values of syntax elements or arithmetic 1579 combinations of values of syntax elements that are followed 1580 when generating the bitstream or that the receiver supports. 1582 In SDP offer/answer, when the SDP answer does not include the 1583 recv-ols-id parameter that is less than the sprop-ols-id 1584 parameter in the SDP offer, the following applies: 1586 o The tier-flag, profile-id, sub-profile-id, and interop- 1587 constraints parameters MUST be used symmetrically, i.e., the 1588 value of each of these parameters in the offer MUST be the 1589 same as that in the answer, either explicitly signaled or 1590 implicitly inferred. 1592 o The level-id parameter is changeable as long as the highest 1593 level indicated by the answer is either equal to or lower 1594 than that in the offer. Note that a highest level higher 1595 than level-id in the offer for receiving can be included as 1596 max-recv-level-id. 1598 In SDP offer/answer, when the SDP answer does include the recv- 1599 ols-id parameter that is less than the sprop-ols-id parameter 1600 in the SDP offer, the set of tier- flag, profile-id, sub- 1601 profile-id, interop-constraints, and level-id parameters 1602 included in the answer MUST be consistent with that for the 1603 chosen output layer set as indicated in the SDP offer, with the 1604 exception that the level-id parameter in the SDP answer is 1605 changeable as long as the highest level indicated by the answer 1606 is either lower than or equal to that in the offer. 1608 More specifications of these parameters, including how they 1609 relate to syntax elements specified in [VVC] are provided 1610 below. 1612 profile-id: 1614 When profile-id is not present, a value of 1 (i.e., the Main 10 1615 profile) MUST be inferred. 1617 When used to indicate properties of a bitstream, profile-id is 1618 derived from the general_profile_idc syntax element that 1619 applies to the bitstream in an instance of the 1620 profile_tier_level( ) syntax structure. 1622 A profile_tier_level( ) syntax structure may be contained in an 1623 SPS, VPS, or DCI NAL units as specified in [VVC]. One of the 1624 following three cases applies to the container NAL unit of the 1625 profile_tier_level( ) syntax structure containing those PTL 1626 syntax elements used to derive the values of profile-id, tier- 1627 flag, level-id, sub-profile-id, or interop-constraints: 1) The 1628 container NAL unit is an SPS, the bitstream is a single-layer 1629 bitstream, and the profile_tier_level( ) syntax structures in 1630 all SPSs referenced by the CVSs in the bitstream has the same 1631 values respectively for those PTL syntax elements; 2) The 1632 container NAL unit is a VPS, the profile_tier_level( ) syntax 1633 structure is the one in the VPS that applies to the OLS 1634 corresponding to the bitstream, and the profile_tier_level( ) 1635 syntax structures applicable to the OLS corresponding to the 1636 bitstream in all VPSs referenced by the CVSs in the bitstream 1637 have the same values respectively for those PTL syntax 1638 elements; 3) The container NAL unit is a DCI NAL unit and the 1639 profile_tier_level( ) syntax structures in all DCI NAL units in 1640 the bitstream has the same values respectively for those PTL 1641 syntax elements. 1643 tier-flag, level-id: 1645 The value of tier-flag MUST be in the range of 0 to 1, 1646 inclusive. The value of level-id MUST be in the range of 0 to 1647 255, inclusive. 1649 If the tier-flag and level-id parameters are used to indicate 1650 properties of a bitstream, they indicate the tier and the 1651 highest level the bitstream complies with. 1653 If the tier-flag and level-id parameters are used for 1654 capability exchange, the following applies. If max-recv-level- 1655 id is not present, the default level defined by level-id 1656 indicates the highest level the codec wishes to support. 1657 Otherwise, max-recv-level-id indicates the highest level the 1658 codec supports for receiving. For either receiving or sending, 1659 all levels that are lower than the highest level supported MUST 1660 also be supported. 1662 If no tier-flag is present, a value of 0 MUST be inferred; if 1663 no level-id is present, a value of 51 (i.e., level 3.1) MUST be 1664 inferred. 1666 Informative note: The level values currently defined in the 1667 VVC specification are in the form of "majorNum.minorNum", 1668 and the value of the level-id for each of the levels is 1669 equal to majorNum * 16 + minorNum * 3. It is expected that 1670 if any level are defined in the future, the same convention 1671 will be used, but this cannot be guaranteed. 1673 When used to indicate properties of a bitstream, the tier-flag 1674 and level-id parameters are derived respectively from the 1675 syntax element general_tier_flag, and the syntax element 1676 general_level_idc or sub_layer_level_idc[j], that apply to the 1677 bitstream, in an instance of the profile_tier_level( ) syntax 1678 structure. 1680 If the tier-flag and level-id are derived from the 1681 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1682 following applies: 1684 o tier-flag = general_tier_flag 1686 o level-id = general_level_idc 1688 Otherwise, if the tier-flag and level-id are derived from the 1689 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1690 unit, and the bitstream contains the highest sub-layer 1691 representation in the OLS corresponding to the bitstream, the 1692 following applies: 1694 o tier-flag = general_tier_flag 1696 o level-id = general_level_idc 1698 Otherwise, if the tier-flag and level-id are derived from the 1699 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1700 unit, and the bitstream does not contains the highest sub-layer 1701 representation in the OLS corresponding to the bitstream, the 1702 following applies, with j being the value of the sprop-sub- 1703 layer-id parameter: 1705 o tier-flag = general_tier_flag 1707 o level-id = sub_layer_level_idc[j] 1709 sub-profile-id: 1711 The value of the parameter is a comma-separated (',') list of 1712 data using base64[RFC4648] (hexadecimal) representation. 1714 When used to indicate properties of a bitstream, sub-profile-id 1715 is derived from each of the ptl_num_sub_profiles 1716 general_sub_profile_idc[i] syntax elements that apply to the 1717 bitstream in an profile_tier_level( ) syntax structure. 1719 interop-constraints: 1721 A base16 [RFC4648] (hexadecimal) representation of the data 1722 that includes the syntax elements 1723 ptl_frame_only_constraint_flag and ptl_multilayer_enabled_flag 1724 and the general_constraints_info( ) syntax structure that apply 1725 to the bitstream in an instance of the profile_tier_level( ) 1726 syntax structure. 1728 If the interop-constraints parameter is not present, the 1729 following MUST be inferred: 1731 o ptl_frame_only_constraint_flag = 0 1733 o ptl_multilayer_enabled_flag = 1 1735 o gci_present_flag in the general_constraints_info( ) syntax 1736 structure = 1 1738 editor-note 14: Double check the default values. Currently, no 1739 constraints, but actually, with the Main 10 profile as default multi- 1740 layer not possible. 1742 Using interop-constraints for capability exchange results in a 1743 requirement on any bitstream to be compliant with the interop- 1744 constraints. 1746 sprop-sub-layer-id: 1748 This parameter MAY be used to indicate the highest allowed 1749 value of TID in the bitstream. When not present, the value of 1750 sprop-sub-layer-id is inferred to be equal to 6. 1752 The value of sprop-sub-layer-id MUST be in the range of 0 to 6, 1753 inclusive. 1755 sprop-ols-id: 1757 This parameter MAY be used to indicate the OLS that the 1758 bitstream applies to. When not present, the value of sprop- 1759 ols-id is inferred to be equal to TargetOlsIdx as specified in 1760 8.1.1 in [VVC]. If this optional parameter is present, sprop- 1761 vps MUST also be present or its content MUST be known a priori 1762 at the receiver. 1764 The value of sprop-ols-id MUST be in the range of 0 to 257, 1765 inclusive. 1767 recv-sub-layer-id: 1769 This parameter MAY be used to signal a receiver's choice of the 1770 offered or declared sub-layer representations in the sprop-vps 1771 and sprop-sps. The value of recv-sub-layer-id indicates the 1772 TID of the highest sub-layer of the bitstream that a receiver 1773 supports. When not present, the value of recv-sub-layer-id is 1774 inferred to be equal to the value of the sprop-sub-layer-id 1775 parameter in the SDP offer. 1777 The value of recv-sub-layer-id MUST be in the range of 0 to 6, 1778 inclusive. 1780 recv-ols-id: 1782 This parameter MAY be used to signal a receiver's choice of the 1783 offered or declared output layer sets in the sprop-vps. The 1784 value of recv-ols-id indicates the OLS index of the bitstream 1785 that a receiver supports. When not present, the value of recv- 1786 ols-id is inferred to be equal to the value of the sprop-ols-id 1787 parameter in the SDP offer. When present, the value of recv- 1788 ols-id must be included only when sprop-ols-id was received and 1789 must refer to an output layer set in the VPS that is in the 1790 same dependency tree as the OLS referred to by sprop-ols-id. 1791 If this optional parameter is present, sprop-vps must have been 1792 received or its content must be known a priori at the receiver. 1794 The value of recv-ols-id MUST be in the range of 0 to 257, 1795 inclusive. 1797 max-recv-level-id: 1799 This parameter MAY be used to indicate the highest level a 1800 receiver supports. 1802 The value of max-recv-level-id MUST be in the range of 0 to 1803 255, inclusive. 1805 When max-recv-level-id is not present, the value is inferred to 1806 be equal to level-id. 1808 max-recv-level-id MUST NOT be present when the highest level 1809 the receiver supports is not higher than the default level. 1811 sprop-dci: 1813 This parameter MAY be used to convey a decoding capability 1814 information NAL unit of the bitstream for out-of-band 1815 transmission. The parameter MAY also be used for capability 1816 exchange. The value of the parameter a base64 [RFC4648] 1817 representations of the decoding capability information NAL unit 1818 as specified in Section 7.3.2.1 of [VVC]. 1820 sprop-vps: 1822 This parameter MAY be used to convey any video parameter set 1823 NAL unit of the bitstream for out-of-band transmission of video 1824 parameter sets. The parameter MAY also be used for capability 1825 exchange and to indicate sub-stream characteristics (i.e., 1826 properties of output layer sets and sublayer representations as 1827 defined in [VVC]). The value of the parameter is a comma- 1828 separated (',') list of base64 [RFC4648] representations of the 1829 video parameter set NAL units as specified in Section 7.3.2.3 1830 of [VVC]. 1832 The sprop-vps parameter MAY contain one or more than one video 1833 parameter set NAL unit. However, all other video parameter 1834 sets contained in the sprop-vps parameter MUST be consistent 1835 with the first video parameter set in the sprop-vps parameter. 1836 A video parameter set vpsB is said to be consistent with 1837 another video parameter set vpsA if any decoder that conforms 1838 to the profile, tier, level, and constraints indicated by the 1839 12 bytes of data starting from the syntax element 1840 general_profile_space to the syntax element general_level_idc, 1841 inclusive, in the first profile_tier_level( ) syntax structure 1842 in vpsA can decode any bitstream that conforms to the profile, 1843 tier, level, and constraints indicated by the 12 bytes of data 1844 starting from the syntax element general_profile_space to the 1845 syntax element general_level_idc, inclusive, in the first 1846 profile_tier_level( ) syntax structure in vpsB. 1848 sprop-sei: 1850 This parameter MAY be used to convey one or more SEI messages 1851 that describe bitstream characteristics. When present, a 1852 decoder can rely on the bitstream characteristics that are 1853 described in the SEI messages for the entire duration of the 1854 session, independently from the persistence scopes of the SEI 1855 messages as specified in [VSEI]. 1857 The value of the parameter is a comma-separated (',') list of 1858 base64 [RFC4648] representations of SEI NAL units as specified 1859 in [VSEI]. 1861 Informative note: Intentionally, no list of applicable or 1862 inapplicable SEI messages is specified here. Conveying 1863 certain SEI messages in sprop-sei may be sensible in some 1864 application scenarios and meaningless in others. However, a 1865 few examples are described below: 1867 1) In an environment where the bitstream was created from 1868 film-based source material, and no splicing is going to 1869 occur during the lifetime of the session, the film grain 1870 characteristics SEI message is likely meaningful, and 1871 sending it in sprop-sei rather than in the bitstream at each 1872 entry point may help with saving bits and allows one to 1873 configure the renderer only once, avoiding unwanted 1874 artifacts. 1876 2) Examples for SEI messages that would be meaningless to be 1877 conveyed in sprop-sei include the decoded picture hash SEI 1878 message (it is close to impossible that all decoded pictures 1879 have the same hashtag), the display orientation SEI message 1880 when the device is a handheld device (as the display 1881 orientation may change when the handheld device is turned 1882 around), or the filler payload SEI message (as there is no 1883 point in just having more bits in SDP). 1885 max-lsr: 1887 The max-lsr MAY be used to signal the capabilities of a 1888 receiver implementation and MUST NOT be used for any other 1889 purpose. The value of max-lsr is an integer indicating the 1890 maximum processing rate in units of luma samples per second. 1891 The max-lsr parameter signals that the receiver is capable of 1892 decoding video at a higher rate than is required by the highest 1893 level. 1895 Informative note: When the OPTIONAL media type parameters 1896 are used to signal the properties of a bitstream, and max- 1897 lsr is not present, the values of tier-flag, profile-id, 1898 sub-profile-id interop-constraints, and level-id must always 1899 be such that the bitstream complies fully with the specified 1900 profile, tier, and level. 1902 When max-lsr is signaled, the receiver MUST be able to decode 1903 bitstreams that conform to the highest level, with the 1904 exception that the MaxLumaSr value in Table 136 of [VVC] for 1905 the highest level is replaced with the value of max-lsr. 1906 Senders MAY use this knowledge to send pictures of a given size 1907 at a higher picture rate than is indicated in the highest 1908 level. 1910 When not present, the value of max-lsr is inferred to be equal 1911 to the value of MaxLumaSr given in Table 136 of [VVC] for the 1912 highest level. 1914 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 1915 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 1916 [VVC] for the highest level. 1918 max-fps: 1920 The value of max-fps is an integer indicating the maximum 1921 picture rate in units of pictures per 100 seconds that can be 1922 effectively processed by the receiver. The max-fps parameter 1923 MAY be used to signal that the receiver has a constraint in 1924 that it is not capable of processing video effectively at the 1925 full picture rate that is implied by the highest level and, 1926 when present, max-lsr. 1928 The value of max-fps is not necessarily the picture rate at 1929 which the maximum picture size can be sent, it constitutes a 1930 constraint on maximum picture rate for all resolutions. 1932 Informative note: The max-fps parameter is semantically 1933 different from max-lsr in that max-fps is used to signal a 1934 constraint, lowering the maximum picture rate from what is 1935 implied by other parameters. 1937 The encoder MUST use a picture rate equal to or less than this 1938 value. In cases where the max-fps parameter is absent, the 1939 encoder is free to choose any picture rate according to the 1940 highest level and any signaled optional parameters. 1942 The value of max-fps MUST be smaller than or equal to the full 1943 picture rate that is implied by the highest level and, when 1944 present, max-lsr. 1946 sprop-max-don-diff: 1948 If there is no NAL unit naluA that is followed in transmission 1949 order by any NAL unit preceding naluA in decoding order (i.e., 1950 the transmission order of the NAL units is the same as the 1951 decoding order), the value of this parameter MUST be equal to 1952 0. 1954 Otherwise, this parameter specifies the maximum absolute 1955 difference between the decoding order number (i.e., AbsDon) 1956 values of any two NAL units naluA and naluB, where naluA 1957 follows naluB in decoding order and precedes naluB in 1958 transmission order. 1960 The value of sprop-max-don-diff MUST be an integer in the range 1961 of 0 to 32767, inclusive. 1963 When not present, the value of sprop-max-don-diff is inferred 1964 to be equal to 0. 1966 sprop-depack-buf-bytes: 1968 This parameter signals the required size of the de- 1969 packetization buffer in units of bytes. The value of the 1970 parameter MUST be greater than or equal to the maximum buffer 1971 occupancy (in units of bytes) of the de-packetization buffer as 1972 specified in Section 6. 1974 The value of sprop-depack-buf-bytes MUST be an integer in the 1975 range of 0 to 4294967295, inclusive. 1977 When sprop-max-don-diff is present and greater than 0, this 1978 parameter MUST be present and the value MUST be greater than 0. 1979 When not present, the value of sprop-depack-buf-bytes is 1980 inferred to be equal to 0. 1982 Informative note: The value of sprop-depack-buf-bytes 1983 indicates the required size of the de-packetization buffer 1984 only. When network jitter can occur, an appropriately sized 1985 jitter buffer has to be available as well. 1987 depack-buf-cap: 1989 This parameter signals the capabilities of a receiver 1990 implementation and indicates the amount of de-packetization 1991 buffer space in units of bytes that the receiver has available 1992 for reconstructing the NAL unit decoding order from NAL units 1993 carried in the RTP stream. A receiver is able to handle any 1994 RTP stream for which the value of the sprop-depack-buf-bytes 1995 parameter is smaller than or equal to this parameter. 1997 When not present, the value of depack-buf-cap is inferred to be 1998 equal to 4294967295. The value of depack-buf-cap MUST be an 1999 integer in the range of 1 to 4294967295, inclusive. 2001 Informative note: depack-buf-cap indicates the maximum 2002 possible size of the de-packetization buffer of the receiver 2003 only, without allowing for network jitter. 2005 7.2. SDP Parameters 2007 The receiver MUST ignore any parameter unspecified in this memo. 2009 7.2.1. Mapping of Payload Type Parameters to SDP 2011 The media type video/H266 string is mapped to fields in the Session 2012 Description Protocol (SDP) [RFC4566] as follows: 2014 * The media name in the "m=" line of SDP MUST be video. 2016 * The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2017 media subtype). 2019 * The clock rate in the "a=rtpmap" line MUST be 90000. 2021 * The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2022 interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id, 2023 recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2024 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2025 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2026 This parameter is expressed as a media type string, in the form of 2027 a semicolon-separated list of parameter=value pairs. 2029 * The OPTIONAL parameter sprop-vps, when present, MUST be included 2030 in the "a=fmtp" line of SDP or conveyed using the "fmtp" source 2031 attribute as specified in Section 6.3 of [RFC5576]. For a 2032 particular media format (i.e., RTP payload type), sprop-vps MUST 2033 NOT be both included in the "a=fmtp" line of SDP and conveyed 2034 using the "fmtp" source attribute. When included in the "a=fmtp" 2035 line of SDP, sprop-vps is expressed as a media type string, in the 2036 form of a parameter=value pair. When conveyed in the "a=fmtp" 2037 line of SDP for a particular payload type, the parameter sprop-vps 2038 MUST be applied to each SSRC with the payload type. When conveyed 2039 using the "fmtp" source attribute, sprop-vps is only associated 2040 with the given source and payload type as parts of the "fmtp" 2041 source attribute. 2043 An example of media representation in SDP is as follows: 2045 m=video 49170 RTP/AVP 98 2046 a=rtpmap:98 H266/90000 2047 a=fmtp:98 profile-id=1; sprop-vps=