idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-16.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == There is 1 instance of lines with non-ascii characters in the document. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (5 May 2022) is 714 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1389 -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3' -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI' -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 0 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 6 November 2022 Y. Sanchez 6 Fraunhofer HHI 7 Y. Wang 8 Bytedance Inc. 9 M. M Hannuksela 10 Nokia Technologies 11 5 May 2022 13 RTP Payload Format for Versatile Video Coding (VVC) 14 draft-ietf-avtcore-rtp-vvc-16 16 Abstract 18 This memo describes an RTP payload format for the video coding 19 standard ITU-T Recommendation H.266 and ISO/IEC International 20 Standard 23090-3, both also known as Versatile Video Coding (VVC) and 21 developed by the Joint Video Experts Team (JVET). The RTP payload 22 format allows for packetization of one or more Network Abstraction 23 Layer (NAL) units in each RTP packet payload as well as fragmentation 24 of a NAL unit into multiple RTP packets. The payload format has wide 25 applicability in videoconferencing, Internet video streaming, and 26 high-bitrate entertainment-quality video, among other applications. 28 Status of This Memo 30 This Internet-Draft is submitted in full conformance with the 31 provisions of BCP 78 and BCP 79. 33 Internet-Drafts are working documents of the Internet Engineering 34 Task Force (IETF). Note that other groups may also distribute 35 working documents as Internet-Drafts. The list of current Internet- 36 Drafts is at https://datatracker.ietf.org/drafts/current/. 38 Internet-Drafts are draft documents valid for a maximum of six months 39 and may be updated, replaced, or obsoleted by other documents at any 40 time. It is inappropriate to use Internet-Drafts as reference 41 material or to cite them other than as "work in progress." 43 This Internet-Draft will expire on 6 November 2022. 45 Copyright Notice 47 Copyright (c) 2022 IETF Trust and the persons identified as the 48 document authors. All rights reserved. 50 This document is subject to BCP 78 and the IETF Trust's Legal 51 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 52 license-info) in effect on the date of publication of this document. 53 Please review these documents carefully, as they describe your rights 54 and restrictions with respect to this document. Code Components 55 extracted from this document must include Revised BSD License text as 56 described in Section 4.e of the Trust Legal Provisions and are 57 provided without warranty as described in the Revised BSD License. 59 Table of Contents 61 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 62 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 63 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 64 1.1.2. Systems and Transport Interfaces (informative) . . . 6 65 1.1.3. High-Level Picture Partitioning (informative) . . . . 11 66 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 13 67 1.2. Overview of the Payload Format . . . . . . . . . . . . . 14 68 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 15 69 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 15 70 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 15 71 3.1.1. Definitions from the VVC Specification . . . . . . . 15 72 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 18 73 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 19 74 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 20 75 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 20 76 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 22 77 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 22 78 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 23 79 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 23 80 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 27 81 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 30 82 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 31 83 6. De-packetization Process . . . . . . . . . . . . . . . . . . 32 84 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 34 85 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 34 86 7.2. Optional Parameters Definition . . . . . . . . . . . . . 35 87 7.3. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 45 88 7.3.1. Mapping of Payload Type Parameters to SDP . . . . . . 46 89 7.3.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 48 90 7.3.3. Usage in Declarative Session Descriptions . . . . . . 57 91 7.3.4. Considerations for Parameter Sets . . . . . . . . . . 59 92 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 59 93 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 59 94 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 59 95 9. Security Considerations . . . . . . . . . . . . . . . . . . . 60 96 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 61 97 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 62 98 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 62 99 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 62 100 13.1. Normative References . . . . . . . . . . . . . . . . . . 62 101 13.2. Informative References . . . . . . . . . . . . . . . . . 64 102 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 66 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 66 105 1. Introduction 107 The Versatile Video Coding specification was formally published as 108 both ITU-T Recommendation H.266 [VVC] and ISO/IEC International 109 Standard 23090-3 [ISO23090-3]. VVC is reported to provide 110 significant coding efficiency gains over High Efficiency Video Coding 111 [HEVC], also known as H.265, and other earlier video codecs. 113 This memo specifies an RTP payload format for VVC. It shares its 114 basic design with the NAL (Network Abstraction Layer) unit based RTP 115 payload formats of AVC Video Coding [RFC6184], Scalable Video Coding 116 (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] and 117 their respective predecessors. With respect to design philosophy, 118 security, congestion control, and overall implementation complexity, 119 it has similar properties to those earlier payload format 120 specifications. This is a conscious choice, as at least RFC 6184 is 121 widely deployed and generally known in the relevant implementer 122 communities. Certain scalability-related mechanisms known from 123 [RFC6190] were incorporated into this document, as VVC version 1 124 supports temporal, spatial, and signal-to-noise ratio (SNR) 125 scalability. 127 1.1. Overview of the VVC Codec 129 VVC and HEVC share a similar hybrid video codec design. In this 130 memo, we provide a very brief overview of those features of VVC that 131 are, in some form, addressed by the payload format specified herein. 132 Implementers have to read, understand, and apply the ITU-T/ISO/IEC 133 specifications pertaining to VVC to arrive at interoperable, well- 134 performing implementations. 136 Conceptually, both VVC and HEVC include a Video Coding Layer (VCL), 137 which is often used to refer to the coding-tool features, and a NAL, 138 which is often used to refer to the systems and transport interface 139 aspects of the codecs. 141 1.1.1. Coding-Tool Features (informative) 143 Coding tool features are described below with occasional reference to 144 the coding tool set of HEVC, which is well known in the community. 146 Similar to earlier hybrid-video-coding-based standards, including 147 HEVC, the following basic video coding design is employed by VVC. A 148 prediction signal is first formed by either intra- or motion- 149 compensated prediction, and the residual (the difference between the 150 original and the prediction) is then coded. The gains in coding 151 efficiency are achieved by redesigning and improving almost all parts 152 of the codec over earlier designs. In addition, VVC includes several 153 tools to make the implementation on parallel architectures easier. 155 Finally, VVC includes temporal, spatial, and SNR scalability as well 156 as multiview coding support. 158 Coding blocks and transform structure 160 Among major coding-tool differences between HEVC and VVC, one of the 161 important improvements is the more flexible coding tree structure in 162 VVC, i.e., multi-type tree. In addition to quadtree, binary and 163 ternary trees are also supported, which contributes significant 164 improvement in coding efficiency. Moreover, the maximum size of a 165 coding tree unit (CTU) is increased from 64x64 to 128x128. To 166 improve the coding efficiency of chroma signal, luma chroma separated 167 trees at CTU level may be employed for intra-slices. The square 168 transforms in HEVC are extended to non-square transforms for 169 rectangular blocks resulting from binary and ternary tree splits. 170 Besides, VVC supports multiple transform sets (MTS), including DCT-2, 171 DST-7, and DCT-8 as well as the non-separable secondary transform. 172 The transforms used in VVC can have different sizes with support for 173 larger transform sizes. For DCT-2, the transform sizes range from 174 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from 175 4x4 to 32x32. In addition, VVC also support sub-block transform for 176 both intra and inter coded blocks. For intra coded blocks, intra 177 sub-partitioning (ISP) may be used to allow sub-block based intra 178 prediction and transform. For inter blocks, sub-block transform may 179 be used assuming that only a part of an inter-block has non-zero 180 transform coefficients. 182 Entropy coding 184 Similar to HEVC, VVC uses a single entropy-coding engine, which is 185 based on context adaptive binary arithmetic coding [CABAC], but with 186 the support of multi-window sizes. The window sizes can be 187 initialized differently for different context models. Due to such a 188 design, it has more efficient adaptation speed and better coding 189 efficiency. A joint chroma residual coding scheme is applied to 190 further exploit the correlation between the residuals of two color 191 components. In VVC, different residual coding schemes are applied 192 for regular transform coefficients and residual samples generated 193 using transform-skip mode. 195 In-loop filtering 197 VVC has more feature support in loop filters than HEVC. The 198 deblocking filter in VVC is similar to HEVC but operates at a smaller 199 grid. After deblocking and sample adaptive offset (SAO), an adaptive 200 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 201 distortion of decoded pictures. Besides, VVC introduces a new module 202 called luma mapping with chroma scaling to fully utilize the dynamic 203 range of signal so that rate-distortion performance of both Standard 204 Dynamic Range (SDR) and High Dynamic Range (HDR) content is improved. 206 Motion prediction and coding 208 Compared to HEVC, VVC introduces several improvements in this area. 209 First, there is the adaptive motion vector resolution (AMVR), which 210 can save bit cost for motion vectors by adaptively signaling motion 211 vector resolution. Then the affine motion compensation is included 212 to capture complicated motion like zooming and rotation. Meanwhile, 213 prediction refinement with the optical flow with affine mode (PROF) 214 is further deployed to mimic affine motion at the pixel level. 215 Thirdly the decoder side motion vector refinement (DMVR) is a method 216 to derive MV vector at decoder side based on block matching so that 217 fewer bits may be spent on motion vectors. Bi-directional optical 218 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 219 offset at 4x4 sub-block level that is derived with equations based on 220 gradients of the prediction samples and a motion difference relative 221 to CU motion vectors. Furthermore, merge with motion vector 222 difference (MMVD) is a special mode, which further signals a limited 223 set of motion vector differences on top of merge mode. In addition 224 to MMVD, there are another three types of special merge modes, i.e., 225 sub-block merge, triangle, and combined intra-/inter-prediction 226 (CIIP). Sub-block merge list includes one candidate of sub-block 227 temporal motion vector prediction (SbTMVP) and up to four candidates 228 of affine motion vectors. Triangle is based on triangular block 229 motion compensation. CIIP combines intra- and inter- predictions 230 with weighting. Adaptive weighting may be employed with a block- 231 level tool called bi-prediction with CU based weighting (BCW) which 232 provides more flexibility than in HEVC. 234 Intra prediction and intra-coding 236 To capture the diversified local image texture directions with finer 237 granularity, VVC supports 65 angular directions instead of 33 238 directions in HEVC. The intra mode coding is based on a 6-most- 239 probable-mode scheme, and the 6 most probable modes are derived using 240 the neighboring intra prediction directions. In addition, to deal 241 with the different distributions of intra prediction angles for 242 different block aspect ratios, a wide-angle intra prediction (WAIP) 243 scheme is applied in VVC by including intra prediction angles beyond 244 those present in HEVC. Unlike HEVC which only allows using the most 245 adjacent line of reference samples for intra prediction, VVC also 246 allows using two further reference lines, as known as multi- 247 reference-line (MRL) intra prediction. The additional reference 248 lines can be only used for the 6 most probable intra prediction 249 modes. To capture the strong correlation between different colour 250 components, in VVC, a cross-component linear mode (CCLM) is utilized 251 which assumes a linear relationship between the luma sample values 252 and their associated chroma samples. For intra prediction, VVC also 253 applies a position-dependent prediction combination (PDPC) for 254 refining the prediction samples closer to the intra prediction block 255 boundary. Matrix-based intra prediction (MIP) modes are also used in 256 VVC which generates an up to 8x8 intra prediction block using a 257 weighted sum of downsampled neighboring reference samples, and the 258 weights are hardcoded constants. 260 Other coding-tool features 262 VVC introduces dependent quantization (DQ) to reduce quantization 263 error by state-based switching between two quantizers. 265 1.1.2. Systems and Transport Interfaces (informative) 267 VVC inherits the basic systems and transport interfaces designs from 268 HEVC and AVC. These include the NAL-unit-based syntax structure, the 269 hierarchical syntax and data unit structure, the supplemental 270 enhancement information (SEI) message mechanism, and the video 271 buffering model based on the hypothetical reference decoder (HRD). 272 The scalability features of VVC are conceptually similar to the 273 scalable variant of HEVC known as SHVC. The hierarchical syntax and 274 data unit structure consists of parameter sets at various levels 275 (decoder, sequence (pertaining to all), sequence (pertaining to a 276 single), picture), picture-level header parameters, slice-level 277 header parameters, and lower-level parameters. 279 A number of key components that influenced the network abstraction 280 layer design of VVC as well as this memo are described below 282 Decoding capability information 284 The decoding capability information includes parameters that stay 285 constant for the lifetime of a VVC bitstream, which in IETF terms can 286 translate to a session. Such information includes profile, level, 287 and sub-profile information to determine a maximum capability interop 288 point that is guaranteed to be never exceeded, even if splicing of 289 video sequences occurs within a session. It further includes 290 constraint fields (most of which are flags), which can optionally be 291 set to indicate that the video bitstream will be constrained in the 292 use of certain features as indicated by the values of those fields. 293 With this, a bitstream can be labeled as not using certain tools, 294 which allows among other things for resource allocation in a decoder 295 implementation. 297 Video parameter set 299 The video parameter set (VPS) pertains to one or more coded video 300 sequences (CVSs) of multiple layers covering the same range of access 301 units, and includes, among other information, decoding dependency 302 expressed as information for reference picture list construction of 303 enhancement layers. The VPS provides a "big picture" of a scalable 304 sequence, including what types of operation points are provided, the 305 profile, tier, and level of the operation points, and some other 306 high-level properties of the bitstream that can be used as the basis 307 for session negotiation and content selection, etc. One VPS may be 308 referenced by one or more sequence parameter sets. 310 Sequence parameter set 312 The sequence parameter set (SPS) contains syntax elements pertaining 313 to a coded layer video sequence (CLVS), which is a group of pictures 314 belonging to the same layer, starting with a random access point, and 315 followed by pictures that may depend on each other, until the next 316 random access point picture. In MPEG-2, the equivalent of a CVS was 317 a group of pictures (GOP), which normally started with an I frame and 318 was followed by P and B frames. While more complex in its options of 319 random access points, VVC retains this basic concept. One remarkable 320 difference of VVC is that a CLVS may start with a Gradual Decoding 321 Refresh (GDR) picture, without requiring presence of traditional 322 random access points in the bitstream, such as instantaneous decoding 323 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 324 applications, a CVS contains a few hundred milliseconds to a few 325 seconds of video. In video conferencing (without switching MCUs 326 involved), a CVS can be as long in duration as the whole session. 328 Picture and adaptation parameter set 330 The picture parameter set and the adaptation parameter set (PPS and 331 APS, respectively) carry information pertaining to zero or more 332 pictures and zero or more slices, respectively. The PPS contains 333 information that is likely to stay constant from picture to picture, 334 at least for pictures for a certain type-whereas the APS contains 335 information, such as adaptive loop filter coefficients, that are 336 likely to change from picture to picture or even within a picture. A 337 single APS is referenced by all slices of the same picture if that 338 APS contains information about luma mapping with chroma scaling 339 (LMCS) or scaling list. Different APSs containing ALF parameters can 340 be referenced by slices of the same picture. 342 Picture header 344 A Picture Header contains information that is common to all slices 345 that belong to the same picture. Being able to send that information 346 as a separate NAL unit when pictures are split into several slices 347 allows for saving bitrate, compared to repeating the same information 348 in all slices. However, there might be scenarios where low-bitrate 349 video is transmitted using a single slice per picture. Having a 350 separate NAL unit to convey that information incurs in an overhead 351 for such scenarios. For such scenarios, the picture header syntax 352 structure is directly included in the slice header, instead of its 353 own NAL unit. The mode of the picture header syntax structure being 354 included in its own NAL unit or not can only be switched on/off for 355 an entire CLVS, and can only be switched off when in the entire CLVS 356 each picture contains only one slice. 358 Profile, tier, and level 360 The profile, tier and level syntax structures in DCI, VPS and SPS 361 contain profile, tier, level information for all layers that refer to 362 the DCI, for layers associated with one or more output layer sets 363 specified by the VPS, and for any layer that refers to the SPS, 364 respectively. 366 Sub-profiles 368 Within the VVC specification, a sub-profile is a 32-bit number, coded 369 according to ITU-T Rec. T.35, that does not carry a semantics. It is 370 carried in the profile_tier_level structure and hence (potentially) 371 present in the DCI, VPS, and SPS. External registration bodies can 372 register a T.35 codepoint with ITU-T registration authorities and 373 associate with their registration a description of bitstream 374 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 375 would allow encoder manufacturers to label the bitstreams generated 376 by their encoder as complying with such sub-profile. It is expected 377 that upstream standardization organizations (such as: DVB and ATSC), 378 as well as walled-garden video services will take advantage of this 379 labeled system. In contrast to "normal" profiles, it is expected 380 that sub-profiles may indicate encoder choices traditionally left 381 open in the (decoder-centric) video coding specs, such as GOP 382 structures, minimum/maximum QP values, and the mandatory use of 383 certain tools or SEI messages. 385 General constraint fields 387 The profile_tier_level structure carries a considerable number of 388 constraint fields (most of which are flags), which an encoder can use 389 to indicate to a decoder that it will not use a certain tool or 390 technology. They were included in reaction to a perceived market 391 need for labeled a bitstream as not exercising a certain tool that 392 has become commercially unviable. 394 Temporal scalability support 396 VVC includes support of temporal scalability, by inclusion of the 397 signaling of TemporalId in the NAL unit header, the restriction that 398 pictures of a particular temporal sublayer cannot be used for inter 399 prediction reference by pictures of a lower temporal sublayer, the 400 sub-bitstream extraction process, and the requirement that each sub- 401 bitstream extraction output be a conforming bitstream. Media-Aware 402 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 403 header for stream adaptation purposes based on temporal scalability. 405 Reference picture resampling (RPR) 407 In AVC and HEVC, the spatial resolution of pictures cannot change 408 unless a new sequence using a new SPS starts, with an Intra random 409 access point (IRAP) picture. VVC enables picture resolution change 410 within a sequence at a position without encoding an IRAP picture, 411 which is always intra-coded. This feature is sometimes referred to 412 as reference picture resampling (RPR), as the feature needs 413 resampling of a reference picture used for inter prediction when that 414 reference picture has a different resolution than the current picture 415 being decoded. RPR allows resolution change without the need of 416 coding an IRAP picture and hence avoids a momentary bit rate spike 417 caused by an IRAP picture in streaming or video conferencing 418 scenarios, e.g., to cope with network condition changes. RPR can 419 also be used in application scenarios wherein zooming of the entire 420 video region or some region of interest is needed. 422 Spatial, SNR, and multiview scalability 424 VVC includes support for spatial, SNR, and multiview scalability. 425 Scalable video coding is widely considered to have technical benefits 426 and enrich services for various video applications. Until recently, 427 however, the functionality has not been included in the first version 428 of specifications of the video codecs. In VVC, however, all those 429 forms of scalability are supported in the first version of VVC 430 natively through the signaling of the nuh_layer_id in the NAL unit 431 header, the VPS which associates layers with given nuh_layer_id to 432 each other, reference picture selection, reference picture resampling 433 for spatial scalability, and a number of other mechanisms not 434 relevant for this memo. 436 Spatial scalability 438 With the existence of Reference Picture Resampling (RPR), the 439 additional burden for scalability support is just a 440 modification of the high-level syntax (HLS). The inter-layer 441 prediction is employed in a scalable system to improve the 442 coding efficiency of the enhancement layers. In addition to 443 the spatial and temporal motion-compensated predictions that 444 are available in a single-layer codec, the inter-layer 445 prediction in VVC uses the possibly resampled video data of the 446 reconstructed reference picture from a reference layer to 447 predict the current enhancement layer. The resampling process 448 for inter-layer prediction, when used, is performed at the 449 block-level, reusing the existing interpolation process for 450 motion compensation in single-layer coding. It means that no 451 additional resampling process is needed to support spatial 452 scalability. 454 SNR scalability 456 SNR scalability is similar to spatial scalability except that 457 the resampling factors are 1:1. In other words, there is no 458 change in resolution, but there is inter-layer prediction. 460 Multiview scalability 462 The first version of VVC also supports multiview scalability, 463 wherein a multi-layer bitstream carries layers representing 464 multiple views, and one or more of the represented views can be 465 output at the same time. 467 SEI messages 469 Supplemental enhancement information (SEI) messages are information 470 in the bitstream that do not influence the decoding process as 471 specified in the VVC spec, but address issues of representation/ 472 rendering of the decoded bitstream, label the bitstream for certain 473 applications, among other, similar tasks. The overall concept of SEI 474 messages and many of the messages themselves has been inherited from 475 the AVC and HEVC specs. Except for the SEI messages that affect the 476 specification of the hypothetical reference decoder (HRD), other SEI 477 messages for use in the VVC environment, which are generally useful 478 also in other video coding technologies, are not included in the main 479 VVC specification but in a companion specification [VSEI]. 481 1.1.3. High-Level Picture Partitioning (informative) 483 VVC inherited the concept of tiles and wavefront parallel processing 484 (WPP) from HEVC, with some minor to moderate differences. The basic 485 concept of slices was kept in VVC but designed in an essentially 486 different form. VVC is the first video coding standard that includes 487 subpictures as a feature, which provides the same functionality as 488 HEVC motion-constrained tile sets (MCTSs) but designed differently to 489 have better coding efficiency and to be friendlier for usage in 490 application systems. More details of these differences are described 491 below. 493 Tiles and WPP 495 Same as in HEVC, a picture can be split into tile rows and tile 496 columns in VVC, in-picture prediction across tile boundaries is 497 disallowed, etc. However, the syntax for signaling of tile 498 partitioning has been simplified, by using a unified syntax design 499 for both the uniform and the non-uniform mode. In addition, 500 signaling of entry point offsets for tiles in the slice header is 501 optional in VVC while it is mandatory in HEVC. The WPP design in VVC 502 has two differences compared to HEVC: i) The CTU row delay is reduced 503 from two CTUs to one CTU; ii) signaling of entry point offsets for 504 WPP in the slice header is optional in VVC while it is mandatory in 505 HEVC. 507 Slices 509 In VVC, the conventional slices based on CTUs (as in HEVC) or 510 macroblocks (as in AVC) have been removed. The main reasoning behind 511 this architectural change is as follows. The advances in video 512 coding since 2003 (the publication year of AVC v1) have been such 513 that slice-based error concealment has become practically impossible, 514 due to the ever-increasing number and efficiency of in-picture and 515 inter-picture prediction mechanisms. An error-concealed picture is 516 the decoding result of a transmitted coded picture for which there is 517 some data loss (e.g., loss of some slices) of the coded picture or a 518 reference picture for at least some part of the coded picture is not 519 error-free (e.g., that reference picture was an error-concealed 520 picture). For example, when one of the multiple slices of a picture 521 is lost, it may be error-concealed using an interpolation of the 522 neighboring slices. While advanced video coding prediction 523 mechanisms provide significantly higher coding efficiency, they also 524 make it harder for machines to estimate the quality of an error- 525 concealed picture, which was already a hard problem with the use of 526 simpler prediction mechanisms. Advanced in-picture prediction 527 mechanisms also cause the coding efficiency loss due to splitting a 528 picture into multiple slices to be more significant. Furthermore, 529 network conditions become significantly better while at the same time 530 techniques for dealing with packet losses have become significantly 531 improved. As a result, very few implementations have recently used 532 slices for maximum transmission unit size matching. Instead, 533 substantially all applications where low-delay error resilience is 534 required (e.g., video telephony and video conferencing) rely on 535 system/transport-level error resilience (e.g., retransmission, 536 forward error correction) and/or picture-based error resilience tools 537 (feedback-based error resilience, insertion of IRAPs, scalability 538 with higher protection level of the base layer, and so on). 539 Considering all the above, nowadays it is very rare that a picture 540 that cannot be correctly decoded is passed to the decoder, and when 541 such a rare case occurs, the system can afford to wait for an error- 542 free picture to be decoded and available for display without 543 resulting in frequent and long periods of picture freezing seen by 544 end users. 546 Slices in VVC have two modes: rectangular slices and raster-scan 547 slices. The rectangular slice, as indicated by its name, covers a 548 rectangular region of the picture. Typically, a rectangular slice 549 consists of several complete tiles. However, it is also possible 550 that a rectangular slice is a subset of a tile and consists of one or 551 more consecutive, complete CTU rows within a tile. A raster-scan 552 slice consists of one or more complete tiles in a tile raster scan 553 order, hence the region covered by a raster-scan slices need not but 554 could have a non-rectangular shape, but it may also happen to have 555 the shape of a rectangle. The concept of slices in VVC is therefore 556 strongly linked to or based on tiles instead of CTUs (as in HEVC) or 557 macroblocks (as in AVC). 559 Subpictures 561 VVC is the first video coding standard that includes the support of 562 subpictures as a feature. Each subpicture consists of one or more 563 complete rectangular slices that collectively cover a rectangular 564 region of the picture. A subpicture may be either specified to be 565 extractable (i.e., coded independently of other subpictures of the 566 same picture and of earlier pictures in decoding order) or not 567 extractable. Regardless of whether a subpicture is extractable or 568 not, the encoder can control whether in-loop filtering (including 569 deblocking, SAO, and ALF) is applied across the subpicture boundaries 570 individually for each subpicture. 572 Functionally, subpictures are similar to the motion-constrained tile 573 sets (MCTSs) in HEVC. They both allow independent coding and 574 extraction of a rectangular subset of a sequence of coded pictures, 575 for use cases like viewport-dependent 360o video streaming 576 optimization and region of interest (ROI) applications. 578 There are several important design differences between subpictures 579 and MCTSs. First, the subpictures feature in VVC allows motion 580 vectors of a coding block pointing outside of the subpicture even 581 when the subpicture is extractable by applying sample padding at 582 subpicture boundaries in this case, similarly as at picture 583 boundaries. Second, additional changes were introduced for the 584 selection and derivation of motion vectors in the merge mode and in 585 the decoder side motion vector refinement process of VVC. This 586 allows higher coding efficiency compared to the non-normative motion 587 constraints applied at the encoder-side for MCTSs. Third, rewriting 588 of SHs (and PH NAL units, when present) is not needed when extracting 589 one or more extractable subpictures from a sequence of pictures to 590 create a sub-bitstream that is a conforming bitstream. In sub- 591 bitstream extractions based on HEVC MCTSs, rewriting of SHs is 592 needed. Note that in both HEVC MCTSs extraction and VVC subpictures 593 extraction, rewriting of SPSs and PPSs is needed. However, typically 594 there are only a few parameter sets in a bitstream, while each 595 picture has at least one slice, therefore rewriting of SHs can be a 596 significant burden for application systems. Fourth, slices of 597 different subpictures within a picture are allowed to have different 598 NAL unit types. Fifth, VVC specifies HRD and level definitions for 599 subpicture sequences, thus the conformance of the sub-bitstream of 600 each extractable subpicture sequence can be ensured by encoders. 602 1.1.4. NAL Unit Header 604 VVC maintains the NAL unit concept of HEVC with modifications. VVC 605 uses a two-byte NAL unit header, as shown in Figure 1. The payload 606 of a NAL unit refers to the NAL unit excluding the NAL unit header. 608 +---------------+---------------+ 609 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 610 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 611 |F|Z| LayerID | Type | TID | 612 +---------------+---------------+ 614 The Structure of the VVC NAL Unit Header. 616 Figure 1 618 The semantics of the fields in the NAL unit header are as specified 619 in VVC and described briefly below for convenience. In addition to 620 the name and size of each field, the corresponding syntax element 621 name in VVC is also provided. 623 F: 1 bit 625 forbidden_zero_bit. Required to be zero in VVC. Note that the 626 inclusion of this bit in the NAL unit header was to enable 627 transport of VVC video over MPEG-2 transport systems (avoidance of 628 start code emulations) [MPEG2S]. In the context of this memo the 629 value 1 may be used to indicate a syntax violation, e.g., for a 630 NAL unit resulted from aggregating a number of fragmented units of 631 a NAL unit but missing the last fragment, as described in the last 632 sentence of section 4.3.3. 634 Z: 1 bit 636 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 637 for future extensions by ITU-T and ISO/IEC. 638 This memo does not overload the "Z" bit for local extensions, as 639 a) overloading the "F" bit is sufficient and b) to preserve the 640 usefulness of this memo to possible future versions of [VVC]. 642 LayerId: 6 bits 644 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 645 a layer may be, e.g., a spatial scalable layer, a quality scalable 646 layer, a layer containing a different view, etc. 648 Type: 5 bits 650 nal_unit_type. This field specifies the NAL unit type as defined 651 in Table 5 of [VVC]. For a reference of all currently defined NAL 652 unit types and their semantics, please refer to Section 7.4.2.2 in 653 [VVC]. 655 TID: 3 bits 657 nuh_temporal_id_plus1. This field specifies the temporal 658 identifier of the NAL unit plus 1. The value of TemporalId is 659 equal to TID minus 1. A TID value of 0 is illegal to ensure that 660 there is at least one bit in the NAL unit header equal to 1, so to 661 enable the consideration of start code emulations in the NAL unit 662 payload data independent of the NAL unit header. 664 1.2. Overview of the Payload Format 666 This payload format defines the following processes required for 667 transport of VVC coded data over RTP [RFC3550]: 669 * Usage of RTP header with this payload format 670 * Packetization of VVC coded NAL units into RTP packets using three 671 types of payload structures: a single NAL unit packet, aggregation 672 packet, and fragment unit 674 * Transmission of VVC NAL units of the same bitstream within a 675 single RTP stream 677 * Media type parameters to be used with the Session Description 678 Protocol (SDP) [RFC8866] 680 * Usage of RTCP feedback messages 682 2. Conventions 684 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 685 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 686 "OPTIONAL" in this document are to be interpreted as described in BCP 687 14 [RFC2119] [RFC8174] when, and only when, they appear in all 688 capitals, as shown here. 690 3. Definitions and Abbreviations 692 3.1. Definitions 694 This document uses the terms and definitions of VVC. Section 3.1.1 695 lists relevant definitions from [VVC] for convenience. Section 3.1.2 696 provides definitions specific to this memo. All the used terms and 697 definitions in this memo are verbatim copies of [VVC] specification. 699 3.1.1. Definitions from the VVC Specification 701 Access unit (AU): A set of PUs that belong to different layers and 702 contain coded pictures associated with the same time for output from 703 the DPB. 705 Adaptation parameter set (APS): A syntax structure containing syntax 706 elements that apply to zero or more slices as determined by zero or 707 more syntax elements found in slice headers. 709 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 710 byte stream, that forms the representation of a sequence of AUs 711 forming one or more coded video sequences (CVSs). 713 Coded picture: A coded representation of a picture comprising VCL NAL 714 units with a particular value of nuh_layer_id within an AU and 715 containing all CTUs of the picture. 717 Clean random access (CRA) PU: A PU in which the coded picture is a 718 CRA picture. 720 Clean random access (CRA) picture: An IRAP picture for which each VCL 721 NAL unit has nal_unit_type equal to CRA_NUT. 723 Coded video sequence (CVS): A sequence of AUs that consists, in 724 decoding order, of a CVSS AU, followed by zero or more AUs that are 725 not CVSS AUs, including all subsequent AUs up to but not including 726 any subsequent AU that is a CVSS AU. 728 Coded video sequence start (CVSS) AU: An AU in which there is a PU 729 for each layer in the CVS and the coded picture in each PU is a CLVSS 730 picture. 732 Coded layer video sequence (CLVS): A sequence of PUs with the same 733 value of nuh_layer_id that consists, in decoding order, of a CLVSS 734 PU, followed by zero or more PUs that are not CLVSS PUs, including 735 all subsequent PUs up to but not including any subsequent PU that is 736 a CLVSS PU. 738 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 739 picture is a CLVSS picture. 741 Coded layer video sequence start (CLVSS) picture: A coded picture 742 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 743 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 745 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 746 of chroma samples of a picture that has three sample arrays, or a CTB 747 of samples of a monochrome picture or a picture that is coded using 748 three separate colour planes and syntax structures used to code the 749 samples. 751 Decoding Capability Information (DCI): A syntax structure containing 752 syntax elements that apply to the entire bitstream. 754 Decoded picture buffer (DPB): A buffer holding decoded pictures for 755 reference, output reordering, or output delay specified for the 756 hypothetical reference decoder. 758 Gradual decoding refresh (GDR) picture: A picture for which each VCL 759 NAL unit has nal_unit_type equal to GDR_NUT. 761 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 762 picture is an IDR picture. 764 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 765 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 766 IDR_N_LP. 768 Intra random access point (IRAP) AU: An AU in which there is a PU for 769 each layer in the CVS and the coded picture in each PU is an IRAP 770 picture. 772 Intra random access point (IRAP) PU: A PU in which the coded picture 773 is an IRAP picture. 775 Intra random access point (IRAP) picture: A coded picture for which 776 all VCL NAL units have the same value of nal_unit_type in the range 777 of IDR_W_RADL to CRA_NUT, inclusive. 779 Layer: A set of VCL NAL units that all have a particular value of 780 nuh_layer_id and the associated non-VCL NAL units. 782 Network abstraction layer (NAL) unit: A syntax structure containing 783 an indication of the type of data to follow and bytes containing that 784 data in the form of an RBSP interspersed as necessary with emulation 785 prevention bytes. 787 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 789 Output Layer Set (OLS): A set of layers for which one or more layers 790 are specified as the output layers. 792 Operation point (OP): A temporal subset of an OLS, identified by an 793 OLS index and a highest value of TemporalId. 795 Picture parameter set (PPS): A syntax structure containing syntax 796 elements that apply to zero or more entire coded pictures as 797 determined by a syntax element found in each slice header. 799 Picture unit (PU): A set of NAL units that are associated with each 800 other according to a specified classification rule, are consecutive 801 in decoding order, and contain exactly one coded picture. 803 Random access: The act of starting the decoding process for a 804 bitstream at a point other than the beginning of the stream. 806 Sequence parameter set (SPS): A syntax structure containing syntax 807 elements that apply to zero or more entire CLVSs as determined by the 808 content of a syntax element found in the PPS referred to by a syntax 809 element found in each picture header. 811 Slice: An integer number of complete tiles or an integer number of 812 consecutive complete CTU rows within a tile of a picture that are 813 exclusively contained in a single NAL unit. 815 Slice header (SH): A part of a coded slice containing the data 816 elements pertaining to all tiles or CTU rows within a tile 817 represented in the slice. 819 Sublayer: A temporal scalable layer of a temporal scalable bitstream 820 consisting of VCL NAL units with a particular value of the TemporalId 821 variable, and the associated non-VCL NAL units. 823 Subpicture: An rectangular region of one or more slices within a 824 picture. 826 Sublayer representation: A subset of the bitstream consisting of NAL 827 units of a particular sublayer and the lower sublayers. 829 Tile: A rectangular region of CTUs within a particular tile column 830 and a particular tile row in a picture. 832 Tile column: A rectangular region of CTUs having a height equal to 833 the height of the picture and a width specified by syntax elements in 834 the picture parameter set. 836 Tile row: A rectangular region of CTUs having a height specified by 837 syntax elements in the picture parameter set and a width equal to the 838 width of the picture. 840 Video coding layer (VCL) NAL unit: A collective term for coded slice 841 NAL units and the subset of NAL units that have reserved values of 842 nal_unit_type that are classified as VCL NAL units in this 843 Specification. 845 3.1.2. Definitions Specific to This Memo 847 Media-Aware Network Element (MANE): A network element, such as a 848 middlebox, selective forwarding unit, or application-layer gateway 849 that is capable of parsing certain aspects of the RTP payload headers 850 or the RTP payload and reacting to their contents. 852 Informative note: The concept of a MANE goes beyond normal routers 853 or gateways in that a MANE has to be aware of the signaling (e.g., 854 to learn about the payload type mappings of the media streams), 855 and in that it has to be trusted when working with Secure RTP 856 (SRTP). The advantage of using MANEs is that they allow packets 857 to be dropped according to the needs of the media coding. For 858 example, if a MANE has to drop packets due to congestion on a 859 certain link, it can identify and remove those packets whose 860 elimination produces the least adverse effect on the user 861 experience. After dropping packets, MANEs must rewrite RTCP 862 packets to match the changes to the RTP stream, as specified in 863 Section 7 of [RFC3550]. 865 NAL unit decoding order: A NAL unit order that conforms to the 866 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 867 follow the Order of NAL units in the bitstream. 869 RTP stream (See [RFC7656]): Within the scope of this memo, one RTP 870 stream is utilized to transport a VVC bitstream, which may contain 871 one or more layers, and each layer may contain one or more temporal 872 sublayers. 874 Transmission order: The order of packets in ascending RTP sequence 875 number order (in modulo arithmetic). Within an aggregation packet, 876 the NAL unit transmission order is the same as the order of 877 appearance of NAL units in the packet. 879 3.2. Abbreviations 881 AU Access Unit 883 AP Aggregation Packet 885 APS Adaptation Parameter Set 887 CTU Coding Tree Unit 889 CVS Coded Video Sequence 891 DPB Decoded Picture Buffer 893 DCI Decoding Capability Information 895 DON Decoding Order Number 897 FIR Full Intra Request 899 FU Fragmentation Unit 901 GDR Gradual Decoding Refresh 903 HRD Hypothetical Reference Decoder 905 IDR Instantaneous Decoding Refresh 906 IRAP Intra Random Access Point 908 MANE Media-Aware Network Element 910 MTU Maximum Transfer Unit 912 NAL Network Abstraction Layer 914 NALU Network Abstraction Layer Unit 916 OLS Output Layer Set 918 PLI Picture Loss Indication 920 PPS Picture Parameter Set 922 RPSI Reference Picture Selection Indication 924 SEI Supplemental Enhancement Information 926 SLI Slice Loss Indication 928 SPS Sequence Parameter Set 930 VCL Video Coding Layer 932 VPS Video Parameter Set 934 4. RTP Payload Format 936 4.1. RTP Header Usage 938 The format of the RTP header is specified in [RFC3550] (reprinted as 939 Figure 2 for convenience). This payload format uses the fields of 940 the header in a manner consistent with that specification. 942 The RTP payload (and the settings for some RTP header bits) for 943 aggregation packets and fragmentation units are specified in 944 Section 4.3.2 and Section 4.3.3, respectively. 946 0 1 2 3 947 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 948 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 949 |V=2|P|X| CC |M| PT | sequence number | 950 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 951 | timestamp | 952 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 953 | synchronization source (SSRC) identifier | 954 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 955 | contributing source (CSRC) identifiers | 956 | .... | 957 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 959 RTP Header According to [RFC3550] 961 Figure 2 963 The RTP header information to be set according to this RTP payload 964 format is set as follows: 966 Marker bit (M): 1 bit 968 Set for the last packet, in transmission order, among each set of 969 packets that contain NAL units of one access unit. This is in 970 line with the normal use of the M bit in video formats to allow an 971 efficient playout buffer handling. 973 Payload Type (PT): 7 bits 975 The assignment of an RTP payload type for this new packet format 976 is outside the scope of this document and will not be specified 977 here. The assignment of a payload type has to be performed either 978 through the profile used or in a dynamic way. 980 Sequence Number (SN): 16 bits 982 Set and used in accordance with [RFC3550]. 984 Timestamp: 32 bits 985 The RTP timestamp is set to the sampling timestamp of the content. 986 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 987 properties of its own (e.g., parameter set and SEI NAL units), the 988 RTP timestamp MUST be set to the RTP timestamp of the coded 989 pictures of the access unit in which the NAL unit (according to 990 Section 7.4.2.4 of [VVC]) is included. Receivers MUST use the RTP 991 timestamp for the display process, even when the bitstream 992 contains picture timing SEI messages or decoding unit information 993 SEI messages as specified in [VVC]. 995 Informative note: When picture timing SEI messages are present, 996 the RTP sender is responsible to ensure that the RTP timestamps 997 are consistent with the timing information carried in the 998 picture timing SEI messages. 1000 Synchronization source (SSRC): 32 bits 1002 Used to identify the source of the RTP packets. A single SSRC is 1003 used for all parts of a single bitstream. 1005 4.2. Payload Header Usage 1007 The first two bytes of the payload of an RTP packet are referred to 1008 as the payload header. The payload header consists of the same 1009 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 1010 in Section 1.1.4, irrespective of the type of the payload structure. 1012 The TID value indicates (among other things) the relative importance 1013 of an RTP packet, for example, because NAL units belonging to higher 1014 temporal sublayers are not used for the decoding of lower temporal 1015 sublayers. A lower value of TID indicates a higher importance. 1016 More-important NAL units MAY be better protected against transmission 1017 losses than less-important NAL units. 1019 4.3. Payload Structures 1021 Three different types of RTP packet payload structures are specified. 1022 A receiver can identify the type of an RTP packet payload through the 1023 Type field in the payload header. 1025 The three different payload structures are as follows: 1027 * Single NAL unit packet: Contains a single NAL unit in the payload, 1028 and the NAL unit header of the NAL unit also serves as the payload 1029 header. This payload structure is specified in Section 4.4.1. 1031 * Aggregation Packet (AP): Contains more than one NAL unit within 1032 one access unit. This payload structure is specified in 1033 Section 4.3.2. 1035 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 1036 This payload structure is specified in Section 4.3.3. 1038 4.3.1. Single NAL Unit Packets 1040 A single NAL unit packet contains exactly one NAL unit, and consists 1041 of a payload header (denoted as PayloadHdr), a conditional 16-bit 1042 DONL field (in network byte order), and the NAL unit payload data 1043 (the NAL unit excluding its NAL unit header) of the contained NAL 1044 unit, as shown in Figure 3. 1046 0 1 2 3 1047 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1048 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1049 | PayloadHdr | DONL (conditional) | 1050 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1051 | | 1052 | NAL unit payload data | 1053 | | 1054 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1055 | :...OPTIONAL RTP padding | 1056 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1058 The Structure of a Single NAL Unit Packet 1060 Figure 3 1062 The DONL field, when present, specifies the value of the 16 least 1063 significant bits of the decoding order number of the contained NAL 1064 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 1065 be present, and the variable DON for the contained NAL unit is 1066 derived as equal to the value of the DONL field. Otherwise (sprop- 1067 max-don-diff is equal to 0), the DONL field MUST NOT be present. 1069 4.3.2. Aggregation Packets (APs) 1071 Aggregation Packets (APs) can reduce packetization overhead for small 1072 NAL units, such as most of the non-VCL NAL units, which are often 1073 only a few octets in size. 1075 An AP aggregates NAL units of one access unit and it MUST NOT contain 1076 NAL units from more than one AU. Each NAL unit to be carried in an 1077 AP is encapsulated in an aggregation unit. NAL units aggregated in 1078 one AP are included in NAL unit decoding order. 1080 An AP consists of a payload header (denoted as PayloadHdr) followed 1081 by two or more aggregation units, as shown in Figure 4. 1083 0 1 2 3 1084 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1085 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1086 | PayloadHdr (Type=28) | | 1087 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1088 | | 1089 | two or more aggregation units | 1090 | | 1091 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1092 | :...OPTIONAL RTP padding | 1093 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1095 The Structure of an Aggregation Packet 1097 Figure 4 1099 The fields in the payload header of an AP are set as follows. The F 1100 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 1101 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 1102 be equal to 28. 1104 The value of LayerId MUST be equal to the lowest value of LayerId of 1105 all the aggregated NAL units. The value of TID MUST be the lowest 1106 value of TID of all the aggregated NAL units. 1108 Informative note: All VCL NAL units in an AP have the same TID 1109 value since they belong to the same access unit. However, an AP 1110 may contain non-VCL NAL units for which the TID value in the NAL 1111 unit header may be different than the TID value of the VCL NAL 1112 units in the same AP. 1114 Informative Note: If a system envisions sub-picture level or 1115 picture level modifications, for example by removing sub-pictures 1116 or pictures of a particular layer, a good design choice on the 1117 sender's side would be to aggregate NAL units belonging to only 1118 the same sub-picture or picture of a particular layer. 1120 An AP MUST carry at least two aggregation units and can carry as many 1121 aggregation units as necessary; however, the total amount of data in 1122 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1123 chosen so that the resulting IP packet is smaller than the MTU size 1124 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1125 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1126 not contain another AP. 1128 The first aggregation unit in an AP consists of a conditional 16-bit 1129 DONL field (in network byte order) followed by a 16-bit unsigned size 1130 information (in network byte order) that indicates the size of the 1131 NAL unit in bytes (excluding these two octets, but including the NAL 1132 unit header), followed by the NAL unit itself, including its NAL unit 1133 header, as shown in Figure 5. 1135 0 1 2 3 1136 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1137 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1138 | : DONL (conditional) | NALU size | 1139 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1140 | NALU size | | 1141 +-+-+-+-+-+-+-+-+ NAL unit | 1142 | | 1143 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1144 | : 1145 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1147 The Structure of the First Aggregation Unit in an AP 1149 Figure 5 1151 The DONL field, when present, specifies the value of the 16 least 1152 significant bits of the decoding order number of the aggregated NAL 1153 unit. 1155 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1156 present in an aggregation unit that is the first aggregation unit in 1157 an AP, and the variable DON for the aggregated NAL unit is derived as 1158 equal to the value of the DONL field, and the variable DON for an 1159 aggregation unit that is not the first aggregation unit in an AP 1160 aggregated NAL unit is derived as equal to the DON of the preceding 1161 aggregated NAL unit in the same AP plus 1 modulo 65536. Otherwise 1162 (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be 1163 present in an aggregation unit that is the first aggregation unit in 1164 an AP. 1166 An aggregation unit that is not the first aggregation unit in an AP 1167 will be followed immediately by a 16-bit unsigned size information 1168 (in network byte order) that indicates the size of the NAL unit in 1169 bytes (excluding these two octets, but including the NAL unit 1170 header), followed by the NAL unit itself, including its NAL unit 1171 header, as shown in Figure 6. 1173 0 1 2 3 1174 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1175 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1176 | : NALU size | NAL unit | 1177 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1178 | | 1179 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1180 | : 1181 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1183 The Structure of an Aggregation Unit That Is Not the First 1184 Aggregation Unit in an AP 1186 Figure 6 1188 Figure 7 presents an example of an AP that contains two aggregation 1189 units, labeled as 1 and 2 in the figure, without the DONL field being 1190 present. 1192 0 1 2 3 1193 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1194 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1195 | RTP Header | 1196 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1197 | PayloadHdr (Type=28) | NALU 1 Size | 1198 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1199 | NALU 1 HDR | | 1200 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1201 | . . . | 1202 | | 1203 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1204 | . . . | NALU 2 Size | NALU 2 HDR | 1205 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1206 | NALU 2 HDR | | 1207 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1208 | . . . | 1209 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1210 | :...OPTIONAL RTP padding | 1211 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1213 An Example of an AP Packet Containing 1214 Two Aggregation Units without the DONL Field 1216 Figure 7 1218 Figure 8 presents an example of an AP that contains two aggregation 1219 units, labeled as 1 and 2 in the figure, with the DONL field being 1220 present. 1222 0 1 2 3 1223 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1224 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1225 | RTP Header | 1226 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1227 | PayloadHdr (Type=28) | NALU 1 DONL | 1228 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1229 | NALU 1 Size | NALU 1 HDR | 1230 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1231 | | 1232 | NALU 1 Data . . . | 1233 | | 1234 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1235 | : NALU 2 Size | 1236 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1237 | NALU 2 HDR | | 1238 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1239 | | 1240 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1241 | :...OPTIONAL RTP padding | 1242 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1244 An Example of an AP Containing 1245 Two Aggregation Units with the DONL Field 1247 Figure 8 1249 4.3.3. Fragmentation Units 1251 Fragmentation Units (FUs) are introduced to enable fragmenting a 1252 single NAL unit into multiple RTP packets, possibly without 1253 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1254 unit consists of an integer number of consecutive octets of that NAL 1255 unit. Fragments of the same NAL unit MUST be sent in consecutive 1256 order with ascending RTP sequence numbers (with no other RTP packets 1257 within the same RTP stream being sent between the first and last 1258 fragment). 1260 When a NAL unit is fragmented and conveyed within FUs, it is referred 1261 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1262 NOT be nested; i.e., an FU can not contain a subset of another FU. 1264 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1265 time of the fragmented NAL unit. 1267 An FU consists of a payload header (denoted as PayloadHdr), an FU 1268 header of one octet, a conditional 16-bit DONL field (in network byte 1269 order), and an FU payload, as shown in Figure 9. 1271 0 1 2 3 1272 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1273 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1274 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1275 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1276 | DONL (cond) | | 1277 |-+-+-+-+-+-+-+-+ | 1278 | FU payload | 1279 | | 1280 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1281 | :...OPTIONAL RTP padding | 1282 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1284 The Structure of an FU 1286 Figure 9 1288 The fields in the payload header are set as follows. The Type field 1289 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1290 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1291 unit. 1293 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1294 FuType field, as shown in Figure 10. 1296 +---------------+ 1297 |0|1|2|3|4|5|6|7| 1298 +-+-+-+-+-+-+-+-+ 1299 |S|E|P| FuType | 1300 +---------------+ 1302 The Structure of FU Header 1304 Figure 10 1306 The semantics of the FU header fields are as follows: 1308 S: 1 bit 1310 When set to 1, the S bit indicates the start of a fragmented NAL 1311 unit, i.e., the first byte of the FU payload is also the first 1312 byte of the payload of the fragmented NAL unit. When the FU 1313 payload is not the start of the fragmented NAL unit payload, the S 1314 bit MUST be set to 0. 1316 E: 1 bit 1318 When set to 1, the E bit indicates the end of a fragmented NAL 1319 unit, i.e., the last byte of the payload is also the last byte of 1320 the fragmented NAL unit. When the FU payload is not the last 1321 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1323 P: 1 bit 1325 When set to 1, the P bit indicates the last FU of the last VCL NAL 1326 unit of a coded picture, i.e., the last byte of the FU payload is 1327 also the last byte of the last VCL NAL unit of the coded picture. 1328 When the FU payload is not the last fragment of the last VCL NAL 1329 unit of a coded picture, the P bit MUST be set to 0. 1331 FuType: 5 bits 1333 The field FuType MUST be equal to the field Type of the fragmented 1334 NAL unit. 1336 The DONL field, when present, specifies the value of the 16 least 1337 significant bits of the decoding order number of the fragmented NAL 1338 unit. 1340 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1341 the DONL field MUST be present in the FU, and the variable DON for 1342 the fragmented NAL unit is derived as equal to the value of the DONL 1343 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1344 equal to 0), the DONL field MUST NOT be present in the FU. 1346 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1347 the Start bit and End bit must not both be set to 1 in the same FU 1348 header. 1350 The FU payload consists of fragments of the payload of the fragmented 1351 NAL unit so that if the FU payloads of consecutive FUs, starting with 1352 an FU with the S bit equal to 1 and ending with an FU with the E bit 1353 equal to 1, are sequentially concatenated, the payload of the 1354 fragmented NAL unit can be reconstructed. The NAL unit header of the 1355 fragmented NAL unit is not included as such in the FU payload, but 1356 rather the information of the NAL unit header of the fragmented NAL 1357 unit is conveyed in F, LayerId, and TID fields of the FU payload 1358 headers of the FUs and the FuType field of the FU header of the FUs. 1359 An FU payload MUST NOT be empty. 1361 If an FU is lost, the receiver SHOULD discard all following 1362 fragmentation units in transmission order corresponding to the same 1363 fragmented NAL unit, unless the decoder in the receiver is known to 1364 be prepared to gracefully handle incomplete NAL units. 1366 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1367 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1368 n of that NAL unit is not received. In this case, the 1369 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1370 syntax violation. 1372 4.4. Decoding Order Number 1374 For each NAL unit, the variable AbsDon is derived, representing the 1375 decoding order number that is indicative of the NAL unit decoding 1376 order. 1378 Let NAL unit n be the n-th NAL unit in transmission order within an 1379 RTP stream. 1381 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1382 for NAL unit n, is derived as equal to n. 1384 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1385 derived as follows, where DON[n] is the value of the variable DON for 1386 NAL unit n: 1388 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1389 transmission order), AbsDon[0] is set equal to DON[0]. 1391 * Otherwise (n is greater than 0), the following applies for 1392 derivation of AbsDon[n]: 1394 If DON[n] == DON[n-1], 1395 AbsDon[n] = AbsDon[n-1] 1397 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1398 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1400 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1401 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1403 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1404 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - DON[n]) 1406 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1407 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1409 For any two NAL units m and n, the following applies: 1411 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1412 NAL unit m in NAL unit decoding order. 1414 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1415 of the two NAL units can be in either order. 1417 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1418 NAL unit m in decoding order. 1420 Informative note: When two consecutive NAL units in the NAL 1421 unit decoding order have different values of AbsDon, the 1422 absolute difference between the two AbsDon values may be 1423 greater than or equal to 1. 1425 Informative note: There are multiple reasons to allow for the 1426 absolute difference of the values of AbsDon for two consecutive 1427 NAL units in the NAL unit decoding order to be greater than 1428 one. An increment by one is not required, as at the time of 1429 associating values of AbsDon to NAL units, it may not be known 1430 whether all NAL units are to be delivered to the receiver. For 1431 example, a gateway might not forward VCL NAL units of higher 1432 sublayers or some SEI NAL units when there is congestion in the 1433 network. In another example, the first intra-coded picture of 1434 a pre-encoded clip is transmitted in advance to ensure that it 1435 is readily available in the receiver, and when transmitting the 1436 first intra-coded picture, the originator does not exactly know 1437 how many NAL units will be encoded before the first intra-coded 1438 picture of the pre-encoded clip follows in decoding order. 1439 Thus, the values of AbsDon for the NAL units of the first 1440 intra-coded picture of the pre-encoded clip have to be 1441 estimated when they are transmitted, and gaps in values of 1442 AbsDon may occur. 1444 5. Packetization Rules 1446 The following packetization rules apply: 1448 * If sprop-max-don-diff is greater than 0, the transmission order of 1449 NAL units carried in the RTP stream MAY be different than the NAL 1450 unit decoding order. Otherwise (sprop-max-don-diff is equal to 1451 0), the transmission order of NAL units carried in the RTP stream 1452 MUST be the same as the NAL unit decoding order. 1454 * A NAL unit of a small size SHOULD be encapsulated in an 1455 aggregation packet together with one or more other NAL units in 1456 order to avoid the unnecessary packetization overhead for small 1457 NAL units. For example, non-VCL NAL units such as access unit 1458 delimiters, parameter sets, or SEI NAL units are typically small 1459 and can often be aggregated with VCL NAL units without violating 1460 MTU size constraints. 1462 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1463 viewpoint, be encapsulated in an aggregation packet together with 1464 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1465 be meaningless without the associated VCL NAL unit being 1466 available. 1468 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1469 unit packet MUST be used. 1471 6. De-packetization Process 1473 The general concept behind de-packetization is to get the NAL units 1474 out of the RTP packets in an RTP stream and pass them to the decoder 1475 in the NAL unit decoding order. 1477 The de-packetization process is implementation dependent. Therefore, 1478 the following description should be seen as an example of a suitable 1479 implementation. Other schemes may be used as well, as long as the 1480 output for the same input is the same as the process described below. 1481 The output is the same when the set of output NAL units and their 1482 order are both identical. Optimizations relative to the described 1483 algorithms are possible. 1485 All normal RTP mechanisms related to buffer management apply. In 1486 particular, duplicated or outdated RTP packets (as indicated by the 1487 RTP sequence number and the RTP timestamp) are removed. To determine 1488 the exact time for decoding, factors such as a possible intentional 1489 delay to allow for proper inter-stream synchronization MUST be 1490 factored in. 1492 NAL units with NAL unit type values in the range of 0 to 27, 1493 inclusive, may be passed to the decoder. NAL-unit-like structures 1494 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1495 NOT be passed to the decoder. 1497 The receiver includes a receiver buffer, which is used to compensate 1498 for transmission delay jitter within individual RTP stream, and to 1499 reorder NAL units from transmission order to the NAL unit decoding 1500 order. In this section, the receiver operation is described under 1501 the assumption that there is no transmission delay jitter within an 1502 RTP stream. To make a difference from a practical receiver buffer 1503 that is also used for compensation of transmission delay jitter, the 1504 receiver buffer is hereafter called the de-packetization buffer in 1505 this section. Receivers should also prepare for transmission delay 1506 jitter; that is, either reserve separate buffers for transmission 1507 delay jitter buffering and de-packetization buffering or use a 1508 receiver buffer for both transmission delay jitter and de- 1509 packetization. Moreover, receivers should take transmission delay 1510 jitter into account in the buffering operation, e.g., by additional 1511 initial buffering before starting of decoding and playback. 1513 The de-packetization process extracts the NAL units from the RTP 1514 packets in an RTP stream as follows. When an RTP packet carries a 1515 single NAL unit packet, the payload of the RTP packet is extracted as 1516 a single NAL unit, excluding the DONL field, i.e., third and fourth 1517 bytes, when sprop-max-don-diff is greater than 0. When an RTP packet 1518 carries an Aggregation Packet, several NAL units are extracted from 1519 the payload of the RTP packet. In this case, each NAL unit 1520 corresponds to the part of the payload of each aggregation unit that 1521 follows the NALU size field as described in Section 4.3.2. When an 1522 RTP packet carries a Fragmentation Unit (FU), all RTP packets from 1523 the first FU (with the S field equal to 1) of the fragmented NAL unit 1524 up to the last FU (with the E field equal to 1) of the fragmented NAL 1525 unit are collected. The NAL unit is extracted from these RTP packets 1526 by concatenating all FU payloads in the same order as the 1527 corresponding RTP packets and appending the NAL unit header with the 1528 fields F, LayerId, and TID, set to equal to the values of the fields 1529 F, LayerId, and TID in the payload header of the FUs respectively, 1530 and with the NAL unit type set equal to the value of the field FuType 1531 in the FU header of the FUs, as described in Section 4.3.3. 1533 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1534 size is zero bytes, and the NAL units carried in the single RTP 1535 stream are directly passed to the decoder in their transmission 1536 order, which is identical to their decoding order. 1538 When sprop-max-don-diff is greater than 0, the process described in 1539 the remainder of this section applies. 1541 There are two buffering states in the receiver: initial buffering and 1542 buffering while playing. Initial buffering starts when the reception 1543 is initialized. After initial buffering, decoding and playback are 1544 started, and the buffering-while-playing mode is used. 1546 Regardless of the buffering state, the receiver stores incoming NAL 1547 units in reception order into the de-packetization buffer. NAL units 1548 carried in RTP packets are stored in the de-packetization buffer 1549 individually, and the value of AbsDon is calculated and stored for 1550 each NAL unit. 1552 Initial buffering lasts until the difference between the greatest and 1553 smallest AbsDon values of the NAL units in the de-packetization 1554 buffer is greater than or equal to the value of sprop-max-don-diff. 1556 After initial buffering, whenever the difference between the greatest 1557 and smallest AbsDon values of the NAL units in the de-packetization 1558 buffer is greater than or equal to the value of sprop-max-don-diff, 1559 the following operation is repeatedly applied until this difference 1560 is smaller than sprop-max-don-diff: 1562 * The NAL unit in the de-packetization buffer with the smallest 1563 value of AbsDon is removed from the de-packetization buffer and 1564 passed to the decoder. 1566 When no more NAL units are flowing into the de-packetization buffer, 1567 all NAL units remaining in the de-packetization buffer are removed 1568 from the buffer and passed to the decoder in the order of increasing 1569 AbsDon values. 1571 7. Payload Format Parameters 1573 This section specifies the optional parameters. A mapping of the 1574 parameters with Session Description Protocol (SDP) [RFC4556] is also 1575 provided for applications that use SDP. 1577 7.1. Media Type Registration 1579 The receiver MUST ignore any parameter unspecified in this memo. 1581 Type name: video 1583 Subtype name: H266 1585 Required parameters: N/A 1587 Optional parameters: 1589 profile-id, tier-flag, sub-profile-id, interop-constraints, level- 1590 id, sprop-sublayer-id, sprop-ols-id, recv-sublayer-id, recv-ols- 1591 id, max-recv-level-id, sprop-dci, sprop-vps, sprop-sps, sprop-pps, 1592 sprop-sei, max-lsr, max-fps, sprop-max-don-diff, sprop-depack-buf- 1593 bytes, depack-buf-cap (Refer to Section 7.2 for definitions). 1595 Encoding considerations: 1597 This type is only defined for transfer via RTP (RFC 3550). 1599 Security considerations: 1601 See Section 9 of RFC XXXX. 1603 Interoperability considerations: N/A 1605 Published specification: 1607 Please refer to RFC XXXX and its Section 13. 1609 Applications that use this media type: 1611 Any application that relies on VVC-based video services over RTP 1613 Fragment identifier considerations: N/A 1615 Additional information: N/A 1617 Person & email address to contact for further information: 1619 Stephan Wenger (stewe@stewe.org) 1621 Intended usage: COMMON 1623 Restrictions on usage: N/A 1625 Author: See Authors' Addresses section of RFC XXXX. 1627 Change controller: 1629 IETF Audio/Video Transport Core Maintenance Working Group 1630 delegated from the IESG. 1632 7.2. Optional Parameters Definition 1634 profile-id, tier-flag, sub-profile-id, interop-constraints, and 1635 level-id: 1637 These parameters indicate the profile, tier, default level, sub- 1638 profile, and some constraints of the bitstream carried by the RTP 1639 stream, or a specific set of the profile, tier, default level, 1640 sub-profile and some constraints the receiver supports. 1642 The subset of coding tools that may have been used to generate the 1643 bitstream or that the receiver supports, as well as some 1644 additional constraints are indicated collectively by profile-id, 1645 sub-profile-id, and interop-constraints. 1647 Informative note: There are 128 values of profile-id. The 1648 subset of coding tools identified by the profile-id can be 1649 further constrained with up to 255 instances of sub-profile-id. 1650 In addition, 68 bits included in interop-constraints, which can 1651 be extended up to 324 bits provide means to further restrict 1652 tools from existing profiles. To be able to support this fine- 1653 granular signaling of coding tool subsets with profile-id, sub- 1654 profile-id and interop-constraints, it would be safe to require 1655 symmetric use of these parameters in SDP offer/answer unless 1656 recv-ols-id is included in the SDP answer for choosing one of 1657 the layers offered. 1659 The tier is indicated by tier-flag. The default level is 1660 indicated by level-id. The tier and the default level specify the 1661 limits on values of syntax elements or arithmetic combinations of 1662 values of syntax elements that are followed when generating the 1663 bitstream or that the receiver supports. 1665 In SDP offer/answer, when the SDP answer does not include the 1666 recv-ols-id parameter that is less than the sprop-ols-id parameter 1667 in the SDP offer, the following applies: 1669 - The tier-flag, profile-id, sub-profile-id, and interop- 1670 constraints parameters MUST be used symmetrically, i.e., the 1671 value of each of these parameters in the offer MUST be the same 1672 as that in the answer, either explicitly signaled or implicitly 1673 inferred. 1675 - The level-id parameter is changeable as long as the highest 1676 level indicated by the answer is either equal to or lower than 1677 that in the offer. Note that a highest level higher than 1678 level-id in the offer for receiving can be included as max- 1679 recv-level-id. 1681 In SDP offer/answer, when the SDP answer does include the recv- 1682 ols-id parameter that is less than the sprop-ols-id parameter 1683 in the SDP offer, the set of tier-flag, profile-id, sub- 1684 profile-id, interop-constraints, and level-id parameters 1685 included in the answer MUST be consistent with that for the 1686 chosen output layer set as indicated in the SDP offer, with the 1687 exception that the level-id parameter in the SDP answer is 1688 changeable as long as the highest level indicated by the answer 1689 is either lower than or equal to that in the offer. 1691 More specifications of these parameters, including how they relate 1692 to syntax elements specified in [VVC] are provided below. 1694 profile-id: 1696 When profile-id is not present, a value of 1 (i.e., the Main 10 1697 profile) MUST be inferred. 1699 When used to indicate properties of a bitstream, profile-id is 1700 derived from the general_profile_idc syntax element that applies 1701 to the bitstream in an instance of the profile_tier_level( ) 1702 syntax structure. 1704 VVC bitstreams transported over RTP using the technologies of this 1705 memo SHOULD contain only a single profile_tier_level( ) structure 1706 in the DCI, unless the sender can assure that a receiver can 1707 correctly decode the VVC bitstream regardless of which 1708 profile_tier_level( ) structure contained in the DCI was used for 1709 deriving profile-id and other parameters for the SDP O/A exchange. 1711 As specified in [VVC], a profile_tier_level( ) syntax structure 1712 may be contained in an SPS NAL unit, and one or more 1713 profile_tier_level( ) syntax structures may be contained in a VPS 1714 NAL unit and in a DCI NAL unit. One of the following three cases 1715 applies to the container NAL unit of the profile_tier_level( ) 1716 syntax structure containing syntax elements used to derive the 1717 values of profile-id, tier-flag, level-id, sub-profile-id, or 1718 interop-constraints: 1) The container NAL unit is an SPS, the 1719 bitstream is a single-layer bitstream, and the profile_tier_level( 1720 ) syntax structures in all SPSs referenced by the CVSs in the 1721 bitstream has the same values respectively for those 1722 profile_tier_level( ) syntax elements; 2) The container NAL unit 1723 is a VPS, the profile_tier_level( ) syntax structure is the one in 1724 the VPS that applies to the OLS corresponding to the bitstream, 1725 and the profile_tier_level( ) syntax structures applicable to the 1726 OLS corresponding to the bitstream in all VPSs referenced by the 1727 CVSs in the bitstream have the same values respectively for those 1728 profile_tier_level( ) syntax elements; 3) The container NAL unit 1729 is a DCI NAL unit and the profile_tier_level( ) syntax structures 1730 in all DCI NAL units in the bitstream has the same values 1731 respectively for those profile_tier_level( ) syntax elements. 1733 [VVC] allows for multiple profile_tier_level( ) structures in a 1734 DCI NAL unit, which may contain different values for the syntax 1735 elements used to derive the values of profile-id, tier-flag, 1736 level-id, sub-profile-id, or interop-constraints in the different 1737 entries. However, herein defined is only a single profile-id, 1738 tier-flag, level-id, sub-profile-id, or interop-constraints. When 1739 signaling these parameters and a DCI NAL unit is present with 1740 multiple profile_tier_level( ) structures, these values SHOULD be 1741 the same as the first profile_tier_level structure in the DCI, 1742 unless the sender has ensured that the receiver can decode the 1743 bitstream when a different value is chosen. 1745 tier-flag, level-id: 1747 The value of tier-flag MUST be in the range of 0 to 1, inclusive. 1748 The value of level-id MUST be in the range of 0 to 255, inclusive. 1750 If the tier-flag and level-id parameters are used to indicate 1751 properties of a bitstream, they indicate the tier and the highest 1752 level the bitstream complies with. 1754 If the tier-flag and level-id parameters are used for capability 1755 exchange, the following applies. If max-recv-level-id is not 1756 present, the default level defined by level-id indicates the 1757 highest level the codec wishes to support. Otherwise, max-recv- 1758 level-id indicates the highest level the codec supports for 1759 receiving. For either receiving or sending, all levels that are 1760 lower than the highest level supported MUST also be supported. 1762 If no tier-flag is present, a value of 0 MUST be inferred; if no 1763 level-id is present, a value of 51 (i.e., level 3.1) MUST be 1764 inferred. 1766 Informative note: The level values currently defined in the VVC 1767 specification are in the form of "majorNum.minorNum", and the 1768 value of the level-id for each of the levels is equal to 1769 majorNum * 16 + minorNum * 3. It is expected that if any 1770 levels are defined in the future, the same convention will be 1771 used, but this cannot be guaranteed. 1773 When used to indicate properties of a bitstream, the tier-flag and 1774 level-id parameters are derived respectively from the syntax 1775 element general_tier_flag, and the syntax element 1776 general_level_idc or sub_layer_level_idc[j], that apply to the 1777 bitstream, in an instance of the profile_tier_level( ) syntax 1778 structure. 1780 If the tier-flag and level-id are derived from the 1781 profile_tier_level( ) syntax structure in a DCI NAL unit, the 1782 following applies: 1784 - tier-flag = general_tier_flag 1786 - level-id = general_level_idc 1788 Otherwise, if the tier-flag and level-id are derived from the 1789 profile_tier_level( ) syntax structure in an SPS or VPS NAL unit, 1790 and the bitstream contains the highest sublayer representation in 1791 the OLS corresponding to the bitstream, the following applies: 1793 - tier-flag = general_tier_flag 1795 - level-id = general_level_idc 1797 Otherwise, if the tier-flag and level-id are derived from the 1798 profile_tier_level( ) syntax structure in an SPS or VPS NAL 1799 unit, and the bitstream does not contain the highest sublayer 1800 representation in the OLS corresponding to the bitstream, the 1801 following applies, with j being the value of the sprop- 1802 sublayer-id parameter: 1804 - tier-flag = general_tier_flag 1806 - level-id = sub_layer_level_idc[j] 1808 sub-profile-id: 1810 The value of the parameter is a comma-separated (',') list of data 1811 using base64 [RFC4648] representation. 1813 When used to indicate properties of a bitstream, sub-profile-id is 1814 derived from each of the ptl_num_sub_profiles 1815 general_sub_profile_idc[i] syntax elements that apply to the 1816 bitstream in a profile_tier_level( ) syntax structure. 1818 interop-constraints: 1820 A base64 [RFC4648] representation of the data that includes the 1821 syntax elements ptl_frame_only_constraint_flag and 1822 ptl_multilayer_enabled_flag and the general_constraints_info( ) 1823 syntax structure that apply to the bitstream in an instance of the 1824 profile_tier_level( ) syntax structure. 1826 If the interop-constraints parameter is not present, the following 1827 MUST be inferred: 1829 - ptl_frame_only_constraint_flag = 1 1831 - ptl_multilayer_enabled_flag = 0 1833 - gci_present_flag in the general_constraints_info( ) syntax 1834 structure = 0 1836 Using interop-constraints for capability exchange results in a 1837 requirement on any bitstream to be compliant with the interop- 1838 constraints. 1840 sprop-sublayer-id: 1842 This parameter MAY be used to indicate the highest allowed value 1843 of TID in the bitstream. When not present, the value of sprop- 1844 sublayer-id is inferred to be equal to 6. 1846 The value of sprop-sublayer-id MUST be in the range of 0 to 6, 1847 inclusive. 1849 sprop-ols-id: 1851 This parameter MAY be used to indicate the OLS that the bitstream 1852 applies to. When not present, the value of sprop-ols-id is 1853 inferred to be equal to TargetOlsIdx as specified in 8.1.1 in 1854 [VVC]. If this optional parameter is present, sprop-vps MUST also 1855 be present or its content MUST be known a priori at the receiver. 1857 The value of sprop-ols-id MUST be in the range of 0 to 256, 1858 inclusive. 1860 Informative note: VVC allows having up to 257 output layer sets 1861 indicated in the VPS as the number of output layer sets minus 2 1862 is indicated with a field of 8 bits. 1864 recv-sublayer-id: 1866 This parameter MAY be used to signal a receiver's choice of the 1867 offered or declared sublayer representations in the sprop-vps and 1868 sprop-sps. The value of recv-sublayer-id indicates the TID of the 1869 highest sublayer that a receiver supports. When not present, the 1870 value of recv-sublayer-id is inferred to be equal to the value of 1871 the sprop-sublayer-id parameter in the SDP offer. 1873 The value of recv-sublayer-id MUST be in the range of 0 to 6, 1874 inclusive. 1876 recv-ols-id: 1878 This parameter MAY be used to signal a receiver's choice of the 1879 offered or declared output layer sets in the sprop-vps. The value 1880 of recv-ols-id indicates the OLS index of the bitstream that a 1881 receiver supports. When not present, the value of recv-ols-id is 1882 inferred to be equal to value of the sprop-ols-id parameter 1883 inferred from or indicated in the SDP offer. When present, the 1884 value of recv-ols-id must be included only when sprop-ols-id was 1885 received and must refer to an output layer set in the VPS that 1886 includes no layers other than all or a subset of the layers of the 1887 OLS referred to by sprop-ols-id. If this optional parameter is 1888 present, sprop-vps must have been received or its content must be 1889 known a priori at the receiver. 1891 The value of recv-ols-id MUST be in the range of 0 to 256, 1892 inclusive. 1894 max-recv-level-id: 1896 This parameter MAY be used to indicate the highest level a 1897 receiver supports. 1899 The value of max-recv-level-id MUST be in the range of 0 to 255, 1900 inclusive. 1902 When max-recv-level-id is not present, the value is inferred to be 1903 equal to level-id. 1905 max-recv-level-id MUST NOT be present when the highest level the 1906 receiver supports is not higher than the default level. 1908 sprop-dci: 1910 This parameter MAY be used to convey a decoding capability 1911 information NAL unit of the bitstream for out-of-band 1912 transmission. The parameter MAY also be used for capability 1913 exchange. The value of the parameter a base64 [RFC4648] 1914 representations of the decoding capability information NAL unit as 1915 specified in Section 7.3.2.1 of [VVC]. 1917 sprop-vps: 1919 This parameter MAY be used to convey any video parameter set NAL 1920 unit of the bitstream for out-of-band transmission of video 1921 parameter sets. The parameter MAY also be used for capability 1922 exchange and to indicate sub-stream characteristics (i.e., 1923 properties of output layer sets and sublayer representations as 1924 defined in [VVC]). The value of the parameter is a comma- 1925 separated (',') list of base64 [RFC4648] representations of the 1926 video parameter set NAL units as specified in Section 7.3.2.3 of 1927 [VVC]. 1929 The sprop-vps parameter MAY contain one or more than one video 1930 parameter set NAL units. However, all other video parameter sets 1931 contained in the sprop-vps parameter MUST be consistent with the 1932 first video parameter set in the sprop-vps parameter. A video 1933 parameter set vpsB is said to be consistent with another video 1934 parameter set vpsA if the number of OLSs in vpsA and vpsB is the 1935 same and any decoder that conforms to the profile, tier, level, 1936 and constraints indicated by the data starting from the syntax 1937 element general_profile_idc to the syntax structure 1938 general_constraints_info(), inclusive, in the profile_tier_level( 1939 ) syntax structure corresponding to any OLS with index olsIdx in 1940 vpsA can decode any CVS(s) referencing vpsB when TargetOlsIdx is 1941 equal to olsIdx that conforms to the profile, tier, level, and 1942 constraints indicated by the data starting from the syntax element 1943 general_profile_idc to the syntax structure 1944 general_constraints_info(), inclusive, in the profile_tier_level( 1945 ) syntax structure corresponding to the OLS with index 1946 TargetOlsIdx in vpsB. 1948 sprop-sps: 1950 This parameter MAY be used to convey sequence parameter set NAL 1951 units of the bitstream for out-of-band transmission of sequence 1952 parameter sets. The value of the parameter is a comma-separated 1953 (',') list of base64 [RFC4648] representations of the sequence 1954 parameter set NAL units as specified in Section 7.3.2.4 of [VVC]. 1956 A sequence parameter set spsB is said to be consistent with 1957 another sequence parameter set spsA if any decoder that conforms 1958 to the profile, tier, level, and constraints indicated by the data 1959 starting from the syntax element general_profile_idc to the syntax 1960 structure general_constraints_info(), inclusive, in the 1961 profile_tier_level( ) syntax structure in spsA can decode any 1962 CLVS(s) referencing spsB that conforms to the profile, tier, 1963 level, and constraints indicated by the data starting from the 1964 syntax element general_profile_idc to the syntax structure 1965 general_constraints_info(), inclusive, in the profile_tier_level( 1966 ) syntax structure in spsB. 1968 sprop-pps: 1970 This parameter MAY be used to convey picture parameter set NAL 1971 units of the bitstream for out-of-band transmission of picture 1972 parameter sets. The value of the parameter is a comma-separated 1973 (',') list of base64 [RFC4648] representations of the picture 1974 parameter set NAL units as specified in Section 7.3.2.5 of [VVC]. 1976 sprop-sei: 1978 This parameter MAY be used to convey one or more SEI messages that 1979 describe bitstream characteristics. When present, a decoder can 1980 rely on the bitstream characteristics that are described in the 1981 SEI messages for the entire duration of the session, independently 1982 from the persistence scopes of the SEI messages as specified in 1983 [VSEI]. 1985 The value of the parameter is a comma-separated (',') list of 1986 base64 [RFC4648] representations of SEI NAL units as specified in 1987 [VSEI]. 1989 Informative note: Intentionally, no list of applicable or 1990 inapplicable SEI messages is specified here. Conveying certain 1991 SEI messages in sprop-sei may be sensible in some application 1992 scenarios and meaningless in others. However, a few examples 1993 are described below: 1995 1) In an environment where the bitstream was created from film- 1996 based source material, and no splicing is going to occur during 1997 the lifetime of the session, the film grain characteristics SEI 1998 message is likely meaningful, and sending it in sprop-sei 1999 rather than in the bitstream at each entry point may help with 2000 saving bits and allows one to configure the renderer only once, 2001 avoiding unwanted artifacts. 2003 2) Examples for SEI messages that would be meaningless to be 2004 conveyed in sprop-sei include the decoded picture hash SEI 2005 message (it is close to impossible that all decoded pictures 2006 have the same hashtag) or the filler payload SEI message (as 2007 there is no point in just having more bits in SDP). 2009 max-lsr: 2011 The max-lsr MAY be used to signal the capabilities of a receiver 2012 implementation and MUST NOT be used for any other purpose. The 2013 value of max-lsr is an integer indicating the maximum processing 2014 rate in units of luma samples per second. The max-lsr parameter 2015 signals that the receiver is capable of decoding video at a higher 2016 rate than is required by the highest level. 2018 Informative note: When the OPTIONAL media type parameters are 2019 used to signal the properties of a bitstream, and max-lsr is 2020 not present, the values of tier-flag, profile-id, sub-profile- 2021 id interop-constraints, and level-id must always be such that 2022 the bitstream complies fully with the specified profile, tier, 2023 and level. 2025 When max-lsr is signaled, the receiver MUST be able to decode 2026 bitstreams that conform to the highest level, with the exception 2027 that the MaxLumaSr value in Table 136 of [VVC] for the highest 2028 level is replaced with the value of max-lsr. Senders MAY use this 2029 knowledge to send pictures of a given size at a higher picture 2030 rate than is indicated in the highest level. 2032 When not present, the value of max-lsr is inferred to be equal to 2033 the value of MaxLumaSr given in Table 136 of [VVC] for the highest 2034 level. 2036 The value of max-lsr MUST be in the range of MaxLumaSr to 16 * 2037 MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of 2038 [VVC] for the highest level. 2040 max-fps: 2042 The value of max-fps is an integer indicating the maximum picture 2043 rate in units of pictures per 100 seconds that can be effectively 2044 processed by the receiver. The max-fps parameter MAY be used to 2045 signal that the receiver has a constraint in that it is not 2046 capable of processing video effectively at the full picture rate 2047 that is implied by the highest level and, when present, max-lsr. 2049 The value of max-fps is not necessarily the picture rate at which 2050 the maximum picture size can be sent, it constitutes a constraint 2051 on maximum picture rate for all resolutions. 2053 Informative note: The max-fps parameter is semantically 2054 different from max-lsr in that max-fps is used to signal a 2055 constraint, lowering the maximum picture rate from what is 2056 implied by other parameters. 2058 The encoder MUST use a picture rate equal to or less than this 2059 value. In cases where the max-fps parameter is absent, the 2060 encoder is free to choose any picture rate according to the 2061 highest level and any signaled optional parameters. 2063 The value of max-fps MUST be smaller than or equal to the full 2064 picture rate that is implied by the highest level and, when 2065 present, max-lsr. 2067 sprop-max-don-diff: 2069 If there is no NAL unit naluA that is followed in transmission 2070 order by any NAL unit preceding naluA in decoding order (i.e., the 2071 transmission order of the NAL units is the same as the decoding 2072 order), the value of this parameter MUST be equal to 0. 2074 Otherwise, this parameter specifies the maximum absolute 2075 difference between the decoding order number (i.e., AbsDon) values 2076 of any two NAL units naluA and naluB, where naluA follows naluB in 2077 decoding order and precedes naluB in transmission order. 2079 The value of sprop-max-don-diff MUST be an integer in the range of 2080 0 to 32767, inclusive. 2082 When not present, the value of sprop-max-don-diff is inferred to 2083 be equal to 0. 2085 sprop-depack-buf-bytes: 2087 This parameter signals the required size of the de-packetization 2088 buffer in units of bytes. The value of the parameter MUST be 2089 greater than or equal to the maximum buffer occupancy (in units of 2090 bytes) of the de-packetization buffer as specified in Section 6. 2092 The value of sprop-depack-buf-bytes MUST be an integer in the 2093 range of 0 to 4294967295, inclusive. 2095 When sprop-max-don-diff is present and greater than 0, this 2096 parameter MUST be present and the value MUST be greater than 0. 2097 When not present, the value of sprop-depack-buf-bytes is inferred 2098 to be equal to 0. 2100 Informative note: The value of sprop-depack-buf-bytes indicates 2101 the required size of the de-packetization buffer only. When 2102 network jitter can occur, an appropriately sized jitter buffer 2103 has to be available as well. 2105 depack-buf-cap: 2107 This parameter signals the capabilities of a receiver 2108 implementation and indicates the amount of de-packetization buffer 2109 space in units of bytes that the receiver has available for 2110 reconstructing the NAL unit decoding order from NAL units carried 2111 in the RTP stream. A receiver is able to handle any RTP stream 2112 for which the value of the sprop-depack-buf-bytes parameter is 2113 smaller than or equal to this parameter. 2115 When not present, the value of depack-buf-cap is inferred to be 2116 equal to 4294967295. The value of depack-buf-cap MUST be an 2117 integer in the range of 1 to 4294967295, inclusive. 2119 Informative note: depack-buf-cap indicates the maximum possible 2120 size of the de-packetization buffer of the receiver only, 2121 without allowing for network jitter. 2123 7.3. SDP Parameters 2125 The receiver MUST ignore any parameter unspecified in this memo. 2127 7.3.1. Mapping of Payload Type Parameters to SDP 2129 The media type video/H266 string is mapped to fields in the Session 2130 Description Protocol (SDP) [RFC8866] as follows: 2132 * The media name in the "m=" line of SDP MUST be video. 2134 * The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 2135 media subtype). 2137 * The clock rate in the "a=rtpmap" line MUST be 90000. 2139 * The OPTIONAL parameters profile-id, tier-flag, sub-profile-id, 2140 interop-constraints, level-id, sprop-sublayer-id, sprop-ols-id, 2141 recv-sublayer-id, recv-ols-id, max-recv-level-id, max-lsr, max- 2142 fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf- 2143 cap, when present, MUST be included in the "a=fmtp" line of SDP. 2144 The fmtp line is expressed as a media type string, in the form of 2145 a semicolon-separated list of parameter=value pairs. 2147 * The OPTIONAL parameter sprop-vps, sprop-sps, sprop-pps, sprop-sei, 2148 and sprop-dci, when present, MUST be included in the "a=fmtp" line 2149 of SDP or conveyed using the "fmtp" source attribute as specified 2150 in Section 6.3 of [RFC5576]. For a particular media format (i.e., 2151 RTP payload type), sprop-vps, sprop-sps, sprop-pps, sprop-sei, or 2152 sprop-dci MUST NOT be both included in the "a=fmtp" line of SDP 2153 and conveyed using the "fmtp" source attribute. When included in 2154 the "a=fmtp" line of SDP, those parameters are expressed as a 2155 media type string, in the form of a semicolon-separated list of 2156 parameter=value pairs. When conveyed in the "a=fmtp" line of SDP 2157 for a particular payload type, the parameters sprop-vps, sprop- 2158 sps, sprop-pps, sprop-sei, and sprop-dci MUST be applied to each 2159 SSRC with the payload type. When conveyed using the "fmtp" source 2160 attribute, these parameters are only associated with the given 2161 source and payload type as parts of the "fmtp" source attribute. 2163 Informative note: Conveyance of sprop-vps, sprop-sps, and 2164 sprop-pps using the "fmtp" source attribute allows for out-of- 2165 band transport of parameter sets in topologies like Topo-Video- 2166 switch-MCU as specified in [RFC7667] 2168 An general usage of media representation in SDP is as follows: 2170 m=video 49170 RTP/AVP 98 2171 a=rtpmap:98 H266/90000 2172 a=fmtp:98 profile-id=1; 2173 sprop-vps=