idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([ISO23090-3]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (February 25, 2020) is 1523 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1231 -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: August 28, 2020 February 25, 2020 7 RTP Payload Format for Versatile Video Coding (VVC) 8 draft-ietf-avtcore-rtp-vvc-00 10 Abstract 12 This memo describes an RTP payload format for the video coding 13 standard ITU-T Recommendation [H.266] and ISO/IEC International 14 Standard [ISO23090-3], both also known as Versatile Video Coding 15 (VVC) and developed by the Joint Video Experts Team (JVET). The RTP 16 payload format allows for packetization of one or more Network 17 Abstraction Layer (NAL) units in each RTP packet payload as well as 18 fragmentation of a NAL unit into multiple RTP packets. The payload 19 format has wide applicability in videoconferencing, Internet video 20 streaming, and high-bitrate entertainment-quality video, among other 21 applications. 23 Status of This Memo 25 This Internet-Draft is submitted in full conformance with the 26 provisions of BCP 78 and BCP 79. 28 Internet-Drafts are working documents of the Internet Engineering 29 Task Force (IETF). Note that other groups may also distribute 30 working documents as Internet-Drafts. The list of current Internet- 31 Drafts is at https://datatracker.ietf.org/drafts/current/. 33 Internet-Drafts are draft documents valid for a maximum of six months 34 and may be updated, replaced, or obsoleted by other documents at any 35 time. It is inappropriate to use Internet-Drafts as reference 36 material or to cite them other than as "work in progress." 38 This Internet-Draft will expire on August 28, 2020. 40 Copyright Notice 42 Copyright (c) 2020 IETF Trust and the persons identified as the 43 document authors. All rights reserved. 45 This document is subject to BCP 78 and the IETF Trust's Legal 46 Provisions Relating to IETF Documents 47 (https://trustee.ietf.org/license-info) in effect on the date of 48 publication of this document. Please review these documents 49 carefully, as they describe your rights and restrictions with respect 50 to this document. Code Components extracted from this document must 51 include Simplified BSD License text as described in Section 4.e of 52 the Trust Legal Provisions and are provided without warranty as 53 described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 59 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 60 1.1.2. Systems and Transport Interfaces . . . . . . . . . . 6 61 1.1.3. Parallel Processing Support (informative) . . . . . . 10 62 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 10 63 1.2. Overview of the Payload Format . . . . . . . . . . . . . 11 64 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 12 65 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 12 66 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 12 67 3.1.1. Definitions from the VVC Specification . . . . . . . 12 68 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 15 69 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 16 70 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 17 71 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 17 72 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 19 73 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 19 74 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 19 75 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 20 76 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 24 77 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 27 78 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 28 79 6. De-packetization Process . . . . . . . . . . . . . . . . . . 29 80 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 31 81 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 31 82 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 31 83 8.2. Slice Loss Indication (SLI) . . . . . . . . . . . . . . . 31 84 8.3. Reference Picture Selection Indication (RPSI) . . . . . . 32 85 8.4. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 32 86 9. Frame marking . . . . . . . . . . . . . . . . . . . . . . . . 32 87 10. Security Considerations . . . . . . . . . . . . . . . . . . . 32 88 11. Congestion Control . . . . . . . . . . . . . . . . . . . . . 34 89 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 35 90 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 35 91 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 35 92 14.1. Normative References . . . . . . . . . . . . . . . . . . 35 93 14.2. Informative References . . . . . . . . . . . . . . . . . 37 94 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 38 95 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 38 97 1. Introduction 99 The Versatile Video Coding [VVC] specification, formally published as 100 both ITU-T Recommendation H.266 and ISO/IEC International Standard 101 23090-3 [ISO23090-3], is currently in the ISO/IEC approval process 102 and is planned for ratification in mid 2020. H.266 is reported to 103 provide significant coding efficiency gains over H.265 and earlier 104 video codec formats. 106 This memo describes an RTP payload format for VVC. It shares its 107 basic design with the NAL (Network Abstraction Layer) unit-based RTP 108 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 109 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 110 and their respective predecessors. With respect to design 111 philosophy, security, congestion control, and overall implementation 112 complexity, it has similar properties to those earlier payload format 113 specifications. This is a conscious choice, as at least RFC 6184 is 114 widely deployed and generally known in the relevant implementer 115 communities. Certain mechanisms known from [RFC6190] were 116 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 117 signal-to-noise ratio (SNR) scalability. 119 1.1. Overview of the VVC Codec 121 [VVC] and [HEVC] share a similar hybrid video codec design. In this 122 memo, we provide a very brief overview of those features of VVC that 123 are, in some form, addressed by the payload format specified herein. 124 Implementers have to read, understand, and apply the ITU- T/ISO/IEC 125 specifications pertaining to [VVC] to arrive at interoperable, well- 126 performing implementations. 128 Conceptually, both [VVC] and [HEVC] include a Video Coding Layer 129 (VCL), which is often used to refer to the coding-tool features, and 130 a NAL, which is often used to refer to the systems and transport 131 interface aspects of the codecs. 133 1.1.1. Coding-Tool Features (informative) 135 Coding tool features are described below with occasional reference to 136 the coding tool set of [HEVC], which is well known in the community. 138 Similar to earlier hybrid-video-coding-based standards, including 139 HEVC, the following basic video coding design is employed by VVC. A 140 prediction signal is first formed by either intra- or motion- 141 compensated prediction, and the residual (the difference between the 142 original and the prediction) is then coded. The gains in coding 143 efficiency are achieved by redesigning and improving almost all parts 144 of the codec over earlier designs. In addition, [VVC] includes 145 several tools to make the implementation on parallel architectures 146 easier. 148 Finally, [VVC] includes temporal, spatial, and SNR scalability as 149 well as multiview coding support. 151 Coding blocks and transform structure 153 Among major coding-tool differences between HEVC and VVC, one of the 154 important improvements is the more flexible coding tree structure in 155 VVC, i.e., multi-type tree. In addition to quadtree, binary and 156 ternary trees are also supported, which contributes significant 157 improvement in coding efficiency. Moreover, the maximum size of 158 Coding Tree Unit (CTU) is increased from 64x64 to 128x128. To 159 improve the coding efficiency of chroma signal, luma chroma separated 160 trees at CTU level may be employed for intra-slices. The square 161 transforms in HEVC are extended to non-square transforms for 162 rectangular blocks resulting from binary and ternary tree splits. 163 Besides, [VVC] supports multiple transform sets (MTS), including DCT- 164 2, DST-7, and DCT-8 as well as the non-separable secondary transform. 165 The transforms used in [VVC] can have different sizes with support 166 for larger transform sizes. For DCT-2, the transform sizes range 167 from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range 168 from 4x4 to 32x32. In addition, [VVC] also support sub-block 169 transform for both intra and inter coded blocks. For intra coded 170 blocks, intra sub-partitioning (ISP) may be used to allow sub-block 171 based intra prediction and transform. For inter blocks, sub-block 172 transform may be used assuming that only a part of an inter-block has 173 non-zero transform coefficients. 175 Entropy coding 177 Similar to HEVC , [VVC] uses a single entropy-coding engine, which is 178 based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC], 179 but with the support of multi-window sizes. The window sizes can be 180 initialized differently for different context models. Due to such a 181 design, it has more efficient adaptation speed and better coding 182 efficiency. A joint chroma residual coding scheme is applied to 183 further exploit the correlation between the residuals of two color 184 components. In VVC, different residual coding schemes are applied 185 for regular transform coefficients and residual samples generated 186 using transform-skip mode. 188 In-loop filtering 190 [VVC] has more feature support in loop filters than HEVC. The 191 deblocking filter in [VVC] is similar to HEVC but operates at a 192 smaller grid. After deblocking and sample adaptive offset (SAO), an 193 adaptive loop filter (ALF) may be used. As a Wiener filter, ALF 194 reduces distortion of decoded pictures. Besides, [VVC] introduces a 195 new module before deblocking called luma mapping with chroma scaling 196 to fully utilize the dynamic range of signal so that rate-distortion 197 performance of both SDR and HDR content is improved. 199 Motion prediction and coding 201 Compared to HEVC, [VVC] introduces several improvements in this area. 202 First, there is the Adaptive motion vector resolution (AMVR), which 203 can save bit cost for motion vectors by adaptively signaling motion 204 vector resolution. Then the Affine motion compensation is included 205 to capture complicated motion like zooming and rotation. Meanwhile, 206 prediction refinement with the optical flow with affine mode (PROF) 207 is further deployed to mimic affine motion at the pixel level. 208 Thirdly the decoder side motion vector refinement (DMVR) is a method 209 to derive MV vector at decoder side so that fewer bits may be spent 210 on motion vectors. Bi-directional optical flow (BDOF) is a similar 211 method to DMVR but at 4x4 sub-block level. Another difference is 212 that DMVR is based on block matching while BDOF derives MVs with 213 equations. Furthermore, merge with motion vector difference (MMVD) 214 is a special mode, which further signals a limited set of motion 215 vector differences on top of merge mode. In addition to MMVD, there 216 are another three types of special merge modes, i.e., sub-block 217 merge, triangle, and combined intra-/inter- prediction (CIIP). Sub- 218 block merge list includes one candidate of sub-block temporal motion 219 vector prediction (SbTMVP) and up to four candidates of affine motion 220 vectors. Triangle is based on triangular block motion compensation. 221 CIIP combines intra- and inter- predictions with weighting. Adaptive 222 weighting may be employed with a block-level tool called bi- 223 prediction with CU based weighting (BCW) which provides more 224 flexibility than in HEVC. 226 Intra prediction and intra-coding 228 To capture the diversified local image texture directions with finer 229 granularity, [VVC] supports 65 angular directions instead of 33 230 directions in HEVC. The intra mode coding is based on a 6 most 231 probable mode scheme, and the 6 most probable modes are derived using 232 the neighboring intra prediction directions. In addition, to deal 233 with the different distributions of intra prediction angles for 234 different block aspect ratios, a wide-angle intra prediction (WAIP) 235 scheme is applied in [VVC] by including intra prediction angles 236 beyond those present in HEVC. Unlike HEVC which only allows using 237 the most adjacent line of reference samples for intra prediction, 238 [VVC] also allows using two further reference lines, as known as 239 multi-reference-line (MRL) intra prediction. The additional 240 reference lines can be only used for 6 most probable intra prediction 241 modes. To capture the strong correlation between different colour 242 components, in VVC, a cross-component linear mode (CCLM) is utilized 243 which assumes a linear relationship between the luma sample 245 values and their associated chroma samples. For intra prediction, 246 [VVC] also applies a position-dependent prediction combination (PDPC) 247 for refining the prediction samples closer to the intra prediction 248 block boundary. Matrix-based intra prediction (MIP) modes are also 249 used in [VVC] which generates an up to 8x8 intra prediction block 250 using a weighted sum of downsampled neighboring reference samples, 251 and the weightings are hardcoded constants. 253 Other coding-tool feature 255 [VVC] introduces dependent quantization (DQ) to reduce quantization 256 error by state-based switching between two quantizers. 258 1.1.2. Systems and Transport Interfaces 260 [VVC] inherits the basic systems and transport interfaces designs 261 from HEVC and H.264. These include the NAL-unit-based syntax 262 structure, the hierarchical syntax and data unit structure, the 263 Supplemental Enhancement Information (SEI) message mechanism, and the 264 video buffering model based on the Hypothetical Reference Decoder 265 (HRD). The scalability features of [VVC] are conceptually similar to 266 the scalable variant of HEVC known as SHVC. The hierarchical syntax 267 and data unit structure consists of parameter sets at various levels 268 (decoder, sequence (pertaining to all), sequence (pertaining to a 269 single), picture), slice-level header parameters, and lower-level 270 parameters. 272 A number of key components that influenced the Network Abstraction 273 Layer design of [VVC] as well as this memo are described below 275 Decoding Capability Information 277 The Decoding capability information includes parameters that stay 278 constant for the lifetime of a Video Bitstream, which in IETF terms 279 can translate to the lifetime of a session. Decoding capability 280 informations can include profile, level, and sub-profile information 281 to determine a maximum complexity interop point that is guaranteed to 282 be never exceeded, even if splicing of video sequences occurs within 283 a session. It further optionally includes constraint flags, which 284 indicate that the video bitstream will be constraint in the use of 285 certain features as indicated by the values of those flags. With 286 this, a bitstream can be labelled as not using certain tools, which 287 allows among other things for resource allocation in a decoder 288 implementation. 290 Video parameter set 292 The Video Parameter Set (VPS) pertains to a Coded Video Sequences 293 (CVS) of multiple layers covering the same range of picture units, 294 and includes, among other information decoding dependency expressed 295 as information for reference picture set construction of enhancement 296 layers. The VPS provides a "big picture" of a scalable sequence, 297 including what types of operation points are provided, the profile, 298 tier, and level of the operation points, and some other high-level 299 properties of the bitstream that can be used as the basis for session 300 negotiation and content selection, etc. One VPS may be referenced by 301 one or more Sequence parameter sets. 303 Sequence parameter set 305 The Sequence Parameter Set (SPS) contains syntax elements pertaining 306 to a coded layer video sequence (CLVS), which is a group of pictures 307 belonging to the same layer, starting with a random access point, and 308 followed by pictures that may depend on each other and the random 309 access point picture. In MPGEG-2, the equivalent of a CVS was a 310 Group of Pictures (GOP), which normally started with an I frame and 311 was followed by P and B frames. While more complex in its options of 312 random access points, VVC retains this basic concept. In many TV- 313 like applications, a CVS contains a few hundred milliseconds to a few 314 seconds of video. In video conferencing (without switching MCUs 315 involved), a CVS can be as long in duration as the whole session. 317 Picture and Adaptation parameter set 319 The Picture Parameter Set and the Adaptation Parameter Set (PPS and 320 APS, respectively) carry information pertaining to zero or more 321 pictures and zero or more slices, respectively. The PPS contains 322 information that is likely to stay constant from picture to picture- 323 at least for pictures for a certain type-whereas the APS contains 324 information, such as adaptive loop filter coefficients, that are 325 likely to change from picture to picture. 327 Profile, tier, and level 329 The profile, tiler and level syntax structures in DCI, VPS and SPS 330 contain profile, tier, level information for all layers that refer to 331 the DCI, for layers associated with one or more output layer sets 332 specified by the VPS, and for the lowest layer among the layers that 333 refers to the SPS, respectively. 335 Sub-Profiles 336 Within the [VVC] specification, a sub-profile is a 32-bit number 337 coded according to ITU-T Rec. T.35, that does not carry a semantic. 338 It is carried in the profile_tier_level structure and hence 339 (potentially) present in the DCI, VPS, and SPS. External 340 registration bodies can register a T.35 codepoint with ITU-T 341 registration authorities and associate with their registration a 342 description of bitstream complexity restrictions beyond the profiles 343 defined by ITU-T and ISO/IEC. This would allow encoder manufacturers 344 to label the bitstreams generated by their encoder as complying with 345 such sub-profile. It is expected that upstream standardization 346 organizations (such as: DVB and ATSC), as well as walled-garden video 347 services will take advantage of this labelling system. In contrast 348 to "normal" profiles, it is expected that sub-profiles may indicate 349 encoder choices traditionally left open in the (decoder- centric) 350 video coding specs, such as GOP structures, minimum/maximum QP 351 values, and the mandatory use of certain tools or SEI messages. 353 Constraint Flags 355 The profile_tier_level structure optionally carries a considerable 356 number of constraint flags, which an encoder can use to indicate to a 357 decoder that it will not use a certain tool or technology. They were 358 included in reaction to a perceived market need for labelling a 359 bitstream as not exercising a certain tool that has become 360 commercially unviable. 362 Temporal scalability support 364 Editor notes: need will update along with VVC new draft in the 365 future 367 [VVC] includes support of temporal scalability, by inclusion of the 368 signaling of TemporalId in the NAL unit header, the restriction that 369 pictures of a particular temporal sub-layer cannot be used for inter 370 prediction reference by pictures of a lower temporal sub-layer, the 371 sub-bitstream extraction process, and the requirement that each sub- 372 bitstream extraction output be a conforming bitstream. Media-Aware 373 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 374 header for stream adaptation purposes based on temporal scalability. 376 Spatial, SNR, View Scalability 378 [VVC] includes support for spatial, SNR, and View scalability. 379 Scalable video coding is widely considered to have technical benefits 380 and enrich services for various video applications. Until recently, 381 however, the functionality has not been included in the main profiles 382 of video codecs and not wide deployed due to additional costs. In 383 VVC, however, all those forms of scalability are supported natively 384 through the signaling of the layer_id in the NAL unit header, the VPS 385 which associates layers with given layer_ids to each other, reference 386 picture selection, reference picture resampling for spatial 387 scalability, and a number of other mechanisms not relevant for this 388 memo. Scalability support can be implemented in a single decoding 389 "loop" and is widely considered a comparatively lightweight 390 operation. 392 Spatial Scalability 394 With the existence of Reference Picture Resampling, in the 395 "main" profile of VVC, the additional burden for scalability 396 support is just a minor modification of the high-level syntax 397 (HLS). In technical aspects, the inter-layer prediction is 398 employed in a scalable system to improve the coding efficiency 399 of the enhancement layers. In addition to the spatial and 400 temporal motion-compensated predictions that are available in a 401 single- layer codec, the inter-layer prediction in [VVC] uses 402 the resampled video data of the reconstructed reference picture 403 from a reference layer to predict the current enhancement 404 layer. Then, the resampling process for inter-layer prediction 405 is performed at the block-level, by modifying the existing 406 interpolation process for motion compensation. It means that 407 no additional resampling process is needed to support 408 scalability. 410 SNR Scalability 412 SNR scalability is similar to Spatial Scalability except that 413 the resampling factors are 1:1-in other words, there is no 414 change in resolution, but there is inter-layer prediction. 416 SEI Messages 418 Supplementary Enhancement Information (SEI) messages are codepoints 419 in the bitstream that do not influence the decoding process as 420 specified in the [VVC] spec, but address issues of representation/ 421 rendering of the decoded bitstream, label the bitstream for certain 422 applications, among other, similar tasks. The overall concept of SEI 423 messages and many of the messages themselves has been inherited from 424 the H.264 and HEVC specs. In the [VVC] environment, some of the SEI 425 messages considered to be generally useful also in other video coding 426 technologies have been moved out of the main specification into a 427 companion document (TO DO: add reference once ITU designation is 428 known). 430 1.1.3. Parallel Processing Support (informative) 432 Compared to HEVC, the [VVC] design to support parallelization offers 433 numerous improvements. Some of those improvements are still 434 undergoing changes in JVET. Information, to the extent relevant for 435 this memo, will be added in future versions of this memo as the 436 standardization in JVET progresses and the technology stabilizes. 438 Editor notes: udpate on sub-picture/slice/tile is needed following 439 new VVC draft 441 1.1.4. NAL Unit Header 443 [VVC] maintains the NAL unit concept of HEVC with modifications. VVC 444 uses a two-byte NAL unit header, as shown in Figure 1. The payload 445 of a NAL unit refers to the NAL unit excluding the NAL unit header. 447 +---------------+---------------+ 448 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 449 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 450 |F|Z| LayerID | Type | TID | 451 +---------------+---------------+ 453 The Structure of the VVC NAL Unit Header. 455 Figure 1 457 The semantics of the fields in the NAL unit header are as specified 458 in [VVC] and described briefly below for convenience. In addition to 459 the name and size of each field, the corresponding syntax element 460 name in [VVC] is also provided. 462 F: 1 bit 464 forbidden_zero_bit. Required to be zero in VVC. Note that the 465 inclusion of this bit in the NAL unit header was to enable 466 transport of [VVC] video over MPEG-2 transport systems (avoidance 467 of start code emulations) [MPEG2S]. In the context of this memo 468 the value 1 may be used to indicate a syntax violation, e.g., for 469 a NAL unit resulted from aggregating a number of fragmented units 470 of a NAL unit but missing the last fragment, as described in 471 Section TBD. 473 Z: 1 bit 474 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 475 for future extensions by ITU-T and ISO/IEC. 476 This memo does not overload the "Z" bit for local extensions, as 477 a) overloading the "F" bit is sufficient and b) to preserve the 478 usefulness of this memo to possible future versions of [VVC]. 480 LayerId: 6 bits 482 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 483 a layer may be, e.g., a spatial scalable layer, a quality scalable 484 layer . 486 Type: 6 bits 488 nal_unit_type. This field specifies the NAL unit type as defined 489 in Table 7-1 of VVC. For a reference of all currently defined NAL 490 unit types and their semantics, please refer to Section 7.4.2.2 in 491 [VVC]. 493 TID: 3 bits 495 nuh_temporal_id_plus1. This field specifies the temporal 496 identifier of the NAL unit plus 1. The value of TemporalId is 497 equal to TID minus 1. A TID value of 0 is illegal to ensure that 498 there is at least one bit in the NAL unit header equal to 1, so to 499 enable independent considerations of start code emulations in the 500 NAL unit header and in the NAL unit payload data. 502 1.2. Overview of the Payload Format 504 This payload format defines the following processes required for 505 transport of [VVC] coded data over RTP [RFC3550]: 507 o Usage of RTP header with this payload format 509 o Packetization of [VVC] coded NAL units into RTP packets using 510 three types of payload structures: a single NAL unit packet, 511 aggregation packet, and fragment unit 513 o Transmission of [VVC] NAL units of the same bitstream within a 514 single RTP stream. 516 o Media type parameters to be used with the Session Description 517 Protocol (SDP) [RFC4566] 519 o Frame-marking mapping [FrameMarking] 521 2. Conventions 523 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 524 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 525 "OPTIONAL" in this document are to be interpreted as described in BCP 526 14 [RFC2119] [RFC8174] when, and only when, they appear in all 527 capitals, as shown above. 529 3. Definitions and Abbreviations 531 3.1. Definitions 533 This document uses the terms and definitions of VVC. Section 3.1.1 534 lists relevant definitions from [VVC] for convenience. Section 3.1.2 535 provides definitions specific to this memo. 537 3.1.1. Definitions from the VVC Specification 539 Editor notes: 541 Access unit (AU): A set of PUs that belong to different layers and 542 contain coded pictures associated with the same time for output from 543 the DPB. 545 Adaptation parameter set (APS): A syntax structure containing syntax 546 elements that apply to zero or more slices as determined by zero or 547 more syntax elements found in slice headers. 549 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 550 byte stream, that forms the representation of a sequence of AUs 551 forming one or more coded video sequences (CVSs). 553 Coded picture: A coded representation of a picture comprising VCL NAL 554 units with a particular value of nuh_layer_id within an AU and 555 containing all CTUs of the picture. 557 Clean random access (CRA) PU: A PU in which the coded picture is a 558 CRA picture. 560 Clean random access (CRA) picture: An IRAP picture for which each VCL 561 NAL unit has nal_unit_type equal to CRA_NUT. 563 Coded video sequence (CVS): A sequence of AUs that consists, in 564 decoding order, of a CVSS AU, followed by zero or more AUs that are 565 not CVSS AUs, including all subsequent AUs up to but not including 566 any subsequent AU that is a CVSS AU. 568 Coded video sequence start (CVSS) AU: An AU in which there is a PU 569 for each layer in the CVS and the coded picture in each PU is a CLVSS 570 picture. 572 Coded layer video sequence (CLVS): A sequence of PUs with the same 573 value of nuh_layer_id that consists, in decoding order, of a CLVSS 574 PU, followed by zero or more PUs that are not CLVSS PUs, including 575 all subsequent PUs up to but not including any subsequent PU that is 576 a CLVSS PU. 578 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 579 picture is a CLVSS picture. 581 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 582 of chroma samples of a picture that has three sample arrays, or a CTB 583 of samples of a monochrome picture or a picture that is coded using 584 three separate colour planes and syntax structures used to code the 585 samples. 587 Decoding Capability Information (DCI): A syntax structure containing 588 syntax elements that apply to the entire bitstream. 590 Decoded picture buffer (DPB): A buffer holding decoded pictures for 591 reference, output reordering, or output delay specified for the 592 hypothetical reference decoder. 594 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 595 picture is an IDR picture. 597 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 598 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 599 IDR_N_LP.. 601 Intra random access point (IRAP) AU: An AU in which there is a PU for 602 each layer in the CVS and the coded picture in each PU is an IRAP 603 picture. 605 Intra random access point (IRAP) PU: A PU in which the coded picture 606 is an IRAP picture. 608 Layer: A set of VCL NAL units that all have a particular value of 609 nuh_layer_id and the associated non-VCL NAL units. 611 Network abstraction layer (NAL) unit: A syntax structure containing 612 an indication of the type of data to follow and bytes containing that 613 data in the form of an RBSP interspersed as necessary with emulation 614 prevention bytes. 616 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 618 Operation point (OP): A temporal subset of an OLS, identified by an 619 OLS index and a highest value of TemporalId. 621 Picture parameter set (PPS): A syntax structure containing syntax 622 elements that apply to zero or more entire coded pictures as 623 determined by a syntax element found in each slice header. 625 Picture unit (PU): A set of NAL units that are associated with each 626 other according to a specified classification rule, are consecutive 627 in decoding order, and contain exactly one coded picture. 629 Random access: The act of starting the decoding process for a 630 bitstream at a point other than the beginning of the stream. 632 Sequence parameter set (SPS): A syntax structure containing syntax 633 elements that apply to zero or more entire CLVSs as determined by the 634 content of a syntax element found in the PPS referred to by a syntax 635 element found in each picture header. 637 Slice: An integer number of complete tiles or an integer number of 638 consecutive complete CTU rows within a tile of a picture that are 639 exclusively contained in a single NAL unit. 641 Sub-layer: A temporal scalable layer of a temporal scalable bitstream 642 consisting of VCL NAL units with a particular value of the TemporalId 643 variable, and the associated non-VCL NAL units. 645 Subpicture: An rectangular region of one or more slices within a 646 picture. 648 Sub-layer representation: A subset of the bitstream consisting of NAL 649 units of a particular sub-layer and the lower sub-layers. 651 Tile: A rectangular region of CTUs within a particular tile column 652 and a particular tile row in a picture. 654 Tile column: A rectangular region of CTUs having a height equal to 655 the height of the picture and a width specified by syntax elements in 656 the picture parameter set. 658 Tile row: A rectangular region of CTUs having a height specified by 659 syntax elements in the picture parameter set and a width equal to the 660 width of the picture. 662 Video coding layer (VCL) NAL unit: A collective term for coded slice 663 NAL units and the subset of NAL units that have reserved values of 664 nal_unit_type that are classified as VCL NAL units in this 665 Specification. 667 3.1.2. Definitions Specific to This Memo 669 Media-Aware Network Element (MANE): A network element, such as a 670 middlebox, selective forwarding unit, or application-layer gateway 671 that is capable of parsing certain aspects of the RTP payload headers 672 or the RTP payload and reacting to their contents. 674 Editor Notes: the following informative needs to be updated along 675 with frame marking update 677 Informative note: The concept of a MANE goes beyond normal routers 678 or gateways in that a MANE has to be aware of the signaling (e.g., 679 to learn about the payload type mappings of the media streams), 680 and in that it has to be trusted when working with Secure RTP 681 (SRTP). The advantage of using MANEs is that they allow packets 682 to be dropped according to the needs of the media coding. For 683 example, if a MANE has to drop packets due to congestion on a 684 certain link, it can identify and remove those packets whose 685 elimination produces the least adverse effect on the user 686 experience. After dropping packets, MANEs must rewrite RTCP 687 packets to match the changes to the RTP stream, as specified in 688 Section 7 of [RFC3550]. 690 NAL unit decoding order: A NAL unit order that conforms to the 691 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 692 follow the Order of NAL units in the bitstream. 694 NAL unit output order: A NAL unit order in which NAL units of 695 different access units are in the output order of the decoded 696 pictures corresponding to the access units, as specified in [VVC], 697 and in which NAL units within an access unit are in their decoding 698 order. 700 RTP stream: See [RFC7656]. Within the scope of this memo, one RTP 701 stream is utilized to transport one or more temporal sub-layers. 703 Transmission order: The order of packets in ascending RTP sequence 704 number order (in modulo arithmetic). Within an aggregation packet, 705 the NAL unit transmission order is the same as the order of 706 appearance of NAL units in the packet. 708 3.2. Abbreviations 710 AU Access Unit 712 AP Aggregation Packet 714 CTU Coding Tree Unit 716 CVS Coded Video Sequence 718 DPB Decoded Picture Buffer 720 DCI Decoding capability information 722 DON Decoding Order Number 724 DONB Decoding Order Number Base 726 FIR Full Intra Request 728 FU Fragmentation Unit 730 HRD Hypothetical Reference Decoder 732 IDR Instantaneous Decoding Refresh 734 MANE Media-Aware Network Element 736 MTU Maximum Transfer Unit 738 NAL Network Abstraction Layer 740 NALU Network Abstraction Layer Unit 742 PLI Picture Loss Indication 744 PPS Picture Parameter Set 746 RPS Reference Picture Set 748 RPSI Reference Picture Selection Indication 750 SEI Supplemental Enhancement Information 752 SLI Slice Loss Indication 754 SPS Sequence Parameter Set 755 VCL Video Coding Layer 757 VPS Video Parameter Set 759 4. RTP Payload Format 761 4.1. RTP Header Usage 763 The format of the RTP header is specified in [RFC3550] (reprinted as 764 Figure 2 for convenience). This payload format uses the fields of 765 the header in a manner consistent with that specification. 767 The RTP payload (and the settings for some RTP header bits) for 768 aggregation packets and fragmentation units are specified in 769 Section 4.3.2 and Section 4.3.3, respectively. 771 0 1 2 3 772 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 773 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 774 |V=2|P|X| CC |M| PT | sequence number | 775 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 776 | timestamp | 777 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 778 | synchronization source (SSRC) identifier | 779 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 780 | contributing source (CSRC) identifiers | 781 | .... | 782 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 784 RTP Header According to {{RFC3550}} 786 Figure 2 788 The RTP header information to be set according to this RTP payload 789 format is set as follows: 791 Marker bit (M): 1 bit 793 Set for the last packet of the access unit, carried in the current 794 RTP stream. This is in line with the normal use of the M bit in 795 video formats to allow an efficient playout buffer handling. 797 Editor notes: The informative note below needs updating once 798 the NAL unit type table is stable in the [VVC] spec. 800 Informative note: The content of a NAL unit does not tell 801 whether or not the NAL unit is the last NAL unit, in decoding 802 order, of an access unit. An RTP sender implementation may 803 obtain this information from the video encoder. If, however, 804 the implementation cannot obtain this information directly from 805 the encoder, e.g., when the bitstream was pre-encoded, and also 806 there is no timestamp allocated for each NAL unit, then the 807 sender implementation can inspect subsequent NAL units in 808 decoding order to determine whether or not the NAL unit is the 809 last NAL unit of an access unit as follows. A NAL unit is 810 determined to be the last NAL unit of an access unit if it is 811 the last NAL unit of the bitstream. A NAL unit naluX is also 812 determined to be the last NAL unit of an access unit if both 813 the following conditions are true: 1) the next VCL NAL unit 814 naluY in decoding order has the high-order bit of the first 815 byte after its NAL unit header equal to 1 or nal_unit_type 816 equal to 19, and 2) all NAL units between naluX and naluY, when 817 present, have nal_unit_type in the range of 13 to17, inclusive, 818 equal to 20, equal to 23 or equal to 26. 820 Payload Type (PT): 7 bits 822 The assignment of an RTP payload type for this new packet format 823 is outside the scope of this document and will not be specified 824 here. The assignment of a payload type has to be performed either 825 through the profile used or in a dynamic way. 827 Sequence Number (SN): 16 bits 829 Set and used in accordance with [RFC3550]. 831 Timestamp: 32 bits 833 The RTP timestamp is set to the sampling timestamp of the content. 834 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 835 properties of its own (e.g., parameter set and SEI NAL units), the 836 RTP timestamp MUST be set to the RTP timestamp of the coded 837 picture of the access unit in which the NAL unit (according to 838 Annex D of VVC) is included. Receivers MUST use the RTP timestamp 839 for the display process, even when the bitstream contains picture 840 timing SEI messages or decoding unit information SEI messages as 841 specified in VVC. 843 Synchronization source (SSRC): 32 bits 845 Used to identify the source of the RTP packets. A single SSRC is 846 used for all parts of a single bitstream. 848 4.2. Payload Header Usage 850 The first two bytes of the payload of an RTP packet are referred to 851 as the payload header. The payload header consists of the same 852 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 853 in Section 1.1.4, irrespective of the type of the payload structure. 855 The TID value indicates (among other things) the relative importance 856 of an RTP packet, for example, because NAL units belonging to higher 857 temporal sub-layers are not used for the decoding of lower temporal 858 sub-layers. A lower value of TID indicates a higher importance. 859 More-important NAL units MAY be better protected against transmission 860 losses than less-important NAL units. 862 For Discussion: quite possibly something similar can be said for 863 the Layer_id in layered coding, but perhaps not in multiview 864 coding. (The relevant part of the spec is relatively new, 865 therefore the soft language). However, for serious layer pruning, 866 interpretation of the VPS is required. We can add language about 867 the need for stateful interpretation of LayerID vis-a-vis 868 stateless interpretation of TID later. 870 4.3. Payload Structures 872 Three different types of RTP packet payload structures are specified. 873 A receiver can identify the type of an RTP packet payload through the 874 Type field in the payload header. 876 The four different payload structures are as follows: 878 o Single NAL unit packet: Contains a single NAL unit in the payload, 879 and the NAL unit header of the NAL unit also serves as the payload 880 header. This payload structure is specified in Section 4.4.1. 882 o Aggregation Packet (AP): Contains more than one NAL unit within 883 one access unit. This payload structure is specified in 884 Section 4.3.2. 886 o Fragmentation Unit (FU): Contains a subset of a single NAL unit. 887 This payload structure is specified in Section 4.3.3. 889 4.3.1. Single NAL Unit Packets 891 Editor notes: its better to add a section to describe DONL and 892 sprop-max_don_diff 894 A single NAL unit packet contains exactly one NAL unit, and consists 895 of a payload header (denoted as PayloadHdr), a conditional 16-bit 896 DONL field (in network byte order), and the NAL unit payload data 897 (the NAL unit excluding its NAL unit header) of the contained NAL 898 unit, as shown in Figure 3. 900 0 1 2 3 901 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 902 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 903 | PayloadHdr | DONL (conditional) | 904 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 905 | | 906 | NAL unit payload data | 907 | | 908 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 909 | :...OPTIONAL RTP padding | 910 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 912 The Structure of a Single NAL Unit Packet 914 Figure 3 916 The DONL field, when present, specifies the value of the 16 least 917 significant bits of the decoding order number of the contained NAL 918 unit. If sprop-max-don-diff is greater than 0 for any of the RTP 919 streams, the DONL field MUST be present, and the variable DON for the 920 contained NAL unit is derived as equal to the value of the DONL 921 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 922 streams), the DONL field MUST NOT be present. 924 4.3.2. Aggregation Packets (APs) 926 Aggregation Packets (APs) can reduce of packetization overhead for 927 small NAL units, such as most of the non- VCL NAL units, which are 928 often only a few octets in size. 930 An AP aggregates NAL units of one access unit. Each NAL unit to be 931 carried in an AP is encapsulated in an aggregation unit. NAL units 932 aggregated in one AP are included in NAL unit decoding order. 934 An AP consists of a payload header (denoted as PayloadHdr) followed 935 by two or more aggregation units, as shown in Figure 4. 937 0 1 2 3 938 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 939 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 940 | PayloadHdr (Type=28) | | 941 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 942 | | 943 | two or more aggregation units | 944 | | 945 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 946 | :...OPTIONAL RTP padding | 947 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 949 The Structure of an Aggregation Packet 951 Figure 4 953 The fields in the payload header of an AP are set as follows. The F 954 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 955 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 956 be equal to 28. 958 The value of LayerId MUST be equal to the lowest value of LayerId of 959 all the aggregated NAL units. The value of TID MUST be the lowest 960 value of TID of all the aggregated NAL units. 962 Informative note: All VCL NAL units in an AP have the same TID 963 value since they belong to the same access unit. However, an AP 964 may contain non-VCL NAL units for which the TID value in the NAL 965 unit header may be different than the TID value of the VCL NAL 966 units in the same AP. 968 An AP MUST carry at least two aggregation units and can carry as many 969 aggregation units as necessary; however, the total amount of data in 970 an AP obviously MUST fit into an IP packet, and the size SHOULD be 971 chosen so that the resulting IP packet is smaller than the MTU size 972 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 973 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 974 not contain another AP. 976 The first aggregation unit in an AP consists of a conditional 16-bit 977 DONL field (in network byte order) followed by a 16-bit unsigned size 978 information (in network byte order) that indicates the size of the 979 NAL unit in bytes (excluding these two octets, but including the NAL 980 unit header), followed by the NAL unit itself, including its NAL unit 981 header, as shown in Figure 5. 983 0 1 2 3 984 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 985 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 986 | : DONL (conditional) | NALU size | 987 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 988 | NALU size | | 989 +-+-+-+-+-+-+-+-+ NAL unit | 990 | | 991 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 992 | : 993 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 995 The Structure of the First Aggregation Unit in an AP 997 Figure 5 999 The DONL field, when present, specifies the value of the 16 least 1000 significant bits of the decoding order number of the aggregated NAL 1001 unit. 1003 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 1004 the DONL field MUST be present in an aggregation unit that is the 1005 first aggregation unit in an AP, and the variable DON for the 1006 aggregated NAL unit is derived as equal to the value of the DONL 1007 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 1008 streams), the DONL field MUST NOT be present in an aggregation unit 1009 that is the first aggregation unit in an AP. 1011 An aggregation unit that is not the first aggregation unit in an AP 1012 will be followed immediately by a 16-bit unsigned size information 1013 (in network byte order) that indicates the size of the NAL unit in 1014 bytes (excluding these two octets, but including the NAL unit 1015 header), followed by the NAL unit itself, including its NAL unit 1016 header, as shown in Figure 6. 1018 0 1 2 3 1019 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1020 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1021 | : NALU size | NAL unit | 1022 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1023 | | 1024 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1025 | : 1026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1028 The Structure of an Aggregation Unit That Is Not the First 1029 Aggregation Unit in an AP 1031 Figure 6 1033 Figure 7 presents an example of an AP that contains two aggregation 1034 units, labeled as 1 and 2 in the figure, without the DONL field being 1035 present. 1037 0 1 2 3 1038 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1039 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1040 | RTP Header | 1041 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1042 | PayloadHdr (Type=28) | NALU 1 Size | 1043 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1044 | NALU 1 HDR | | 1045 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1046 | . . . | 1047 | | 1048 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1049 | . . . | NALU 2 Size | NALU 2 HDR | 1050 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1051 | NALU 2 HDR | | 1052 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1053 | . . . | 1054 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1055 | :...OPTIONAL RTP padding | 1056 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1058 An Example of an AP Packet Containing 1059 Two Aggregation Units without the DONL Field 1061 Figure 7 1063 Figure 8 presents an example of an AP that contains two aggregation 1064 units, labeled as 1 and 2 in the figure, with the DONL field being 1065 present. 1067 0 1 2 3 1068 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1069 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1070 | RTP Header | 1071 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1072 | PayloadHdr (Type=28) | NALU 1 DONL | 1073 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1074 | NALU 1 Size | NALU 1 HDR | 1075 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1076 | | 1077 | NALU 1 Data . . . | 1078 | | 1079 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1080 | : NALU 2 Size | 1081 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1082 | NALU 2 HDR | | 1083 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1084 | | 1085 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1086 | :...OPTIONAL RTP padding | 1087 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1089 An Example of an AP Containing 1090 Two Aggregation Units with the DONL Field 1092 Figure 8 1094 4.3.3. Fragmentation Units 1096 Fragmentation Units (FUs) are introduced to enable fragmenting a 1097 single NAL unit into multiple RTP packets, possibly without 1098 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1099 unit consists of an integer number of consecutive octets of that NAL 1100 unit. Fragments of the same NAL unit MUST be sent in consecutive 1101 order with ascending RTP sequence numbers (with no other RTP packets 1102 within the same RTP stream being sent between the first and last 1103 fragment). 1105 When a NAL unit is fragmented and conveyed within FUs, it is referred 1106 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1107 NOT be nested; i.e., an FU can not contain a subset of another FU. 1109 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1110 time of the fragmented NAL unit. 1112 An FU consists of a payload header (denoted as PayloadHdr), an FU 1113 header of one octet, a conditional 16-bit DONL field (in network byte 1114 order), and an FU payload, as shown in Figure 9}. 1116 0 1 2 3 1117 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1118 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1119 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1121 | DONL (cond) | | 1122 |-+-+-+-+-+-+-+-+ | 1123 | FU payload | 1124 | | 1125 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1126 | :...OPTIONAL RTP padding | 1127 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1129 The Structure of an FU 1131 Figure 9 1133 The fields in the payload header are set as follows. The Type field 1134 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1135 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1136 unit. 1138 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1139 FuType field, as shown in Figure 10. 1141 +---------------+ 1142 |0|1|2|3|4|5|6|7| 1143 +-+-+-+-+-+-+-+-+ 1144 |S|E|R| FuType | 1145 +---------------+ 1147 The Structure of FU Header 1149 Figure 10 1151 The semantics of the FU header fields are as follows: 1153 S: 1 bit 1155 When set to 1, the S bit indicates the start of a fragmented NAL 1156 unit, i.e., the first byte of the FU payload is also the first 1157 byte of the payload of the fragmented NAL unit. When the FU 1158 payload is not the start of the fragmented NAL unit payload, the S 1159 bit MUST be set to 0. 1161 E: 1 bit 1162 When set to 1, the E bit indicates the end of a fragmented NAL 1163 unit, i.e., the last byte of the payload is also the last byte of 1164 the fragmented NAL unit. When the FU payload is not the last 1165 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1167 Reserved: 1 bit 1169 Placeholder 1171 FuType: 5 bits 1173 The field FuType MUST be equal to the field Type of the fragmented 1174 NAL unit. 1176 The DONL field, when present, specifies the value of the 16 least 1177 significant bits of the decoding order number of the fragmented NAL 1178 unit. 1180 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 1181 and the S bit is equal to 1, the DONL field MUST be present in the 1182 FU, and the variable DON for the fragmented NAL unit is derived as 1183 equal to the value of the DONL field. Otherwise (sprop-max-don-diff 1184 is equal to 0 for all the RTP streams, or the S bit is equal to 0), 1185 the DONL field MUST NOT be present in the FU. 1187 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1188 the Start bit and End bit must not both be set to 1 in the same FU 1189 header. 1191 The FU payload consists of fragments of the payload of the fragmented 1192 NAL unit so that if the FU payloads of consecutive FUs, starting with 1193 an FU with the S bit equal to 1 and ending with an FU with the E bit 1194 equal to 1, are sequentially concatenated, the payload of the 1195 fragmented NAL unit can be reconstructed. The NAL unit header of the 1196 fragmented NAL unit is not included as such in the FU payload, but 1197 rather the information of the NAL unit header of the fragmented NAL 1198 unit is conveyed in F, LayerId, and TID fields of the FU payload 1199 headers of the FUs and the FuType field of the FU header of the FUs. 1200 An FU payload MUST NOT be empty. 1202 If an FU is lost, the receiver SHOULD discard all following 1203 fragmentation units in transmission order corresponding to the same 1204 fragmented NAL unit, unless the decoder in the receiver is known to 1205 be prepared to gracefully handle incomplete NAL units. 1207 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1208 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1209 n of that NAL unit is not received. In this case, the 1210 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1211 syntax violation. 1213 4.4. Decoding Order Number 1215 For each NAL unit, the variable AbsDon is derived, representing the 1216 decoding order number that is indicative of the NAL unit decoding 1217 order. 1219 Let NAL unit n be the n-th NAL unit in transmission order within an 1220 RTP stream. 1222 If sprop-max-don-diff is equal to 0 for all the RTP streams carrying 1223 the [VVC] bitstream, AbsDon[n], the value of AbsDon for NAL unit n, 1224 is derived as equal to n. 1226 Otherwise (sprop-max-don-diff is greater than 0 for any of the RTP 1227 streams), AbsDon[n] is derived as follows, where DON[n] is the value 1228 of the variable DON for NAL unit n: 1230 o If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1231 transmission order), AbsDon[0] is set equal to DON[0]. 1233 o Otherwise (n is greater than 0), the following applies for 1234 derivation of AbsDon[n]: 1236 If DON[n] == DON[n-1], 1237 AbsDon[n] = AbsDon[n-1] 1239 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1240 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1242 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1243 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1245 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1246 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1247 DON[n]) 1249 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1250 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1252 For any two NAL units m and n, the following applies: 1254 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1255 NAL unit m in NAL unit decoding order. 1257 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1258 of the two NAL units can be in either order. 1260 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1261 NAL unit m in decoding order. 1263 Informative note: When two consecutive NAL units in the NAL 1264 unit decoding order have different values of AbsDon, the 1265 absolute difference between the two AbsDon values may be 1266 greater than or equal to 1. 1268 Informative note: There are multiple reasons to allow for the 1269 absolute difference of the values of AbsDon for two consecutive 1270 NAL units in the NAL unit decoding order to be greater than 1271 one. An increment by one is not required, as at the time of 1272 associating values of AbsDon to NAL units, it may not be known 1273 whether all NAL units are to be delivered to the receiver. For 1274 example, a gateway might not forward VCL NAL units of higher 1275 sub- layers or some SEI NAL units when there is congestion in 1276 the network. In another example, the first intra-coded picture 1277 of a pre-encoded clip is transmitted in advance to ensure that 1278 it is readily available in the receiver, and when transmitting 1279 the first intra-coded picture, the originator does not exactly 1280 know how many NAL units will be encoded before the first intra- 1281 coded picture of the pre-encoded clip follows in decoding 1282 order. Thus, the values of AbsDon for the NAL units of the 1283 first intra-coded picture of the pre-encoded clip have to be 1284 estimated when they are transmitted, and gaps in values of 1285 AbsDon may occur. 1287 5. Packetization Rules 1289 The following packetization rules apply: 1291 o If sprop-max-don-diff is greater than 0 for any of the RTP 1292 streams, the transmission order of NAL units carried in the RTP 1293 stream MAY be different than the NAL unit decoding order and the 1294 NAL unit output order. 1296 o A NAL unit of a small size SHOULD be encapsulated in an 1297 aggregation packet together one or more other NAL units in order 1298 to avoid the unnecessary packetization overhead for small NAL 1299 units. For example, non-VCL NAL units such as access unit 1300 delimiters, parameter sets, or SEI NAL units are typically small 1301 and can often be aggregated with VCL NAL units without violating 1302 MTU size constraints. 1304 o Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1305 viewpoint, be encapsulated in an aggregation packet together with 1306 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1307 be meaningless without the associated VCL NAL unit being 1308 available. 1310 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1311 unit packet MUST be used. 1313 6. De-packetization Process 1315 The general concept behind de-packetization is to get the NAL units 1316 out of the RTP packets in an RTP stream and pass them to the decoder 1317 in the NAL unit decoding order. 1319 The de-packetization process is implementation dependent. Therefore, 1320 the following description should be seen as an example of a suitable 1321 implementation. Other schemes may be used as well, as long as the 1322 output for the same input is the same as the process described below. 1323 The output is the same when the set of output NAL units and their 1324 order are both identical. Optimizations relative to the described 1325 algorithms are possible. 1327 All normal RTP mechanisms related to buffer management apply. In 1328 particular, duplicated or outdated RTP packets (as indicated by the 1329 RTP sequences number and the RTP timestamp) are removed. To 1330 determine the exact time for decoding, factors such as a possible 1331 intentional delay to allow for proper inter-stream synchronization 1332 MUST be factored in. 1334 NAL units with NAL unit type values in the range of 0 to 27, 1335 inclusive, may be passed to the decoder. NAL-unit-like structures 1336 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1337 NOT be passed to the decoder. 1339 The receiver includes a receiver buffer, which is used to compensate 1340 for transmission delay jitter within individual RTP streams and 1341 across RTP streams, to reorder NAL units from transmission order to 1342 the NAL unit decoding order. In this section, the receiver operation 1343 is described under the assumption that there is no transmission delay 1344 jitter within an RTP stream and across RTP streams. To make a 1345 difference from a practical receiver buffer that is also used for 1346 compensation of transmission delay jitter, the receiver buffer is 1347 hereafter called the de-packetization buffer in this section. 1348 Receivers should also prepare for transmission delay jitter; that is, 1349 either reserve separate buffers for transmission delay jitter 1350 buffering and de-packetization buffering or use a receiver buffer for 1351 both transmission delay jitter and de- packetization. Moreover, 1352 receivers should take transmission delay jitter into account in the 1353 buffering operation, e.g., by additional initial buffering before 1354 starting of decoding and playback. 1356 When sprop-max-don-diff is equal to 0 for all the received RTP 1357 streams, the de-packetization buffer size is zero bytes, and the 1358 process described in the remainder of this paragraph applies. 1359 The NAL units carried in the single RTP stream are directly passed to 1360 the decoder in their transmission order, which is identical to their 1361 decoding order. When there are several NAL units of the same RTP 1362 stream with the same NTP timestamp, the order to pass them to the 1363 decoder is their transmission order. 1365 Informative note: The mapping between RTP and NTP timestamps is 1366 conveyed in RTCP SR packets. In addition, the mechanisms for 1367 faster media timestamp synchronization discussed in [RFC6051] may 1368 be used to speed up the acquisition of the RTP-to-wall-clock 1369 mapping. 1371 When sprop-max-don-diff is greater than 0 for any the received RTP 1372 streams, the process described in the remainder of this section 1373 applies. 1375 There are two buffering states in the receiver: initial buffering and 1376 buffering while playing. Initial buffering starts when the reception 1377 is initialized. After initial buffering, decoding and playback are 1378 started, and the buffering-while-playing mode is used. 1380 Regardless of the buffering state, the receiver stores incoming NAL 1381 units, in reception order, into the de-packetization buffer. NAL 1382 units carried in RTP packets are stored in the de-packetization 1383 buffer individually, and the value of AbsDon is calculated and stored 1384 for each NAL unit. 1386 Initial buffering lasts until condition A (the difference between the 1387 greatest and smallest AbsDon values of the NAL units in the de- 1388 packetization buffer is greater than or equal to the value of sprop- 1389 max-don-diff) or condition B (the number of NAL units in the de- 1390 packetization buffer is greater than the value of sprop-depack-buf- 1391 nalus) is true. 1393 After initial buffering, whenever condition A or condition B is true, 1394 the following operation is repeatedly applied until both condition A 1395 and condition B become false: 1397 o The NAL unit in the de-packetization buffer with the smallest 1398 value of AbsDon is removed from the de-packetization buffer and 1399 passed to the decoder. 1401 When no more NAL units are flowing into the de-packetization buffer, 1402 all NAL units remaining in the de-packetization buffer are removed 1403 from the buffer and passed to the decoder in the order of increasing 1404 AbsDon values. 1406 7. Payload Format Parameters 1408 Placeholder 1410 8. Use with Feedback Messages 1412 The following subsections define the use of the Picture Loss 1413 Indication (PLI), Slice Lost Indication (SLI), Reference Picture 1414 Selection Indication (RPSI), and Full Intra Request (FIR) feedback 1415 messages with HEVC. The PLI, SLI, and RPSI messages are defined in 1416 [RFC4585], and the FIR message is defined in [RFC5104]. 1418 8.1. Picture Loss Indication (PLI) 1420 As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a 1421 media sender indicates "the loss of an undefined amount of coded 1422 video data belonging to one or more pictures". Without having any 1423 specific knowledge of the setup of the bitstream (such as use and 1424 location of in-band parameter sets, non-IRAP decoder refresh points, 1425 picture structures, and so forth), a reaction to the reception of an 1426 PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant 1427 parameter sets; potentially with sufficient redundancy so to ensure 1428 correct reception. However, sometimes information about the 1429 bitstream structure is known. For example, state could have been 1430 established outside of the mechanisms defined in this document that 1431 parameter sets are conveyed out of band only, and stay static for the 1432 duration of the session. In that case, it is obviously unnecessary 1433 to send them in-band as a result of the reception of a PLI. Other 1434 examples could be devised based on a priori knowledge of different 1435 aspects of the bitstream structure. In all cases, the timing and 1436 congestion control mechanisms of RFC 4585 MUST be observed. 1438 8.2. Slice Loss Indication (SLI) 1440 For further study. Maybe remove as there are no known 1441 implementations of SDLI in [HEVC] based systems 1443 8.3. Reference Picture Selection Indication (RPSI) 1445 Feedback-based reference picture selection has been shown as a 1446 powerful tool to stop temporal error propagation for improved error 1447 resilience [Girod99] [Wang05]. In one approach, the decoder side 1448 tracks errors in the decoded pictures and informs the encoder side 1449 that a particular picture that has been decoded relatively earlier is 1450 correct and still present in the decoded picture buffer; it requests 1451 the encoder to use that correct picture-availability information when 1452 encoding the next picture, so to stop further temporal error 1453 propagation. For this approach, the decoder side should use the RPSI 1454 feedback message. 1456 Encoders can encode some long-term reference pictures as specified in 1457 [VVC] for purposes described in the previous paragraph without the 1458 need of a huge decoded picture buffer. As shown in [Wang05], with a 1459 flexible reference picture management scheme, as in VVC, even a 1460 decoded picture buffer size of two picture storage buffers would work 1461 for the approach described in the previous paragraph. 1463 The text above is copy-paste from RFC 7798. If we keep the RPSI 1464 message, it needs adaptation to the [VVC] syntax. Doing so shouldn't 1465 be too hard as the [VVC] reference picture mechanism is not too 1466 different from the [HEVC] one. 1468 8.4. Full Intra Request (FIR) 1470 The purpose of the FIR message is to force an encoder to send an 1471 independent decoder refresh point as soon as possible, while 1472 observing applicable congestion-control-related constraints, such as 1473 those set out in [RFC8082]). 1475 Upon reception of a FIR, a sender MUST send an IDR picture. 1476 Parameter sets MUST also be sent, except when there is a priori 1477 knowledge that the parameter sets have been correctly established. A 1478 typical example for that is an understanding between sender and 1479 receiver, established by means outside this document, that parameter 1480 sets are exclusively sent out-of-band. 1482 9. Frame marking 1484 placeholder 1486 10. Security Considerations 1488 The scope of this Security Considerations section is limited to the 1489 payload format itself and to one feature of [VVC] that may pose a 1490 particularly serious security risk if implemented naively. The 1491 payload format, in isolation, does not form a complete system. 1492 Implementers are advised to read and understand relevant security- 1493 related documents, especially those pertaining to RTP (see the 1494 Security Considerations section in [RFC3550] ), and the security of 1495 the call-control stack chosen (that may make use of the media type 1496 registration of this memo). Implementers should also consider known 1497 security vulnerabilities of video coding and decoding implementations 1498 in general and avoid those. 1500 Within this RTP payload format, and with the exception of the user 1501 data SEI message as described below, no security threats other than 1502 those common to RTP payload formats are known. In other words, 1503 neither the various media-plane-based mechanisms, nor the signaling 1504 part of this memo, seems to pose a security risk beyond those common 1505 to all RTP-based systems. 1507 RTP packets using the payload format defined in this specification 1508 are subject to the security considerations discussed in the RTP 1509 specification [RFC3550] , and in any applicable RTP profile such as 1510 RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/ 1511 SAVPF [RFC5124] . However, as "Securing the RTP Framework: Why RTP 1512 Does Not Mandate a Single Media Security Solution" [RFC7202] 1513 discusses, it is not an RTP payload format's responsibility to 1514 discuss or mandate what solutions are used to meet the basic security 1515 goals like confidentiality, integrity and source authenticity for RTP 1516 in general. This responsibility lays on anyone using RTP in an 1517 application. They can find guidance on available security mechanisms 1518 and important considerations in "Options for Securing RTP Sessions" 1519 [RFC7201] . The rest of this section discusses the security impacting 1520 properties of the payload format itself. 1522 Because the data compression used with this payload format is applied 1523 end-to-end, any encryption needs to be performed after compression. 1524 A potential denial-of-service threat exists for data encodings using 1525 compression techniques that have non-uniform receiver-end 1526 computational load. The attacker can inject pathological datagrams 1527 into the bitstream that are complex to decode and that cause the 1528 receiver to be overloaded. [VVC] is particularly vulnerable to such 1529 attacks, as it is extremely simple to generate datagrams containing 1530 NAL units that affect the decoding process of many future NAL units. 1531 Therefore, the usage of data origin authentication and data integrity 1532 protection of at least the RTP packet is RECOMMENDED, for example, 1533 with SRTP [RFC3711] . 1535 Like HEVC [RFC7798], [VVC] includes a user data Supplemental 1536 Enhancement Information (SEI) message. This SEI message allows 1537 inclusion of an arbitrary bitstring into the video bitstream. Such a 1538 bitstring could include JavaScript, machine code, and other active 1539 content. [VVC] leaves the handling of this SEI message to the 1540 receiving system. In order to avoid harmful side effects the user 1541 data SEI message, decoder implementations cannot naively trust its 1542 content. For example, it would be a bad and insecure implementation 1543 practice to forward any JavaScript a decoder implementation detects 1544 to a web browser. The safest way to deal with user data SEI messages 1545 is to simply discard them, but that can have negative side effects on 1546 the quality of experience by the user. 1548 End-to-end security with authentication, integrity, or 1549 confidentiality protection will prevent a MANE from performing media- 1550 aware operations other than discarding complete packets. In the case 1551 of confidentiality protection, it will even be prevented from 1552 discarding packets in a media-aware way. To be allowed to perform 1553 such operations, a MANE is required to be a trusted entity that is 1554 included in the security context establishment. 1556 11. Congestion Control 1558 Congestion control for RTP SHALL be used in accordance with RTP 1559 [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551]. 1560 If best-effort service is being used, an additional requirement is 1561 that users of this payload format MUST monitor packet loss to ensure 1562 that the packet loss rate is within an acceptable range. Packet loss 1563 is considered acceptable if a TCP flow across the same network path, 1564 and experiencing the same network conditions, would achieve an 1565 average throughput, measured on a reasonable timescale, that is not 1566 less than all RTP streams combined are achieving. This condition can 1567 be satisfied by implementing congestion-control mechanisms to adapt 1568 the transmission rate, the number of layers subscribed for a layered 1569 multicast session, or by arranging for a receiver to leave the 1570 session if the loss rate is unacceptably high. 1572 The bitrate adaptation necessary for obeying the congestion control 1573 principle is easily achievable when real-time encoding is used, for 1574 example, by adequately tuning the quantization parameter. However, 1575 when pre-encoded content is being transmitted, bandwidth adaptation 1576 requires the pre-coded bitstream to be tailored for such adaptivity. 1577 The key mechanisms available in [VVC] are temporal scalability, and 1578 spatial/SNR scalability. A media sender can remove NAL units 1579 belonging to higher temporal sub-layers (i.e., those NAL units with a 1580 high value of TID) or higher spatio-SNR layers (as indicated by 1581 interpreting the VPS) until the sending bitrate drops to an 1582 acceptable range. 1584 The mechanisms mentioned above generally work within a defined 1585 profile and level and, therefore, no renegotiation of the channel is 1586 required. Only when non-downgradable parameters (such as profile) 1587 are required to be changed does it become necessary to terminate and 1588 restart the RTP stream(s). This may be accomplished by using 1589 different RTP payload types. 1591 MANEs MAY remove certain unusable packets from the RTP stream when 1592 that RTP stream was damaged due to previous packet losses. This can 1593 help reduce the network load in certain special cases. For example, 1594 MANES can remove those FUs where the leading FUs belonging to the 1595 same NAL unit have been lost or those dependent slice segments when 1596 the leading slice segments belonging to the same slice have been 1597 lost, because the trailing FUs or dependent slice segments are 1598 meaningless to most decoders. MANES can also remove higher temporal 1599 scalable layers if the outbound transmission (from the MANE's 1600 viewpoint) experiences congestion. 1602 12. IANA Considerations 1604 Placeholder 1606 13. Acknowledgements 1608 Dr. Byeongdoo Choi is thanked for the video codec related technical 1609 discussion and other aspects in this memo. Xin Zhao and Dr. Xiang Li 1610 are thanked for their contributions on [VVC] specification 1611 descriptive content. Spencer Dawkins is thanked for his valuable 1612 review comments that led to great improvements of this memo. Some 1613 parts of this specification share text with the RTP payload format 1614 for HEVC [RFC7798]. We thank the authors of that specification for 1615 their excellent work. 1617 14. References 1619 14.1. Normative References 1621 [H.266] "ITU-T, Versatile Video Coding", n.d.. 1623 [ISO23090-3] 1624 "ISO/IEC DIS Information technology --- Coded 1625 representation of immersive media --- Part 3 Versatile 1626 video codings", n.d., 1627 . 1629 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1630 Requirement Levels", BCP 14, RFC 2119, 1631 DOI 10.17487/RFC2119, March 1997, 1632 . 1634 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1635 Jacobson, "RTP: A Transport Protocol for Real-Time 1636 Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550, 1637 July 2003, . 1639 [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and 1640 Video Conferences with Minimal Control", STD 65, RFC 3551, 1641 DOI 10.17487/RFC3551, July 2003, 1642 . 1644 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 1645 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 1646 RFC 3711, DOI 10.17487/RFC3711, March 2004, 1647 . 1649 [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session 1650 Description Protocol", RFC 4566, DOI 10.17487/RFC4566, 1651 July 2006, . 1653 [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, 1654 "Extended RTP Profile for Real-time Transport Control 1655 Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, 1656 DOI 10.17487/RFC4585, July 2006, 1657 . 1659 [RFC5104] Wenger, S., Chandra, U., Westerlund, M., and B. Burman, 1660 "Codec Control Messages in the RTP Audio-Visual Profile 1661 with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104, 1662 February 2008, . 1664 [RFC5124] Ott, J. and E. Carrara, "Extended Secure RTP Profile for 1665 Real-time Transport Control Protocol (RTCP)-Based Feedback 1666 (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February 1667 2008, . 1669 [RFC7656] Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and 1670 B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms 1671 for Real-Time Transport Protocol (RTP) Sources", RFC 7656, 1672 DOI 10.17487/RFC7656, November 2015, 1673 . 1675 [RFC8082] Wenger, S., Lennox, J., Burman, B., and M. Westerlund, 1676 "Using Codec Control Messages in the RTP Audio-Visual 1677 Profile with Feedback with Layered Codecs", RFC 8082, 1678 DOI 10.17487/RFC8082, March 2017, 1679 . 1681 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1682 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1683 May 2017, . 1685 [VVC] "Versatile Video Coding (Draft 8), Joint Video Experts 1686 Team (JVET)", January 2020. 1688 14.2. Informative References 1690 [CABAC] Sole, J, . and . et al, "Transform coefficient coding in 1691 HEVC, IEEE Transactions on Circuts and Systems for Video 1692 Technology", DOI 10.1109/TCSVT.2012.2223055, December 1693 2012. 1695 [FrameMarking] 1696 Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame 1697 Marking RTP Header Extension", Work in Progress draft- 1698 berger-avtext-framemarking , 2015. 1700 [Girod99] Girod, B, . and . et al, "Feedback-based error control for 1701 mobile video transmission, Proceedings of the IEEE", 1702 DOI 110.1109/5.790632, October 1999. 1704 [HEVC] "High efficiency video coding, ITU-T Recommendation 1705 H.265", April 2013. 1707 [MPEG2S] IS0/IEC, ., "Information technology - Generic coding 1708 ofmoving pictures and associated audio information - Part 1709 1:Systems, ISO International Standard 13818-1", 2013. 1711 [RFC6051] Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP 1712 Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010, 1713 . 1715 [RFC6184] Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP 1716 Payload Format for H.264 Video", RFC 6184, 1717 DOI 10.17487/RFC6184, May 2011, 1718 . 1720 [RFC6190] Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis, 1721 "RTP Payload Format for Scalable Video Coding", RFC 6190, 1722 DOI 10.17487/RFC6190, May 2011, 1723 . 1725 [RFC7201] Westerlund, M. and C. Perkins, "Options for Securing RTP 1726 Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014, 1727 . 1729 [RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP 1730 Framework: Why RTP Does Not Mandate a Single Media 1731 Security Solution", RFC 7202, DOI 10.17487/RFC7202, April 1732 2014, . 1734 [RFC7798] Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M. 1735 Hannuksela, "RTP Payload Format for High Efficiency Video 1736 Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March 1737 2016, . 1739 [Wang05] Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient 1740 video coding using flexible reference fames", Visual 1741 Communications and Image Processing 2005 (VCIP 2005) , 1742 July 2005. 1744 Appendix A. Change History 1746 draft-zhao-payload-rtp-vvc-00 ........ initial version 1748 Authors' Addresses 1750 Shuai Zhao 1751 Tencent 1752 2747 Park Blvd 1753 Palo Alto 94588 1754 USA 1756 Email: shuai.zhao@ieee.org 1758 Stephan Wenger 1759 Tencent 1760 2747 Park Blvd 1761 Palo Alto 94588 1763 Email: stewe@stewe.org