idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([ISO23090-3]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (March 30, 2020) is 1488 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1264 -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: October 1, 2020 Y. Sanchez 6 Fraunhofer HHI 7 March 30, 2020 9 RTP Payload Format for Versatile Video Coding (VVC) 10 draft-ietf-avtcore-rtp-vvc-01 12 Abstract 14 This memo describes an RTP payload format for the video coding 15 standard ITU-T Recommendation [H.266] and ISO/IEC International 16 Standard [ISO23090-3], both also known as Versatile Video Coding 17 (VVC) and developed by the Joint Video Experts Team (JVET). The RTP 18 payload format allows for packetization of one or more Network 19 Abstraction Layer (NAL) units in each RTP packet payload as well as 20 fragmentation of a NAL unit into multiple RTP packets. The payload 21 format has wide applicability in videoconferencing, Internet video 22 streaming, and high-bitrate entertainment-quality video, among other 23 applications. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on October 1, 2020. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 3 62 1.1.2. Systems and Transport Interfaces . . . . . . . . . . 6 63 1.1.3. Parallel Processing Support (informative) . . . . . . 10 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 10 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 12 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 12 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 12 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 12 69 3.1.1. Definitions from the VVC Specification . . . . . . . 13 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 16 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 16 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 17 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 18 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 19 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 20 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 20 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 21 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 25 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 28 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 29 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 30 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 32 83 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 32 84 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 32 85 8.2. Slice Loss Indication (SLI) . . . . . . . . . . . . . . . 32 86 8.3. Reference Picture Selection Indication (RPSI) . . . . . . 33 87 8.4. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 33 88 9. Frame marking . . . . . . . . . . . . . . . . . . . . . . . . 33 89 10. Security Considerations . . . . . . . . . . . . . . . . . . . 33 90 11. Congestion Control . . . . . . . . . . . . . . . . . . . . . 35 91 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 36 92 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 36 93 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 36 94 14.1. Normative References . . . . . . . . . . . . . . . . . . 36 95 14.2. Informative References . . . . . . . . . . . . . . . . . 38 96 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 39 97 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 39 99 1. Introduction 101 The Versatile Video Coding [VVC] specification, formally published as 102 both ITU-T Recommendation H.266 and ISO/IEC International Standard 103 23090-3 [ISO23090-3], is currently in the ISO/IEC approval process 104 and is planned for ratification in mid 2020. H.266 is reported to 105 provide significant coding efficiency gains over H.265 and earlier 106 video codec formats. 108 This memo describes an RTP payload format for VVC. It shares its 109 basic design with the NAL (Network Abstraction Layer) unit-based RTP 110 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 111 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 112 and their respective predecessors. With respect to design 113 philosophy, security, congestion control, and overall implementation 114 complexity, it has similar properties to those earlier payload format 115 specifications. This is a conscious choice, as at least RFC 6184 is 116 widely deployed and generally known in the relevant implementer 117 communities. Certain mechanisms known from [RFC6190] were 118 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 119 signal-to-noise ratio (SNR) scalability. 121 1.1. Overview of the VVC Codec 123 [VVC] and [HEVC] share a similar hybrid video codec design. In this 124 memo, we provide a very brief overview of those features of VVC that 125 are, in some form, addressed by the payload format specified herein. 126 Implementers have to read, understand, and apply the ITU- T/ISO/IEC 127 specifications pertaining to [VVC] to arrive at interoperable, well- 128 performing implementations. 130 Conceptually, both [VVC] and [HEVC] include a Video Coding Layer 131 (VCL), which is often used to refer to the coding-tool features, and 132 a NAL, which is often used to refer to the systems and transport 133 interface aspects of the codecs. 135 1.1.1. Coding-Tool Features (informative) 137 Coding tool features are described below with occasional reference to 138 the coding tool set of [HEVC], which is well known in the community. 140 Similar to earlier hybrid-video-coding-based standards, including 141 HEVC, the following basic video coding design is employed by VVC. A 142 prediction signal is first formed by either intra- or motion- 143 compensated prediction, and the residual (the difference between the 144 original and the prediction) is then coded. The gains in coding 145 efficiency are achieved by redesigning and improving almost all parts 146 of the codec over earlier designs. In addition, [VVC] includes 147 several tools to make the implementation on parallel architectures 148 easier. 150 Finally, [VVC] includes temporal, spatial, and SNR scalability as 151 well as multiview coding support. 153 Coding blocks and transform structure 155 Among major coding-tool differences between HEVC and VVC, one of the 156 important improvements is the more flexible coding tree structure in 157 VVC, i.e., multi-type tree. In addition to quadtree, binary and 158 ternary trees are also supported, which contributes significant 159 improvement in coding efficiency. Moreover, the maximum size of 160 Coding Tree Unit (CTU) is increased from 64x64 to 128x128. To 161 improve the coding efficiency of chroma signal, luma chroma separated 162 trees at CTU level may be employed for intra-slices. The square 163 transforms in HEVC are extended to non-square transforms for 164 rectangular blocks resulting from binary and ternary tree splits. 165 Besides, [VVC] supports multiple transform sets (MTS), including DCT- 166 2, DST-7, and DCT-8 as well as the non-separable secondary transform. 167 The transforms used in [VVC] can have different sizes with support 168 for larger transform sizes. For DCT-2, the transform sizes range 169 from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range 170 from 4x4 to 32x32. In addition, [VVC] also support sub-block 171 transform for both intra and inter coded blocks. For intra coded 172 blocks, intra sub-partitioning (ISP) may be used to allow sub-block 173 based intra prediction and transform. For inter blocks, sub-block 174 transform may be used assuming that only a part of an inter-block has 175 non-zero transform coefficients. 177 Entropy coding 179 Similar to HEVC , [VVC] uses a single entropy-coding engine, which is 180 based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC], 181 but with the support of multi-window sizes. The window sizes can be 182 initialized differently for different context models. Due to such a 183 design, it has more efficient adaptation speed and better coding 184 efficiency. A joint chroma residual coding scheme is applied to 185 further exploit the correlation between the residuals of two color 186 components. In VVC, different residual coding schemes are applied 187 for regular transform coefficients and residual samples generated 188 using transform-skip mode. 190 In-loop filtering 192 [VVC] has more feature support in loop filters than HEVC. The 193 deblocking filter in [VVC] is similar to HEVC but operates at a 194 smaller grid. After deblocking and sample adaptive offset (SAO), an 195 adaptive loop filter (ALF) may be used. As a Wiener filter, ALF 196 reduces distortion of decoded pictures. Besides, [VVC] introduces a 197 new module before deblocking called luma mapping with chroma scaling 198 to fully utilize the dynamic range of signal so that rate-distortion 199 performance of both SDR and HDR content is improved. 201 Motion prediction and coding 203 Compared to HEVC, [VVC] introduces several improvements in this area. 204 First, there is the Adaptive motion vector resolution (AMVR), which 205 can save bit cost for motion vectors by adaptively signaling motion 206 vector resolution. Then the Affine motion compensation is included 207 to capture complicated motion like zooming and rotation. Meanwhile, 208 prediction refinement with the optical flow with affine mode (PROF) 209 is further deployed to mimic affine motion at the pixel level. 210 Thirdly the decoder side motion vector refinement (DMVR) is a method 211 to derive MV vector at decoder side based on block matching so that 212 fewer bits may be spent on motion vectors. Bi-directional optical 213 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 214 offset at 4x4 sub-block level that is derived with equations based on 215 gradients of the prediction samples and a motion difference relative 216 to CU motion vectors. Furthermore, merge with motion vector 217 difference (MMVD) is a special mode, which further signals a limited 218 set of motion vector differences on top of merge mode. In addition 219 to MMVD, there are another three types of special merge modes, i.e., 220 sub-block merge, triangle, and combined intra-/inter- prediction 221 (CIIP). Sub- block merge list includes one candidate of sub-block 222 temporal motion vector prediction (SbTMVP) and up to four candidates 223 of affine motion vectors. Triangle is based on triangular block 224 motion compensation. CIIP combines intra- and inter- predictions 225 with weighting. Adaptive weighting may be employed with a block- 226 level tool called bi-prediction with CU based weighting (BCW) which 227 provides more flexibility than in HEVC. 229 Intra prediction and intra-coding 231 To capture the diversified local image texture directions with finer 232 granularity, [VVC] supports 65 angular directions instead of 33 233 directions in HEVC. The intra mode coding is based on a 6 most 234 probable mode scheme, and the 6 most probable modes are derived using 235 the neighboring intra prediction directions. In addition, to deal 236 with the different distributions of intra prediction angles for 237 different block aspect ratios, a wide-angle intra prediction (WAIP) 238 scheme is applied in [VVC] by including intra prediction angles 239 beyond those present in HEVC. Unlike HEVC which only allows using 240 the most adjacent line of reference samples for intra prediction, 241 [VVC] also allows using two further reference lines, as known as 242 multi-reference-line (MRL) intra prediction. The additional 243 reference lines can be only used for 6 most probable intra prediction 244 modes. To capture the strong correlation between different colour 245 components, in VVC, a cross-component linear mode (CCLM) is utilized 246 which assumes a linear relationship between the luma sample values 247 and their associated chroma samples. For intra prediction, [VVC] 248 also applies a position-dependent prediction combination (PDPC) for 249 refining the prediction samples closer to the intra prediction block 250 boundary. Matrix-based intra prediction (MIP) modes are also used in 251 [VVC] which generates an up to 8x8 intra prediction block using a 252 weighted sum of downsampled neighboring reference samples, and the 253 weightings are hardcoded constants. 255 Other coding-tool feature 257 [VVC] introduces dependent quantization (DQ) to reduce quantization 258 error by state-based switching between two quantizers. 260 1.1.2. Systems and Transport Interfaces 262 [VVC] inherits the basic systems and transport interfaces designs 263 from HEVC and H.264. These include the NAL-unit-based syntax 264 structure, the hierarchical syntax and data unit structure, the 265 Supplemental Enhancement Information (SEI) message mechanism, and the 266 video buffering model based on the Hypothetical Reference Decoder 267 (HRD). The scalability features of [VVC] are conceptually similar to 268 the scalable variant of HEVC known as SHVC. The hierarchical syntax 269 and data unit structure consists of parameter sets at various levels 270 (decoder, sequence (pertaining to all), sequence (pertaining to a 271 single), picture), picture-level header parameters, slice-level 272 header parameters, and lower-level parameters. 274 A number of key components that influenced the Network Abstraction 275 Layer design of [VVC] as well as this memo are described below 277 Decoding Capability Information 279 The Decoding capability information includes parameters that stay 280 constant for the lifetime of a Video Bitstream, which in IETF terms 281 can translate to the lifetime of a session. Decoding capability 282 informations can include profile, level, and sub-profile information 283 to determine a maximum complexity interop point that is guaranteed to 284 be never exceeded, even if splicing of video sequences occurs within 285 a session. It further includes constraint flags, which can 286 optionally be set to indicate that the video bitstream will be 287 constraint in the use of certain features as indicated by the values 288 of those flags. With this, a bitstream can be labelled as not using 289 certain tools, which allows among other things for resource 290 allocation in a decoder implementation. 292 Video parameter set 294 The Video Parameter Set (VPS) pertains to a Coded Video Sequences 295 (CVS) of multiple layers covering the same range of picture units, 296 and includes, among other information decoding dependency expressed 297 as information for reference picture set construction of enhancement 298 layers. The VPS provides a "big picture" of a scalable sequence, 299 including what types of operation points are provided, the profile, 300 tier, and level of the operation points, and some other high-level 301 properties of the bitstream that can be used as the basis for session 302 negotiation and content selection, etc. One VPS may be referenced by 303 one or more Sequence parameter sets. 305 Sequence parameter set 307 The Sequence Parameter Set (SPS) contains syntax elements pertaining 308 to a coded layer video sequence (CLVS), which is a group of pictures 309 belonging to the same layer, starting with a random access point, and 310 followed by pictures that may depend on each other and the random 311 access point picture. In MPGEG-2, the equivalent of a CVS was a 312 Group of Pictures (GOP), which normally started with an I frame and 313 was followed by P and B frames. While more complex in its options of 314 random access points, VVC retains this basic concept. One remarkable 315 difference of VVC is that a CLVS may start with a Gradual Decoding 316 Refresh (GDR) picture, without requiring presence of traditional 317 random access points in the bitstream, such as Instantaneous Decoding 318 Refresh (IDR) or Clean Random Access (CRA) pictures. In many TV-like 319 applications, a CVS contains a few hundred milliseconds to a few 320 seconds of video. In video conferencing (without switching MCUs 321 involved), a CVS can be as long in duration as the whole session. 323 Picture and Adaptation parameter set 325 The Picture Parameter Set and the Adaptation Parameter Set (PPS and 326 APS, respectively) carry information pertaining to zero or more 327 pictures and zero or more slices, respectively. The PPS contains 328 information that is likely to stay constant from picture to picture- 329 at least for pictures for a certain type-whereas the APS contains 330 information, such as adaptive loop filter coefficients, that are 331 likely to change from picture to picture or even within a picture. A 332 single APS can be referenced by slices of the same picture if that 333 APS contains information about luma mapping with chroma scaling 334 (LMCS) but different APS can be referenced by slices of the same 335 picture if those APS contain information about ALF. 337 Picture Header 339 A Picture Header contains information that is common to all slices 340 that belong to the same picture. Being able to send that information 341 as a separate NAL unit when pictures are split into several slices 342 allows for saving bitrate, compared to repeating the same information 343 in all slices. However, there might be scenarios where low-bitrate 344 video is transmitted using a single slice per picture. Having a 345 separate NAL unit to convey that information incurs in an overhead 346 for such scenarios. Therefore, VVC specifies signaling that 347 indicates whether Picture Headers are present in the CLVS or not. 349 Profile, tier, and level 351 The profile, tier and level syntax structures in DCI, VPS and SPS 352 contain profile, tier, level information for all layers that refer to 353 the DCI, for layers associated with one or more output layer sets 354 specified by the VPS, and for any layer that refers to the SPS, 355 respectively. 357 Sub-Profiles 359 Within the [VVC] specification, a sub-profile is a 32-bit number 360 coded according to ITU-T Rec. T.35, that does not carry a semantic. 361 It is carried in the profile_tier_level structure and hence 362 (potentially) present in the DCI, VPS, and SPS. External 363 registration bodies can register a T.35 codepoint with ITU-T 364 registration authorities and associate with their registration a 365 description of bitstream complexity restrictions beyond the profiles 366 defined by ITU-T and ISO/IEC. This would allow encoder manufacturers 367 to label the bitstreams generated by their encoder as complying with 368 such sub-profile. It is expected that upstream standardization 369 organizations (such as: DVB and ATSC), as well as walled-garden video 370 services will take advantage of this labelling system. In contrast 371 to "normal" profiles, it is expected that sub-profiles may indicate 372 encoder choices traditionally left open in the (decoder- centric) 373 video coding specs, such as GOP structures, minimum/maximum QP 374 values, and the mandatory use of certain tools or SEI messages. 376 Constraint Flags 378 The profile_tier_level structure carries a considerable number of 379 constraint flags, which an encoder can use to indicate to a decoder 380 that it will not use a certain tool or technology. They were 381 included in reaction to a perceived market need for labelling a 382 bitstream as not exercising a certain tool that has become 383 commercially unviable. 385 Temporal scalability support 387 Editor notes: need will update along with VVC new draft in the 388 future 390 [VVC] includes support of temporal scalability, by inclusion of the 391 signaling of TemporalId in the NAL unit header, the restriction that 392 pictures of a particular temporal sub-layer cannot be used for inter 393 prediction reference by pictures of a lower temporal sub-layer, the 394 sub-bitstream extraction process, and the requirement that each sub- 395 bitstream extraction output be a conforming bitstream. Media-Aware 396 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 397 header for stream adaptation purposes based on temporal scalability. 399 Spatial, SNR, View Scalability 401 [VVC] includes support for spatial, SNR, and View scalability. 402 Scalable video coding is widely considered to have technical benefits 403 and enrich services for various video applications. Until recently, 404 however, the functionality has not been included in the main profiles 405 of video codecs and not wide deployed due to additional costs. In 406 VVC, however, all those forms of scalability are supported natively 407 through the signaling of the layer_id in the NAL unit header, the VPS 408 which associates layers with given layer_ids to each other, reference 409 picture selection, reference picture resampling for spatial 410 scalability, and a number of other mechanisms not relevant for this 411 memo. Scalability support can be implemented in a single decoding 412 "loop" and is widely considered a comparatively lightweight 413 operation. 415 Spatial Scalability 417 With the existence of Reference Picture Resampling (RPR), in 418 the "main" profile of VVC, the additional burden for 419 scalability support is just a minor modification of the high- 420 level syntax (HLS). In technical aspects, the inter-layer 421 prediction is employed in a scalable system to improve the 422 coding efficiency of the enhancement layers. In addition to 423 the spatial and temporal motion-compensated predictions that 424 are available in a single- layer codec, the inter-layer 425 prediction in [VVC] uses the resampled video data of the 426 reconstructed reference picture from a reference layer to 427 predict the current enhancement layer. Then, the resampling 428 process for inter-layer prediction is performed at the block- 429 level, without modifying the existing interpolation process for 430 motion compensation compared to non-scalable RPR. It means 431 that no additional resampling process is needed to support 432 scalability. 434 SNR Scalability 436 SNR scalability is similar to Spatial Scalability except that 437 the resampling factors are 1:1-in other words, there is no 438 change in resolution, but there is inter-layer prediction. 440 SEI Messages 442 Supplementary Enhancement Information (SEI) messages are codepoints 443 in the bitstream that do not influence the decoding process as 444 specified in the [VVC] spec, but address issues of representation/ 445 rendering of the decoded bitstream, label the bitstream for certain 446 applications, among other, similar tasks. The overall concept of SEI 447 messages and many of the messages themselves has been inherited from 448 the H.264 and HEVC specs. In the [VVC] environment, some of the SEI 449 messages considered to be generally useful also in other video coding 450 technologies have been moved out of the main specification into a 451 companion document (TO DO: add reference once ITU designation is 452 known). 454 1.1.3. Parallel Processing Support (informative) 456 Compared to HEVC, the [VVC] design to support parallelization offers 457 numerous improvements. Some of those improvements are still 458 undergoing changes in JVET. Information, to the extent relevant for 459 this memo, will be added in future versions of this memo as the 460 standardization in JVET progresses and the technology stabilizes. 462 Editor notes: udpate on sub-picture/slice/tile is needed following 463 new VVC draft 465 1.1.4. NAL Unit Header 467 [VVC] maintains the NAL unit concept of HEVC with modifications. VVC 468 uses a two-byte NAL unit header, as shown in Figure 1. The payload 469 of a NAL unit refers to the NAL unit excluding the NAL unit header. 471 +---------------+---------------+ 472 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 473 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 474 |F|Z| LayerID | Type | TID | 475 +---------------+---------------+ 477 The Structure of the VVC NAL Unit Header. 479 Figure 1 481 The semantics of the fields in the NAL unit header are as specified 482 in [VVC] and described briefly below for convenience. In addition to 483 the name and size of each field, the corresponding syntax element 484 name in [VVC] is also provided. 486 F: 1 bit 488 forbidden_zero_bit. Required to be zero in VVC. Note that the 489 inclusion of this bit in the NAL unit header was to enable 490 transport of [VVC] video over MPEG-2 transport systems (avoidance 491 of start code emulations) [MPEG2S]. In the context of this memo 492 the value 1 may be used to indicate a syntax violation, e.g., for 493 a NAL unit resulted from aggregating a number of fragmented units 494 of a NAL unit but missing the last fragment, as described in 495 Section TBD. 497 Z: 1 bit 499 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 500 for future extensions by ITU-T and ISO/IEC. 501 This memo does not overload the "Z" bit for local extensions, as 502 a) overloading the "F" bit is sufficient and b) to preserve the 503 usefulness of this memo to possible future versions of [VVC]. 505 LayerId: 6 bits 507 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 508 a layer may be, e.g., a spatial scalable layer, a quality scalable 509 layer . 511 Type: 5 bits 513 nal_unit_type. This field specifies the NAL unit type as defined 514 in Table 7-1 of VVC. For a reference of all currently defined NAL 515 unit types and their semantics, please refer to Section 7.4.2.2 in 516 [VVC]. 518 TID: 3 bits 520 nuh_temporal_id_plus1. This field specifies the temporal 521 identifier of the NAL unit plus 1. The value of TemporalId is 522 equal to TID minus 1. A TID value of 0 is illegal to ensure that 523 there is at least one bit in the NAL unit header equal to 1, so to 524 enable independent considerations of start code emulations in the 525 NAL unit header and in the NAL unit payload data. 527 1.2. Overview of the Payload Format 529 This payload format defines the following processes required for 530 transport of [VVC] coded data over RTP [RFC3550]: 532 o Usage of RTP header with this payload format 534 o Packetization of [VVC] coded NAL units into RTP packets using 535 three types of payload structures: a single NAL unit packet, 536 aggregation packet, and fragment unit 538 o Transmission of [VVC] NAL units of the same bitstream within a 539 single RTP stream. 541 o Media type parameters to be used with the Session Description 542 Protocol (SDP) [RFC4566] 544 o Frame-marking mapping [FrameMarking] 546 2. Conventions 548 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 549 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 550 "OPTIONAL" in this document are to be interpreted as described in BCP 551 14 [RFC2119] [RFC8174] when, and only when, they appear in all 552 capitals, as shown above. 554 3. Definitions and Abbreviations 556 3.1. Definitions 558 This document uses the terms and definitions of VVC. Section 3.1.1 559 lists relevant definitions from [VVC] for convenience. Section 3.1.2 560 provides definitions specific to this memo. 562 3.1.1. Definitions from the VVC Specification 564 Editor notes: 566 Access unit (AU): A set of PUs that belong to different layers and 567 contain coded pictures associated with the same time for output from 568 the DPB. 570 Adaptation parameter set (APS): A syntax structure containing syntax 571 elements that apply to zero or more slices as determined by zero or 572 more syntax elements found in slice headers. 574 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 575 byte stream, that forms the representation of a sequence of AUs 576 forming one or more coded video sequences (CVSs). 578 Coded picture: A coded representation of a picture comprising VCL NAL 579 units with a particular value of nuh_layer_id within an AU and 580 containing all CTUs of the picture. 582 Clean random access (CRA) PU: A PU in which the coded picture is a 583 CRA picture. 585 Clean random access (CRA) picture: An IRAP picture for which each VCL 586 NAL unit has nal_unit_type equal to CRA_NUT. 588 Coded video sequence (CVS): A sequence of AUs that consists, in 589 decoding order, of a CVSS AU, followed by zero or more AUs that are 590 not CVSS AUs, including all subsequent AUs up to but not including 591 any subsequent AU that is a CVSS AU. 593 Coded video sequence start (CVSS) AU: An AU in which there is a PU 594 for each layer in the CVS and the coded picture in each PU is a CLVSS 595 picture. 597 Coded layer video sequence (CLVS): A sequence of PUs with the same 598 value of nuh_layer_id that consists, in decoding order, of a CLVSS 599 PU, followed by zero or more PUs that are not CLVSS PUs, including 600 all subsequent PUs up to but not including any subsequent PU that is 601 a CLVSS PU. 603 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 604 picture is a CLVSS picture. 606 Coded layer video sequence start (CLVSS) picture: A coded picture 607 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 608 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 610 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 611 of chroma samples of a picture that has three sample arrays, or a CTB 612 of samples of a monochrome picture or a picture that is coded using 613 three separate colour planes and syntax structures used to code the 614 samples. 616 Decoding Capability Information (DCI): A syntax structure containing 617 syntax elements that apply to the entire bitstream. 619 Decoded picture buffer (DPB): A buffer holding decoded pictures for 620 reference, output reordering, or output delay specified for the 621 hypothetical reference decoder. 623 Gradual decoding refresh (GDR) picture: A picture for which each VCL 624 NAL unit has nal_unit_type equal to GDR_NUT. 626 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 627 picture is an IDR picture. 629 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 630 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 631 IDR_N_LP. 633 Intra random access point (IRAP) AU: An AU in which there is a PU for 634 each layer in the CVS and the coded picture in each PU is an IRAP 635 picture. 637 Intra random access point (IRAP) PU: A PU in which the coded picture 638 is an IRAP picture. 640 Intra random access point (IRAP) picture: A coded picture for which 641 all VCL NAL units have the same value of nal_unit_type in the range 642 of IDR_W_RADL to CRA_NUT, inclusive. 644 Layer: A set of VCL NAL units that all have a particular value of 645 nuh_layer_id and the associated non-VCL NAL units. 647 Network abstraction layer (NAL) unit: A syntax structure containing 648 an indication of the type of data to follow and bytes containing that 649 data in the form of an RBSP interspersed as necessary with emulation 650 prevention bytes. 652 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 654 Operation point (OP): A temporal subset of an OLS, identified by an 655 OLS index and a highest value of TemporalId. 657 Picture parameter set (PPS): A syntax structure containing syntax 658 elements that apply to zero or more entire coded pictures as 659 determined by a syntax element found in each slice header. 661 Picture unit (PU): A set of NAL units that are associated with each 662 other according to a specified classification rule, are consecutive 663 in decoding order, and contain exactly one coded picture. 665 Random access: The act of starting the decoding process for a 666 bitstream at a point other than the beginning of the stream. 668 Sequence parameter set (SPS): A syntax structure containing syntax 669 elements that apply to zero or more entire CLVSs as determined by the 670 content of a syntax element found in the PPS referred to by a syntax 671 element found in each picture header. 673 Slice: An integer number of complete tiles or an integer number of 674 consecutive complete CTU rows within a tile of a picture that are 675 exclusively contained in a single NAL unit. 677 Sub-layer: A temporal scalable layer of a temporal scalable bitstream 678 consisting of VCL NAL units with a particular value of the TemporalId 679 variable, and the associated non-VCL NAL units. 681 Subpicture: An rectangular region of one or more slices within a 682 picture. 684 Sub-layer representation: A subset of the bitstream consisting of NAL 685 units of a particular sub-layer and the lower sub-layers. 687 Tile: A rectangular region of CTUs within a particular tile column 688 and a particular tile row in a picture. 690 Tile column: A rectangular region of CTUs having a height equal to 691 the height of the picture and a width specified by syntax elements in 692 the picture parameter set. 694 Tile row: A rectangular region of CTUs having a height specified by 695 syntax elements in the picture parameter set and a width equal to the 696 width of the picture. 698 Video coding layer (VCL) NAL unit: A collective term for coded slice 699 NAL units and the subset of NAL units that have reserved values of 700 nal_unit_type that are classified as VCL NAL units in this 701 Specification. 703 3.1.2. Definitions Specific to This Memo 705 Media-Aware Network Element (MANE): A network element, such as a 706 middlebox, selective forwarding unit, or application-layer gateway 707 that is capable of parsing certain aspects of the RTP payload headers 708 or the RTP payload and reacting to their contents. 710 Editor Notes: the following informative needs to be updated along 711 with frame marking update 713 Informative note: The concept of a MANE goes beyond normal routers 714 or gateways in that a MANE has to be aware of the signaling (e.g., 715 to learn about the payload type mappings of the media streams), 716 and in that it has to be trusted when working with Secure RTP 717 (SRTP). The advantage of using MANEs is that they allow packets 718 to be dropped according to the needs of the media coding. For 719 example, if a MANE has to drop packets due to congestion on a 720 certain link, it can identify and remove those packets whose 721 elimination produces the least adverse effect on the user 722 experience. After dropping packets, MANEs must rewrite RTCP 723 packets to match the changes to the RTP stream, as specified in 724 Section 7 of [RFC3550]. 726 NAL unit decoding order: A NAL unit order that conforms to the 727 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 728 follow the Order of NAL units in the bitstream. 730 NAL unit output order: A NAL unit order in which NAL units of 731 different access units are in the output order of the decoded 732 pictures corresponding to the access units, as specified in [VVC], 733 and in which NAL units within an access unit are in their decoding 734 order. 736 RTP stream: See [RFC7656]. Within the scope of this memo, one RTP 737 stream is utilized to transport one or more temporal sub-layers. 739 Transmission order: The order of packets in ascending RTP sequence 740 number order (in modulo arithmetic). Within an aggregation packet, 741 the NAL unit transmission order is the same as the order of 742 appearance of NAL units in the packet. 744 3.2. Abbreviations 746 AU Access Unit 748 AP Aggregation Packet 750 CTU Coding Tree Unit 751 CVS Coded Video Sequence 753 DPB Decoded Picture Buffer 755 DCI Decoding capability information 757 DON Decoding Order Number 759 FIR Full Intra Request 761 FU Fragmentation Unit 763 HRD Hypothetical Reference Decoder 765 IDR Instantaneous Decoding Refresh 767 MANE Media-Aware Network Element 769 MTU Maximum Transfer Unit 771 NAL Network Abstraction Layer 773 NALU Network Abstraction Layer Unit 775 PLI Picture Loss Indication 777 PPS Picture Parameter Set 779 RPS Reference Picture Set 781 RPSI Reference Picture Selection Indication 783 SEI Supplemental Enhancement Information 785 SLI Slice Loss Indication 787 SPS Sequence Parameter Set 789 VCL Video Coding Layer 791 VPS Video Parameter Set 793 4. RTP Payload Format 794 4.1. RTP Header Usage 796 The format of the RTP header is specified in [RFC3550] (reprinted as 797 Figure 2 for convenience). This payload format uses the fields of 798 the header in a manner consistent with that specification. 800 The RTP payload (and the settings for some RTP header bits) for 801 aggregation packets and fragmentation units are specified in 802 Section 4.3.2 and Section 4.3.3, respectively. 804 0 1 2 3 805 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 806 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 807 |V=2|P|X| CC |M| PT | sequence number | 808 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 809 | timestamp | 810 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 811 | synchronization source (SSRC) identifier | 812 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 813 | contributing source (CSRC) identifiers | 814 | .... | 815 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 817 RTP Header According to {{RFC3550}} 819 Figure 2 821 The RTP header information to be set according to this RTP payload 822 format is set as follows: 824 Marker bit (M): 1 bit 826 Set for the last packet of the access unit, carried in the current 827 RTP stream. This is in line with the normal use of the M bit in 828 video formats to allow an efficient playout buffer handling. 830 Editor notes: The informative note below needs updating once 831 the NAL unit type table is stable in the [VVC] spec. 833 Informative note: The content of a NAL unit does not tell 834 whether or not the NAL unit is the last NAL unit, in decoding 835 order, of an access unit. An RTP sender implementation may 836 obtain this information from the video encoder. If, however, 837 the implementation cannot obtain this information directly from 838 the encoder, e.g., when the bitstream was pre-encoded, and also 839 there is no timestamp allocated for each NAL unit, then the 840 sender implementation can inspect subsequent NAL units in 841 decoding order to determine whether or not the NAL unit is the 842 last NAL unit of an access unit as follows. A NAL unit is 843 determined to be the last NAL unit of an access unit if it is 844 the last NAL unit of the bitstream. A NAL unit naluX is also 845 determined to be the last NAL unit of an access unit if both 846 the following conditions are true: 1) the next VCL NAL unit 847 naluY in decoding order has the high-order bit of the first 848 byte after its NAL unit header equal to 1 or nal_unit_type 849 equal to 19, and 2) all NAL units between naluX and naluY, when 850 present, have nal_unit_type in the range of 13 to17, inclusive, 851 equal to 20, equal to 23 or equal to 26. 853 Payload Type (PT): 7 bits 855 The assignment of an RTP payload type for this new packet format 856 is outside the scope of this document and will not be specified 857 here. The assignment of a payload type has to be performed either 858 through the profile used or in a dynamic way. 860 Sequence Number (SN): 16 bits 862 Set and used in accordance with [RFC3550]. 864 Timestamp: 32 bits 866 The RTP timestamp is set to the sampling timestamp of the content. 867 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 868 properties of its own (e.g., parameter set and SEI NAL units), the 869 RTP timestamp MUST be set to the RTP timestamp of the coded 870 picture of the access unit in which the NAL unit (according to 871 Annex D of VVC) is included. Receivers MUST use the RTP timestamp 872 for the display process, even when the bitstream contains picture 873 timing SEI messages or decoding unit information SEI messages as 874 specified in VVC. 876 Synchronization source (SSRC): 32 bits 878 Used to identify the source of the RTP packets. A single SSRC is 879 used for all parts of a single bitstream. 881 4.2. Payload Header Usage 883 The first two bytes of the payload of an RTP packet are referred to 884 as the payload header. The payload header consists of the same 885 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 886 in Section 1.1.4, irrespective of the type of the payload structure. 888 The TID value indicates (among other things) the relative importance 889 of an RTP packet, for example, because NAL units belonging to higher 890 temporal sub-layers are not used for the decoding of lower temporal 891 sub-layers. A lower value of TID indicates a higher importance. 892 More-important NAL units MAY be better protected against transmission 893 losses than less-important NAL units. 895 For Discussion: quite possibly something similar can be said for 896 the Layer_id in layered coding, but perhaps not in multiview 897 coding. (The relevant part of the spec is relatively new, 898 therefore the soft language). However, for serious layer pruning, 899 interpretation of the VPS is required. We can add language about 900 the need for stateful interpretation of LayerID vis-a-vis 901 stateless interpretation of TID later. 903 4.3. Payload Structures 905 Three different types of RTP packet payload structures are specified. 906 A receiver can identify the type of an RTP packet payload through the 907 Type field in the payload header. 909 The three different payload structures are as follows: 911 o Single NAL unit packet: Contains a single NAL unit in the payload, 912 and the NAL unit header of the NAL unit also serves as the payload 913 header. This payload structure is specified in Section 4.4.1. 915 o Aggregation Packet (AP): Contains more than one NAL unit within 916 one access unit. This payload structure is specified in 917 Section 4.3.2. 919 o Fragmentation Unit (FU): Contains a subset of a single NAL unit. 920 This payload structure is specified in Section 4.3.3. 922 4.3.1. Single NAL Unit Packets 924 Editor notes: its better to add a section to describe DONL and 925 sprop-max_don_diff 927 A single NAL unit packet contains exactly one NAL unit, and consists 928 of a payload header (denoted as PayloadHdr), a conditional 16-bit 929 DONL field (in network byte order), and the NAL unit payload data 930 (the NAL unit excluding its NAL unit header) of the contained NAL 931 unit, as shown in Figure 3. 933 0 1 2 3 934 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 935 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 936 | PayloadHdr | DONL (conditional) | 937 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 938 | | 939 | NAL unit payload data | 940 | | 941 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 942 | :...OPTIONAL RTP padding | 943 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 The Structure of a Single NAL Unit Packet 947 Figure 3 949 The DONL field, when present, specifies the value of the 16 least 950 significant bits of the decoding order number of the contained NAL 951 unit. If sprop-max-don-diff is greater than 0 for any of the RTP 952 streams, the DONL field MUST be present, and the variable DON for the 953 contained NAL unit is derived as equal to the value of the DONL 954 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 955 streams), the DONL field MUST NOT be present. 957 4.3.2. Aggregation Packets (APs) 959 Aggregation Packets (APs) can reduce of packetization overhead for 960 small NAL units, such as most of the non- VCL NAL units, which are 961 often only a few octets in size. 963 An AP aggregates NAL units of one access unit. Each NAL unit to be 964 carried in an AP is encapsulated in an aggregation unit. NAL units 965 aggregated in one AP are included in NAL unit decoding order. 967 An AP consists of a payload header (denoted as PayloadHdr) followed 968 by two or more aggregation units, as shown in Figure 4. 970 0 1 2 3 971 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 972 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 973 | PayloadHdr (Type=28) | | 974 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 975 | | 976 | two or more aggregation units | 977 | | 978 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 979 | :...OPTIONAL RTP padding | 980 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 982 The Structure of an Aggregation Packet 984 Figure 4 986 The fields in the payload header of an AP are set as follows. The F 987 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 988 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 989 be equal to 28. 991 The value of LayerId MUST be equal to the lowest value of LayerId of 992 all the aggregated NAL units. The value of TID MUST be the lowest 993 value of TID of all the aggregated NAL units. 995 Informative note: All VCL NAL units in an AP have the same TID 996 value since they belong to the same access unit. However, an AP 997 may contain non-VCL NAL units for which the TID value in the NAL 998 unit header may be different than the TID value of the VCL NAL 999 units in the same AP. 1001 An AP MUST carry at least two aggregation units and can carry as many 1002 aggregation units as necessary; however, the total amount of data in 1003 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1004 chosen so that the resulting IP packet is smaller than the MTU size 1005 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1006 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1007 not contain another AP. 1009 The first aggregation unit in an AP consists of a conditional 16-bit 1010 DONL field (in network byte order) followed by a 16-bit unsigned size 1011 information (in network byte order) that indicates the size of the 1012 NAL unit in bytes (excluding these two octets, but including the NAL 1013 unit header), followed by the NAL unit itself, including its NAL unit 1014 header, as shown in Figure 5. 1016 0 1 2 3 1017 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1018 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1019 | : DONL (conditional) | NALU size | 1020 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1021 | NALU size | | 1022 +-+-+-+-+-+-+-+-+ NAL unit | 1023 | | 1024 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1025 | : 1026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1028 The Structure of the First Aggregation Unit in an AP 1030 Figure 5 1032 The DONL field, when present, specifies the value of the 16 least 1033 significant bits of the decoding order number of the aggregated NAL 1034 unit. 1036 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 1037 the DONL field MUST be present in an aggregation unit that is the 1038 first aggregation unit in an AP, and the variable DON for the 1039 aggregated NAL unit is derived as equal to the value of the DONL 1040 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 1041 streams), the DONL field MUST NOT be present in an aggregation unit 1042 that is the first aggregation unit in an AP. 1044 An aggregation unit that is not the first aggregation unit in an AP 1045 will be followed immediately by a 16-bit unsigned size information 1046 (in network byte order) that indicates the size of the NAL unit in 1047 bytes (excluding these two octets, but including the NAL unit 1048 header), followed by the NAL unit itself, including its NAL unit 1049 header, as shown in Figure 6. 1051 0 1 2 3 1052 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1053 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1054 | : NALU size | NAL unit | 1055 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1056 | | 1057 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1058 | : 1059 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1061 The Structure of an Aggregation Unit That Is Not the First 1062 Aggregation Unit in an AP 1064 Figure 6 1066 Figure 7 presents an example of an AP that contains two aggregation 1067 units, labeled as 1 and 2 in the figure, without the DONL field being 1068 present. 1070 0 1 2 3 1071 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1072 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1073 | RTP Header | 1074 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1075 | PayloadHdr (Type=28) | NALU 1 Size | 1076 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1077 | NALU 1 HDR | | 1078 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1079 | . . . | 1080 | | 1081 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1082 | . . . | NALU 2 Size | NALU 2 HDR | 1083 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1084 | NALU 2 HDR | | 1085 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1086 | . . . | 1087 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1088 | :...OPTIONAL RTP padding | 1089 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1091 An Example of an AP Packet Containing 1092 Two Aggregation Units without the DONL Field 1094 Figure 7 1096 Figure 8 presents an example of an AP that contains two aggregation 1097 units, labeled as 1 and 2 in the figure, with the DONL field being 1098 present. 1100 0 1 2 3 1101 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1103 | RTP Header | 1104 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1105 | PayloadHdr (Type=28) | NALU 1 DONL | 1106 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1107 | NALU 1 Size | NALU 1 HDR | 1108 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1109 | | 1110 | NALU 1 Data . . . | 1111 | | 1112 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1113 | : NALU 2 Size | 1114 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1115 | NALU 2 HDR | | 1116 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1117 | | 1118 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1119 | :...OPTIONAL RTP padding | 1120 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1122 An Example of an AP Containing 1123 Two Aggregation Units with the DONL Field 1125 Figure 8 1127 4.3.3. Fragmentation Units 1129 Fragmentation Units (FUs) are introduced to enable fragmenting a 1130 single NAL unit into multiple RTP packets, possibly without 1131 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1132 unit consists of an integer number of consecutive octets of that NAL 1133 unit. Fragments of the same NAL unit MUST be sent in consecutive 1134 order with ascending RTP sequence numbers (with no other RTP packets 1135 within the same RTP stream being sent between the first and last 1136 fragment). 1138 When a NAL unit is fragmented and conveyed within FUs, it is referred 1139 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1140 NOT be nested; i.e., an FU can not contain a subset of another FU. 1142 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1143 time of the fragmented NAL unit. 1145 An FU consists of a payload header (denoted as PayloadHdr), an FU 1146 header of one octet, a conditional 16-bit DONL field (in network byte 1147 order), and an FU payload, as shown in Figure 9}. 1149 0 1 2 3 1150 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1151 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1152 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1153 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1154 | DONL (cond) | | 1155 |-+-+-+-+-+-+-+-+ | 1156 | FU payload | 1157 | | 1158 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1159 | :...OPTIONAL RTP padding | 1160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1162 The Structure of an FU 1164 Figure 9 1166 The fields in the payload header are set as follows. The Type field 1167 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1168 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1169 unit. 1171 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1172 FuType field, as shown in Figure 10. 1174 +---------------+ 1175 |0|1|2|3|4|5|6|7| 1176 +-+-+-+-+-+-+-+-+ 1177 |S|E|R| FuType | 1178 +---------------+ 1180 The Structure of FU Header 1182 Figure 10 1184 The semantics of the FU header fields are as follows: 1186 S: 1 bit 1188 When set to 1, the S bit indicates the start of a fragmented NAL 1189 unit, i.e., the first byte of the FU payload is also the first 1190 byte of the payload of the fragmented NAL unit. When the FU 1191 payload is not the start of the fragmented NAL unit payload, the S 1192 bit MUST be set to 0. 1194 E: 1 bit 1195 When set to 1, the E bit indicates the end of a fragmented NAL 1196 unit, i.e., the last byte of the payload is also the last byte of 1197 the fragmented NAL unit. When the FU payload is not the last 1198 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1200 Reserved: 1 bit 1202 Placeholder 1204 FuType: 5 bits 1206 The field FuType MUST be equal to the field Type of the fragmented 1207 NAL unit. 1209 The DONL field, when present, specifies the value of the 16 least 1210 significant bits of the decoding order number of the fragmented NAL 1211 unit. 1213 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 1214 and the S bit is equal to 1, the DONL field MUST be present in the 1215 FU, and the variable DON for the fragmented NAL unit is derived as 1216 equal to the value of the DONL field. Otherwise (sprop-max-don-diff 1217 is equal to 0 for all the RTP streams, or the S bit is equal to 0), 1218 the DONL field MUST NOT be present in the FU. 1220 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1221 the Start bit and End bit must not both be set to 1 in the same FU 1222 header. 1224 The FU payload consists of fragments of the payload of the fragmented 1225 NAL unit so that if the FU payloads of consecutive FUs, starting with 1226 an FU with the S bit equal to 1 and ending with an FU with the E bit 1227 equal to 1, are sequentially concatenated, the payload of the 1228 fragmented NAL unit can be reconstructed. The NAL unit header of the 1229 fragmented NAL unit is not included as such in the FU payload, but 1230 rather the information of the NAL unit header of the fragmented NAL 1231 unit is conveyed in F, LayerId, and TID fields of the FU payload 1232 headers of the FUs and the FuType field of the FU header of the FUs. 1233 An FU payload MUST NOT be empty. 1235 If an FU is lost, the receiver SHOULD discard all following 1236 fragmentation units in transmission order corresponding to the same 1237 fragmented NAL unit, unless the decoder in the receiver is known to 1238 be prepared to gracefully handle incomplete NAL units. 1240 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1241 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1242 n of that NAL unit is not received. In this case, the 1243 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1244 syntax violation. 1246 4.4. Decoding Order Number 1248 For each NAL unit, the variable AbsDon is derived, representing the 1249 decoding order number that is indicative of the NAL unit decoding 1250 order. 1252 Let NAL unit n be the n-th NAL unit in transmission order within an 1253 RTP stream. 1255 If sprop-max-don-diff is equal to 0 for all the RTP streams carrying 1256 the [VVC] bitstream, AbsDon[n], the value of AbsDon for NAL unit n, 1257 is derived as equal to n. 1259 Otherwise (sprop-max-don-diff is greater than 0 for any of the RTP 1260 streams), AbsDon[n] is derived as follows, where DON[n] is the value 1261 of the variable DON for NAL unit n: 1263 o If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1264 transmission order), AbsDon[0] is set equal to DON[0]. 1266 o Otherwise (n is greater than 0), the following applies for 1267 derivation of AbsDon[n]: 1269 If DON[n] == DON[n-1], 1270 AbsDon[n] = AbsDon[n-1] 1272 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1273 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1275 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1276 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1278 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1279 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1280 DON[n]) 1282 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1283 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1285 For any two NAL units m and n, the following applies: 1287 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1288 NAL unit m in NAL unit decoding order. 1290 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1291 of the two NAL units can be in either order. 1293 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1294 NAL unit m in decoding order. 1296 Informative note: When two consecutive NAL units in the NAL 1297 unit decoding order have different values of AbsDon, the 1298 absolute difference between the two AbsDon values may be 1299 greater than or equal to 1. 1301 Informative note: There are multiple reasons to allow for the 1302 absolute difference of the values of AbsDon for two consecutive 1303 NAL units in the NAL unit decoding order to be greater than 1304 one. An increment by one is not required, as at the time of 1305 associating values of AbsDon to NAL units, it may not be known 1306 whether all NAL units are to be delivered to the receiver. For 1307 example, a gateway might not forward VCL NAL units of higher 1308 sub- layers or some SEI NAL units when there is congestion in 1309 the network. In another example, the first intra-coded picture 1310 of a pre-encoded clip is transmitted in advance to ensure that 1311 it is readily available in the receiver, and when transmitting 1312 the first intra-coded picture, the originator does not exactly 1313 know how many NAL units will be encoded before the first intra- 1314 coded picture of the pre-encoded clip follows in decoding 1315 order. Thus, the values of AbsDon for the NAL units of the 1316 first intra-coded picture of the pre-encoded clip have to be 1317 estimated when they are transmitted, and gaps in values of 1318 AbsDon may occur. 1320 5. Packetization Rules 1322 The following packetization rules apply: 1324 o If sprop-max-don-diff is greater than 0 for any of the RTP 1325 streams, the transmission order of NAL units carried in the RTP 1326 stream MAY be different than the NAL unit decoding order and the 1327 NAL unit output order. 1329 o A NAL unit of a small size SHOULD be encapsulated in an 1330 aggregation packet together one or more other NAL units in order 1331 to avoid the unnecessary packetization overhead for small NAL 1332 units. For example, non-VCL NAL units such as access unit 1333 delimiters, parameter sets, or SEI NAL units are typically small 1334 and can often be aggregated with VCL NAL units without violating 1335 MTU size constraints. 1337 o Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1338 viewpoint, be encapsulated in an aggregation packet together with 1339 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1340 be meaningless without the associated VCL NAL unit being 1341 available. 1343 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1344 unit packet MUST be used. 1346 6. De-packetization Process 1348 The general concept behind de-packetization is to get the NAL units 1349 out of the RTP packets in an RTP stream and pass them to the decoder 1350 in the NAL unit decoding order. 1352 The de-packetization process is implementation dependent. Therefore, 1353 the following description should be seen as an example of a suitable 1354 implementation. Other schemes may be used as well, as long as the 1355 output for the same input is the same as the process described below. 1356 The output is the same when the set of output NAL units and their 1357 order are both identical. Optimizations relative to the described 1358 algorithms are possible. 1360 All normal RTP mechanisms related to buffer management apply. In 1361 particular, duplicated or outdated RTP packets (as indicated by the 1362 RTP sequences number and the RTP timestamp) are removed. To 1363 determine the exact time for decoding, factors such as a possible 1364 intentional delay to allow for proper inter-stream synchronization 1365 MUST be factored in. 1367 NAL units with NAL unit type values in the range of 0 to 27, 1368 inclusive, may be passed to the decoder. NAL-unit-like structures 1369 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1370 NOT be passed to the decoder. 1372 The receiver includes a receiver buffer, which is used to compensate 1373 for transmission delay jitter within individual RTP streams and 1374 across RTP streams, to reorder NAL units from transmission order to 1375 the NAL unit decoding order. In this section, the receiver operation 1376 is described under the assumption that there is no transmission delay 1377 jitter within an RTP stream and across RTP streams. To make a 1378 difference from a practical receiver buffer that is also used for 1379 compensation of transmission delay jitter, the receiver buffer is 1380 hereafter called the de-packetization buffer in this section. 1381 Receivers should also prepare for transmission delay jitter; that is, 1382 either reserve separate buffers for transmission delay jitter 1383 buffering and de-packetization buffering or use a receiver buffer for 1384 both transmission delay jitter and de- packetization. Moreover, 1385 receivers should take transmission delay jitter into account in the 1386 buffering operation, e.g., by additional initial buffering before 1387 starting of decoding and playback. 1389 When sprop-max-don-diff is equal to 0 for all the received RTP 1390 streams, the de-packetization buffer size is zero bytes, and the 1391 process described in the remainder of this paragraph applies. 1392 The NAL units carried in the single RTP stream are directly passed to 1393 the decoder in their transmission order, which is identical to their 1394 decoding order. When there are several NAL units of the same RTP 1395 stream with the same NTP timestamp, the order to pass them to the 1396 decoder is their transmission order. 1398 Informative note: The mapping between RTP and NTP timestamps is 1399 conveyed in RTCP SR packets. In addition, the mechanisms for 1400 faster media timestamp synchronization discussed in [RFC6051] may 1401 be used to speed up the acquisition of the RTP-to-wall-clock 1402 mapping. 1404 When sprop-max-don-diff is greater than 0 for any the received RTP 1405 streams, the process described in the remainder of this section 1406 applies. 1408 There are two buffering states in the receiver: initial buffering and 1409 buffering while playing. Initial buffering starts when the reception 1410 is initialized. After initial buffering, decoding and playback are 1411 started, and the buffering-while-playing mode is used. 1413 Regardless of the buffering state, the receiver stores incoming NAL 1414 units, in reception order, into the de-packetization buffer. NAL 1415 units carried in RTP packets are stored in the de-packetization 1416 buffer individually, and the value of AbsDon is calculated and stored 1417 for each NAL unit. 1419 Initial buffering lasts until condition A (the difference between the 1420 greatest and smallest AbsDon values of the NAL units in the de- 1421 packetization buffer is greater than or equal to the value of sprop- 1422 max-don-diff) or condition B (the number of NAL units in the de- 1423 packetization buffer is greater than the value of sprop-depack-buf- 1424 nalus) is true. 1426 After initial buffering, whenever condition A or condition B is true, 1427 the following operation is repeatedly applied until both condition A 1428 and condition B become false: 1430 o The NAL unit in the de-packetization buffer with the smallest 1431 value of AbsDon is removed from the de-packetization buffer and 1432 passed to the decoder. 1434 When no more NAL units are flowing into the de-packetization buffer, 1435 all NAL units remaining in the de-packetization buffer are removed 1436 from the buffer and passed to the decoder in the order of increasing 1437 AbsDon values. 1439 7. Payload Format Parameters 1441 Placeholder 1443 8. Use with Feedback Messages 1445 The following subsections define the use of the Picture Loss 1446 Indication (PLI), Slice Lost Indication (SLI), Reference Picture 1447 Selection Indication (RPSI), and Full Intra Request (FIR) feedback 1448 messages with HEVC. The PLI, SLI, and RPSI messages are defined in 1449 [RFC4585], and the FIR message is defined in [RFC5104]. 1451 8.1. Picture Loss Indication (PLI) 1453 As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a 1454 media sender indicates "the loss of an undefined amount of coded 1455 video data belonging to one or more pictures". Without having any 1456 specific knowledge of the setup of the bitstream (such as use and 1457 location of in-band parameter sets, non-IRAP decoder refresh points, 1458 picture structures, and so forth), a reaction to the reception of an 1459 PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant 1460 parameter sets; potentially with sufficient redundancy so to ensure 1461 correct reception. However, sometimes information about the 1462 bitstream structure is known. For example, state could have been 1463 established outside of the mechanisms defined in this document that 1464 parameter sets are conveyed out of band only, and stay static for the 1465 duration of the session. In that case, it is obviously unnecessary 1466 to send them in-band as a result of the reception of a PLI. Other 1467 examples could be devised based on a priori knowledge of different 1468 aspects of the bitstream structure. In all cases, the timing and 1469 congestion control mechanisms of RFC 4585 MUST be observed. 1471 8.2. Slice Loss Indication (SLI) 1473 For further study. Maybe remove as there are no known 1474 implementations of SDLI in [HEVC] based systems 1476 8.3. Reference Picture Selection Indication (RPSI) 1478 Feedback-based reference picture selection has been shown as a 1479 powerful tool to stop temporal error propagation for improved error 1480 resilience [Girod99] [Wang05]. In one approach, the decoder side 1481 tracks errors in the decoded pictures and informs the encoder side 1482 that a particular picture that has been decoded relatively earlier is 1483 correct and still present in the decoded picture buffer; it requests 1484 the encoder to use that correct picture-availability information when 1485 encoding the next picture, so to stop further temporal error 1486 propagation. For this approach, the decoder side should use the RPSI 1487 feedback message. 1489 Encoders can encode some long-term reference pictures as specified in 1490 [VVC] for purposes described in the previous paragraph without the 1491 need of a huge decoded picture buffer. As shown in [Wang05], with a 1492 flexible reference picture management scheme, as in VVC, even a 1493 decoded picture buffer size of two picture storage buffers would work 1494 for the approach described in the previous paragraph. 1496 The text above is copy-paste from RFC 7798. If we keep the RPSI 1497 message, it needs adaptation to the [VVC] syntax. Doing so shouldn't 1498 be too hard as the [VVC] reference picture mechanism is not too 1499 different from the [HEVC] one. 1501 8.4. Full Intra Request (FIR) 1503 The purpose of the FIR message is to force an encoder to send an 1504 independent decoder refresh point as soon as possible, while 1505 observing applicable congestion-control-related constraints, such as 1506 those set out in [RFC8082]). 1508 Upon reception of a FIR, a sender MUST send an IDR picture. 1509 Parameter sets MUST also be sent, except when there is a priori 1510 knowledge that the parameter sets have been correctly established. A 1511 typical example for that is an understanding between sender and 1512 receiver, established by means outside this document, that parameter 1513 sets are exclusively sent out-of-band. 1515 9. Frame marking 1517 placeholder 1519 10. Security Considerations 1521 The scope of this Security Considerations section is limited to the 1522 payload format itself and to one feature of [VVC] that may pose a 1523 particularly serious security risk if implemented naively. The 1524 payload format, in isolation, does not form a complete system. 1525 Implementers are advised to read and understand relevant security- 1526 related documents, especially those pertaining to RTP (see the 1527 Security Considerations section in [RFC3550] ), and the security of 1528 the call-control stack chosen (that may make use of the media type 1529 registration of this memo). Implementers should also consider known 1530 security vulnerabilities of video coding and decoding implementations 1531 in general and avoid those. 1533 Within this RTP payload format, and with the exception of the user 1534 data SEI message as described below, no security threats other than 1535 those common to RTP payload formats are known. In other words, 1536 neither the various media-plane-based mechanisms, nor the signaling 1537 part of this memo, seems to pose a security risk beyond those common 1538 to all RTP-based systems. 1540 RTP packets using the payload format defined in this specification 1541 are subject to the security considerations discussed in the RTP 1542 specification [RFC3550] , and in any applicable RTP profile such as 1543 RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/ 1544 SAVPF [RFC5124] . However, as "Securing the RTP Framework: Why RTP 1545 Does Not Mandate a Single Media Security Solution" [RFC7202] 1546 discusses, it is not an RTP payload format's responsibility to 1547 discuss or mandate what solutions are used to meet the basic security 1548 goals like confidentiality, integrity and source authenticity for RTP 1549 in general. This responsibility lays on anyone using RTP in an 1550 application. They can find guidance on available security mechanisms 1551 and important considerations in "Options for Securing RTP Sessions" 1552 [RFC7201] . The rest of this section discusses the security impacting 1553 properties of the payload format itself. 1555 Because the data compression used with this payload format is applied 1556 end-to-end, any encryption needs to be performed after compression. 1557 A potential denial-of-service threat exists for data encodings using 1558 compression techniques that have non-uniform receiver-end 1559 computational load. The attacker can inject pathological datagrams 1560 into the bitstream that are complex to decode and that cause the 1561 receiver to be overloaded. [VVC] is particularly vulnerable to such 1562 attacks, as it is extremely simple to generate datagrams containing 1563 NAL units that affect the decoding process of many future NAL units. 1564 Therefore, the usage of data origin authentication and data integrity 1565 protection of at least the RTP packet is RECOMMENDED, for example, 1566 with SRTP [RFC3711] . 1568 Like HEVC [RFC7798], [VVC] includes a user data Supplemental 1569 Enhancement Information (SEI) message. This SEI message allows 1570 inclusion of an arbitrary bitstring into the video bitstream. Such a 1571 bitstring could include JavaScript, machine code, and other active 1572 content. [VVC] leaves the handling of this SEI message to the 1573 receiving system. In order to avoid harmful side effects the user 1574 data SEI message, decoder implementations cannot naively trust its 1575 content. For example, it would be a bad and insecure implementation 1576 practice to forward any JavaScript a decoder implementation detects 1577 to a web browser. The safest way to deal with user data SEI messages 1578 is to simply discard them, but that can have negative side effects on 1579 the quality of experience by the user. 1581 End-to-end security with authentication, integrity, or 1582 confidentiality protection will prevent a MANE from performing media- 1583 aware operations other than discarding complete packets. In the case 1584 of confidentiality protection, it will even be prevented from 1585 discarding packets in a media-aware way. To be allowed to perform 1586 such operations, a MANE is required to be a trusted entity that is 1587 included in the security context establishment. 1589 11. Congestion Control 1591 Congestion control for RTP SHALL be used in accordance with RTP 1592 [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551]. 1593 If best-effort service is being used, an additional requirement is 1594 that users of this payload format MUST monitor packet loss to ensure 1595 that the packet loss rate is within an acceptable range. Packet loss 1596 is considered acceptable if a TCP flow across the same network path, 1597 and experiencing the same network conditions, would achieve an 1598 average throughput, measured on a reasonable timescale, that is not 1599 less than all RTP streams combined are achieving. This condition can 1600 be satisfied by implementing congestion-control mechanisms to adapt 1601 the transmission rate, the number of layers subscribed for a layered 1602 multicast session, or by arranging for a receiver to leave the 1603 session if the loss rate is unacceptably high. 1605 The bitrate adaptation necessary for obeying the congestion control 1606 principle is easily achievable when real-time encoding is used, for 1607 example, by adequately tuning the quantization parameter. However, 1608 when pre-encoded content is being transmitted, bandwidth adaptation 1609 requires the pre-coded bitstream to be tailored for such adaptivity. 1610 The key mechanisms available in [VVC] are temporal scalability, and 1611 spatial/SNR scalability. A media sender can remove NAL units 1612 belonging to higher temporal sub-layers (i.e., those NAL units with a 1613 high value of TID) or higher spatio-SNR layers (as indicated by 1614 interpreting the VPS) until the sending bitrate drops to an 1615 acceptable range. 1617 The mechanisms mentioned above generally work within a defined 1618 profile and level and, therefore, no renegotiation of the channel is 1619 required. Only when non-downgradable parameters (such as profile) 1620 are required to be changed does it become necessary to terminate and 1621 restart the RTP stream(s). This may be accomplished by using 1622 different RTP payload types. 1624 MANEs MAY remove certain unusable packets from the RTP stream when 1625 that RTP stream was damaged due to previous packet losses. This can 1626 help reduce the network load in certain special cases. For example, 1627 MANES can remove those FUs where the leading FUs belonging to the 1628 same NAL unit have been lost or those dependent slice segments when 1629 the leading slice segments belonging to the same slice have been 1630 lost, because the trailing FUs or dependent slice segments are 1631 meaningless to most decoders. MANES can also remove higher temporal 1632 scalable layers if the outbound transmission (from the MANE's 1633 viewpoint) experiences congestion. 1635 12. IANA Considerations 1637 Placeholder 1639 13. Acknowledgements 1641 Dr. Byeongdoo Choi is thanked for the video codec related technical 1642 discussion and other aspects in this memo. Xin Zhao and Dr. Xiang Li 1643 are thanked for their contributions on [VVC] specification 1644 descriptive content. Spencer Dawkins is thanked for his valuable 1645 review comments that led to great improvements of this memo. Some 1646 parts of this specification share text with the RTP payload format 1647 for HEVC [RFC7798]. We thank the authors of that specification for 1648 their excellent work. 1650 14. References 1652 14.1. Normative References 1654 [H.266] "ITU-T, Versatile Video Coding", n.d.. 1656 [ISO23090-3] 1657 "ISO/IEC DIS Information technology --- Coded 1658 representation of immersive media --- Part 3 Versatile 1659 video codings", n.d., 1660 . 1662 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1663 Requirement Levels", BCP 14, RFC 2119, 1664 DOI 10.17487/RFC2119, March 1997, 1665 . 1667 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1668 Jacobson, "RTP: A Transport Protocol for Real-Time 1669 Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550, 1670 July 2003, . 1672 [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and 1673 Video Conferences with Minimal Control", STD 65, RFC 3551, 1674 DOI 10.17487/RFC3551, July 2003, 1675 . 1677 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 1678 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 1679 RFC 3711, DOI 10.17487/RFC3711, March 2004, 1680 . 1682 [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session 1683 Description Protocol", RFC 4566, DOI 10.17487/RFC4566, 1684 July 2006, . 1686 [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, 1687 "Extended RTP Profile for Real-time Transport Control 1688 Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, 1689 DOI 10.17487/RFC4585, July 2006, 1690 . 1692 [RFC5104] Wenger, S., Chandra, U., Westerlund, M., and B. Burman, 1693 "Codec Control Messages in the RTP Audio-Visual Profile 1694 with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104, 1695 February 2008, . 1697 [RFC5124] Ott, J. and E. Carrara, "Extended Secure RTP Profile for 1698 Real-time Transport Control Protocol (RTCP)-Based Feedback 1699 (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February 1700 2008, . 1702 [RFC7656] Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and 1703 B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms 1704 for Real-Time Transport Protocol (RTP) Sources", RFC 7656, 1705 DOI 10.17487/RFC7656, November 2015, 1706 . 1708 [RFC8082] Wenger, S., Lennox, J., Burman, B., and M. Westerlund, 1709 "Using Codec Control Messages in the RTP Audio-Visual 1710 Profile with Feedback with Layered Codecs", RFC 8082, 1711 DOI 10.17487/RFC8082, March 2017, 1712 . 1714 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1715 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1716 May 2017, . 1718 [VVC] "Versatile Video Coding (Draft 8), Joint Video Experts 1719 Team (JVET)", January 2020. 1721 14.2. Informative References 1723 [CABAC] Sole, J, . and . et al, "Transform coefficient coding in 1724 HEVC, IEEE Transactions on Circuts and Systems for Video 1725 Technology", DOI 10.1109/TCSVT.2012.2223055, December 1726 2012. 1728 [FrameMarking] 1729 Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame 1730 Marking RTP Header Extension", Work in Progress draft- 1731 berger-avtext-framemarking , 2015. 1733 [Girod99] Girod, B, . and . et al, "Feedback-based error control for 1734 mobile video transmission, Proceedings of the IEEE", 1735 DOI 110.1109/5.790632, October 1999. 1737 [HEVC] "High efficiency video coding, ITU-T Recommendation 1738 H.265", April 2013. 1740 [MPEG2S] IS0/IEC, ., "Information technology - Generic coding 1741 ofmoving pictures and associated audio information - Part 1742 1:Systems, ISO International Standard 13818-1", 2013. 1744 [RFC6051] Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP 1745 Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010, 1746 . 1748 [RFC6184] Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP 1749 Payload Format for H.264 Video", RFC 6184, 1750 DOI 10.17487/RFC6184, May 2011, 1751 . 1753 [RFC6190] Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis, 1754 "RTP Payload Format for Scalable Video Coding", RFC 6190, 1755 DOI 10.17487/RFC6190, May 2011, 1756 . 1758 [RFC7201] Westerlund, M. and C. Perkins, "Options for Securing RTP 1759 Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014, 1760 . 1762 [RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP 1763 Framework: Why RTP Does Not Mandate a Single Media 1764 Security Solution", RFC 7202, DOI 10.17487/RFC7202, April 1765 2014, . 1767 [RFC7798] Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M. 1768 Hannuksela, "RTP Payload Format for High Efficiency Video 1769 Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March 1770 2016, . 1772 [Wang05] Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient 1773 video coding using flexible reference fames", Visual 1774 Communications and Image Processing 2005 (VCIP 2005) , 1775 July 2005. 1777 Appendix A. Change History 1779 draft-zhao-payload-rtp-vvc-00 ........ initial version 1781 draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and 1782 corrections 1784 Authors' Addresses 1786 Shuai Zhao 1787 Tencent 1788 2747 Park Blvd 1789 Palo Alto 94588 1790 USA 1792 Email: shuai.zhao@ieee.org 1794 Stephan Wenger 1795 Tencent 1796 2747 Park Blvd 1797 Palo Alto 94588 1799 Email: stewe@stewe.org 1801 Yago Sanchez 1802 Fraunhofer HHI 1803 Einsteinufer 37 1804 Berlin 10587 1805 Germany 1807 Email: yago.sanchez@hhi.fraunhofer.de