idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-02.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([ISO23090-3]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (July 11, 2020) is 1383 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1267 -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: January 12, 2021 Y. Sanchez 6 Fraunhofer HHI 7 July 11, 2020 9 RTP Payload Format for Versatile Video Coding (VVC) 10 draft-ietf-avtcore-rtp-vvc-02 12 Abstract 14 This memo describes an RTP payload format for the video coding 15 standard ITU-T Recommendation [H.266] and ISO/IEC International 16 Standard [ISO23090-3], both also known as Versatile Video Coding 17 (VVC) and developed by the Joint Video Experts Team (JVET). The RTP 18 payload format allows for packetization of one or more Network 19 Abstraction Layer (NAL) units in each RTP packet payload as well as 20 fragmentation of a NAL unit into multiple RTP packets. The payload 21 format has wide applicability in videoconferencing, Internet video 22 streaming, and high-bitrate entertainment-quality video, among other 23 applications. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on January 12, 2021. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 4 62 1.1.2. Systems and Transport Interfaces . . . . . . . . . . 6 63 1.1.3. Parallel Processing Support (informative) . . . . . . 10 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 11 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 12 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 12 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 12 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 12 69 3.1.1. Definitions from the VVC Specification . . . . . . . 13 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 16 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 16 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 17 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 18 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 19 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 20 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 20 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 21 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 25 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 28 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 29 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 30 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 32 83 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 32 84 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 32 85 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 32 86 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 33 87 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 33 88 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 33 89 8.2. Slice Loss Indication (SLI) . . . . . . . . . . . . . . . 33 90 8.3. Reference Picture Selection Indication (RPSI) . . . . . . 33 91 8.4. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 34 92 9. Frame Marking . . . . . . . . . . . . . . . . . . . . . . . . 34 93 9.1. Frame Marking Short Extension . . . . . . . . . . . . . . 35 94 9.2. Frame Marking Long Extension . . . . . . . . . . . . . . 36 95 10. Security Considerations . . . . . . . . . . . . . . . . . . . 37 96 11. Congestion Control . . . . . . . . . . . . . . . . . . . . . 38 97 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 98 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 39 99 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 39 100 14.1. Normative References . . . . . . . . . . . . . . . . . . 39 101 14.2. Informative References . . . . . . . . . . . . . . . . . 41 102 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 42 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 43 105 1. Introduction 107 The Versatile Video Coding [VVC] specification, formally published as 108 both ITU-T Recommendation H.266 and ISO/IEC International Standard 109 23090-3 [ISO23090-3], is currently in the ISO/IEC approval process 110 and is planned for ratification in mid 2020. H.266 is reported to 111 provide significant coding efficiency gains over H.265 and earlier 112 video codec formats. 114 This memo describes an RTP payload format for VVC. It shares its 115 basic design with the NAL (Network Abstraction Layer) unit-based RTP 116 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 117 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 118 and their respective predecessors. With respect to design 119 philosophy, security, congestion control, and overall implementation 120 complexity, it has similar properties to those earlier payload format 121 specifications. This is a conscious choice, as at least RFC 6184 is 122 widely deployed and generally known in the relevant implementer 123 communities. Certain mechanisms known from [RFC6190] were 124 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 125 signal-to-noise ratio (SNR) scalability. 127 1.1. Overview of the VVC Codec 129 [VVC] and [HEVC] share a similar hybrid video codec design. In this 130 memo, we provide a very brief overview of those features of VVC that 131 are, in some form, addressed by the payload format specified herein. 132 Implementers have to read, understand, and apply the ITU- T/ISO/IEC 133 specifications pertaining to [VVC] to arrive at interoperable, well- 134 performing implementations. 136 Conceptually, both [VVC] and [HEVC] include a Video Coding Layer 137 (VCL), which is often used to refer to the coding-tool features, and 138 a NAL, which is often used to refer to the systems and transport 139 interface aspects of the codecs. 141 1.1.1. Coding-Tool Features (informative) 143 Coding tool features are described below with occasional reference to 144 the coding tool set of [HEVC], which is well known in the community. 146 Similar to earlier hybrid-video-coding-based standards, including 147 HEVC, the following basic video coding design is employed by VVC. A 148 prediction signal is first formed by either intra- or motion- 149 compensated prediction, and the residual (the difference between the 150 original and the prediction) is then coded. The gains in coding 151 efficiency are achieved by redesigning and improving almost all parts 152 of the codec over earlier designs. In addition, [VVC] includes 153 several tools to make the implementation on parallel architectures 154 easier. 156 Finally, [VVC] includes temporal, spatial, and SNR scalability as 157 well as multiview coding support. 159 Coding blocks and transform structure 161 Among major coding-tool differences between HEVC and VVC, one of the 162 important improvements is the more flexible coding tree structure in 163 VVC, i.e., multi-type tree. In addition to quadtree, binary and 164 ternary trees are also supported, which contributes significant 165 improvement in coding efficiency. Moreover, the maximum size of 166 Coding Tree Unit (CTU) is increased from 64x64 to 128x128. To 167 improve the coding efficiency of chroma signal, luma chroma separated 168 trees at CTU level may be employed for intra-slices. The square 169 transforms in HEVC are extended to non-square transforms for 170 rectangular blocks resulting from binary and ternary tree splits. 171 Besides, [VVC] supports multiple transform sets (MTS), including DCT- 172 2, DST-7, and DCT-8 as well as the non-separable secondary transform. 173 The transforms used in [VVC] can have different sizes with support 174 for larger transform sizes. For DCT-2, the transform sizes range 175 from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range 176 from 4x4 to 32x32. In addition, [VVC] also support sub-block 177 transform for both intra and inter coded blocks. For intra coded 178 blocks, intra sub-partitioning (ISP) may be used to allow sub-block 179 based intra prediction and transform. For inter blocks, sub-block 180 transform may be used assuming that only a part of an inter-block has 181 non-zero transform coefficients. 183 Entropy coding 185 Similar to HEVC , [VVC] uses a single entropy-coding engine, which is 186 based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC], 187 but with the support of multi-window sizes. The window sizes can be 188 initialized differently for different context models. Due to such a 189 design, it has more efficient adaptation speed and better coding 190 efficiency. A joint chroma residual coding scheme is applied to 191 further exploit the correlation between the residuals of two color 192 components. In VVC, different residual coding schemes are applied 193 for regular transform coefficients and residual samples generated 194 using transform-skip mode. 196 In-loop filtering 198 [VVC] has more feature support in loop filters than HEVC. The 199 deblocking filter in [VVC] is similar to HEVC but operates at a 200 smaller grid. After deblocking and sample adaptive offset (SAO), an 201 adaptive loop filter (ALF) may be used. As a Wiener filter, ALF 202 reduces distortion of decoded pictures. Besides, [VVC] introduces a 203 new module before deblocking called luma mapping with chroma scaling 204 to fully utilize the dynamic range of signal so that rate-distortion 205 performance of both SDR and HDR content is improved. 207 Motion prediction and coding 209 Compared to HEVC, [VVC] introduces several improvements in this area. 210 First, there is the Adaptive motion vector resolution (AMVR), which 211 can save bit cost for motion vectors by adaptively signaling motion 212 vector resolution. Then the Affine motion compensation is included 213 to capture complicated motion like zooming and rotation. Meanwhile, 214 prediction refinement with the optical flow with affine mode (PROF) 215 is further deployed to mimic affine motion at the pixel level. 216 Thirdly the decoder side motion vector refinement (DMVR) is a method 217 to derive MV vector at decoder side based on block matching so that 218 fewer bits may be spent on motion vectors. Bi-directional optical 219 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 220 offset at 4x4 sub-block level that is derived with equations based on 221 gradients of the prediction samples and a motion difference relative 222 to CU motion vectors. Furthermore, merge with motion vector 223 difference (MMVD) is a special mode, which further signals a limited 224 set of motion vector differences on top of merge mode. In addition 225 to MMVD, there are another three types of special merge modes, i.e., 226 sub-block merge, triangle, and combined intra-/inter- prediction 227 (CIIP). Sub- block merge list includes one candidate of sub-block 228 temporal motion vector prediction (SbTMVP) and up to four candidates 229 of affine motion vectors. Triangle is based on triangular block 230 motion compensation. CIIP combines intra- and inter- predictions 231 with weighting. Adaptive weighting may be employed with a block- 232 level tool called bi-prediction with CU based weighting (BCW) which 233 provides more flexibility than in HEVC. 235 Intra prediction and intra-coding 236 To capture the diversified local image texture directions with finer 237 granularity, [VVC] supports 65 angular directions instead of 33 238 directions in HEVC. The intra mode coding is based on a 6 most 239 probable mode scheme, and the 6 most probable modes are derived using 240 the neighboring intra prediction directions. In addition, to deal 241 with the different distributions of intra prediction angles for 242 different block aspect ratios, a wide-angle intra prediction (WAIP) 243 scheme is applied in [VVC] by including intra prediction angles 244 beyond those present in HEVC. Unlike HEVC which only allows using 245 the most adjacent line of reference samples for intra prediction, 246 [VVC] also allows using two further reference lines, as known as 247 multi-reference-line (MRL) intra prediction. The additional 248 reference lines can be only used for 6 most probable intra prediction 249 modes. To capture the strong correlation between different colour 250 components, in VVC, a cross-component linear mode (CCLM) is utilized 251 which assumes a linear relationship between the luma sample values 252 and their associated chroma samples. For intra prediction, [VVC] 253 also applies a position-dependent prediction combination (PDPC) for 254 refining the prediction samples closer to the intra prediction block 255 boundary. Matrix-based intra prediction (MIP) modes are also used in 256 [VVC] which generates an up to 8x8 intra prediction block using a 257 weighted sum of downsampled neighboring reference samples, and the 258 weightings are hardcoded constants. 260 Other coding-tool feature 262 [VVC] introduces dependent quantization (DQ) to reduce quantization 263 error by state-based switching between two quantizers. 265 1.1.2. Systems and Transport Interfaces 267 [VVC] inherits the basic systems and transport interfaces designs 268 from HEVC and H.264. These include the NAL-unit-based syntax 269 structure, the hierarchical syntax and data unit structure, the 270 Supplemental Enhancement Information (SEI) message mechanism, and the 271 video buffering model based on the Hypothetical Reference Decoder 272 (HRD). The scalability features of [VVC] are conceptually similar to 273 the scalable variant of HEVC known as SHVC. The hierarchical syntax 274 and data unit structure consists of parameter sets at various levels 275 (decoder, sequence (pertaining to all), sequence (pertaining to a 276 single), picture), picture-level header parameters, slice-level 277 header parameters, and lower-level parameters. 279 A number of key components that influenced the Network Abstraction 280 Layer design of [VVC] as well as this memo are described below 282 Decoding Capability Information 283 The Decoding capability information includes parameters that stay 284 constant for the lifetime of a Video Bitstream, which in IETF terms 285 can translate to the lifetime of a session. Decoding capability 286 informations can include profile, level, and sub-profile information 287 to determine a maximum complexity interop point that is guaranteed to 288 be never exceeded, even if splicing of video sequences occurs within 289 a session. It further includes constraint flags, which can 290 optionally be set to indicate that the video bitstream will be 291 constraint in the use of certain features as indicated by the values 292 of those flags. With this, a bitstream can be labelled as not using 293 certain tools, which allows among other things for resource 294 allocation in a decoder implementation. 296 Video parameter set 298 The Video Parameter Set (VPS) pertains to a Coded Video Sequences 299 (CVS) of multiple layers covering the same range of picture units, 300 and includes, among other information decoding dependency expressed 301 as information for reference picture set construction of enhancement 302 layers. The VPS provides a "big picture" of a scalable sequence, 303 including what types of operation points are provided, the profile, 304 tier, and level of the operation points, and some other high-level 305 properties of the bitstream that can be used as the basis for session 306 negotiation and content selection, etc. One VPS may be referenced by 307 one or more Sequence parameter sets. 309 Sequence parameter set 311 The Sequence Parameter Set (SPS) contains syntax elements pertaining 312 to a coded layer video sequence (CLVS), which is a group of pictures 313 belonging to the same layer, starting with a random access point, and 314 followed by pictures that may depend on each other and the random 315 access point picture. In MPGEG-2, the equivalent of a CVS was a 316 Group of Pictures (GOP), which normally started with an I frame and 317 was followed by P and B frames. While more complex in its options of 318 random access points, VVC retains this basic concept. One remarkable 319 difference of VVC is that a CLVS may start with a Gradual Decoding 320 Refresh (GDR) picture, without requiring presence of traditional 321 random access points in the bitstream, such as Instantaneous Decoding 322 Refresh (IDR) or Clean Random Access (CRA) pictures. In many TV-like 323 applications, a CVS contains a few hundred milliseconds to a few 324 seconds of video. In video conferencing (without switching MCUs 325 involved), a CVS can be as long in duration as the whole session. 327 Picture and Adaptation parameter set 329 The Picture Parameter Set and the Adaptation Parameter Set (PPS and 330 APS, respectively) carry information pertaining to zero or more 331 pictures and zero or more slices, respectively. The PPS contains 332 information that is likely to stay constant from picture to picture- 333 at least for pictures for a certain type-whereas the APS contains 334 information, such as adaptive loop filter coefficients, that are 335 likely to change from picture to picture or even within a picture. A 336 single APS can be referenced by slices of the same picture if that 337 APS contains information about luma mapping with chroma scaling 338 (LMCS) but different APS can be referenced by slices of the same 339 picture if those APS contain information about ALF. 341 Picture Header 343 A Picture Header contains information that is common to all slices 344 that belong to the same picture. Being able to send that information 345 as a separate NAL unit when pictures are split into several slices 346 allows for saving bitrate, compared to repeating the same information 347 in all slices. However, there might be scenarios where low-bitrate 348 video is transmitted using a single slice per picture. Having a 349 separate NAL unit to convey that information incurs in an overhead 350 for such scenarios. Therefore, VVC specifies signaling that 351 indicates whether Picture Headers are present in the CLVS or not. 353 Profile, tier, and level 355 The profile, tier and level syntax structures in DCI, VPS and SPS 356 contain profile, tier, level information for all layers that refer to 357 the DCI, for layers associated with one or more output layer sets 358 specified by the VPS, and for any layer that refers to the SPS, 359 respectively. 361 Sub-Profiles 363 Within the [VVC] specification, a sub-profile is a 32-bit number 364 coded according to ITU-T Rec. T.35, that does not carry a semantic. 365 It is carried in the profile_tier_level structure and hence 366 (potentially) present in the DCI, VPS, and SPS. External 367 registration bodies can register a T.35 codepoint with ITU-T 368 registration authorities and associate with their registration a 369 description of bitstream complexity restrictions beyond the profiles 370 defined by ITU-T and ISO/IEC. This would allow encoder manufacturers 371 to label the bitstreams generated by their encoder as complying with 372 such sub-profile. It is expected that upstream standardization 373 organizations (such as: DVB and ATSC), as well as walled-garden video 374 services will take advantage of this labelling system. In contrast 375 to "normal" profiles, it is expected that sub-profiles may indicate 376 encoder choices traditionally left open in the (decoder- centric) 377 video coding specs, such as GOP structures, minimum/maximum QP 378 values, and the mandatory use of certain tools or SEI messages. 380 Constraint Flags 382 The profile_tier_level structure carries a considerable number of 383 constraint flags, which an encoder can use to indicate to a decoder 384 that it will not use a certain tool or technology. They were 385 included in reaction to a perceived market need for labelling a 386 bitstream as not exercising a certain tool that has become 387 commercially unviable. 389 Temporal scalability support 391 Editor notes: need will update along with VVC new draft in the 392 future 394 [VVC] includes support of temporal scalability, by inclusion of the 395 signaling of TemporalId in the NAL unit header, the restriction that 396 pictures of a particular temporal sub-layer cannot be used for inter 397 prediction reference by pictures of a lower temporal sub-layer, the 398 sub-bitstream extraction process, and the requirement that each sub- 399 bitstream extraction output be a conforming bitstream. Media-Aware 400 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 401 header for stream adaptation purposes based on temporal scalability. 403 Spatial, SNR, View Scalability 405 [VVC] includes support for spatial, SNR, and View scalability. 406 Scalable video coding is widely considered to have technical benefits 407 and enrich services for various video applications. Until recently, 408 however, the functionality has not been included in the main profiles 409 of video codecs and not wide deployed due to additional costs. In 410 VVC, however, all those forms of scalability are supported natively 411 through the signaling of the layer_id in the NAL unit header, the VPS 412 which associates layers with given layer_ids to each other, reference 413 picture selection, reference picture resampling for spatial 414 scalability, and a number of other mechanisms not relevant for this 415 memo. Scalability support can be implemented in a single decoding 416 "loop" and is widely considered a comparatively lightweight 417 operation. 419 Spatial Scalability 421 With the existence of Reference Picture Resampling (RPR), in 422 the "main" profile of VVC, the additional burden for 423 scalability support is just a minor modification of the high- 424 level syntax (HLS). In technical aspects, the inter-layer 425 prediction is employed in a scalable system to improve the 426 coding efficiency of the enhancement layers. In addition to 427 the spatial and temporal motion-compensated predictions that 428 are available in a single- layer codec, the inter-layer 429 prediction in [VVC] uses the resampled video data of the 430 reconstructed reference picture from a reference layer to 431 predict the current enhancement layer. Then, the resampling 432 process for inter-layer prediction is performed at the block- 433 level, without modifying the existing interpolation process for 434 motion compensation compared to non-scalable RPR. It means 435 that no additional resampling process is needed to support 436 scalability. 438 SNR Scalability 440 SNR scalability is similar to Spatial Scalability except that 441 the resampling factors are 1:1-in other words, there is no 442 change in resolution, but there is inter-layer prediction. 444 SEI Messages 446 Supplementary Enhancement Information (SEI) messages are codepoints 447 in the bitstream that do not influence the decoding process as 448 specified in the [VVC] spec, but address issues of representation/ 449 rendering of the decoded bitstream, label the bitstream for certain 450 applications, among other, similar tasks. The overall concept of SEI 451 messages and many of the messages themselves has been inherited from 452 the H.264 and HEVC specs. In the [VVC] environment, some of the SEI 453 messages considered to be generally useful also in other video coding 454 technologies have been moved out of the main specification into a 455 companion document (TO DO: add reference once ITU designation is 456 known). 458 1.1.3. Parallel Processing Support (informative) 460 Compared to HEVC, the [VVC] design to support parallelization offers 461 numerous improvements. Some of those improvements are still 462 undergoing changes in JVET. Information, to the extent relevant for 463 this memo, will be added in future versions of this memo as the 464 standardization in JVET progresses and the technology stabilizes. 466 Editor notes: udpate on sub-picture/slice/tile is needed following 467 new VVC draft 469 1.1.4. NAL Unit Header 471 [VVC] maintains the NAL unit concept of HEVC with modifications. VVC 472 uses a two-byte NAL unit header, as shown in Figure 1. The payload 473 of a NAL unit refers to the NAL unit excluding the NAL unit header. 475 +---------------+---------------+ 476 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 477 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 478 |F|Z| LayerID | Type | TID | 479 +---------------+---------------+ 481 The Structure of the VVC NAL Unit Header. 483 Figure 1 485 The semantics of the fields in the NAL unit header are as specified 486 in [VVC] and described briefly below for convenience. In addition to 487 the name and size of each field, the corresponding syntax element 488 name in [VVC] is also provided. 490 F: 1 bit 492 forbidden_zero_bit. Required to be zero in VVC. Note that the 493 inclusion of this bit in the NAL unit header was to enable 494 transport of [VVC] video over MPEG-2 transport systems (avoidance 495 of start code emulations) [MPEG2S]. In the context of this memo 496 the value 1 may be used to indicate a syntax violation, e.g., for 497 a NAL unit resulted from aggregating a number of fragmented units 498 of a NAL unit but missing the last fragment, as described in 499 Section TBD. 501 Z: 1 bit 503 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 504 for future extensions by ITU-T and ISO/IEC. 505 This memo does not overload the "Z" bit for local extensions, as 506 a) overloading the "F" bit is sufficient and b) to preserve the 507 usefulness of this memo to possible future versions of [VVC]. 509 LayerId: 6 bits 511 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 512 a layer may be, e.g., a spatial scalable layer, a quality scalable 513 layer . 515 Type: 5 bits 516 nal_unit_type. This field specifies the NAL unit type as defined 517 in Table 7-1 of VVC. For a reference of all currently defined NAL 518 unit types and their semantics, please refer to Section 7.4.2.2 in 519 [VVC]. 521 TID: 3 bits 523 nuh_temporal_id_plus1. This field specifies the temporal 524 identifier of the NAL unit plus 1. The value of TemporalId is 525 equal to TID minus 1. A TID value of 0 is illegal to ensure that 526 there is at least one bit in the NAL unit header equal to 1, so to 527 enable independent considerations of start code emulations in the 528 NAL unit header and in the NAL unit payload data. 530 1.2. Overview of the Payload Format 532 This payload format defines the following processes required for 533 transport of [VVC] coded data over RTP [RFC3550]: 535 o Usage of RTP header with this payload format 537 o Packetization of [VVC] coded NAL units into RTP packets using 538 three types of payload structures: a single NAL unit packet, 539 aggregation packet, and fragment unit 541 o Transmission of [VVC] NAL units of the same bitstream within a 542 single RTP stream. 544 o Media type parameters to be used with the Session Description 545 Protocol (SDP) [RFC4566] 547 o Frame-marking mapping [FrameMarking] 549 2. Conventions 551 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 552 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 553 "OPTIONAL" in this document are to be interpreted as described in BCP 554 14 [RFC2119] [RFC8174] when, and only when, they appear in all 555 capitals, as shown above. 557 3. Definitions and Abbreviations 559 3.1. Definitions 561 This document uses the terms and definitions of VVC. Section 3.1.1 562 lists relevant definitions from [VVC] for convenience. Section 3.1.2 563 provides definitions specific to this memo. 565 3.1.1. Definitions from the VVC Specification 567 Editor notes: 569 Access unit (AU): A set of PUs that belong to different layers and 570 contain coded pictures associated with the same time for output from 571 the DPB. 573 Adaptation parameter set (APS): A syntax structure containing syntax 574 elements that apply to zero or more slices as determined by zero or 575 more syntax elements found in slice headers. 577 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 578 byte stream, that forms the representation of a sequence of AUs 579 forming one or more coded video sequences (CVSs). 581 Coded picture: A coded representation of a picture comprising VCL NAL 582 units with a particular value of nuh_layer_id within an AU and 583 containing all CTUs of the picture. 585 Clean random access (CRA) PU: A PU in which the coded picture is a 586 CRA picture. 588 Clean random access (CRA) picture: An IRAP picture for which each VCL 589 NAL unit has nal_unit_type equal to CRA_NUT. 591 Coded video sequence (CVS): A sequence of AUs that consists, in 592 decoding order, of a CVSS AU, followed by zero or more AUs that are 593 not CVSS AUs, including all subsequent AUs up to but not including 594 any subsequent AU that is a CVSS AU. 596 Coded video sequence start (CVSS) AU: An AU in which there is a PU 597 for each layer in the CVS and the coded picture in each PU is a CLVSS 598 picture. 600 Coded layer video sequence (CLVS): A sequence of PUs with the same 601 value of nuh_layer_id that consists, in decoding order, of a CLVSS 602 PU, followed by zero or more PUs that are not CLVSS PUs, including 603 all subsequent PUs up to but not including any subsequent PU that is 604 a CLVSS PU. 606 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 607 picture is a CLVSS picture. 609 Coded layer video sequence start (CLVSS) picture: A coded picture 610 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 611 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 613 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 614 of chroma samples of a picture that has three sample arrays, or a CTB 615 of samples of a monochrome picture or a picture that is coded using 616 three separate colour planes and syntax structures used to code the 617 samples. 619 Decoding Capability Information (DCI): A syntax structure containing 620 syntax elements that apply to the entire bitstream. 622 Decoded picture buffer (DPB): A buffer holding decoded pictures for 623 reference, output reordering, or output delay specified for the 624 hypothetical reference decoder. 626 Gradual decoding refresh (GDR) picture: A picture for which each VCL 627 NAL unit has nal_unit_type equal to GDR_NUT. 629 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 630 picture is an IDR picture. 632 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 633 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 634 IDR_N_LP. 636 Intra random access point (IRAP) AU: An AU in which there is a PU for 637 each layer in the CVS and the coded picture in each PU is an IRAP 638 picture. 640 Intra random access point (IRAP) PU: A PU in which the coded picture 641 is an IRAP picture. 643 Intra random access point (IRAP) picture: A coded picture for which 644 all VCL NAL units have the same value of nal_unit_type in the range 645 of IDR_W_RADL to CRA_NUT, inclusive. 647 Layer: A set of VCL NAL units that all have a particular value of 648 nuh_layer_id and the associated non-VCL NAL units. 650 Network abstraction layer (NAL) unit: A syntax structure containing 651 an indication of the type of data to follow and bytes containing that 652 data in the form of an RBSP interspersed as necessary with emulation 653 prevention bytes. 655 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 657 Operation point (OP): A temporal subset of an OLS, identified by an 658 OLS index and a highest value of TemporalId. 660 Picture parameter set (PPS): A syntax structure containing syntax 661 elements that apply to zero or more entire coded pictures as 662 determined by a syntax element found in each slice header. 664 Picture unit (PU): A set of NAL units that are associated with each 665 other according to a specified classification rule, are consecutive 666 in decoding order, and contain exactly one coded picture. 668 Random access: The act of starting the decoding process for a 669 bitstream at a point other than the beginning of the stream. 671 Sequence parameter set (SPS): A syntax structure containing syntax 672 elements that apply to zero or more entire CLVSs as determined by the 673 content of a syntax element found in the PPS referred to by a syntax 674 element found in each picture header. 676 Slice: An integer number of complete tiles or an integer number of 677 consecutive complete CTU rows within a tile of a picture that are 678 exclusively contained in a single NAL unit. 680 Sub-layer: A temporal scalable layer of a temporal scalable bitstream 681 consisting of VCL NAL units with a particular value of the TemporalId 682 variable, and the associated non-VCL NAL units. 684 Subpicture: An rectangular region of one or more slices within a 685 picture. 687 Sub-layer representation: A subset of the bitstream consisting of NAL 688 units of a particular sub-layer and the lower sub-layers. 690 Tile: A rectangular region of CTUs within a particular tile column 691 and a particular tile row in a picture. 693 Tile column: A rectangular region of CTUs having a height equal to 694 the height of the picture and a width specified by syntax elements in 695 the picture parameter set. 697 Tile row: A rectangular region of CTUs having a height specified by 698 syntax elements in the picture parameter set and a width equal to the 699 width of the picture. 701 Video coding layer (VCL) NAL unit: A collective term for coded slice 702 NAL units and the subset of NAL units that have reserved values of 703 nal_unit_type that are classified as VCL NAL units in this 704 Specification. 706 3.1.2. Definitions Specific to This Memo 708 Media-Aware Network Element (MANE): A network element, such as a 709 middlebox, selective forwarding unit, or application-layer gateway 710 that is capable of parsing certain aspects of the RTP payload headers 711 or the RTP payload and reacting to their contents. 713 Editor Notes: the following informative needs to be updated along 714 with frame marking update 716 Informative note: The concept of a MANE goes beyond normal routers 717 or gateways in that a MANE has to be aware of the signaling (e.g., 718 to learn about the payload type mappings of the media streams), 719 and in that it has to be trusted when working with Secure RTP 720 (SRTP). The advantage of using MANEs is that they allow packets 721 to be dropped according to the needs of the media coding. For 722 example, if a MANE has to drop packets due to congestion on a 723 certain link, it can identify and remove those packets whose 724 elimination produces the least adverse effect on the user 725 experience. After dropping packets, MANEs must rewrite RTCP 726 packets to match the changes to the RTP stream, as specified in 727 Section 7 of [RFC3550]. 729 NAL unit decoding order: A NAL unit order that conforms to the 730 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 731 follow the Order of NAL units in the bitstream. 733 NAL unit output order: A NAL unit order in which NAL units of 734 different access units are in the output order of the decoded 735 pictures corresponding to the access units, as specified in [VVC], 736 and in which NAL units within an access unit are in their decoding 737 order. 739 RTP stream: See [RFC7656]. Within the scope of this memo, one RTP 740 stream is utilized to transport one or more temporal sub-layers. 742 Transmission order: The order of packets in ascending RTP sequence 743 number order (in modulo arithmetic). Within an aggregation packet, 744 the NAL unit transmission order is the same as the order of 745 appearance of NAL units in the packet. 747 3.2. Abbreviations 749 AU Access Unit 751 AP Aggregation Packet 753 CTU Coding Tree Unit 754 CVS Coded Video Sequence 756 DPB Decoded Picture Buffer 758 DCI Decoding capability information 760 DON Decoding Order Number 762 FIR Full Intra Request 764 FU Fragmentation Unit 766 HRD Hypothetical Reference Decoder 768 IDR Instantaneous Decoding Refresh 770 MANE Media-Aware Network Element 772 MTU Maximum Transfer Unit 774 NAL Network Abstraction Layer 776 NALU Network Abstraction Layer Unit 778 PLI Picture Loss Indication 780 PPS Picture Parameter Set 782 RPS Reference Picture Set 784 RPSI Reference Picture Selection Indication 786 SEI Supplemental Enhancement Information 788 SLI Slice Loss Indication 790 SPS Sequence Parameter Set 792 VCL Video Coding Layer 794 VPS Video Parameter Set 796 4. RTP Payload Format 797 4.1. RTP Header Usage 799 The format of the RTP header is specified in [RFC3550] (reprinted as 800 Figure 2 for convenience). This payload format uses the fields of 801 the header in a manner consistent with that specification. 803 The RTP payload (and the settings for some RTP header bits) for 804 aggregation packets and fragmentation units are specified in 805 Section 4.3.2 and Section 4.3.3, respectively. 807 0 1 2 3 808 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 809 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 810 |V=2|P|X| CC |M| PT | sequence number | 811 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 812 | timestamp | 813 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 814 | synchronization source (SSRC) identifier | 815 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 816 | contributing source (CSRC) identifiers | 817 | .... | 818 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 820 RTP Header According to {{RFC3550}} 822 Figure 2 824 The RTP header information to be set according to this RTP payload 825 format is set as follows: 827 Marker bit (M): 1 bit 829 Set for the last packet of the access unit, carried in the current 830 RTP stream. This is in line with the normal use of the M bit in 831 video formats to allow an efficient playout buffer handling. 833 Editor notes: The informative note below needs updating once 834 the NAL unit type table is stable in the [VVC] spec. 836 Informative note: The content of a NAL unit does not tell 837 whether or not the NAL unit is the last NAL unit, in decoding 838 order, of an access unit. An RTP sender implementation may 839 obtain this information from the video encoder. If, however, 840 the implementation cannot obtain this information directly from 841 the encoder, e.g., when the bitstream was pre-encoded, and also 842 there is no timestamp allocated for each NAL unit, then the 843 sender implementation can inspect subsequent NAL units in 844 decoding order to determine whether or not the NAL unit is the 845 last NAL unit of an access unit as follows. A NAL unit is 846 determined to be the last NAL unit of an access unit if it is 847 the last NAL unit of the bitstream. A NAL unit naluX is also 848 determined to be the last NAL unit of an access unit if both 849 the following conditions are true: 1) the next VCL NAL unit 850 naluY in decoding order has the high-order bit of the first 851 byte after its NAL unit header equal to 1 or nal_unit_type 852 equal to 19, and 2) all NAL units between naluX and naluY, when 853 present, have nal_unit_type in the range of 13 to17, inclusive, 854 equal to 20, equal to 23 or equal to 26. 856 Payload Type (PT): 7 bits 858 The assignment of an RTP payload type for this new packet format 859 is outside the scope of this document and will not be specified 860 here. The assignment of a payload type has to be performed either 861 through the profile used or in a dynamic way. 863 Sequence Number (SN): 16 bits 865 Set and used in accordance with [RFC3550]. 867 Timestamp: 32 bits 869 The RTP timestamp is set to the sampling timestamp of the content. 870 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 871 properties of its own (e.g., parameter set and SEI NAL units), the 872 RTP timestamp MUST be set to the RTP timestamp of the coded 873 picture of the access unit in which the NAL unit (according to 874 Annex D of VVC) is included. Receivers MUST use the RTP timestamp 875 for the display process, even when the bitstream contains picture 876 timing SEI messages or decoding unit information SEI messages as 877 specified in VVC. 879 Synchronization source (SSRC): 32 bits 881 Used to identify the source of the RTP packets. A single SSRC is 882 used for all parts of a single bitstream. 884 4.2. Payload Header Usage 886 The first two bytes of the payload of an RTP packet are referred to 887 as the payload header. The payload header consists of the same 888 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 889 in Section 1.1.4, irrespective of the type of the payload structure. 891 The TID value indicates (among other things) the relative importance 892 of an RTP packet, for example, because NAL units belonging to higher 893 temporal sub-layers are not used for the decoding of lower temporal 894 sub-layers. A lower value of TID indicates a higher importance. 895 More-important NAL units MAY be better protected against transmission 896 losses than less-important NAL units. 898 For Discussion: quite possibly something similar can be said for 899 the Layer_id in layered coding, but perhaps not in multiview 900 coding. (The relevant part of the spec is relatively new, 901 therefore the soft language). However, for serious layer pruning, 902 interpretation of the VPS is required. We can add language about 903 the need for stateful interpretation of LayerID vis-a-vis 904 stateless interpretation of TID later. 906 4.3. Payload Structures 908 Three different types of RTP packet payload structures are specified. 909 A receiver can identify the type of an RTP packet payload through the 910 Type field in the payload header. 912 The three different payload structures are as follows: 914 o Single NAL unit packet: Contains a single NAL unit in the payload, 915 and the NAL unit header of the NAL unit also serves as the payload 916 header. This payload structure is specified in Section 4.4.1. 918 o Aggregation Packet (AP): Contains more than one NAL unit within 919 one access unit. This payload structure is specified in 920 Section 4.3.2. 922 o Fragmentation Unit (FU): Contains a subset of a single NAL unit. 923 This payload structure is specified in Section 4.3.3. 925 4.3.1. Single NAL Unit Packets 927 Editor notes: its better to add a section to describe DONL and 928 sprop-max_don_diff. sprop-max_don_diff is used but not specified 929 as parameters in section 7 are not yet specified. A value of 930 sprop-max_don_diff greater than 0 indicates that the transmission 931 order may not correspond to the decoding order and that the DON is 932 is included in the payload header. 934 A single NAL unit packet contains exactly one NAL unit, and consists 935 of a payload header (denoted as PayloadHdr), a conditional 16-bit 936 DONL field (in network byte order), and the NAL unit payload data 937 (the NAL unit excluding its NAL unit header) of the contained NAL 938 unit, as shown in Figure 3. 940 0 1 2 3 941 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 943 | PayloadHdr | DONL (conditional) | 944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 | | 946 | NAL unit payload data | 947 | | 948 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 949 | :...OPTIONAL RTP padding | 950 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 952 The Structure of a Single NAL Unit Packet 954 Figure 3 956 The DONL field, when present, specifies the value of the 16 least 957 significant bits of the decoding order number of the contained NAL 958 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 959 be present, and the variable DON for the contained NAL unit is 960 derived as equal to the value of the DONL field. Otherwise (sprop- 961 max-don-diff is equal to 0), the DONL field MUST NOT be present. 963 4.3.2. Aggregation Packets (APs) 965 Aggregation Packets (APs) can reduce of packetization overhead for 966 small NAL units, such as most of the non- VCL NAL units, which are 967 often only a few octets in size. 969 An AP aggregates NAL units of one access unit. Each NAL unit to be 970 carried in an AP is encapsulated in an aggregation unit. NAL units 971 aggregated in one AP are included in NAL unit decoding order. 973 An AP consists of a payload header (denoted as PayloadHdr) followed 974 by two or more aggregation units, as shown in Figure 4. 976 0 1 2 3 977 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 978 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 979 | PayloadHdr (Type=28) | | 980 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 981 | | 982 | two or more aggregation units | 983 | | 984 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 985 | :...OPTIONAL RTP padding | 986 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 988 The Structure of an Aggregation Packet 990 Figure 4 992 The fields in the payload header of an AP are set as follows. The F 993 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 994 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 995 be equal to 28. 997 The value of LayerId MUST be equal to the lowest value of LayerId of 998 all the aggregated NAL units. The value of TID MUST be the lowest 999 value of TID of all the aggregated NAL units. 1001 Informative note: All VCL NAL units in an AP have the same TID 1002 value since they belong to the same access unit. However, an AP 1003 may contain non-VCL NAL units for which the TID value in the NAL 1004 unit header may be different than the TID value of the VCL NAL 1005 units in the same AP. 1007 An AP MUST carry at least two aggregation units and can carry as many 1008 aggregation units as necessary; however, the total amount of data in 1009 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1010 chosen so that the resulting IP packet is smaller than the MTU size 1011 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1012 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1013 not contain another AP. 1015 The first aggregation unit in an AP consists of a conditional 16-bit 1016 DONL field (in network byte order) followed by a 16-bit unsigned size 1017 information (in network byte order) that indicates the size of the 1018 NAL unit in bytes (excluding these two octets, but including the NAL 1019 unit header), followed by the NAL unit itself, including its NAL unit 1020 header, as shown in Figure 5. 1022 0 1 2 3 1023 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1024 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1025 | : DONL (conditional) | NALU size | 1026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1027 | NALU size | | 1028 +-+-+-+-+-+-+-+-+ NAL unit | 1029 | | 1030 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1031 | : 1032 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1034 The Structure of the First Aggregation Unit in an AP 1036 Figure 5 1038 The DONL field, when present, specifies the value of the 16 least 1039 significant bits of the decoding order number of the aggregated NAL 1040 unit. 1042 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1043 present in an aggregation unit that is the first aggregation unit in 1044 an AP, and the variable DON for the aggregated NAL unit is derived as 1045 equal to the value of the DONL field. Otherwise (sprop-max-don-diff 1046 is equal to 0), the DONL field MUST NOT be present in an aggregation 1047 unit that is the first aggregation unit in an AP. 1049 An aggregation unit that is not the first aggregation unit in an AP 1050 will be followed immediately by a 16-bit unsigned size information 1051 (in network byte order) that indicates the size of the NAL unit in 1052 bytes (excluding these two octets, but including the NAL unit 1053 header), followed by the NAL unit itself, including its NAL unit 1054 header, as shown in Figure 6. 1056 0 1 2 3 1057 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1058 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1059 | : NALU size | NAL unit | 1060 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1061 | | 1062 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1063 | : 1064 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1066 The Structure of an Aggregation Unit That Is Not the First 1067 Aggregation Unit in an AP 1069 Figure 6 1071 Figure 7 presents an example of an AP that contains two aggregation 1072 units, labeled as 1 and 2 in the figure, without the DONL field being 1073 present. 1075 0 1 2 3 1076 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1077 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1078 | RTP Header | 1079 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1080 | PayloadHdr (Type=28) | NALU 1 Size | 1081 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1082 | NALU 1 HDR | | 1083 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1084 | . . . | 1085 | | 1086 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1087 | . . . | NALU 2 Size | NALU 2 HDR | 1088 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1089 | NALU 2 HDR | | 1090 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1091 | . . . | 1092 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1093 | :...OPTIONAL RTP padding | 1094 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1096 An Example of an AP Packet Containing 1097 Two Aggregation Units without the DONL Field 1099 Figure 7 1101 Figure 8 presents an example of an AP that contains two aggregation 1102 units, labeled as 1 and 2 in the figure, with the DONL field being 1103 present. 1105 0 1 2 3 1106 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1107 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1108 | RTP Header | 1109 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1110 | PayloadHdr (Type=28) | NALU 1 DONL | 1111 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1112 | NALU 1 Size | NALU 1 HDR | 1113 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1114 | | 1115 | NALU 1 Data . . . | 1116 | | 1117 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1118 | : NALU 2 Size | 1119 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1120 | NALU 2 HDR | | 1121 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1122 | | 1123 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1124 | :...OPTIONAL RTP padding | 1125 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1127 An Example of an AP Containing 1128 Two Aggregation Units with the DONL Field 1130 Figure 8 1132 4.3.3. Fragmentation Units 1134 Fragmentation Units (FUs) are introduced to enable fragmenting a 1135 single NAL unit into multiple RTP packets, possibly without 1136 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1137 unit consists of an integer number of consecutive octets of that NAL 1138 unit. Fragments of the same NAL unit MUST be sent in consecutive 1139 order with ascending RTP sequence numbers (with no other RTP packets 1140 within the same RTP stream being sent between the first and last 1141 fragment). 1143 When a NAL unit is fragmented and conveyed within FUs, it is referred 1144 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1145 NOT be nested; i.e., an FU can not contain a subset of another FU. 1147 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1148 time of the fragmented NAL unit. 1150 An FU consists of a payload header (denoted as PayloadHdr), an FU 1151 header of one octet, a conditional 16-bit DONL field (in network byte 1152 order), and an FU payload, as shown in Figure 9}. 1154 0 1 2 3 1155 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1156 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1157 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1158 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1159 | DONL (cond) | | 1160 |-+-+-+-+-+-+-+-+ | 1161 | FU payload | 1162 | | 1163 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1164 | :...OPTIONAL RTP padding | 1165 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1167 The Structure of an FU 1169 Figure 9 1171 The fields in the payload header are set as follows. The Type field 1172 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1173 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1174 unit. 1176 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1177 FuType field, as shown in Figure 10. 1179 +---------------+ 1180 |0|1|2|3|4|5|6|7| 1181 +-+-+-+-+-+-+-+-+ 1182 |S|E|R| FuType | 1183 +---------------+ 1185 The Structure of FU Header 1187 Figure 10 1189 The semantics of the FU header fields are as follows: 1191 S: 1 bit 1193 When set to 1, the S bit indicates the start of a fragmented NAL 1194 unit, i.e., the first byte of the FU payload is also the first 1195 byte of the payload of the fragmented NAL unit. When the FU 1196 payload is not the start of the fragmented NAL unit payload, the S 1197 bit MUST be set to 0. 1199 E: 1 bit 1200 When set to 1, the E bit indicates the end of a fragmented NAL 1201 unit, i.e., the last byte of the payload is also the last byte of 1202 the fragmented NAL unit. When the FU payload is not the last 1203 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1205 Reserved: 1 bit 1207 Placeholder 1209 FuType: 5 bits 1211 The field FuType MUST be equal to the field Type of the fragmented 1212 NAL unit. 1214 The DONL field, when present, specifies the value of the 16 least 1215 significant bits of the decoding order number of the fragmented NAL 1216 unit. 1218 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1219 the DONL field MUST be present in the FU, and the variable DON for 1220 the fragmented NAL unit is derived as equal to the value of the DONL 1221 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1222 equal to 0), the DONL field MUST NOT be present in the FU. 1224 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1225 the Start bit and End bit must not both be set to 1 in the same FU 1226 header. 1228 The FU payload consists of fragments of the payload of the fragmented 1229 NAL unit so that if the FU payloads of consecutive FUs, starting with 1230 an FU with the S bit equal to 1 and ending with an FU with the E bit 1231 equal to 1, are sequentially concatenated, the payload of the 1232 fragmented NAL unit can be reconstructed. The NAL unit header of the 1233 fragmented NAL unit is not included as such in the FU payload, but 1234 rather the information of the NAL unit header of the fragmented NAL 1235 unit is conveyed in F, LayerId, and TID fields of the FU payload 1236 headers of the FUs and the FuType field of the FU header of the FUs. 1237 An FU payload MUST NOT be empty. 1239 If an FU is lost, the receiver SHOULD discard all following 1240 fragmentation units in transmission order corresponding to the same 1241 fragmented NAL unit, unless the decoder in the receiver is known to 1242 be prepared to gracefully handle incomplete NAL units. 1244 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1245 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1246 n of that NAL unit is not received. In this case, the 1247 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1248 syntax violation. 1250 4.4. Decoding Order Number 1252 For each NAL unit, the variable AbsDon is derived, representing the 1253 decoding order number that is indicative of the NAL unit decoding 1254 order. 1256 Let NAL unit n be the n-th NAL unit in transmission order within an 1257 RTP stream. 1259 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1260 for NAL unit n, is derived as equal to n. 1262 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1263 derived as follows, where DON[n] is the value of the variable DON for 1264 NAL unit n: 1266 o If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1267 transmission order), AbsDon[0] is set equal to DON[0]. 1269 o Otherwise (n is greater than 0), the following applies for 1270 derivation of AbsDon[n]: 1272 If DON[n] == DON[n-1], 1273 AbsDon[n] = AbsDon[n-1] 1275 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1276 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1278 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1279 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1281 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1282 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1283 DON[n]) 1285 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1286 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1288 For any two NAL units m and n, the following applies: 1290 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1291 NAL unit m in NAL unit decoding order. 1293 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1294 of the two NAL units can be in either order. 1296 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1297 NAL unit m in decoding order. 1299 Informative note: When two consecutive NAL units in the NAL unit 1300 decoding order have different values of AbsDon, the absolute 1301 difference between the two AbsDon values may be greater than or 1302 equal to 1. 1304 Informative note: There are multiple reasons to allow for the 1305 absolute difference of the values of AbsDon for two consecutive 1306 NAL units in the NAL unit decoding order to be greater than one. 1307 An increment by one is not required, as at the time of associating 1308 values of AbsDon to NAL units, it may not be known whether all NAL 1309 units are to be delivered to the receiver. For example, a gateway 1310 might not forward VCL NAL units of higher sub-layers or some SEI 1311 NAL units when there is congestion in the network. 1312 In another example, the first intra-coded picture of a pre-encoded 1313 clip is transmitted in advance to ensure that it is readily 1314 available in the receiver, and when transmitting the first intra- 1315 coded picture, the originator does not exactly know how many NAL 1316 units will be encoded before the first intra-coded picture of the 1317 pre-encoded clip follows in decoding order. Thus, the values of 1318 AbsDon for the NAL units of the first intra-coded picture of the 1319 pre-encoded clip have to be estimated when they are transmitted, 1320 and gaps in values of AbsDon may occur. 1322 5. Packetization Rules 1324 The following packetization rules apply: 1326 o If sprop-max-don-diff is greater than 0, the transmission order of 1327 NAL units carried in the RTP stream MAY be different than the NAL 1328 unit decoding order and the NAL unit output order. 1330 o A NAL unit of a small size SHOULD be encapsulated in an 1331 aggregation packet together one or more other NAL units in order 1332 to avoid the unnecessary packetization overhead for small NAL 1333 units. For example, non-VCL NAL units such as access unit 1334 delimiters, parameter sets, or SEI NAL units are typically small 1335 and can often be aggregated with VCL NAL units without violating 1336 MTU size constraints. 1338 o Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1339 viewpoint, be encapsulated in an aggregation packet together with 1340 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1341 be meaningless without the associated VCL NAL unit being 1342 available. 1344 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1345 unit packet MUST be used. 1347 6. De-packetization Process 1349 The general concept behind de-packetization is to get the NAL units 1350 out of the RTP packets in an RTP stream and pass them to the decoder 1351 in the NAL unit decoding order. 1353 The de-packetization process is implementation dependent. Therefore, 1354 the following description should be seen as an example of a suitable 1355 implementation. Other schemes may be used as well, as long as the 1356 output for the same input is the same as the process described below. 1357 The output is the same when the set of output NAL units and their 1358 order are both identical. Optimizations relative to the described 1359 algorithms are possible. 1361 All normal RTP mechanisms related to buffer management apply. In 1362 particular, duplicated or outdated RTP packets (as indicated by the 1363 RTP sequences number and the RTP timestamp) are removed. To 1364 determine the exact time for decoding, factors such as a possible 1365 intentional delay to allow for proper inter-stream synchronization 1366 MUST be factored in. 1368 NAL units with NAL unit type values in the range of 0 to 27, 1369 inclusive, may be passed to the decoder. NAL-unit-like structures 1370 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1371 NOT be passed to the decoder. 1373 The receiver includes a receiver buffer, which is used to compensate 1374 for transmission delay jitter within individual RTP streams and 1375 across RTP streams, to reorder NAL units from transmission order to 1376 the NAL unit decoding order. In this section, the receiver operation 1377 is described under the assumption that there is no transmission delay 1378 jitter within an RTP stream and across RTP streams. To make a 1379 difference from a practical receiver buffer that is also used for 1380 compensation of transmission delay jitter, the receiver buffer is 1381 hereafter called the de-packetization buffer in this section. 1382 Receivers should also prepare for transmission delay jitter; that is, 1383 either reserve separate buffers for transmission delay jitter 1384 buffering and de-packetization buffering or use a receiver buffer for 1385 both transmission delay jitter and de- packetization. Moreover, 1386 receivers should take transmission delay jitter into account in the 1387 buffering operation, e.g., by additional initial buffering before 1388 starting of decoding and playback. 1390 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1391 size is zero bytes, and the process described in the remainder of 1392 this paragraph applies. 1393 The NAL units carried in the single RTP stream are directly passed to 1394 the decoder in their transmission order, which is identical to their 1395 decoding order. When there are several NAL units of the same RTP 1396 stream with the same NTP timestamp, the order to pass them to the 1397 decoder is their transmission order. 1399 Informative note: The mapping between RTP and NTP timestamps is 1400 conveyed in RTCP SR packets. In addition, the mechanisms for 1401 faster media timestamp synchronization discussed in [RFC6051] may 1402 be used to speed up the acquisition of the RTP-to-wall-clock 1403 mapping. 1405 When sprop-max-don-diff is greater than 0, the process described in 1406 the remainder of this section applies. 1408 There are two buffering states in the receiver: initial buffering and 1409 buffering while playing. Initial buffering starts when the reception 1410 is initialized. After initial buffering, decoding and playback are 1411 started, and the buffering-while-playing mode is used. 1413 Regardless of the buffering state, the receiver stores incoming NAL 1414 units, in reception order, into the de-packetization buffer. NAL 1415 units carried in RTP packets are stored in the de-packetization 1416 buffer individually, and the value of AbsDon is calculated and stored 1417 for each NAL unit. 1419 Initial buffering lasts until condition A (the difference between the 1420 greatest and smallest AbsDon values of the NAL units in the de- 1421 packetization buffer is greater than or equal to the value of sprop- 1422 max-don-diff) or condition B (the number of NAL units in the de- 1423 packetization buffer is greater than the value of sprop-depack-buf- 1424 nalus) is true. 1426 After initial buffering, whenever condition A or condition B is true, 1427 the following operation is repeatedly applied until both condition A 1428 and condition B become false: 1430 o The NAL unit in the de-packetization buffer with the smallest 1431 value of AbsDon is removed from the de-packetization buffer and 1432 passed to the decoder. 1434 When no more NAL units are flowing into the de-packetization buffer, 1435 all NAL units remaining in the de-packetization buffer are removed 1436 from the buffer and passed to the decoder in the order of increasing 1437 AbsDon values. 1439 7. Payload Format Parameters 1441 This section specifies the optional parameters. A mapping of the 1442 parameters with Session Description Protocol (SDP) [RFC4556] is also 1443 provided for applications that use SDP. 1445 7.1. Media Type Registration 1447 The receiver MUST ignore any parameter unspecified in this memo. 1449 Type name: Video 1451 Subtype name: H266 1453 Required parameters: none 1455 Optional parameters: 1457 Editor's notes: To be added 1459 7.2. SDP Parameters 1461 The receiver MUST ignore any parameter unspecified in this memo. 1463 7.2.1. Mapping of Payload Type Parameters to SDP 1465 The media type video/H266 string is mapped to fields in the Session 1466 Description Protocol (SDP) [RFC4566] as follows: 1468 o The media name in the "m=" line of SDP MUST be video. 1470 o The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 1471 media subtype). 1473 o The clock rate in the "a=rtpmap" line MUST be 90000. 1475 o OPTIONAL PARAMETERS: 1477 Editor's notes: To be dicussed here 1479 7.2.1.1. SDP Example 1481 An example of media representation in SDP is as follows: 1483 m=video 49170 RTP/AVP 98 1484 a=rtpmap:98 H266/90000 1485 a=fmtp:98 profile-id=1; sprop-vps=