idnits 2.17.1 draft-ietf-avtcore-rtp-vvc-03.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (October 27, 2020) is 1275 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1269 -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC' Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: April 30, 2021 Y. Sanchez 6 Fraunhofer HHI 7 October 27, 2020 9 RTP Payload Format for Versatile Video Coding (VVC) 10 draft-ietf-avtcore-rtp-vvc-03 12 Abstract 14 This memo describes an RTP payload format for the video coding 15 standard ITU-T Recommendation H.266 and ISO/IEC International 16 Standard ISO23090-3, both also known as Versatile Video Coding (VVC) 17 and developed by the Joint Video Experts Team (JVET). The RTP 18 payload format allows for packetization of one or more Network 19 Abstraction Layer (NAL) units in each RTP packet payload as well as 20 fragmentation of a NAL unit into multiple RTP packets. The payload 21 format has wide applicability in videoconferencing, Internet video 22 streaming, and high-bitrate entertainment-quality video, among other 23 applications. 25 Status of This Memo 27 This Internet-Draft is submitted in full conformance with the 28 provisions of BCP 78 and BCP 79. 30 Internet-Drafts are working documents of the Internet Engineering 31 Task Force (IETF). Note that other groups may also distribute 32 working documents as Internet-Drafts. The list of current Internet- 33 Drafts is at https://datatracker.ietf.org/drafts/current/. 35 Internet-Drafts are draft documents valid for a maximum of six months 36 and may be updated, replaced, or obsoleted by other documents at any 37 time. It is inappropriate to use Internet-Drafts as reference 38 material or to cite them other than as "work in progress." 40 This Internet-Draft will expire on April 30, 2021. 42 Copyright Notice 44 Copyright (c) 2020 IETF Trust and the persons identified as the 45 document authors. All rights reserved. 47 This document is subject to BCP 78 and the IETF Trust's Legal 48 Provisions Relating to IETF Documents 49 (https://trustee.ietf.org/license-info) in effect on the date of 50 publication of this document. Please review these documents 51 carefully, as they describe your rights and restrictions with respect 52 to this document. Code Components extracted from this document must 53 include Simplified BSD License text as described in Section 4.e of 54 the Trust Legal Provisions and are provided without warranty as 55 described in the Simplified BSD License. 57 Table of Contents 59 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 60 1.1. Overview of the VVC Codec . . . . . . . . . . . . . . . . 3 61 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 4 62 1.1.2. Systems and Transport Interfaces . . . . . . . . . . 6 63 1.1.3. Parallel Processing Support (informative) . . . . . . 10 64 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 11 65 1.2. Overview of the Payload Format . . . . . . . . . . . . . 12 66 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 12 67 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 12 68 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 12 69 3.1.1. Definitions from the VVC Specification . . . . . . . 13 70 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 16 71 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 16 72 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 17 73 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 18 74 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 19 75 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 20 76 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 20 77 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 21 78 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 25 79 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 28 80 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 29 81 6. De-packetization Process . . . . . . . . . . . . . . . . . . 30 82 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 32 83 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 32 84 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 32 85 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 32 86 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 33 87 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 33 88 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 33 89 8.2. Slice Loss Indication (SLI) . . . . . . . . . . . . . . . 34 90 8.3. Reference Picture Selection Indication (RPSI) . . . . . . 34 91 8.4. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 34 92 9. Frame Marking . . . . . . . . . . . . . . . . . . . . . . . . 35 93 9.1. Frame Marking Short Extension . . . . . . . . . . . . . . 35 94 9.2. Frame Marking Long Extension . . . . . . . . . . . . . . 36 95 10. Security Considerations . . . . . . . . . . . . . . . . . . . 37 96 11. Congestion Control . . . . . . . . . . . . . . . . . . . . . 38 97 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 39 98 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 39 99 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 40 100 14.1. Normative References . . . . . . . . . . . . . . . . . . 40 101 14.2. Informative References . . . . . . . . . . . . . . . . . 42 102 Appendix A. Change History . . . . . . . . . . . . . . . . . . . 43 103 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 43 105 1. Introduction 107 The Versatile Video Coding [VVC] specification, formally published as 108 both ITU-T Recommendation H.266 and ISO/IEC International Standard 109 23090-3 [ISO23090-3], is currently in the ITU-T publication process 110 and the ISO/IEC approval process. [H.266] is reported to provide 111 significant coding efficiency gains over H.265 and earlier video 112 codec formats. 114 This memo specifices an RTP payload format for VVC. It shares its 115 basic design with the NAL (Network Abstraction Layer) unit-based RTP 116 payload formats of, H.264 Video Coding [RFC6184], Scalable Video 117 Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798] 118 and their respective predecessors. With respect to design 119 philosophy, security, congestion control, and overall implementation 120 complexity, it has similar properties to those earlier payload format 121 specifications. This is a conscious choice, as at least RFC 6184 is 122 widely deployed and generally known in the relevant implementer 123 communities. Certain mechanisms known from [RFC6190] were 124 incorporated in VVC, as VVC version 1 supports temporal, spatial, and 125 signal-to-noise ratio (SNR) scalability. 127 1.1. Overview of the VVC Codec 129 [VVC] and [HEVC] share a similar hybrid video codec design. In this 130 memo, we provide a very brief overview of those features of VVC that 131 are, in some form, addressed by the payload format specified herein. 132 Implementers have to read, understand, and apply the ITU- T/ISO/IEC 133 specifications pertaining to [VVC] to arrive at interoperable, well- 134 performing implementations. 136 Conceptually, both [VVC] and [HEVC] include a Video Coding Layer 137 (VCL), which is often used to refer to the coding-tool features, and 138 a NAL, which is often used to refer to the systems and transport 139 interface aspects of the codecs. 141 1.1.1. Coding-Tool Features (informative) 143 Coding tool features are described below with occasional reference to 144 the coding tool set of [HEVC], which is well known in the community. 146 Similar to earlier hybrid-video-coding-based standards, including 147 HEVC, the following basic video coding design is employed by VVC. A 148 prediction signal is first formed by either intra- or motion- 149 compensated prediction, and the residual (the difference between the 150 original and the prediction) is then coded. The gains in coding 151 efficiency are achieved by redesigning and improving almost all parts 152 of the codec over earlier designs. In addition, [VVC] includes 153 several tools to make the implementation on parallel architectures 154 easier. 156 Finally, [VVC] includes temporal, spatial, and SNR scalability as 157 well as multiview coding support. 159 Coding blocks and transform structure 161 Among major coding-tool differences between HEVC and VVC, one of the 162 important improvements is the more flexible coding tree structure in 163 VVC, i.e., multi-type tree. In addition to quadtree, binary and 164 ternary trees are also supported, which contributes significant 165 improvement in coding efficiency. Moreover, the maximum size of 166 coding tree unit (CTU) is increased from 64x64 to 128x128. To 167 improve the coding efficiency of chroma signal, luma chroma separated 168 trees at CTU level may be employed for intra-slices. The square 169 transforms in HEVC are extended to non-square transforms for 170 rectangular blocks resulting from binary and ternary tree splits. 171 Besides, [VVC] supports multiple transform sets (MTS), including DCT- 172 2, DST-7, and DCT-8 as well as the non-separable secondary transform. 173 The transforms used in [VVC] can have different sizes with support 174 for larger transform sizes. For DCT-2, the transform sizes range 175 from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range 176 from 4x4 to 32x32. In addition, [VVC] also support sub-block 177 transform for both intra and inter coded blocks. For intra coded 178 blocks, intra sub-partitioning (ISP) may be used to allow sub-block 179 based intra prediction and transform. For inter blocks, sub-block 180 transform may be used assuming that only a part of an inter-block has 181 non-zero transform coefficients. 183 Entropy coding 185 Similar to HEVC, VVC uses a single entropy-coding engine, which is 186 based on context adaptive binary arithmetic coding [CABAC], but with 187 the support of multi-window sizes. The window sizes can be 188 initialized differently for different context models. Due to such a 189 design, it has more efficient adaptation speed and better coding 190 efficiency. A joint chroma residual coding scheme is applied to 191 further exploit the correlation between the residuals of two color 192 components. In VVC, different residual coding schemes are applied 193 for regular transform coefficients and residual samples generated 194 using transform-skip mode. 196 In-loop filtering 198 VVC has more feature support in loop filters than HEVC. The 199 deblocking filter in VVC is similar to HEVC but operates at a smaller 200 grid. After deblocking and sample adaptive offset (SAO), an adaptive 201 loop filter (ALF) may be used. As a Wiener filter, ALF reduces 202 distortion of decoded pictures. Besides, VVC introduces a new module 203 before deblocking called luma mapping with chroma scaling to fully 204 utilize the dynamic range of signal so that rate-distortion 205 performance of both SDR and HDR content is improved. 207 Motion prediction and coding 209 Compared to HEVC, [VVC] introduces several improvements in this area. 210 First, there is the adaptive motion vector resolution (AMVR), which 211 can save bit cost for motion vectors by adaptively signaling motion 212 vector resolution. Then the affine motion compensation is included 213 to capture complicated motion like zooming and rotation. Meanwhile, 214 prediction refinement with the optical flow with affine mode (PROF) 215 is further deployed to mimic affine motion at the pixel level. 216 Thirdly the decoder side motion vector refinement (DMVR) is a method 217 to derive MV vector at decoder side based on block matching so that 218 fewer bits may be spent on motion vectors. Bi-directional optical 219 flow (BDOF) is a similar method to PROF. BDOF adds a sample wise 220 offset at 4x4 sub-block level that is derived with equations based on 221 gradients of the prediction samples and a motion difference relative 222 to CU motion vectors. Furthermore, merge with motion vector 223 difference (MMVD) is a special mode, which further signals a limited 224 set of motion vector differences on top of merge mode. In addition 225 to MMVD, there are another three types of special merge modes, i.e., 226 sub-block merge, triangle, and combined intra-/inter-prediction 227 (CIIP). Sub- block merge list includes one candidate of sub-block 228 temporal motion vector prediction (SbTMVP) and up to four candidates 229 of affine motion vectors. Triangle is based on triangular block 230 motion compensation. CIIP combines intra- and inter- predictions 231 with weighting. Adaptive weighting may be employed with a block- 232 level tool called bi-prediction with CU based weighting (BCW) which 233 provides more flexibility than in HEVC. 235 Intra prediction and intra-coding 236 To capture the diversified local image texture directions with finer 237 granularity, [VVC] supports 65 angular directions instead of 33 238 directions in HEVC. The intra mode coding is based on a 6-most - 239 probable-mode scheme, and the 6 most probable modes are derived using 240 the neighboring intra prediction directions. In addition, to deal 241 with the different distributions of intra prediction angles for 242 different block aspect ratios, a wide-angle intra prediction (WAIP) 243 scheme is applied in [VVC] by including intra prediction angles 244 beyond those present in HEVC. Unlike HEVC which only allows using 245 the most adjacent line of reference samples for intra prediction, 246 [VVC] also allows using two further reference lines, as known as 247 multi-reference-line (MRL) intra prediction. The additional 248 reference lines can be only used for the 6 most probable intra 249 prediction modes. To capture the strong correlation between 250 different colour components, in VVC, a cross-component linear mode 251 (CCLM) is utilized which assumes a linear relationship between the 252 luma sample values and their associated chroma samples. For intra 253 prediction, [VVC] also applies a position-dependent prediction 254 combination (PDPC) for refining the prediction samples closer to the 255 intra prediction block boundary. Matrix-based intra prediction (MIP) 256 modes are also used in [VVC] which generates an up to 8x8 intra 257 prediction block using a weighted sum of downsampled neighboring 258 reference samples, and the weights are hardcoded constants. 260 Other coding-tool feature 262 [VVC] introduces dependent quantization (DQ) to reduce quantization 263 error by state-based switching between two quantizers. 265 1.1.2. Systems and Transport Interfaces 267 [VVC] inherits the basic systems and transport interfaces designs 268 from HEVC and H.264. These include the NAL-unit-based syntax 269 structure, the hierarchical syntax and data unit structure, the 270 supplemental enhancement information (SEI) message mechanism, and the 271 video buffering model based on the hypothetical reference decoder 272 (HRD). The scalability features of [VVC] are conceptually similar to 273 the scalable variant of HEVC known as SHVC. The hierarchical syntax 274 and data unit structure consists of parameter sets at various levels 275 (decoder, sequence (pertaining to all), sequence (pertaining to a 276 single), picture), picture-level header parameters, slice-level 277 header parameters, and lower-level parameters. 279 A number of key components that influenced the network abstraction 280 layer design of [VVC] as well as this memo are described below 282 Decoding Capability Information 283 The decoding capability information includes parameters that stay 284 constant for the lifetime of a Video Bitstream, which in IETF terms 285 can translate to the lifetime of a session. Such information 286 includes profile, level, and sub-profile information to determine a 287 maximum capability interop point that is guaranteed to be never 288 exceeded, even if splicing of video sequences occurs within a 289 session. It further includes constraint fields (most of which are 290 flags), which can optionally be set to indicate that the video 291 bitstream will be constraint in the use of certain features as 292 indicated by the values of those fields. With this, a bitstream can 293 be labelled as not using certain tools, which allows among other 294 things for resource allocation in a decoder implementation. 296 Video parameter set 298 TThe ideo parameter set (VPS) pertains to a coded video sequences 299 (CVS) of multiple layers covering the same range of access units, and 300 includes, among other information decoding dependency expressed as 301 information for reference picture list construction of enhancement 302 layers. The VPS provides a "big picture" of a scalable sequence, 303 including what types of operation points are provided, the profile, 304 tier, and level of the operation points, and some other high-level 305 properties of the bitstream that can be used as the basis for session 306 negotiation and content selection, etc. One VPS may be referenced by 307 one or more sequence parameter sets. 309 Sequence parameter set 311 The sequence parameter set (SPS) contains syntax elements pertaining 312 to a coded layer video sequence (CLVS), which is a group of pictures 313 belonging to the same layer, starting with a random access point, and 314 followed by pictures that may depend on each other, until the next 315 random access point picture. In MPGEG-2, the equivalent of a CVS was 316 a group of pictures (GOP), which normally started with an I frame and 317 was followed by P and B frames. While more complex in its options of 318 random access points, VVC retains this basic concept. One remarkable 319 difference of VVC is that a CLVS may start with a Gradual Decoding 320 Refresh (GDR) picture, without requiring presence of traditional 321 random access points in the bitstream, such as instantaneous decoding 322 refresh (IDR) or clean random access (CRA) pictures. In many TV-like 323 applications, a CVS contains a few hundred milliseconds to a few 324 seconds of video. In video conferencing (without switching MCUs 325 involved), a CVS can be as long in duration as the whole session. 327 Picture and adaptation parameter set 329 The picture parameter set and the adaptation parameter set (PPS and 330 APS, respectively) carry information pertaining to zero or more 331 pictures and zero or more slices, respectively. The PPS contains 332 information that is likely to stay constant from picture to picture- 333 at least for pictures for a certain type-whereas the APS contains 334 information, such as adaptive loop filter coefficients, that are 335 likely to change from picture to picture or even within a picture. A 336 single APS is referenced by all slices of the same picture if that 337 APS contains information about luma mapping with chroma scaling 338 (LMCS) or scaling list. Different APSs containing ALF parameters can 339 be referenced by slices of the same picture. 341 Picture Header 343 A Picture Header contains information that is common to all slices 344 that belong to the same picture. Being able to send that information 345 as a separate NAL unit when pictures are split into several slices 346 allows for saving bitrate, compared to repeating the same information 347 in all slices. However, there might be scenarios where low-bitrate 348 video is transmitted using a single slice per picture. Having a 349 separate NAL unit to convey that information incurs in an overhead 350 for such scenarios. For such scenarios, the picture header syntax 351 structure is directly included in the slice header, instead of in its 352 own NAL unit. The mode of the picture header syntax structure being 353 included in its own NAL unit or not can only be switched on/off for 354 an entire CLVS, and can only be switched off when in the entire CLVS 355 each picture contains only one slice. 357 Profile, tier, and level 359 The profile, tier and level syntax structures in DCI, VPS and SPS 360 contain profile, tier, level information for all layers that refer to 361 the DCI, for layers associated with one or more output layer sets 362 specified by the VPS, and for any layer that refers to the SPS, 363 respectively. 365 Sub-Profiles 367 Within the VVC specification, a sub-profile is a 32-bit number, coded 368 according to ITU-T Rec. T.35, that does not carry a semantics. It is 369 carried in the profile_tier_level structure and hence (potentially) 370 present in the DCI, VPS, and SPS. External registration bodies can 371 register a T.35 codepoint with ITU-T registration authorities and 372 associate with their registration a description of bitstream 373 restrictions beyond the profiles defined by ITU-T and ISO/IEC. This 374 would allow encoder manufacturers to label the bitstreams generated 375 by their encoder as complying with such sub-profile. It is expected 376 that upstream standardization organizations (such as: DVB and ATSC), 377 as well as walled-garden video services will take advantage of this 378 labelling system. In contrast to "normal" profiles, it is expected 379 that sub-profiles may indicate encoder choices traditionally left 380 open in the (decoder- centric) video coding specs, such as GOP 381 structures, minimum/maximum QP values, and the mandatory use of 382 certain tools or SEI messages. 384 Constraint Fields 386 The profile_tier_level structure carries a considerable number of 387 constraint fields (more of which are flags), which an encoder can use 388 to indicate to a decoder that it will not use a certain tool or 389 technology. They were included in reaction to a perceived market 390 need for labelling a bitstream as not exercising a certain tool that 391 has become commercially unviable. 393 Temporal scalability support 395 Editor notes: need will update along with VVC new draft in the 396 future 398 [VVC] includes support of temporal scalability, by inclusion of the 399 signaling of TemporalId in the NAL unit header, the restriction that 400 pictures of a particular temporal sublayer cannot be used for inter 401 prediction reference by pictures of a lower temporal sublayer, the 402 sub-bitstream extraction process, and the requirement that each sub- 403 bitstream extraction output be a conforming bitstream. Media-Aware 404 Network Elements (MANEs) can utilize the TemporalId in the NAL unit 405 header for stream adaptation purposes based on temporal scalability. 407 Picture reference resampling (RPR) 409 Editor's notes: to do updated 411 Spatial, SNR, and multiview scalability 413 [VVC] includes support for spatial, SNR, and multiview scalability. 414 Scalable video coding is widely considered to have technical benefits 415 and enrich services for various video applications. Until recently, 416 however, the functionality has not been included in the first version 417 of specifications of the video codecs. In VVC, however, all those 418 forms of scalability are supported natively through the signaling of 419 the layer_id in the NAL unit header, the VPS which associates layers 420 with given layer_ids to each other, reference picture selection, 421 reference picture resampling for spatial scalability, and a number of 422 other mechanisms not relevant for this memo. Scalability support can 423 be implemented in a single decoding "loop" and is widely considered a 424 comparatively lightweight operation. 426 Spatial Scalability 427 With the existence of Reference Picture Resampling (RPR), in 428 the "main" profile of VVC, the additional burden for 429 scalability support is just a modification of the high-level 430 syntax (HLS). The inter-layer prediction is employed in a 431 scalable system to improve the coding efficiency of the 432 enhancement layers. In addition to the spatial and temporal 433 motion-compensated predictions that are available in a single- 434 layer codec, the inter-layer prediction in VVC uses the 435 possibly resampled video data of the reconstructed reference 436 picture from a reference layer to predict the current 437 enhancement layer. The resampling process for inter-layer 438 prediction, when used, is performed at the block-level, reusing 439 the existing interpolation process for motion compensation in 440 single-layer coding. It means that no additional resampling 441 process is needed to support spatial scalability. 443 SNR Scalability 445 SNR scalability is similar to spatial scalability except that 446 the resampling factors are 1:1. In other words, there is no 447 change in resolution, but there is inter-layer prediction. 449 SEI Messages 451 Supplementary enhancement information (SEI) messages are information 452 in the bitstream that do not influence the decoding process as 453 specified in the VVC spec, but address issues of representation/ 454 rendering of the decoded bitstream, label the bitstream for certain 455 applications, among other, similar tasks. The overall concept of SEI 456 messages and many of the messages themselves has been inherited from 457 the H.264 and HEVC specs. Except for the SEI messages that affect 458 the specification of the hypothetical reference decoder (HRD), other 459 SEI messages for use in the VVC environment, which are generally 460 useful also in other video coding technologies, are not included in 461 the main VVC specification. 463 1.1.3. Parallel Processing Support (informative) 465 Compared to HEVC, the [VVC] design to support parallelization offers 466 numerous improvements. 468 Editor notes: udpate on sub-picture/slice/tile is needed following 469 new VVC draft 471 1.1.4. NAL Unit Header 473 [VVC] maintains the NAL unit concept of HEVC with modifications. VVC 474 uses a two-byte NAL unit header, as shown in Figure 1. The payload 475 of a NAL unit refers to the NAL unit excluding the NAL unit header. 477 +---------------+---------------+ 478 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 479 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 480 |F|Z| LayerID | Type | TID | 481 +---------------+---------------+ 483 The Structure of the VVC NAL Unit Header. 485 Figure 1 487 The semantics of the fields in the NAL unit header are as specified 488 in [VVC] and described briefly below for convenience. In addition to 489 the name and size of each field, the corresponding syntax element 490 name in [VVC] is also provided. 492 F: 1 bit 494 forbidden_zero_bit. Required to be zero in VVC. Note that the 495 inclusion of this bit in the NAL unit header was to enable 496 transport of [VVC] video over MPEG-2 transport systems (avoidance 497 of start code emulations) [MPEG2S]. In the context of this memo 498 the value 1 may be used to indicate a syntax violation, e.g., for 499 a NAL unit resulted from aggregating a number of fragmented units 500 of a NAL unit but missing the last fragment, as described in 501 Section TBD. 503 Z: 1 bit 505 nuh_reserved_zero_bit. Required to be zero in VVC, and reserved 506 for future extensions by ITU-T and ISO/IEC. 507 This memo does not overload the "Z" bit for local extensions, as 508 a) overloading the "F" bit is sufficient and b) to preserve the 509 usefulness of this memo to possible future versions of [VVC]. 511 LayerId: 6 bits 513 nuh_layer_id. Identifies the layer a NAL unit belongs to, wherein 514 a layer may be, e.g., a spatial scalable layer, a quality scalable 515 layer . 517 Type: 5 bits 518 nal_unit_type. This field specifies the NAL unit type as defined 519 in Table 7-1 of VVC. For a reference of all currently defined NAL 520 unit types and their semantics, please refer to Section 7.4.2.2 in 521 [VVC]. 523 TID: 3 bits 525 nuh_temporal_id_plus1. This field specifies the temporal 526 identifier of the NAL unit plus 1. The value of TemporalId is 527 equal to TID minus 1. A TID value of 0 is illegal to ensure that 528 there is at least one bit in the NAL unit header equal to 1, so to 529 enable independent considerations of start code emulations in the 530 NAL unit header and in the NAL unit payload data. 532 1.2. Overview of the Payload Format 534 This payload format defines the following processes required for 535 transport of [VVC] coded data over RTP [RFC3550]: 537 o Usage of RTP header with this payload format 539 o Packetization of [VVC] coded NAL units into RTP packets using 540 three types of payload structures: a single NAL unit packet, 541 aggregation packet, and fragment unit 543 o Transmission of [VVC] NAL units of the same bitstream within a 544 single RTP stream. 546 o Media type parameters to be used with the Session Description 547 Protocol (SDP) [RFC4566] 549 o Frame-marking mapping [FrameMarking] 551 2. Conventions 553 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 554 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 555 "OPTIONAL" in this document are to be interpreted as described in BCP 556 14 [RFC2119] [RFC8174] when, and only when, they appear in all 557 capitals, as shown above. 559 3. Definitions and Abbreviations 561 3.1. Definitions 563 This document uses the terms and definitions of VVC. Section 3.1.1 564 lists relevant definitions from [VVC] for convenience. Section 3.1.2 565 provides definitions specific to this memo. 567 3.1.1. Definitions from the VVC Specification 569 Editor notes: 571 Access unit (AU): A set of PUs that belong to different layers and 572 contain coded pictures associated with the same time for output from 573 the DPB. 575 Adaptation parameter set (APS): A syntax structure containing syntax 576 elements that apply to zero or more slices as determined by zero or 577 more syntax elements found in slice headers. 579 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 580 byte stream, that forms the representation of a sequence of AUs 581 forming one or more coded video sequences (CVSs). 583 Coded picture: A coded representation of a picture comprising VCL NAL 584 units with a particular value of nuh_layer_id within an AU and 585 containing all CTUs of the picture. 587 Clean random access (CRA) PU: A PU in which the coded picture is a 588 CRA picture. 590 Clean random access (CRA) picture: An IRAP picture for which each VCL 591 NAL unit has nal_unit_type equal to CRA_NUT. 593 Coded video sequence (CVS): A sequence of AUs that consists, in 594 decoding order, of a CVSS AU, followed by zero or more AUs that are 595 not CVSS AUs, including all subsequent AUs up to but not including 596 any subsequent AU that is a CVSS AU. 598 Coded video sequence start (CVSS) AU: An AU in which there is a PU 599 for each layer in the CVS and the coded picture in each PU is a CLVSS 600 picture. 602 Coded layer video sequence (CLVS): A sequence of PUs with the same 603 value of nuh_layer_id that consists, in decoding order, of a CLVSS 604 PU, followed by zero or more PUs that are not CLVSS PUs, including 605 all subsequent PUs up to but not including any subsequent PU that is 606 a CLVSS PU. 608 Coded layer video sequence start (CLVSS) PU: A PU in which the coded 609 picture is a CLVSS picture. 611 Coded layer video sequence start (CLVSS) picture: A coded picture 612 that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or 613 a GDR picture with NoOutputBeforeRecoveryFlag equal to 1. 615 Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs 616 of chroma samples of a picture that has three sample arrays, or a CTB 617 of samples of a monochrome picture or a picture that is coded using 618 three separate colour planes and syntax structures used to code the 619 samples. 621 Decoding Capability Information (DCI): A syntax structure containing 622 syntax elements that apply to the entire bitstream. 624 Decoded picture buffer (DPB): A buffer holding decoded pictures for 625 reference, output reordering, or output delay specified for the 626 hypothetical reference decoder. 628 Gradual decoding refresh (GDR) picture: A picture for which each VCL 629 NAL unit has nal_unit_type equal to GDR_NUT. 631 Instantaneous decoding refresh (IDR) PU: A PU in which the coded 632 picture is an IDR picture. 634 Instantaneous decoding refresh (IDR) picture: An IRAP picture for 635 which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or 636 IDR_N_LP. 638 Intra random access point (IRAP) AU: An AU in which there is a PU for 639 each layer in the CVS and the coded picture in each PU is an IRAP 640 picture. 642 Intra random access point (IRAP) PU: A PU in which the coded picture 643 is an IRAP picture. 645 Intra random access point (IRAP) picture: A coded picture for which 646 all VCL NAL units have the same value of nal_unit_type in the range 647 of IDR_W_RADL to CRA_NUT, inclusive. 649 Layer: A set of VCL NAL units that all have a particular value of 650 nuh_layer_id and the associated non-VCL NAL units. 652 Network abstraction layer (NAL) unit: A syntax structure containing 653 an indication of the type of data to follow and bytes containing that 654 data in the form of an RBSP interspersed as necessary with emulation 655 prevention bytes. 657 Network abstraction layer (NAL) unit stream: A sequence of NAL units. 659 Operation point (OP): A temporal subset of an OLS, identified by an 660 OLS index and a highest value of TemporalId. 662 Picture parameter set (PPS): A syntax structure containing syntax 663 elements that apply to zero or more entire coded pictures as 664 determined by a syntax element found in each slice header. 666 Picture unit (PU): A set of NAL units that are associated with each 667 other according to a specified classification rule, are consecutive 668 in decoding order, and contain exactly one coded picture. 670 Random access: The act of starting the decoding process for a 671 bitstream at a point other than the beginning of the stream. 673 Sequence parameter set (SPS): A syntax structure containing syntax 674 elements that apply to zero or more entire CLVSs as determined by the 675 content of a syntax element found in the PPS referred to by a syntax 676 element found in each picture header. 678 Slice: An integer number of complete tiles or an integer number of 679 consecutive complete CTU rows within a tile of a picture that are 680 exclusively contained in a single NAL unit. 682 sublayer: A temporal scalable layer of a temporal scalable bitstream 683 consisting of VCL NAL units with a particular value of the TemporalId 684 variable, and the associated non-VCL NAL units. 686 Subpicture: An rectangular region of one or more slices within a 687 picture. 689 sublayer representation: A subset of the bitstream consisting of NAL 690 units of a particular sublayer and the lower sublayers. 692 Tile: A rectangular region of CTUs within a particular tile column 693 and a particular tile row in a picture. 695 Tile column: A rectangular region of CTUs having a height equal to 696 the height of the picture and a width specified by syntax elements in 697 the picture parameter set. 699 Tile row: A rectangular region of CTUs having a height specified by 700 syntax elements in the picture parameter set and a width equal to the 701 width of the picture. 703 Video coding layer (VCL) NAL unit: A collective term for coded slice 704 NAL units and the subset of NAL units that have reserved values of 705 nal_unit_type that are classified as VCL NAL units in this 706 Specification. 708 3.1.2. Definitions Specific to This Memo 710 Media-Aware Network Element (MANE): A network element, such as a 711 middlebox, selective forwarding unit, or application-layer gateway 712 that is capable of parsing certain aspects of the RTP payload headers 713 or the RTP payload and reacting to their contents. 715 Editor Notes: the following informative needs to be updated along 716 with frame marking update 718 Informative note: The concept of a MANE goes beyond normal routers 719 or gateways in that a MANE has to be aware of the signaling (e.g., 720 to learn about the payload type mappings of the media streams), 721 and in that it has to be trusted when working with Secure RTP 722 (SRTP). The advantage of using MANEs is that they allow packets 723 to be dropped according to the needs of the media coding. For 724 example, if a MANE has to drop packets due to congestion on a 725 certain link, it can identify and remove those packets whose 726 elimination produces the least adverse effect on the user 727 experience. After dropping packets, MANEs must rewrite RTCP 728 packets to match the changes to the RTP stream, as specified in 729 Section 7 of [RFC3550]. 731 NAL unit decoding order: A NAL unit order that conforms to the 732 constraints on NAL unit order given in Section 7.4.2.4 in [VVC], 733 follow the Order of NAL units in the bitstream. 735 NAL unit output order: A NAL unit order in which NAL units of 736 different access units are in the output order of the decoded 737 pictures corresponding to the access units, as specified in [VVC], 738 and in which NAL units within an access unit are in their decoding 739 order. 741 RTP stream: See [RFC7656]. Within the scope of this memo, one RTP 742 stream is utilized to transport one or more temporal sublayers. 744 Transmission order: The order of packets in ascending RTP sequence 745 number order (in modulo arithmetic). Within an aggregation packet, 746 the NAL unit transmission order is the same as the order of 747 appearance of NAL units in the packet. 749 3.2. Abbreviations 751 AU Access Unit 753 AP Aggregation Packet 755 CTU Coding Tree Unit 756 CVS Coded Video Sequence 758 DPB Decoded Picture Buffer 760 DCI Decoding capability information 762 DON Decoding Order Number 764 FIR Full Intra Request 766 FU Fragmentation Unit 768 HRD Hypothetical Reference Decoder 770 IDR Instantaneous Decoding Refresh 772 MANE Media-Aware Network Element 774 MTU Maximum Transfer Unit 776 NAL Network Abstraction Layer 778 NALU Network Abstraction Layer Unit 780 PLI Picture Loss Indication 782 PPS Picture Parameter Set 784 RPS Reference Picture Set 786 RPSI Reference Picture Selection Indication 788 SEI Supplemental Enhancement Information 790 SLI Slice Loss Indication 792 SPS Sequence Parameter Set 794 VCL Video Coding Layer 796 VPS Video Parameter Set 798 4. RTP Payload Format 799 4.1. RTP Header Usage 801 The format of the RTP header is specified in [RFC3550] (reprinted as 802 Figure 2 for convenience). This payload format uses the fields of 803 the header in a manner consistent with that specification. 805 The RTP payload (and the settings for some RTP header bits) for 806 aggregation packets and fragmentation units are specified in 807 Section 4.3.2 and Section 4.3.3, respectively. 809 0 1 2 3 810 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 811 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 812 |V=2|P|X| CC |M| PT | sequence number | 813 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 814 | timestamp | 815 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 816 | synchronization source (SSRC) identifier | 817 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 818 | contributing source (CSRC) identifiers | 819 | .... | 820 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 822 RTP Header According to {{RFC3550}} 824 Figure 2 826 The RTP header information to be set according to this RTP payload 827 format is set as follows: 829 Marker bit (M): 1 bit 831 Set for the last packet of the access unit, carried in the current 832 RTP stream. This is in line with the normal use of the M bit in 833 video formats to allow an efficient playout buffer handling. 835 Editor notes: The informative note below needs updating once 836 the NAL unit type table is stable in the [VVC] spec. 838 Informative note: The content of a NAL unit does not tell 839 whether or not the NAL unit is the last NAL unit, in decoding 840 order, of an access unit. An RTP sender implementation may 841 obtain this information from the video encoder. If, however, 842 the implementation cannot obtain this information directly from 843 the encoder, e.g., when the bitstream was pre-encoded, and also 844 there is no timestamp allocated for each NAL unit, then the 845 sender implementation can inspect subsequent NAL units in 846 decoding order to determine whether or not the NAL unit is the 847 last NAL unit of an access unit as follows. A NAL unit is 848 determined to be the last NAL unit of an access unit if it is 849 the last NAL unit of the bitstream. A NAL unit naluX is also 850 determined to be the last NAL unit of an access unit if both 851 the following conditions are true: 1) the next VCL NAL unit 852 naluY in decoding order has the high-order bit of the first 853 byte after its NAL unit header equal to 1 or nal_unit_type 854 equal to 19, and 2) all NAL units between naluX and naluY, when 855 present, have nal_unit_type in the range of 13 to17, inclusive, 856 equal to 20, equal to 23 or equal to 26. 858 Payload Type (PT): 7 bits 860 The assignment of an RTP payload type for this new packet format 861 is outside the scope of this document and will not be specified 862 here. The assignment of a payload type has to be performed either 863 through the profile used or in a dynamic way. 865 Sequence Number (SN): 16 bits 867 Set and used in accordance with [RFC3550]. 869 Timestamp: 32 bits 871 The RTP timestamp is set to the sampling timestamp of the content. 872 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 873 properties of its own (e.g., parameter set and SEI NAL units), the 874 RTP timestamp MUST be set to the RTP timestamp of the coded 875 picture of the access unit in which the NAL unit (according to 876 Annex D of VVC) is included. Receivers MUST use the RTP timestamp 877 for the display process, even when the bitstream contains picture 878 timing SEI messages or decoding unit information SEI messages as 879 specified in VVC. 881 Synchronization source (SSRC): 32 bits 883 Used to identify the source of the RTP packets. A single SSRC is 884 used for all parts of a single bitstream. 886 4.2. Payload Header Usage 888 The first two bytes of the payload of an RTP packet are referred to 889 as the payload header. The payload header consists of the same 890 fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown 891 in Section 1.1.4, irrespective of the type of the payload structure. 893 The TID value indicates (among other things) the relative importance 894 of an RTP packet, for example, because NAL units belonging to higher 895 temporal sublayers are not used for the decoding of lower temporal 896 sublayers. A lower value of TID indicates a higher importance. 897 More-important NAL units MAY be better protected against transmission 898 losses than less-important NAL units. 900 For Discussion: quite possibly something similar can be said for 901 the Layer_id in layered coding, but perhaps not in multiview 902 coding. (The relevant part of the spec is relatively new, 903 therefore the soft language). However, for serious layer pruning, 904 interpretation of the VPS is required. We can add language about 905 the need for stateful interpretation of LayerID vis-a-vis 906 stateless interpretation of TID later. 908 4.3. Payload Structures 910 Three different types of RTP packet payload structures are specified. 911 A receiver can identify the type of an RTP packet payload through the 912 Type field in the payload header. 914 The three different payload structures are as follows: 916 o Single NAL unit packet: Contains a single NAL unit in the payload, 917 and the NAL unit header of the NAL unit also serves as the payload 918 header. This payload structure is specified in Section 4.4.1. 920 o Aggregation Packet (AP): Contains more than one NAL unit within 921 one access unit. This payload structure is specified in 922 Section 4.3.2. 924 o Fragmentation Unit (FU): Contains a subset of a single NAL unit. 925 This payload structure is specified in Section 4.3.3. 927 4.3.1. Single NAL Unit Packets 929 Editor notes: its better to add a section to describe DONL and 930 sprop-max_don_diff. sprop-max_don_diff is used but not specified 931 as parameters in section 7 are not yet specified. A value of 932 sprop-max_don_diff greater than 0 indicates that the transmission 933 order may not correspond to the decoding order and that the DON is 934 is included in the payload header. 936 A single NAL unit packet contains exactly one NAL unit, and consists 937 of a payload header (denoted as PayloadHdr), a conditional 16-bit 938 DONL field (in network byte order), and the NAL unit payload data 939 (the NAL unit excluding its NAL unit header) of the contained NAL 940 unit, as shown in Figure 3. 942 0 1 2 3 943 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 944 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 945 | PayloadHdr | DONL (conditional) | 946 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 947 | | 948 | NAL unit payload data | 949 | | 950 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 951 | :...OPTIONAL RTP padding | 952 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 954 The Structure of a Single NAL Unit Packet 956 Figure 3 958 The DONL field, when present, specifies the value of the 16 least 959 significant bits of the decoding order number of the contained NAL 960 unit. If sprop-max-don-diff is greater than 0, the DONL field MUST 961 be present, and the variable DON for the contained NAL unit is 962 derived as equal to the value of the DONL field. Otherwise (sprop- 963 max-don-diff is equal to 0), the DONL field MUST NOT be present. 965 4.3.2. Aggregation Packets (APs) 967 Aggregation Packets (APs) can reduce of packetization overhead for 968 small NAL units, such as most of the non- VCL NAL units, which are 969 often only a few octets in size. 971 An AP aggregates NAL units of one access unit. Each NAL unit to be 972 carried in an AP is encapsulated in an aggregation unit. NAL units 973 aggregated in one AP are included in NAL unit decoding order. 975 An AP consists of a payload header (denoted as PayloadHdr) followed 976 by two or more aggregation units, as shown in Figure 4. 978 0 1 2 3 979 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 980 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 981 | PayloadHdr (Type=28) | | 982 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 983 | | 984 | two or more aggregation units | 985 | | 986 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 987 | :...OPTIONAL RTP padding | 988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 990 The Structure of an Aggregation Packet 992 Figure 4 994 The fields in the payload header of an AP are set as follows. The F 995 bit MUST be equal to 0 if the F bit of each aggregated NAL unit is 996 equal to zero; otherwise, it MUST be equal to 1. The Type field MUST 997 be equal to 28. 999 The value of LayerId MUST be equal to the lowest value of LayerId of 1000 all the aggregated NAL units. The value of TID MUST be the lowest 1001 value of TID of all the aggregated NAL units. 1003 Informative note: All VCL NAL units in an AP have the same TID 1004 value since they belong to the same access unit. However, an AP 1005 may contain non-VCL NAL units for which the TID value in the NAL 1006 unit header may be different than the TID value of the VCL NAL 1007 units in the same AP. 1009 An AP MUST carry at least two aggregation units and can carry as many 1010 aggregation units as necessary; however, the total amount of data in 1011 an AP obviously MUST fit into an IP packet, and the size SHOULD be 1012 chosen so that the resulting IP packet is smaller than the MTU size 1013 so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 1014 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 1015 not contain another AP. 1017 The first aggregation unit in an AP consists of a conditional 16-bit 1018 DONL field (in network byte order) followed by a 16-bit unsigned size 1019 information (in network byte order) that indicates the size of the 1020 NAL unit in bytes (excluding these two octets, but including the NAL 1021 unit header), followed by the NAL unit itself, including its NAL unit 1022 header, as shown in Figure 5. 1024 0 1 2 3 1025 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1026 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1027 | : DONL (conditional) | NALU size | 1028 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1029 | NALU size | | 1030 +-+-+-+-+-+-+-+-+ NAL unit | 1031 | | 1032 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1033 | : 1034 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1036 The Structure of the First Aggregation Unit in an AP 1038 Figure 5 1040 The DONL field, when present, specifies the value of the 16 least 1041 significant bits of the decoding order number of the aggregated NAL 1042 unit. 1044 If sprop-max-don-diff is greater than 0, the DONL field MUST be 1045 present in an aggregation unit that is the first aggregation unit in 1046 an AP, and the variable DON for the aggregated NAL unit is derived as 1047 equal to the value of the DONL field. Otherwise (sprop-max-don-diff 1048 is equal to 0), the DONL field MUST NOT be present in an aggregation 1049 unit that is the first aggregation unit in an AP. 1051 An aggregation unit that is not the first aggregation unit in an AP 1052 will be followed immediately by a 16-bit unsigned size information 1053 (in network byte order) that indicates the size of the NAL unit in 1054 bytes (excluding these two octets, but including the NAL unit 1055 header), followed by the NAL unit itself, including its NAL unit 1056 header, as shown in Figure 6. 1058 0 1 2 3 1059 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1060 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1061 | : NALU size | NAL unit | 1062 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1063 | | 1064 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1065 | : 1066 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1068 The Structure of an Aggregation Unit That Is Not the First 1069 Aggregation Unit in an AP 1071 Figure 6 1073 Figure 7 presents an example of an AP that contains two aggregation 1074 units, labeled as 1 and 2 in the figure, without the DONL field being 1075 present. 1077 0 1 2 3 1078 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1079 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1080 | RTP Header | 1081 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1082 | PayloadHdr (Type=28) | NALU 1 Size | 1083 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1084 | NALU 1 HDR | | 1085 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 1086 | . . . | 1087 | | 1088 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1089 | . . . | NALU 2 Size | NALU 2 HDR | 1090 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1091 | NALU 2 HDR | | 1092 +-+-+-+-+-+-+-+-+ NALU 2 Data | 1093 | . . . | 1094 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1095 | :...OPTIONAL RTP padding | 1096 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1098 An Example of an AP Packet Containing 1099 Two Aggregation Units without the DONL Field 1101 Figure 7 1103 Figure 8 presents an example of an AP that contains two aggregation 1104 units, labeled as 1 and 2 in the figure, with the DONL field being 1105 present. 1107 0 1 2 3 1108 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1109 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1110 | RTP Header | 1111 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1112 | PayloadHdr (Type=28) | NALU 1 DONL | 1113 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1114 | NALU 1 Size | NALU 1 HDR | 1115 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1116 | | 1117 | NALU 1 Data . . . | 1118 | | 1119 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1120 | : NALU 2 Size | 1121 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1122 | NALU 2 HDR | | 1123 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 1124 | | 1125 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1126 | :...OPTIONAL RTP padding | 1127 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1129 An Example of an AP Containing 1130 Two Aggregation Units with the DONL Field 1132 Figure 8 1134 4.3.3. Fragmentation Units 1136 Fragmentation Units (FUs) are introduced to enable fragmenting a 1137 single NAL unit into multiple RTP packets, possibly without 1138 cooperation or knowledge of the [VVC] encoder. A fragment of a NAL 1139 unit consists of an integer number of consecutive octets of that NAL 1140 unit. Fragments of the same NAL unit MUST be sent in consecutive 1141 order with ascending RTP sequence numbers (with no other RTP packets 1142 within the same RTP stream being sent between the first and last 1143 fragment). 1145 When a NAL unit is fragmented and conveyed within FUs, it is referred 1146 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 1147 NOT be nested; i.e., an FU can not contain a subset of another FU. 1149 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 1150 time of the fragmented NAL unit. 1152 An FU consists of a payload header (denoted as PayloadHdr), an FU 1153 header of one octet, a conditional 16-bit DONL field (in network byte 1154 order), and an FU payload, as shown in Figure 9. 1156 0 1 2 3 1157 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1158 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1159 | PayloadHdr (Type=29) | FU header | DONL (cond) | 1160 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1161 | DONL (cond) | | 1162 |-+-+-+-+-+-+-+-+ | 1163 | FU payload | 1164 | | 1165 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1166 | :...OPTIONAL RTP padding | 1167 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1169 The Structure of an FU 1171 Figure 9 1173 The fields in the payload header are set as follows. The Type field 1174 MUST be equal to 29. The fields F, LayerId, and TID MUST be equal to 1175 the fields F, LayerId, and TID, respectively, of the fragmented NAL 1176 unit. 1178 The FU header consists of an S bit, an E bit, an R bit and a 5-bit 1179 FuType field, as shown in Figure 10. 1181 +---------------+ 1182 |0|1|2|3|4|5|6|7| 1183 +-+-+-+-+-+-+-+-+ 1184 |S|E|R| FuType | 1185 +---------------+ 1187 The Structure of FU Header 1189 Figure 10 1191 The semantics of the FU header fields are as follows: 1193 S: 1 bit 1194 When set to 1, the S bit indicates the start of a fragmented NAL 1195 unit, i.e., the first byte of the FU payload is also the first 1196 byte of the payload of the fragmented NAL unit. When the FU 1197 payload is not the start of the fragmented NAL unit payload, the S 1198 bit MUST be set to 0. 1200 E: 1 bit 1202 When set to 1, the E bit indicates the end of a fragmented NAL 1203 unit, i.e., the last byte of the payload is also the last byte of 1204 the fragmented NAL unit. When the FU payload is not the last 1205 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1207 Reserved: 1 bit 1209 Placeholder 1211 FuType: 5 bits 1213 The field FuType MUST be equal to the field Type of the fragmented 1214 NAL unit. 1216 The DONL field, when present, specifies the value of the 16 least 1217 significant bits of the decoding order number of the fragmented NAL 1218 unit. 1220 If sprop-max-don-diff is greater than 0, and the S bit is equal to 1, 1221 the DONL field MUST be present in the FU, and the variable DON for 1222 the fragmented NAL unit is derived as equal to the value of the DONL 1223 field. Otherwise (sprop-max-don-diff is equal to 0, or the S bit is 1224 equal to 0), the DONL field MUST NOT be present in the FU. 1226 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1227 the Start bit and End bit must not both be set to 1 in the same FU 1228 header. 1230 The FU payload consists of fragments of the payload of the fragmented 1231 NAL unit so that if the FU payloads of consecutive FUs, starting with 1232 an FU with the S bit equal to 1 and ending with an FU with the E bit 1233 equal to 1, are sequentially concatenated, the payload of the 1234 fragmented NAL unit can be reconstructed. The NAL unit header of the 1235 fragmented NAL unit is not included as such in the FU payload, but 1236 rather the information of the NAL unit header of the fragmented NAL 1237 unit is conveyed in F, LayerId, and TID fields of the FU payload 1238 headers of the FUs and the FuType field of the FU header of the FUs. 1239 An FU payload MUST NOT be empty. 1241 If an FU is lost, the receiver SHOULD discard all following 1242 fragmentation units in transmission order corresponding to the same 1243 fragmented NAL unit, unless the decoder in the receiver is known to 1244 be prepared to gracefully handle incomplete NAL units. 1246 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1247 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1248 n of that NAL unit is not received. In this case, the 1249 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1250 syntax violation. 1252 4.4. Decoding Order Number 1254 For each NAL unit, the variable AbsDon is derived, representing the 1255 decoding order number that is indicative of the NAL unit decoding 1256 order. 1258 Let NAL unit n be the n-th NAL unit in transmission order within an 1259 RTP stream. 1261 If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon 1262 for NAL unit n, is derived as equal to n. 1264 Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is 1265 derived as follows, where DON[n] is the value of the variable DON for 1266 NAL unit n: 1268 o If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1269 transmission order), AbsDon[0] is set equal to DON[0]. 1271 o Otherwise (n is greater than 0), the following applies for 1272 derivation of AbsDon[n]: 1274 If DON[n] == DON[n-1], 1275 AbsDon[n] = AbsDon[n-1] 1277 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1278 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1280 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1281 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1283 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1284 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1285 DON[n]) 1287 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1288 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1290 For any two NAL units m and n, the following applies: 1292 o AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1293 NAL unit m in NAL unit decoding order. 1295 o When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1296 of the two NAL units can be in either order. 1298 o AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1299 NAL unit m in decoding order. 1301 Informative note: When two consecutive NAL units in the NAL unit 1302 decoding order have different values of AbsDon, the absolute 1303 difference between the two AbsDon values may be greater than or 1304 equal to 1. 1306 Informative note: There are multiple reasons to allow for the 1307 absolute difference of the values of AbsDon for two consecutive 1308 NAL units in the NAL unit decoding order to be greater than one. 1309 An increment by one is not required, as at the time of associating 1310 values of AbsDon to NAL units, it may not be known whether all NAL 1311 units are to be delivered to the receiver. For example, a gateway 1312 might not forward VCL NAL units of higher sublayers or some SEI 1313 NAL units when there is congestion in the network. 1314 In another example, the first intra-coded picture of a pre-encoded 1315 clip is transmitted in advance to ensure that it is readily 1316 available in the receiver, and when transmitting the first intra- 1317 coded picture, the originator does not exactly know how many NAL 1318 units will be encoded before the first intra-coded picture of the 1319 pre-encoded clip follows in decoding order. Thus, the values of 1320 AbsDon for the NAL units of the first intra-coded picture of the 1321 pre-encoded clip have to be estimated when they are transmitted, 1322 and gaps in values of AbsDon may occur. 1324 5. Packetization Rules 1326 The following packetization rules apply: 1328 o If sprop-max-don-diff is greater than 0, the transmission order of 1329 NAL units carried in the RTP stream MAY be different than the NAL 1330 unit decoding order and the NAL unit output order. 1332 o A NAL unit of a small size SHOULD be encapsulated in an 1333 aggregation packet together one or more other NAL units in order 1334 to avoid the unnecessary packetization overhead for small NAL 1335 units. For example, non-VCL NAL units such as access unit 1336 delimiters, parameter sets, or SEI NAL units are typically small 1337 and can often be aggregated with VCL NAL units without violating 1338 MTU size constraints. 1340 o Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1341 viewpoint, be encapsulated in an aggregation packet together with 1342 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1343 be meaningless without the associated VCL NAL unit being 1344 available. 1346 o For carrying exactly one NAL unit in an RTP packet, a single NAL 1347 unit packet MUST be used. 1349 6. De-packetization Process 1351 The general concept behind de-packetization is to get the NAL units 1352 out of the RTP packets in an RTP stream and pass them to the decoder 1353 in the NAL unit decoding order. 1355 The de-packetization process is implementation dependent. Therefore, 1356 the following description should be seen as an example of a suitable 1357 implementation. Other schemes may be used as well, as long as the 1358 output for the same input is the same as the process described below. 1359 The output is the same when the set of output NAL units and their 1360 order are both identical. Optimizations relative to the described 1361 algorithms are possible. 1363 All normal RTP mechanisms related to buffer management apply. In 1364 particular, duplicated or outdated RTP packets (as indicated by the 1365 RTP sequences number and the RTP timestamp) are removed. To 1366 determine the exact time for decoding, factors such as a possible 1367 intentional delay to allow for proper inter-stream synchronization 1368 MUST be factored in. 1370 NAL units with NAL unit type values in the range of 0 to 27, 1371 inclusive, may be passed to the decoder. NAL-unit-like structures 1372 with NAL unit type values in the range of 28 to 31, inclusive, MUST 1373 NOT be passed to the decoder. 1375 The receiver includes a receiver buffer, which is used to compensate 1376 for transmission delay jitter within individual RTP streams and 1377 across RTP streams, to reorder NAL units from transmission order to 1378 the NAL unit decoding order. In this section, the receiver operation 1379 is described under the assumption that there is no transmission delay 1380 jitter within an RTP stream and across RTP streams. To make a 1381 difference from a practical receiver buffer that is also used for 1382 compensation of transmission delay jitter, the receiver buffer is 1383 hereafter called the de-packetization buffer in this section. 1384 Receivers should also prepare for transmission delay jitter; that is, 1385 either reserve separate buffers for transmission delay jitter 1386 buffering and de-packetization buffering or use a receiver buffer for 1387 both transmission delay jitter and de- packetization. Moreover, 1388 receivers should take transmission delay jitter into account in the 1389 buffering operation, e.g., by additional initial buffering before 1390 starting of decoding and playback. 1392 When sprop-max-don-diff is equal to 0, the de-packetization buffer 1393 size is zero bytes, and the process described in the remainder of 1394 this paragraph applies. 1395 The NAL units carried in the single RTP stream are directly passed to 1396 the decoder in their transmission order, which is identical to their 1397 decoding order. When there are several NAL units of the same RTP 1398 stream with the same NTP timestamp, the order to pass them to the 1399 decoder is their transmission order. 1401 Informative note: The mapping between RTP and NTP timestamps is 1402 conveyed in RTCP SR packets. In addition, the mechanisms for 1403 faster media timestamp synchronization discussed in [RFC6051] may 1404 be used to speed up the acquisition of the RTP-to-wall-clock 1405 mapping. 1407 When sprop-max-don-diff is greater than 0, the process described in 1408 the remainder of this section applies. 1410 There are two buffering states in the receiver: initial buffering and 1411 buffering while playing. Initial buffering starts when the reception 1412 is initialized. After initial buffering, decoding and playback are 1413 started, and the buffering-while-playing mode is used. 1415 Regardless of the buffering state, the receiver stores incoming NAL 1416 units, in reception order, into the de-packetization buffer. NAL 1417 units carried in RTP packets are stored in the de-packetization 1418 buffer individually, and the value of AbsDon is calculated and stored 1419 for each NAL unit. 1421 Initial buffering lasts until condition A (the difference between the 1422 greatest and smallest AbsDon values of the NAL units in the de- 1423 packetization buffer is greater than or equal to the value of sprop- 1424 max-don-diff) or condition B (the number of NAL units in the de- 1425 packetization buffer is greater than the value of sprop-depack-buf- 1426 nalus) is true. 1428 After initial buffering, whenever condition A or condition B is true, 1429 the following operation is repeatedly applied until both condition A 1430 and condition B become false: 1432 o The NAL unit in the de-packetization buffer with the smallest 1433 value of AbsDon is removed from the de-packetization buffer and 1434 passed to the decoder. 1436 When no more NAL units are flowing into the de-packetization buffer, 1437 all NAL units remaining in the de-packetization buffer are removed 1438 from the buffer and passed to the decoder in the order of increasing 1439 AbsDon values. 1441 7. Payload Format Parameters 1443 This section specifies the optional parameters. A mapping of the 1444 parameters with Session Description Protocol (SDP) [RFC4556] is also 1445 provided for applications that use SDP. 1447 7.1. Media Type Registration 1449 The receiver MUST ignore any parameter unspecified in this memo. 1451 Type name: Video 1453 Subtype name: H266 1455 Required parameters: none 1457 Optional parameters: 1459 Editor's notes: To be added 1461 7.2. SDP Parameters 1463 The receiver MUST ignore any parameter unspecified in this memo. 1465 7.2.1. Mapping of Payload Type Parameters to SDP 1467 The media type video/H266 string is mapped to fields in the Session 1468 Description Protocol (SDP) [RFC4566] as follows: 1470 o The media name in the "m=" line of SDP MUST be video. 1472 o The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the 1473 media subtype). 1475 o The clock rate in the "a=rtpmap" line MUST be 90000. 1477 o OPTIONAL PARAMETERS: 1479 Editor's notes: To be dicussed here 1481 7.2.1.1. SDP Example 1483 An example of media representation in SDP is as follows: 1485 m=video 49170 RTP/AVP 98 1486 a=rtpmap:98 H266/90000 1487 a=fmtp:98 profile-id=1; sprop-vps=