idnits 2.17.1 draft-ietf-avtcore-rtp-evc-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The abstract seems to contain references ([EVC]), which it shouldn't. Please replace those with straight textual mentions of the documents in question. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == The document seems to lack the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords -- however, there's a paragraph with a matching beginning. Boilerplate error? (The document does seem to have the reference to RFC 2119 which the ID-Checklist requires). -- The document date (4 February 2021) is 1170 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Looks like a reference, but probably isn't: '0' on line 1113 -- Possible downref: Non-RFC (?) normative reference: ref. 'EVC' -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23094-1' ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866) ** Downref: Normative reference to an Informational RFC: RFC 7656 == Outdated reference: A later version (-18) exists of draft-ietf-avtcore-rtp-vvc-07 Summary: 3 errors (**), 0 flaws (~~), 3 warnings (==), 4 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 avtcore S. Zhao 3 Internet-Draft S. Wenger 4 Intended status: Standards Track Tencent 5 Expires: 8 August 2021 Y. Lim 6 Samsung Electronics 7 4 February 2021 9 RTP Payload Format for Essential Video Coding (EVC) 10 draft-ietf-avtcore-rtp-evc-01 12 Abstract 14 This memo describes an RTP payload format for the video coding 15 standard ISO/IEC International Standard 23094-1 [EVC], also known as 16 Essential Video Coding [EVC] and developed by ISO/IEC JTC1/SC29/WG11 17 (MPEG). The RTP payload format allows for packetization of one or 18 more Network Abstraction Layer (NAL) units in each RTP packet payload 19 as well as fragmentation of a NAL unit into multiple RTP packets. 20 The payload format has wide applicability in videoconferencing, 21 Internet video streaming, and high-bitrate entertainment-quality 22 video, among other applications. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on 8 August 2021. 41 Copyright Notice 43 Copyright (c) 2021 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents (https://trustee.ietf.org/ 48 license-info) in effect on the date of publication of this document. 49 Please review these documents carefully, as they describe your rights 50 and restrictions with respect to this document. Code Components 51 extracted from this document must include Simplified BSD License text 52 as described in Section 4.e of the Trust Legal Provisions and are 53 provided without warranty as described in the Simplified BSD License. 55 Table of Contents 57 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 58 1.1. Overview of the EVC Codec . . . . . . . . . . . . . . . . 3 59 1.1.1. Coding-Tool Features (informative) . . . . . . . . . 4 60 1.1.2. Systems and Transport Interfaces . . . . . . . . . . 6 61 1.1.3. Parallel Processing Support (informative) . . . . . . 8 62 1.1.4. NAL Unit Header . . . . . . . . . . . . . . . . . . . 8 63 1.2. Overview of the Payload Format . . . . . . . . . . . . . 9 64 2. Conventions . . . . . . . . . . . . . . . . . . . . . . . . . 10 65 3. Definitions and Abbreviations . . . . . . . . . . . . . . . . 10 66 3.1. Definitions . . . . . . . . . . . . . . . . . . . . . . . 10 67 3.1.1. Definitions from the EVC Specification . . . . . . . 10 68 3.1.2. Definitions Specific to This Memo . . . . . . . . . . 12 69 3.2. Abbreviations . . . . . . . . . . . . . . . . . . . . . . 13 70 4. RTP Payload Format . . . . . . . . . . . . . . . . . . . . . 14 71 4.1. RTP Header Usage . . . . . . . . . . . . . . . . . . . . 15 72 4.2. Payload Header Usage . . . . . . . . . . . . . . . . . . 16 73 4.3. Payload Structures . . . . . . . . . . . . . . . . . . . 17 74 4.3.1. Single NAL Unit Packets . . . . . . . . . . . . . . . 17 75 4.3.2. Aggregation Packets (APs) . . . . . . . . . . . . . . 18 76 4.3.3. Fragmentation Units . . . . . . . . . . . . . . . . . 22 77 4.4. Decoding Order Number . . . . . . . . . . . . . . . . . . 25 78 5. Packetization Rules . . . . . . . . . . . . . . . . . . . . . 26 79 6. De-packetization Process . . . . . . . . . . . . . . . . . . 27 80 7. Payload Format Parameters . . . . . . . . . . . . . . . . . . 29 81 7.1. Media Type Registration . . . . . . . . . . . . . . . . . 29 82 7.2. SDP Parameters . . . . . . . . . . . . . . . . . . . . . 29 83 7.2.1. Mapping of Payload Type Parameters to SDP . . . . . . 29 84 7.2.2. Usage with SDP Offer/Answer Model . . . . . . . . . . 30 85 7.2.3. SDP Example . . . . . . . . . . . . . . . . . . . . . 30 86 8. Use with Feedback Messages . . . . . . . . . . . . . . . . . 30 87 8.1. Picture Loss Indication (PLI) . . . . . . . . . . . . . . 30 88 8.2. Full Intra Request (FIR) . . . . . . . . . . . . . . . . 30 89 9. Security Considerations . . . . . . . . . . . . . . . . . . . 30 90 10. Congestion Control . . . . . . . . . . . . . . . . . . . . . 31 91 11. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 32 92 12. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 32 93 13. References . . . . . . . . . . . . . . . . . . . . . . . . . 32 94 13.1. Normative References . . . . . . . . . . . . . . . . . . 32 95 13.2. Informative References . . . . . . . . . . . . . . . . . 34 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 35 98 1. Introduction 100 The [EVC] specification, which is formally designated as ISO/IEC 101 International Standard 23094-1 [ISO23094-1] has been published in 102 October 2020. One goal of MPEG is to keep [EVC]'s Baseline profile 103 essentially royalty free by by using the technologies published more 104 than 20 years or otherwise freely available for use, whereas more 105 advanced profiles follow a reasonable and non-discriminatory 106 licensing terms policy. Both Baseline profile and higher profiles of 107 [EVC] are reported to provide coding efficiency gains over [HEVC] and 108 [AVC] under certain configurations. 110 This memo describes an RTP payload format for [EVC]. It shares its 111 basic design with the NAL unit-based RTP payload formats of H.264 112 Video Coding [RFC6184], Scalable Video Coding (SVC) [RFC6190], High 113 Efficiency Video Coding (HEVC) [RFC7798], and Versatile Video Coding 114 (VVC)[I-D.ietf-avtcore-rtp-vvc]. With respect to design philosophy, 115 security, congestion control, and overall implementation complexity, 116 it has similar properties to those earlier payload format 117 specifications. This is a conscious choice, as at least RFC 6184 is 118 widely deployed and generally known in the relevant implementer 119 communities. Certain mechanisms known from [RFC6190] were 120 incorporated as EVC supports temporal scalability. [EVC] currently 121 does not offer higher forms of scalability. 123 1.1. Overview of the EVC Codec 125 [EVC], [AVC], [HEVC] and [VVC] share a similar hybrid video codec 126 design. In this memo, we provide a very brief overview of those 127 features of [EVC] that are, in some form, addressed by the payload 128 format specified herein. Implementers have to read, understand, and 129 apply the ISO/IEC specifications pertaining to [EVC] to arrive at 130 interoperable, well-performing implementations. The EVC standard has 131 a Baseline profile and on top of that, a Main profile, the latter 132 including more advanced features. The syntax elements allow encoders 133 to mark a bitstream as to what of the many independent coding tools 134 are exercised in the bitstream, in a spirit similar to the 135 general_constraint_flags of [VVC] is provided. 137 Conceptually, all [EVC], [AVC], [HEVC] and [VVC] include a Video 138 Coding Layer (VCL), which is often used to refer to the coding-tool 139 features, and a Network Abstraction Layer (NAL), which is often used 140 to refer to the systems and transport interface aspects of the 141 codecs. 143 1.1.1. Coding-Tool Features (informative) 145 Coding blocks and transform structure 147 [EVC] uses a traditional quad-tree coding structure, which divides 148 the encoded image into blocks of up to 128x128 luma samples, which 149 can be recursively divided into smaller blocks. The Main profile 150 adds two advanced coding structure tools: Binary Ternary Tree (BTT) 151 that allows non-square coding units and segmentation that changes the 152 processing order of the segmentation unit from traditional left- 153 scanning order processing to right-scanning order processing Unit 154 Coding Order (SUCO). In the Main profile, the picture can be divided 155 into slices and tiles, and these slices can be independently encoded 156 and/or decoded in parallel. 158 When predicting a data block using intra prediction or inter 159 prediction, the remaining data is usually added to the prediction 160 block. The residual data is added to the prediction block. The 161 residual data is obtained by applying an inverse quantization process 162 and an inverse transform. [EVC] includes integer discrete cosine 163 transform (DCT2) and scalar quantization. For the Main profile, 164 Improved Quantization and Transform (IQT) uses a different mapping/ 165 clipping function for quantization. An inverse zig-zag scanning 166 order is used for coefficient coding. Advanced Coefficient Coding 167 (ADCC) in the Main profile can code coefficient values more 168 efficiently, for example, indicated by the last non-zero coefficient. 169 In Main profile, Adaptive Transformation Selection (ATS) is also 170 available and can be applied to integer versions of DST7 or DCT8, and 171 not just DCT2. 173 Entropy coding 175 [EVC] uses a similar binary arithmetic coding mechanism as [AVC]. 176 The mechanism includes a binarization step and a probability update 177 defined by a lookup table. In the Main profile, the derivation 178 process of syntax elements based on adjacent blocks makes the context 179 modeling and initialization process more efficient. 181 In-loop filtering 183 The Baseline profile of [EVC] uses the deblocking filter defined in 184 H.263 Annex J. In the Main profile, compared to the deblocking 185 filter in the Baseline profile, an Advanced Deblocking Filter (ADDB) 186 can be used, which can further reduce artifacts. The Main profile 187 also defines two additional in-loop filters that can be used to 188 improve the quality of decoded pictures before output and/or for 189 inter prediction. A Walsh-Hadamard Transform Domain Filter (HTDF) is 190 applied to the luma samples before deblocking, and the scanning 191 process is used to determine 4 adjacent samples for filtering. An 192 adaptive Loop Filter (ALF) allows to send signals of up to 25 193 different filters for the luma components, and the best filter can be 194 selected through the classification process for each 4x4 block. The 195 filter parameters of the ALF filter are signaled in the Adaptation 196 Parameter Set (APS). 198 Inter-prediction 200 The basis of [EVC] inter prediction is motion compensation using 201 interpolation filters with a quarter sample resolution. In Baseline 202 profile, a motion vector signal is transmitted using one of three 203 spatially neighboring motion vectors and a temporally collocated 204 motion vector as a predictor. The motion vector difference may be 205 signaled relative to the selected predictor, but for the case where 206 no motion vector difference is signaled and there is no remaining 207 data in the block, there is a specific mode called a skip mode. The 208 Main profile includes six additional tools to provide improved inter 209 prediction. With advanced Motion Interpolation and Signaling (AMIS), 210 adjacent blocks can be conceptually merged to indicate that they use 211 the same motion, but more advanced schemes can also be used to create 212 predictions from the basic model list of candidate predictors. The 213 Merge with Motion Vector Difference (MMVD) tool uses a process 214 similar to the concept of merging neighboring blocks, but also allows 215 the use of expressions that include a starting point, motion 216 amplitude, and direction of motion to send a motion vector signal. 218 Using Advanced Motion Vector Prediction (AMVP), candidate motion 219 vector predictions for the block can be derived from its neighboring 220 blocks in the same picture and collocated blocks in the reference 221 picture. The Adaptive Motion Vector Resolution (AMVR) tool provides 222 a way to reduce the accuracy of a motion vector from a quarter sample 223 to half sample, full sample, double sample, or quad sample, which 224 provides the efficiency advantage, such as when sending large motion 225 vector differences. The Main profile also includes the Decoder-side 226 Motion Vector Refinement (DMVR), which uses a bilateral template 227 matching process to refine the motion vectors in a bidirectional 228 fashion. 230 Intra prediction and intra-coding 232 Intra prediction in [EVC] is performed on adjacent samples of coding 233 units in a partitioned structure. For the Baseline profile, all 234 coding units are square, and there are five different prediction 235 modes: DC (mean value of the neighborhood), horizontal, vertical, and 236 two different diagonal directions. In the Main profile, intra 237 prediction can be applied to any rectangular coding unit, and there 238 are 28 additional direction modes available in the so-called Enhanced 239 Intra Prediction Directions (EIPD). In the Main profile, an encoder 240 can also use Intra Block Copy (IBC), where a previously decoded 241 sample blocks of the same picture is used as a predictor. A 242 displacement vector in integer sample precision is signaled to 243 indicate where the prediction block in the current picture is used 244 for this mode. 246 Decoded picture buffer management 248 In [EVC], decoded pictures can be stored in a decoded picture buffer 249 (DPB) for predicting pictures that follow them in decoding order. In 250 the Baseline profile, the management of the DPB (i.e. the process of 251 adding and deleting reference pictures) is controlled by the 252 information in the SPS. For the Main profile, if a Reference Picture 253 List (RPL) scheme is used, DPB management can be controlled by 254 information that is signaled at the picture level. 256 1.1.2. Systems and Transport Interfaces 258 [EVC] inherited the basic systems and transport interfaces designs 259 from [AVC] and [HEVC]. These include the NAL-unit-based syntax 260 structure, the hierarchical syntax and data unit structure and the 261 Supplemental Enhancement Information (SEI) message mechanism. The 262 hierarchical syntax and data unit structure consists of a sequence- 263 level parameter set (SPS), two picture-level parameter sets (PPS and 264 APS, each of which can apply to one or more pictures), slice-level 265 header parameters, and lower-level parameters. 267 A number of key components that influenced the Network Abstraction 268 Layer design of [EVC] as well as this memo are described below 270 Sequence parameter set 272 The Sequence Parameter Set (SPS) contains syntax elements pertaining 273 to a coded video sequence (CVS), which is a group of pictures, 274 starting with a random access point, and followed by pictures that 275 may depend on each other and the random access point picture. In 276 MPGEG-2, the equivalent of a CVS was a Group of Pictures (GOP), which 277 normally started with an I frame and was followed by P and B frames. 278 While more complex in its options of random access points, EVC 279 retains this basic concept. In many TV-like applications, a CVS 280 contains a few hundred milliseconds to a few seconds of video. In 281 video conferencing (without switching MCUs involved), a CVS can be as 282 long in duration as the whole session. 284 Picture and adaptation parameter set 285 The Picture Parameter Set and the Adaptation Parameter Set (PPS and 286 APS, respectively) carry information pertaining to a single picture. 287 The PPS contains information that is likely to stay constant from 288 picture to picture-at least for pictures for a certain type-whereas 289 the APS contains information, such as adaptive loop filter 290 coefficients, that are likely to change from picture to picture. 292 Profile, level and toolsets 294 Profiles and levels follow the same design considerations ask known 295 form [AVC], [HEVC], and in fact video codecs as old as MPEG-1 visual. 296 A profile defines a set of tools (not to confuse with the "toolset" 297 discussed below) that a decoder compliant with this profile has to 298 support. In [EVC], profiles are defined in Annex A. Formally, they 299 are defined as a set of constraints that a bitstream needs to conform 300 to. In [EVC], the Baseline profile is much more severely constraint 301 than Main profile, reducing implementation complexity. Levels relate 302 to bitstream complexity in dimensions such as maximum sample decoding 303 rate, maximum picture size, and similar parameters that are directly 304 related to computational complexity. 306 Profiles and levels are signaled in the highest parameter set 307 available, the SPS. 309 [EVC] contains another mechanism related to the use of coding tools, 310 known as the toolset syntax element. This syntax element, 311 toolset_idc_h and toolset_idc_l located in the SPS, is a bitmask that 312 allows encoders to indicate which coding tools they are using, within 313 the menu of profiles offered by the profile that is also signaled. 314 No decoder conformance point is associated with the toolset, but a 315 bitstream that were using a coding tool that is indicated as not used 316 in the toolset syntax element would obviously be non-compliant. 317 While MPEG specifically rules out the use of the toolset syntax 318 element as a conformance point, walled garden implementations could 319 do so without incurring the interoperability problems MPEG fears, and 320 create bitstreams and decoders that do not support one or more given 321 tools. That, in turn, may be useful to mitigate certain patent 322 related risks. 324 Bitstream and elementary stream 326 Above the Coded Video Sequence (CVS), [EVC] defines a video bitstream 327 that can be used in the MPEG systems context as an elementary stream. 328 For the purpose of this memo, this is not relevant. 330 Random access support 332 [EVC] supports random access mechanism solely based on IDR access 333 unit. 335 Temporal scalability support 337 [EVC] includes support for temporal scalability through the 338 generalized reference picture selection approach known since 339 [AVC]/SVC. Up to six temporal layers are supported. The temporal 340 layer is signaled in the NAL unit header (which co-serves as the 341 payload header in this memo), in the nuh_temporal_id field. 343 Reference picture management 345 placeholder 347 SEI Message 349 [EVC] inherits many of [HEVC]'s SEI Messages, occasionally with 350 changes in syntax and/or semantics making them applicable to EVC. 352 1.1.3. Parallel Processing Support (informative) 354 Placeholder 356 1.1.4. NAL Unit Header 358 [EVC] maintains the NAL unit concept of [HEVC] with different 359 parameter options. EVC also uses a two-byte NAL unit header, as 360 shown in Figure 1. The payload of a NAL unit refers to the NAL unit 361 excluding the NAL unit header. 363 +---------------+---------------+ 364 |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7| 365 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 366 |F| Type | TID | Reserve |E| 367 +-------------+-----------------+ 369 The Structure of the EVC NAL Unit Header 371 Figure 1 373 The semantics of the fields in the NAL unit header are as specified 374 in [EVC] and described briefly below for convenience. In addition to 375 the name and size of each field, the corresponding syntax element 376 name in [EVC] is also provided. 378 F: 1 bit 379 forbidden_zero_bit. Required to be zero in [EVC]. Note that the 380 inclusion of this bit in the NAL unit header was included to 381 enable transport of EVC video over MPEG-2 transport systems 382 (avoidance of start code emulations) [MPEG2S]. In the context of 383 this memo,the value 1 may be used to indicate a syntax violation, 384 e.g., for a NAL unit resulted from aggregating a number of 385 fragmented units of a NAL unit but missing the last fragment, as 386 described in Section xxx. (section # placeholder) 388 Type: 6 bits 390 nal_unit_type_plus1. This field specifies the NAL unit type as 391 defined in Table 4 of [EVC]. If the value of this field is less 392 than and equal to 23, the NAL unit is a VCL NAL unit. Otherwise, 393 the NAL unit is a non-VCL NAL unit. For a reference of all 394 currently defined NAL unit types and their semantics, please refer 395 to Section 7.4.2.2 in [EVC]. 397 TID: 3 bits 399 nuh_temporal_id. This field specifies the temporal identifier of 400 the NAL unit. The value of TemporalId is equal to TID. 401 TemporalId shall be equal to 0 if it is a IDR NAL unit type (NAL 402 unit type 1). 404 Reserve: 5 bits 406 nuh_reserved_zero_5bits. This field shall be equal to the version 407 of the [EVC] specification. Values of nuh_reserved_zero_5bits 408 greater than 0 are reserved for future use by ISO/IEC. Decoders 409 conforming to a profile specified in [EVC] Annex A shall ignore 410 (i.e., remove from the bitstream and discard) all NAL units with 411 values of nuh_reserved_zero_5bits greater than 0. 413 E: 1 bit 415 nuh_extension_flag. This field shall be equal the version of the 416 [EVC] specification. Value of nuh_extesion_flag equal to 1 is 417 reserved for future use by ISO/IEC. Decoders conforming to a 418 profile specified in Annex A shall ignore (i.e., remove from the 419 bitstream and discard) all NAL units with values of 420 nuh_extension_flag equal to 1. 422 1.2. Overview of the Payload Format 424 This payload format defines the following processes required for 425 transport of [EVC] coded data over RTP [RFC3550]: 427 * Usage of RTP header with this payload format 429 * Packetization of [EVC] coded NAL units into RTP packets using 430 three types of payload structures: a single NAL unit, aggregation, 431 and fragment unit packet 433 * Transmission of [EVC] NAL units of the same bitstream within a 434 single RTP stream. 436 * Media type parameters to be used with the Session Description 437 Protocol (SDP) [RFC4566] 439 2. Conventions 441 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 442 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 443 "OPTIONAL" in this document are to be interpreted as described in BCP 444 14 [RFC2119] [RFC8174] when, and only when, they appear in all 445 capitals, as shown above. 447 3. Definitions and Abbreviations 449 3.1. Definitions 451 This document uses the terms and definitions of EVC. Section 3.1.1 452 lists relevant definitions from [EVC] for convenience. Section 3.1.2 453 provides definitions specific to this memo. 455 3.1.1. Definitions from the EVC Specification 457 Access Unit: A set of NAL units that are associated with each other 458 according to a specified classification rule, are consecutive in 459 decoding order, and contain exactly one coded picture. 461 Bitstream: A sequence of bits, in the form of a NAL unit stream or a 462 byte stream, that forms the representation of coded pictures and 463 associated data forming one or more coded video sequences (CVSs). 465 Coded Picture: A coded representation of a picture containing all 466 CTUs of the picture. 468 Coded Video Sequence (CVS): A sequence of access units that consists, 469 in decoding order, of an IDR access unit, followed by zero or more 470 access units that are not IDR access units, including all subsequent 471 access units up to but not including any subsequent access unit that 472 is an IDR access unit. 474 Coding Tree Block (CTB): An NxN block of samples for some value of N 475 such that the division of a component into CTBs is a partitioning. 477 Coding Tree Unit (CTU): A CTB of luma samples, two corresponding CTBs 478 of chroma samples of a picture that has three sample arrays, or a CTB 479 of samples of a monochrome picture or a picture that is coded using 480 three separate colour planes and syntax structures used to code the 481 samples. 483 Decoded Picture: A decoded picture is derived by decoding a coded 484 picture. 486 Decoded Picture Buffer (DPB): A buffer holding decoded pictures for 487 reference, output reordering, or output delay specified for the 488 hypothetical reference decoder in Annex C of [EVC] specification. 490 Dynamic Range Adjustment (DRA): A mapping process that is applied to 491 decoded picture prior to cropping and output as part of the decoding 492 process and is controlled by parameters conveyed in an Adaptation 493 Parameter Set (APS). 495 Hypothetical Reference Decoder (HRD): A hypothetical decoder model 496 that specifies constraints on the variability of conforming NAL unit 497 streams or conforming byte streams that an encoding process may 498 produce. 500 Instantaneous Decoding Refresh (IDR) access unit: An access unit in 501 which the coded picture is an IDR picture. 503 Instantaneous Decoding Refresh (IDR) picture: A coded picture for 504 which each VCL NAL unit has NalUnitType equal to IDR_NUT. 506 Level: A defined set of constraints on the values that may be taken 507 by the syntax elements and variables of this document, or the value 508 of a transform coefficient prior to scaling. 510 Network Abstraction Layer (NAL) unit: A syntax structure containing 511 an indication of the type of data to follow and bytes containing that 512 data in the form of an RBSP interspersed as necessary. 514 Network Abstraction Layer (NAL) Unit Stream: A sequence of NAL units. 516 Non-IDR Picture: A coded picture that is not an IDR picture. 518 Non-VCL NAL Unit: A NAL unit that is not a VCL NAL unit. 520 Picture Parameter Set (PPS): A syntax structure containing syntax 521 elements that apply to zero or more entire coded pictures as 522 determined by a syntax element found in each slice header. 524 Picture Order Count (POC): A variable that is associated with each 525 picture, uniquely identifies the associated picture among all 526 pictures in the CVS, and, when the associated picture is to be output 527 from the decoded picture buffer, indicates the position of the 528 associated picture in output order relative to the output order 529 positions of the other pictures in the same CVS that are to be output 530 from the decoded picture buffer. 532 Raw Byte Sequence Payload (RBSP): A syntax structure containing an 533 integer number of bytes that is encapsulated in a NAL unit and that 534 is either empty or has the form of a string of data bits containing 535 syntax elements followed by an RBSP stop bit and zero or more 536 subsequent bits equal to 0. 538 Sequence Parameter Set (SPS): A syntax structure containing syntax 539 elements that apply to zero or more entire CVSs as determined by the 540 content of a syntax element found in the PPS referred to by a syntax 541 element found in each slice header. 543 Tile row: A rectangular region of CTUs having a height specified by 544 syntax elements in the PPS and a width equal to the width of the 545 picture. 547 Tile scan: A specific sequential ordering of CTUs partitioning a 548 picture in which the CTUs are ordered consecutively in CTU raster 549 scan in a tile whereas tiles in a picture are ordered consecutively 550 in a raster scan of the tiles of the picture. 552 Video coding layer (VCL) NAL unit: A collective term for coded slice 553 NAL units and the subset of NAL units that have reserved values of 554 NalUnitType that are classified as VCL NAL units in this document. 556 3.1.2. Definitions Specific to This Memo 558 Media-Aware Network Element (MANE): A network element, such as a 559 middlebox, selective forwarding unit, or application-layer gateway 560 that is capable of parsing certain aspects of the RTP payload headers 561 or the RTP payload and reacting to their contents. 563 Informative note: The concept of a MANE goes beyond normal routers 564 or gateways in that a MANE has to be aware of the signaling (e.g., 565 to learn about the payload type mappings of the media streams), 566 and in that it has to be trusted when working with Secure RTP 567 (SRTP). The advantage of using MANEs is that they allow packets 568 to be dropped according to the needs of the media coding. For 569 example, if a MANE has to drop packets due to congestion on a 570 certain link, it can identify and remove those packets whose 571 elimination produces the least adverse effect on the user 572 experience. After dropping packets, MANEs must rewrite RTCP 573 packets to match the changes to the RTP stream, as specified in 574 Section 7 of [RFC3550]. 576 NAL unit decoding order: A NAL unit order that conforms to the 577 constraints on NAL unit order given in Section 8.2 and 8.3 in [EVC], 578 follow the Order of NAL units in the bitstream. 580 NAL unit output order: A NAL unit order in which NAL units of 581 different access units are in the output order of the decoded 582 pictures corresponding to the access units, as specified in [EVC], 583 and in which NAL units within an access unit are in their decoding 584 order. 586 RTP stream: See [RFC7656]. Within the scope of this memo, one RTP 587 stream is utilized to transport one or more temporal sub-layers. 589 Transmission order: The order of packets in ascending RTP sequence 590 number order (in modulo arithmetic). Within an aggregation packet, 591 the NAL unit transmission order is the same as the order of 592 appearance of NAL units in the packet. 594 3.2. Abbreviations 596 APS Adaptation Parameter Set 598 ATS Adaptive Transform Selection 600 B Bi-predictive 602 CBR Constant Bit Rate 604 CPB Coded Picture Buffer 606 CTB Coding Tree Block 608 CTU Coding Tree Unit 610 CVS Coded Video Sequence 612 DPB Decoded Picture Buffer 614 HRD Hypothetical Reference Decoder 615 HSS Hypothetical Stream Scheduler 617 I Intra 619 IDR Instantaneous Decoding Refresh 621 LSB Least Significant Bit 623 LTRP Long-Term Reference Picture 625 MMVD Merge with Motion Vector Difference 627 MSB Most Significant Bit 629 NAL Network Abstraction Layer 631 P Predictive 633 POC Picture Order Count 635 PPS Picture Parameter Set 637 QP Quantization Parameter 639 RBSP Raw Byte Sequence Payload 641 RGB Same as GBR 643 SAR Sample Aspect Ratio 645 SEI Supplemental Enhancement Information 647 SODB String Of Data Bits 649 SPS Sequence Parameter Set 651 STRP Short-Term Reference Picture 653 VBR Variable Bit Rate 655 VCL Video Coding Layer 657 4. RTP Payload Format 658 4.1. RTP Header Usage 660 The format of the RTP header is specified in [RFC3550] (reprinted as 661 Figure 2 for convenience). This payload format uses the fields of 662 the header in a manner consistent with that specification. 664 The RTP payload (and the settings for some RTP header bits) for 665 aggregation packets and fragmentation units are specified in 666 Section 4.3.2 and Section 4.3.3, respectively. 668 0 1 2 3 669 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 670 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 671 |V=2|P|X| CC |M| PT | sequence number | 672 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 673 | timestamp | 674 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 675 | synchronization source (SSRC) identifier | 676 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ 677 | contributing source (CSRC) identifiers | 678 | .... | 679 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 681 RTP Header According to {{RFC3550}} 683 Figure 2 685 The RTP header information to be set according to this RTP payload 686 format is set as follows: 688 Marker bit (M): 1 bit 690 Set for the last packet of the access unit, carried in the current 691 RTP stream. This is in line with the normal use of the M bit in 692 video formats to allow an efficient playout buffer handling. 694 editor-note 4: The informative note below needs updating once 695 the NAL unit type table is stable in the [EVC] spec. 697 Informative note: The content of a NAL unit does not tell 698 whether or not the NAL unit is the last NAL unit, in decoding 699 order, of an access unit. An RTP sender implementation may 700 obtain this information from the video encoder. If, however, 701 the implementation cannot obtain this information directly from 702 the encoder, e.g., when the bitstream was pre-encoded, and also 703 there is no timestamp allocated for each NAL unit, then the 704 sender implementation can inspect subsequent NAL units in 705 decoding order to determine whether or not the NAL unit is the 706 last NAL unit of an access unit as follows. A NAL unit is 707 determined to be the last NAL unit of an access unit if it is 708 the last NAL unit of the bitstream. A NAL unit naluX is also 709 determined to be the last NAL unit of an access unit if both 710 the following conditions are true: 1) the next VCL NAL unit 711 naluY in decoding order has the high-order bit of the first 712 byte after its NAL unit header equal to 1 or nal_unit_type 713 equal to 27, and 2) all NAL units between naluX and naluY, when 714 present, have nal_unit_type in the range of 24 to 26, 715 inclusive, equal to 28 or in the range of 29 to 55. 717 Payload Type (PT): 7 bits 719 The assignment of an RTP payload type for this new payload format 720 is outside the scope of this document and will not be specified 721 here. The assignment of a payload type has to be performed either 722 through the profile used or in a dynamic way. 724 Sequence Number (SN): 16 bits 726 Set and used in accordance with [RFC3550]. 728 Timestamp: 32 bits 730 The RTP timestamp is set to the sampling timestamp of the content. 731 A 90 kHz clock rate MUST be used. If the NAL unit has no timing 732 properties of its own (e.g., parameter sets or certain SEI NAL 733 units), the RTP timestamp MUST be set to the RTP timestamp of the 734 coded picture of the access unit in which the NAL unit (according 735 to Annex D of [EVC]) is included. Receivers MUST use the RTP 736 timestamp for the display process, even when the bitstream 737 contains picture timing SEI messages or decoding unit information 738 SEI messages as specified in [EVC]. 740 Synchronization source (SSRC): 32 bits 742 Used to identify the source of the RTP packets. When using SRST, 743 by definition a single SSRC is used for all parts of a single 744 bitstream. 746 4.2. Payload Header Usage 748 The first two bytes of the payload of an RTP packet are referred to 749 as the payload header. The payload header consists of the same 750 fields (F, TID, Reserve and E) as the NAL unit header as shown in 751 Section 1.1.4, irrespective of the type of the payload structure. 753 The TID value indicates (among other things) the relative importance 754 of an RTP packet, for example, because NAL units belonging to higher 755 temporal sub-layers are not used for the decoding of lower temporal 756 sub-layers. A lower value of TID indicates a higher importance. 757 More-important NAL units MAY be better protected against transmission 758 losses than less-important NAL units. 760 4.3. Payload Structures 762 Three different types of RTP packet payload structures are specified. 763 A receiver can identify the type of an RTP packet payload through the 764 Type field in the payload header. 766 The Three different payload structures are as follows: 768 * Single NAL unit packet: Contains a single NAL unit in the payload, 769 and the NAL unit header of the NAL unit also serves as the payload 770 header. This payload structure is specified in Section 4.3.1. 772 * Aggregation Packet (AP): Contains more than one NAL unit within 773 one access unit. This payload structure is specified in 774 Section 4.3.2. 776 * Fragmentation Unit (FU): Contains a subset of a single NAL unit. 777 This payload structure is specified in Section 4.3.3. 779 4.3.1. Single NAL Unit Packets 781 A single NAL unit packet contains exactly one NAL unit, and consists 782 of a payload header (denoted as PayloadHdr), a conditional 16-bit 783 DONL field (in network byte order), and the NAL unit payload data 784 (the NAL unit excluding its NAL unit header) of the contained NAL 785 unit, as shown in Figure 3. 787 0 1 2 3 788 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 789 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 790 | PayloadHdr | DONL (conditional) | 791 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 792 | | 793 | NAL unit payload data | 794 | | 795 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 796 | :...OPTIONAL RTP padding | 797 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 799 The Structure of a Single NAL Unit Packet 800 Figure 3 802 The DONL field, when present, specifies the value of the 16 least 803 significant bits of the decoding order number of the contained NAL 804 unit. If sprop-max-don-diff is greater than 0 for any of the RTP 805 streams, the DONL field MUST be present, and the variable DON for the 806 contained NAL unit is derived as equal to the value of the DONL 807 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 808 streams), the DONL field MUST NOT be present. 810 4.3.2. Aggregation Packets (APs) 812 Aggregation Packets (APs) enable the reduction of packetization 813 overhead for small NAL units, such as most of the non-VCL NAL units, 814 which are often only a few octets in size. 816 An AP aggregates NAL units within one access unit. Each NAL unit to 817 be carried in an AP is encapsulated in an aggregation unit. NAL 818 units aggregated in one AP are in NAL unit decoding order. 820 An AP consists of a payload header (denoted as PayloadHdr) followed 821 by two or more aggregation units, as shown in Figure 4. 823 0 1 2 3 824 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 825 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 826 | PayloadHdr (Type=56) | | 827 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 828 | | 829 | two or more aggregation units | 830 | | 831 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 832 | :...OPTIONAL RTP padding | 833 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 835 The Structure of an Aggregation Packet 837 Figure 4 839 The fields in the payload header are set as follows. The F bit MUST 840 be equal to 0 if the F bit of each aggregated NAL unit is equal to 841 zero; otherwise, it MUST be equal to 1. The Type field MUST be equal 842 to 56. 844 The value of TID MUST be the lowest value of TID of all the 845 aggregated NAL units. The value of Reserve and E Must match the 846 version of [EVC] specification. 848 Informative note: All VCL NAL units in an AP have the same TID 849 value since they belong to the same access unit. However, an AP 850 may contain non-VCL NAL units for which the TID value in the NAL 851 unit header may be different than the TID value of the VCL NAL 852 units in the same AP. 854 An AP MUST carry at least two aggregation units and can carry as many 855 aggregation units as necessary; however, the total amount of data in 856 an AP obviously MUST fit into an IP packet, and the size SHOULD be 857 chosen so that the resulting IP packet is smaller than the path MTU 858 size so to avoid IP layer fragmentation. An AP MUST NOT contain FUs 859 specified in Section 4.3.3. APs MUST NOT be nested; i.e., an AP can 860 not contain another AP. 862 The first aggregation unit in an AP consists of a conditional 16-bit 863 DONL field (in network byte order) followed by a 16-bit unsigned size 864 information (in network byte order) that indicates the size of the 865 NAL unit in bytes (excluding these two octets, but including the NAL 866 unit header), followed by the NAL unit itself, including its NAL unit 867 header, as shown in Figure 5. 869 0 1 2 3 870 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 871 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 872 | : DONL (conditional) | NALU size | 873 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 874 | NALU size | | 875 +-+-+-+-+-+-+-+-+ NAL unit | 876 | | 877 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 878 | : 879 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 881 The Structure of the First Aggregation Unit in an AP 883 Figure 5 885 The DONL field, when present, specifies the value of the 16 least 886 significant bits of the decoding order number of the aggregated NAL 887 unit. 889 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 890 the DONL field MUST be present in an aggregation unit that is the 891 first aggregation unit in an AP, and the variable DON for the 892 aggregated NAL unit is derived as equal to the value of the DONL 893 field. Otherwise (sprop-max-don-diff is equal to 0 for all the RTP 894 streams), the DONL field MUST NOT be present in an aggregation unit 895 that is the first aggregation unit in an AP. 897 An aggregation unit that is not the first aggregation unit in an AP 898 will be followed immediately by a 16-bit unsigned size information 899 (in network byte order) that indicates the size of the NAL unit in 900 bytes (excluding these two octets, but including the NAL unit 901 header), followed by the NAL unit itself, including its NAL unit 902 header, as shown in Figure 6. 904 0 1 2 3 905 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 906 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 907 | : NALU size | NAL unit | 908 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 909 | | 910 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 911 | : 912 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 914 The Structure of an Aggregation Unit That Is Not the First 915 Aggregation Unit in an AP 917 Figure 6 919 Figure 7 presents an example of an AP that contains two aggregation 920 units, labeled as NALU 1 and NALU 2 in the figure, without the DONL 921 field being present. 923 0 1 2 3 924 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 925 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 926 | RTP Header | 927 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 928 | PayloadHdr (Type=56) | NALU 1 Size | 929 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 930 | NALU 1 HDR | | 931 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 1 Data | 932 | . . . | 933 | | 934 + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 935 | . . . | NALU 2 Size | NALU 2 HDR | 936 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 937 | NALU 2 HDR | | 938 +-+-+-+-+-+-+-+-+ NALU 2 Data | 939 | . . . | 940 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 941 | :...OPTIONAL RTP padding | 942 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 944 An Example of an AP Packet Containing 945 Two Aggregation Units without the DONL Field 947 Figure 7 949 Figure 8 presents an example of an AP that contains two aggregation 950 units, labeled as NALU 1 and NALU 2 in the figure, with the DONL 951 field being present. 953 0 1 2 3 954 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 955 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 956 | RTP Header | 957 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 958 | PayloadHdr (Type=56) | NALU 1 DONL | 959 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 960 | NALU 1 Size | NALU 1 HDR | 961 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 962 | | 963 | NALU 1 Data . . . | 964 | | 965 + . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 966 | : NALU 2 Size | 967 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 968 | NALU 2 HDR | | 969 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ NALU 2 Data | 970 | | 971 | . . . +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 972 | :...OPTIONAL RTP padding | 973 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 975 An Example of an AP Containing 976 Two Aggregation Units with the DONL Field 978 Figure 8 980 4.3.3. Fragmentation Units 982 Fragmentation Units (FUs) are introduced to enable fragmenting a 983 single NAL unit into multiple RTP packets, possibly without 984 cooperation or knowledge of the EVC [EVC] encoder. A fragment of a 985 NAL unit consists of an integer number of consecutive octets of that 986 NAL unit. Fragments of the same NAL unit MUST be sent in consecutive 987 order with ascending RTP sequence numbers (with no other RTP packets 988 within the same RTP stream being sent between the first and last 989 fragment). 991 When a NAL unit is fragmented and conveyed within FUs, it is referred 992 to as a fragmented NAL unit. APs MUST NOT be fragmented. FUs MUST 993 NOT be nested; i.e., an FU must not contain a subset of another FU. 995 The RTP timestamp of an RTP packet carrying an FU is set to the NALU- 996 time of the fragmented NAL unit. 998 An FU consists of a payload header (denoted as PayloadHdr), an FU 999 header of one octet, a conditional 16-bit DONL field (in network byte 1000 order), and an FU payload, as shown in Figure 9. 1002 0 1 2 3 1003 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1004 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1005 | PayloadHdr (Type=57) | FU header | DONL (cond) | 1006 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-| 1007 | DONL (cond) | | 1008 |-+-+-+-+-+-+-+-+ | 1009 | FU payload | 1010 | | 1011 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1012 | :...OPTIONAL RTP padding | 1013 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1015 The Structure of an FU 1017 Figure 9 1019 The fields in the payload header are set as follows. The Type field 1020 MUST be equal to 57. The fields F, TID, Reserve and E MUST be equal 1021 to the fields F, TID, Reserve and E, respectively, of the fragmented 1022 NAL unit. 1024 The FU header consists of an S bit, an E bit, and a 6-bit FuType 1025 field, as shown in Figure 10. 1027 +---------------+ 1028 |0|1|2|3|4|5|6|7| 1029 +-+-+-+-+-+-+-+-+ 1030 |S|E| FuType | 1031 +---------------+ 1033 The Structure of FU Header 1035 Figure 10 1037 The semantics of the FU header fields are as follows: 1039 S: 1 bit 1041 When set to 1, the S bit indicates the start of a fragmented NAL 1042 unit, i.e., the first byte of the FU payload is also the first 1043 byte of the payload of the fragmented NAL unit. When the FU 1044 payload is not the start of the fragmented NAL unit payload, the S 1045 bit MUST be set to 0. 1047 E: 1 bit 1048 When set to 1, the E bit indicates the end of a fragmented NAL 1049 unit, i.e., the last byte of the payload is also the last byte of 1050 the fragmented NAL unit. When the FU payload is not the last 1051 fragment of a fragmented NAL unit, the E bit MUST be set to 0. 1053 FuType: 6 bits 1055 The field FuType MUST be equal to the field Type of the fragmented 1056 NAL unit. 1058 The DONL field, when present, specifies the value of the 16 least 1059 significant bits of the decoding order number of the fragmented NAL 1060 unit. 1062 If sprop-max-don-diff is greater than 0 for any of the RTP streams, 1063 and the S bit is equal to 1, the DONL field MUST be present in the 1064 FU, and the variable DON for the fragmented NAL unit is derived as 1065 equal to the value of the DONL field. Otherwise (sprop-max-don-diff 1066 is equal to 0 for all the RTP streams, or the S bit is equal to 0), 1067 the DONL field MUST NOT be present in the FU. 1069 A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e., 1070 the Start bit and End bit must not both be set to 1 in the same FU 1071 header. 1073 The FU payload consists of fragments of the payload of the fragmented 1074 NAL unit so that if the FU payloads of consecutive FUs, starting with 1075 an FU with the S bit equal to 1 and ending with an FU with the E bit 1076 equal to 1, are sequentially concatenated, the payload of the 1077 fragmented NAL unit can be reconstructed. The NAL unit header of the 1078 fragmented NAL unit is not included as such in the FU payload, but 1079 rather the information of the NAL unit header of the fragmented NAL 1080 unit is conveyed in F, TID, Reserve and E fields of the FU payload 1081 headers of the FUs and the FuType field of the FU header of the FUs. 1082 An FU payload MUST NOT be empty. 1084 If an FU is lost, the receiver SHOULD discard all following 1085 fragmentation units in transmission order corresponding to the same 1086 fragmented NAL unit, unless the decoder in the receiver is known to 1087 gracefully handle incomplete NAL units. 1089 A receiver in an endpoint or in a MANE MAY aggregate the first n-1 1090 fragments of a NAL unit to an (incomplete) NAL unit, even if fragment 1091 n of that NAL unit is not received. In this case, the 1092 forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a 1093 syntax violation. 1095 4.4. Decoding Order Number 1097 For each NAL unit, the variable AbsDon is derived, representing the 1098 decoding order number that is indicative of the NAL unit decoding 1099 order. 1101 Let NAL unit n be the n-th NAL unit in transmission order within an 1102 RTP stream. 1104 If sprop-max-don-diff is equal to 0 for all the RTP streams carrying 1105 the HEVC bitstream, AbsDon[n], the value of AbsDon for NAL unit n, is 1106 derived as equal to n. 1108 Otherwise (sprop-max-don-diff is greater than 0 for any of the RTP 1109 streams), AbsDon[n] is derived as follows, where DON[n] is the value 1110 of the variable DON for NAL unit n: 1112 * If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in 1113 transmission order), AbsDon[0] is set equal to DON[0]. 1115 * Otherwise (n is greater than 0), the following applies for 1116 derivation of AbsDon[n]: 1118 If DON[n] == DON[n-1], 1119 AbsDon[n] = AbsDon[n-1] 1121 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768), 1122 AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1] 1124 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768), 1125 AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n] 1127 If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768), 1128 AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 - 1129 DON[n]) 1131 If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768), 1132 AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n]) 1134 For any two NAL units m and n, the following applies: 1136 * AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows 1137 NAL unit m in NAL unit decoding order. 1139 * When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order 1140 of the two NAL units can be in either order. 1142 * AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes 1143 NAL unit m in decoding order. 1145 Informative note: When two consecutive NAL units in the NAL 1146 unit decoding order have different values of AbsDon, the 1147 absolute difference between the two AbsDon values may be 1148 greater than or equal to 1. 1150 Informative note: There are multiple reasons to allow for the 1151 absolute difference of the values of AbsDon for two consecutive 1152 NAL units in the NAL unit decoding order to be greater than 1153 one. An increment by one is not required, as at the time of 1154 associating values of AbsDon to NAL units, it may not be known 1155 whether all NAL units are to be delivered to the receiver. For 1156 example, a gateway might not forward VCL NAL units of higher 1157 sub-layers or some SEI NAL units when there is congestion in 1158 the network. In another example, the first intra-coded picture 1159 of a pre-encoded clip is transmitted in advance to ensure that 1160 it is readily available in the receiver, and when transmitting 1161 the first intra-coded picture, the originator does not exactly 1162 know how many NAL units will be encoded before the first intra- 1163 coded picture of the pre-encoded clip follows in decoding 1164 order. Thus, the values of AbsDon for the NAL units of the 1165 first intra-coded picture of the pre-encoded clip have to be 1166 estimated when they are transmitted, and gaps in values of 1167 AbsDon may occur. 1169 5. Packetization Rules 1171 The following packetization rules apply: 1173 * If sprop-max-don-diff is greater than 0 for any of the RTP 1174 streams, the transmission order of NAL units carried in the RTP 1175 stream MAY be different than the NAL unit decoding order and the 1176 NAL unit output order. 1178 * A NAL unit of a small size SHOULD be encapsulated in an 1179 aggregation packet together with one or more other NAL units in 1180 order to avoid unnecessary packetization overhead for small NAL 1181 units. For example, non-VCL NAL units such as access unit 1182 delimiters, parameter sets, or SEI NAL units are typically small 1183 and can often be aggregated with VCL NAL units without violating 1184 MTU size constraints. 1186 * Each non-VCL NAL unit SHOULD, when possible from an MTU size match 1187 viewpoint, be encapsulated in an aggregation packet together with 1188 its associated VCL NAL unit, as typically a non-VCL NAL unit would 1189 be meaningless without the associated VCL NAL unit being 1190 available. 1192 * For carrying exactly one NAL unit in an RTP packet, a single NAL 1193 unit packet MUST be used. 1195 6. De-packetization Process 1197 The general concept behind de-packetization is to get the NAL units 1198 out of the RTP packets in an RTP stream and pass them to the decoder 1199 in the NAL unit decoding order. 1201 The de-packetization process is implementation dependent. Therefore, 1202 the following description should be seen as an example of a suitable 1203 implementation. Other schemes may be used as well, as long as the 1204 output for the same input is the same as the process described below. 1205 The output is the same when the set of output NAL units and their 1206 order are both identical. Optimizations relative to the described 1207 algorithms are possible. 1209 All normal RTP mechanisms related to buffer management apply. In 1210 particular, duplicated or outdated RTP packets (as indicated by the 1211 RTP sequences number and the RTP timestamp) are removed. To 1212 determine the exact time for decoding, factors such as a possible 1213 intentional delay to allow for proper inter-stream synchronization 1214 must be factored in. 1216 NAL units with NAL unit type values in the range of 0 to 55, 1217 inclusive, may be passed to the decoder. NAL-unit-like structures 1218 with NAL unit type values in the range of 56 to 63, inclusive, MUST 1219 NOT be passed to the decoder. 1221 The receiver includes a receiver buffer, which is used to compensate 1222 for transmission delay jitter within individual RTP streams and 1223 across RTP streams, to reorder NAL units from transmission order to 1224 the NAL unit decoding order. In this section, the receiver operation 1225 is described under the assumption that there is no transmission delay 1226 jitter within an RTP stream. To make a difference from a practical 1227 receiver buffer that is also used for compensation of transmission 1228 delay jitter, the receiver buffer is hereafter called the de- 1229 packetization buffer in this section. Receivers should also prepare 1230 for transmission delay jitter; that is, either reserve separate 1231 buffers for transmission delay jitter buffering and de-packetization 1232 buffering or use a receiver buffer for both transmission delay jitter 1233 and de-packetization. Moreover, receivers should take transmission 1234 delay jitter into account in the buffering operation, e.g., by 1235 additional initial buffering before starting of decoding and 1236 playback. 1238 When sprop-max-don-diff is equal to 0 for the received RTP stream, 1239 the de-packetization buffer size is zero bytes, and the process 1240 described in the remainder of this paragraph applies. The NAL units 1241 carried in the RTP stream are directly passed to the decoder in their 1242 transmission order, which is identical to their decoding order. When 1243 there are several NAL units of the same RTP stream with the same NTP 1244 timestamp, the order to pass them to the decoder is their 1245 transmission order. 1247 Informative note: The mapping between RTP and NTP timestamps is 1248 conveyed in RTCP SR packets. In addition, the mechanisms for 1249 faster media timestamp synchronization discussed in [RFC6051] may 1250 be used to speed up the acquisition of the RTP-to-wall-clock 1251 mapping. 1253 When sprop-max-don-diff is greater than 0 for the received RTP stream 1254 the process described in the remainder of this section applies. 1256 There are two buffering states in the receiver: initial buffering and 1257 buffering while playing. Initial buffering starts when the reception 1258 is initialized. After initial buffering, decoding and playback are 1259 started, and the buffering-while-playing mode is used. 1261 Regardless of the buffering state, the receiver stores incoming NAL 1262 units, in reception order, into the de-packetization buffer. NAL 1263 units carried in RTP packets are stored in the de-packetization 1264 buffer individually, and the value of AbsDon is calculated and stored 1265 for each NAL unit. 1267 Initial buffering lasts until condition A (the difference between the 1268 greatest and smallest AbsDon values of the NAL units in the de- 1269 packetization buffer is greater than or equal to the value of sprop- 1270 max-don-diff) or condition B (the number of NAL units in the de- 1271 packetization buffer is greater than the value of sprop-depack-buf- 1272 nalus) is true. 1274 After initial buffering, whenever condition A or condition B is true, 1275 the following operation is repeatedly applied until both condition A 1276 and condition B become false: 1278 * The NAL unit in the de-packetization buffer with the smallest 1279 value of AbsDon is removed from the de-packetization buffer and 1280 passed to the decoder. 1282 When no more NAL units are flowing into the de-packetization buffer, 1283 all NAL units remaining in the de-packetization buffer are removed 1284 from the buffer and passed to the decoder in the order of increasing 1285 AbsDon values. 1287 7. Payload Format Parameters 1289 This section specifies the optional parameters. A mapping of the 1290 parameters with Session Description Protocol (SDP) [RFC4556] is also 1291 provided for applications that use SDP. 1293 7.1. Media Type Registration 1295 The receiver MUST ignore any parameter unspecified in this memo. 1297 Type name: video 1299 Subtype name: evc 1301 Required parameters: none 1303 Optional parameters: 1305 editor-note 5: To be updated 1307 7.2. SDP Parameters 1309 The receiver MUST ignore any parameter unspecified in this memo. 1311 7.2.1. Mapping of Payload Type Parameters to SDP 1313 The media type video/evc string is mapped to fields in the Session 1314 Description Protocol (SDP) [RFC4566] as follows: 1316 * The media name in the "m=" line of SDP MUST be video. 1318 * The encoding name in the "a=rtpmap" line of SDP MUST be evc (the 1319 media subtype). 1321 * The clock rate in the "a=rtpmap" line MUST be 90000. 1323 * OPTIONAL PARAMETERS: 1325 editor-note 6: To be updated 1327 7.2.2. Usage with SDP Offer/Answer Model 1329 When [EVC] is offered over RTP using SDP in an offer/answer model 1330 [RFC3264] for negotiation for unicast usage, the following 1331 limitations and rules apply: 1333 editor-note 7: to be updated 1335 7.2.3. SDP Example 1337 editor-note 8: to be updated 1339 8. Use with Feedback Messages 1341 Placeholder 1343 8.1. Picture Loss Indication (PLI) 1345 Placeholder 1347 8.2. Full Intra Request (FIR) 1349 Placeholder 1351 9. Security Considerations 1353 The scope of this Security Considerations section is limited to the 1354 payload format itself and to one feature of [EVC] that may pose a 1355 particularly serious security risk if implemented naively. The 1356 payload format, in isolation, does not form a complete system. 1357 Implementers are advised to read and understand relevant security- 1358 related documents, especially those pertaining to RTP (see the 1359 Security Considerations section in [RFC3550] ), and the security of 1360 the call-control stack chosen (that may make use of the media type 1361 registration of this memo). Implementers should also consider known 1362 security vulnerabilities of video coding and decoding implementations 1363 in general and avoid those. 1365 Within this RTP payload format, neither the various media-plane-based 1366 mechanisms, nor the signaling part of this memo, seems to pose a 1367 security risk beyond those common to all RTP-based systems. 1369 RTP packets using the payload format defined in this specification 1370 are subject to the security considerations discussed in the RTP 1371 specification [RFC3550], and in any applicable RTP profile such as 1372 RTP/AVP [RFC3551], RTP/AVPF [RFC4585], RTP/SAVP [RFC3711], or RTP/ 1373 SAVPF [RFC5124]. However, as "Securing the RTP Framework: Why RTP 1374 Does Not Mandate a Single Media Security Solution" [RFC7202] 1375 discusses, it is not an RTP payload format's responsibility to 1376 discuss or mandate what solutions are used to meet the basic security 1377 goals like confidentiality, integrity and source authenticity for RTP 1378 in general. This responsibility lays on anyone using RTP in an 1379 application. They can find guidance on available security mechanisms 1380 and important considerations in "Options for Securing RTP Sessions" 1381 [RFC7201]. Applications SHOULD use one or more appropriate strong 1382 security mechanisms. The rest of this section discusses the security 1383 impacting properties of the payload format itself. 1385 Because the data compression used with this payload format is applied 1386 end-to-end, any encryption needs to be performed after compression. 1387 A potential denial-of-service threat exists for data encodings using 1388 compression techniques that have non-uniform receiver-end 1389 computational load. The attacker can inject pathological datagrams 1390 into the bitstream that are complex to decode and that cause the 1391 receiver to be overloaded. EVC is particularly vulnerable to such 1392 attacks, as it is extremely simple to generate datagrams containing 1393 NAL units that affect the decoding process of many future NAL units. 1394 Therefore, the usage of data origin authentication and data integrity 1395 protection of at least the RTP packet is RECOMMENDED, for example, 1396 with SRTP [RFC3711]. 1398 End-to-end security with authentication, integrity, or 1399 confidentiality protection will prevent a MANE from performing media- 1400 aware operations other than discarding complete packets. In the case 1401 of confidentiality protection, it will even be prevented from 1402 discarding packets in a media-aware way. To be allowed to perform 1403 such operations, a MANE is required to be a trusted entity that is 1404 included in the security context establishment. 1406 10. Congestion Control 1408 Congestion control for RTP SHALL be used in accordance with RTP 1409 [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551]. 1410 If best-effort service is being used, an additional requirement is 1411 that users of this payload format MUST monitor packet loss to ensure 1412 that the packet loss rate is within an acceptable range. Packet loss 1413 is considered acceptable if a TCP flow across the same network path, 1414 and experiencing the same network conditions, would achieve an 1415 average throughput, measured on a reasonable timescale, that is not 1416 less than all RTP streams combined is achieving. This condition can 1417 be satisfied by implementing congestion-control mechanisms to adapt 1418 the transmission rate, the number of layers subscribed for a layered 1419 multicast session, or by arranging for a receiver to leave the 1420 session if the loss rate is unacceptably high. 1422 The bitrate adaptation necessary for obeying the congestion control 1423 principle is easily achievable when real-time encoding is used, for 1424 example, by adequately tuning the quantization parameter. However, 1425 when pre-encoded content is being transmitted, bandwidth adaptation 1426 requires the pre-coded bitstream to be tailored for such adaptivity. 1427 The key mechanism available in [EVC] is temporal scalability. A 1428 media sender can remove NAL units belonging to higher temporal sub- 1429 layers (i.e., those NAL. units with a high value of TID) until the 1430 sending bitrate drops to an acceptable range. 1432 The mechanisms mentioned above generally work within a defined 1433 profile and level and, therefore, no renegotiation of the channel is 1434 required. Only when non-downgradable parameters (such as profile) 1435 are required to be changed does it become necessary to terminate and 1436 restart the RTP stream(s). This may be accomplished by using 1437 different RTP payload types. 1439 MANEs MAY remove certain unusable packets from the RTP stream when 1440 that RTP stream was damaged due to previous packet losses. This can 1441 help reduce the network load in certain special cases. For example, 1442 MANES can remove those FUs where the leading FUs belonging to the 1443 same NAL unit have been lost or those dependent slice segments when 1444 the leading slice segments belonging to the same slice have been 1445 lost, because the trailing FUs or dependent slice segments are 1446 meaningless to most decoders. MANES can also remove higher temporal 1447 scalable layers if the outbound transmission (from the MANE's 1448 viewpoint) experiences congestion. 1450 11. IANA Considerations 1452 Placeholder 1454 12. Acknowledgements 1456 Large parts of this specification share text with the RTP payload 1457 format for HEVC [RFC7798]. We thank the authors of that 1458 specification for their excellent work. 1460 13. References 1462 13.1. Normative References 1464 [EVC] "ISO/IEC FDIS 23094-1 Essential Video Coding", 2020, 1465 . 1467 [ISO23094-1] 1468 "ISO/IEC DIS Information technology --- General video 1469 coding --- Part 1 Essential video coding", n.d., 1470 . 1472 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 1473 Requirement Levels", BCP 14, RFC 2119, 1474 DOI 10.17487/RFC2119, March 1997, 1475 . 1477 [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model 1478 with Session Description Protocol (SDP)", RFC 3264, 1479 DOI 10.17487/RFC3264, June 2002, 1480 . 1482 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 1483 Jacobson, "RTP: A Transport Protocol for Real-Time 1484 Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550, 1485 July 2003, . 1487 [RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and 1488 Video Conferences with Minimal Control", STD 65, RFC 3551, 1489 DOI 10.17487/RFC3551, July 2003, 1490 . 1492 [RFC3711] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. 1493 Norrman, "The Secure Real-time Transport Protocol (SRTP)", 1494 RFC 3711, DOI 10.17487/RFC3711, March 2004, 1495 . 1497 [RFC4556] Zhu, L. and B. Tung, "Public Key Cryptography for Initial 1498 Authentication in Kerberos (PKINIT)", RFC 4556, 1499 DOI 10.17487/RFC4556, June 2006, 1500 . 1502 [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session 1503 Description Protocol", RFC 4566, DOI 10.17487/RFC4566, 1504 July 2006, . 1506 [RFC4585] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, 1507 "Extended RTP Profile for Real-time Transport Control 1508 Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585, 1509 DOI 10.17487/RFC4585, July 2006, 1510 . 1512 [RFC5124] Ott, J. and E. Carrara, "Extended Secure RTP Profile for 1513 Real-time Transport Control Protocol (RTCP)-Based Feedback 1514 (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February 1515 2008, . 1517 [RFC7656] Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and 1518 B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms 1519 for Real-Time Transport Protocol (RTP) Sources", RFC 7656, 1520 DOI 10.17487/RFC7656, November 2015, 1521 . 1523 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 1524 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 1525 May 2017, . 1527 13.2. Informative References 1529 [AVC] "ITU-T Recommendation H.264 - Advanced video coding for 1530 generic audiovisual services", 2014, 1531 . 1533 [HEVC] "High efficiency video coding, ITU-T Recommendation 1534 H.265", 2017, . 1536 [I-D.ietf-avtcore-rtp-vvc] 1537 Zhao, S., Wenger, S., Sanchez, Y., and Y. Wang, "RTP 1538 Payload Format for Versatile Video Coding (VVC)", Work in 1539 Progress, Internet-Draft, draft-ietf-avtcore-rtp-vvc-07, 1540 19 January 2021, . 1543 [MPEG2S] IS0/IEC, ., "Information technology - Generic coding 1544 ofmoving pictures and associated audio information - Part 1545 1:Systems, ISO International Standard 13818-1", 2013. 1547 [RFC6051] Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP 1548 Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010, 1549 . 1551 [RFC6184] Wang, Y.-K., Even, R., Kristensen, T., and R. Jesup, "RTP 1552 Payload Format for H.264 Video", RFC 6184, 1553 DOI 10.17487/RFC6184, May 2011, 1554 . 1556 [RFC6190] Wenger, S., Wang, Y.-K., Schierl, T., and A. 1557 Eleftheriadis, "RTP Payload Format for Scalable Video 1558 Coding", RFC 6190, DOI 10.17487/RFC6190, May 2011, 1559 . 1561 [RFC7201] Westerlund, M. and C. Perkins, "Options for Securing RTP 1562 Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014, 1563 . 1565 [RFC7202] Perkins, C. and M. Westerlund, "Securing the RTP 1566 Framework: Why RTP Does Not Mandate a Single Media 1567 Security Solution", RFC 7202, DOI 10.17487/RFC7202, April 1568 2014, . 1570 [RFC7798] Wang, Y.-K., Sanchez, Y., Schierl, T., Wenger, S., and M. 1571 M. Hannuksela, "RTP Payload Format for High Efficiency 1572 Video Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, 1573 March 2016, . 1575 [VVC] "ISO/IEC FDIS 23090-3 Information technology --- Coded 1576 representation of immersive media --- Part 3 - Versatile 1577 video coding", 2020, 1578 . 1580 Authors' Addresses 1582 Shuai Zhao 1583 Tencent 1584 2747 Park Blvd 1585 Palo Alto, 94588 1586 United States of America 1588 Email: shuai.zhao@ieee.org 1590 Stephan Wenger 1591 Tencent 1592 2747 Park Blvd 1593 Palo Alto, 94588 1594 United States of America 1596 Email: stewe@stewe.org 1598 Youngkwon Lim 1599 Samsung Electronics 1600 6625 Excellence Way 1601 Plano, 75013 1602 United States of America 1604 Email: yklwhite@gmail.com