idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-10.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (July 09, 2021) is 1015 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1381

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'

  -- Obsolete informational reference (is this intentional?): RFC 2326
     (Obsoleted by RFC 7826)


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 5 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: January 10, 2022                                     Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                                 Y. Wang
8	                                                          Bytedance Inc.
9	                                                           July 09, 2021

11	          RTP Payload Format for Versatile Video Coding (VVC)
12	                     draft-ietf-avtcore-rtp-vvc-10

14	Abstract

16	   This memo describes an RTP payload format for the video coding
17	   standard ITU-T Recommendation H.266 and ISO/IEC International
18	   Standard 23090-3, both also known as Versatile Video Coding (VVC) and
19	   developed by the Joint Video Experts Team (JVET).  The RTP payload
20	   format allows for packetization of one or more Network Abstraction
21	   Layer (NAL) units in each RTP packet payload as well as fragmentation
22	   of a NAL unit into multiple RTP packets.  The payload format has wide
23	   applicability in videoconferencing, Internet video streaming, and
24	   high-bitrate entertainment-quality video, among other applications.

26	Status of This Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at https://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on January 10, 2022.

43	Copyright Notice

45	   Copyright (c) 2021 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents
50	   (https://trustee.ietf.org/license-info) in effect on the date of
51	   publication of this document.  Please review these documents
52	   carefully, as they describe your rights and restrictions with respect
53	   to this document.  Code Components extracted from this document must
54	   include Simplified BSD License text as described in Section 4.e of
55	   the Trust Legal Provisions and are provided without warranty as
56	   described in the Simplified BSD License.

58	Table of Contents

60	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
61	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
62	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   4
63	       1.1.2.  Systems and Transport Interfaces (informative)  . . .   6
64	       1.1.3.  High-Level Picture Partitioning (informative) . . . .  11
65	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  13
66	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  15
67	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  15
68	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  15
69	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  15
70	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  16
71	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  19
72	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  19
73	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  20
74	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  20
75	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  22
76	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  22
77	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  23
78	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  23
79	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  27
80	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  30
81	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  32
82	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  32
83	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  34
84	     7.1.  Media Type Registration . . . . . . . . . . . . . . . . .  35
85	     7.2.  SDP Parameters  . . . . . . . . . . . . . . . . . . . . .  48
86	       7.2.1.  Mapping of Payload Type Parameters to SDP . . . . . .  48
87	       7.2.2.  Usage with SDP Offer/Answer Model . . . . . . . . . .  49
88	       7.2.3.  Usage in Declarative Session Descriptions . . . . . .  59
89	       7.2.4.  Considerations for Parameter Sets . . . . . . . . . .  60
90	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  60
91	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  61
92	     8.2.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  61
93	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  61
94	   10. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  63
95	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  64
96	   12. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  64
97	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  64
98	     13.1.  Normative References . . . . . . . . . . . . . . . . . .  64
99	     13.2.  Informative References . . . . . . . . . . . . . . . . .  66
100	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  67
101	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  68

103	1.  Introduction

105	   The Versatile Video Coding [VVC] specification, formally published as
106	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
107	   23090-3, is currently in the ITU-T publication process and the ISO/
108	   IEC approval process.  VVC is reported to provide significant coding
109	   efficiency gains over HEVC [HEVC] as known as H.265, and other
110	   earlier video codecs.

112	   This memo specifies an RTP payload format for VVC.  It shares its
113	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
114	   payload formats of H.264 Video Coding [RFC6184], Scalable Video
115	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
116	   and their respective predecessors.  With respect to design
117	   philosophy, security, congestion control, and overall implementation
118	   complexity, it has similar properties to those earlier payload format
119	   specifications.  This is a conscious choice, as at least RFC 6184 is
120	   widely deployed and generally known in the relevant implementer
121	   communities.  Certain scalability-related mechanisms known from
122	   [RFC6190] were incorporated into this document, as VVC version 1
123	   supports temporal, spatial, and signal-to-noise ratio (SNR)
124	   scalability.

126	1.1.  Overview of the VVC Codec

128	   VVC and HEVC share a similar hybrid video codec design.  In this
129	   memo, we provide a very brief overview of those features of VVC that
130	   are, in some form, addressed by the payload format specified herein.
131	   Implementers have to read, understand, and apply the ITU-T/ISO/IEC
132	   specifications pertaining to VVC to arrive at interoperable, well-
133	   performing implementations.

135	   Conceptually, both VVC and HEVC include a Video Coding Layer (VCL),
136	   which is often used to refer to the coding-tool features, and a NAL,
137	   which is often used to refer to the systems and transport interface
138	   aspects of the codecs.

140	1.1.1.  Coding-Tool Features (informative)

142	   Coding tool features are described below with occasional reference to
143	   the coding tool set of HEVC, which is well known in the community.

145	   Similar to earlier hybrid-video-coding-based standards, including
146	   HEVC, the following basic video coding design is employed by VVC.  A
147	   prediction signal is first formed by either intra- or motion-
148	   compensated prediction, and the residual (the difference between the
149	   original and the prediction) is then coded.  The gains in coding
150	   efficiency are achieved by redesigning and improving almost all parts
151	   of the codec over earlier designs.  In addition, VVC includes several
152	   tools to make the implementation on parallel architectures easier.

154	   Finally, VVC includes temporal, spatial, and SNR scalability as well
155	   as multiview coding support.

157	   Coding blocks and transform structure

159	   Among major coding-tool differences between HEVC and VVC, one of the
160	   important improvements is the more flexible coding tree structure in
161	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
162	   ternary trees are also supported, which contributes significant
163	   improvement in coding efficiency.  Moreover, the maximum size of
164	   coding tree unit (CTU) is increased from 64x64 to 128x128.  To
165	   improve the coding efficiency of chroma signal, luma chroma separated
166	   trees at CTU level may be employed for intra-slices.  The square
167	   transforms in HEVC are extended to non-square transforms for
168	   rectangular blocks resulting from binary and ternary tree splits.
169	   Besides, VVC supports multiple transform sets (MTS), including DCT-2,
170	   DST-7, and DCT-8 as well as the non-separable secondary transform.
171	   The transforms used in VVC can have different sizes with support for
172	   larger transform sizes.  For DCT-2, the transform sizes range from
173	   2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from
174	   4x4 to 32x32.  In addition, VVC also support sub-block transform for
175	   both intra and inter coded blocks.  For intra coded blocks, intra
176	   sub-partitioning (ISP) may be used to allow sub-block based intra
177	   prediction and transform.  For inter blocks, sub-block transform may
178	   be used assuming that only a part of an inter-block has non-zero
179	   transform coefficients.

181	   Entropy coding

183	   Similar to HEVC, VVC uses a single entropy-coding engine, which is
184	   based on context adaptive binary arithmetic coding [CABAC], but with
185	   the support of multi-window sizes.  The window sizes can be
186	   initialized differently for different context models.  Due to such a
187	   design, it has more efficient adaptation speed and better coding
188	   efficiency.  A joint chroma residual coding scheme is applied to
189	   further exploit the correlation between the residuals of two color
190	   components.  In VVC, different residual coding schemes are applied
191	   for regular transform coefficients and residual samples generated
192	   using transform-skip mode.

194	   In-loop filtering

196	   VVC has more feature support in loop filters than HEVC.  The
197	   deblocking filter in VVC is similar to HEVC but operates at a smaller
198	   grid.  After deblocking and sample adaptive offset (SAO), an adaptive
199	   loop filter (ALF) may be used.  As a Wiener filter, ALF reduces
200	   distortion of decoded pictures.  Besides, VVC introduces a new module
201	   before deblocking called luma mapping with chroma scaling to fully
202	   utilize the dynamic range of signal so that rate-distortion
203	   performance of both SDR and HDR content is improved.

205	   Motion prediction and coding

207	   Compared to HEVC, VVC introduces several improvements in this area.
208	   First, there is the adaptive motion vector resolution (AMVR), which
209	   can save bit cost for motion vectors by adaptively signaling motion
210	   vector resolution.  Then the affine motion compensation is included
211	   to capture complicated motion like zooming and rotation.  Meanwhile,
212	   prediction refinement with the optical flow with affine mode (PROF)
213	   is further deployed to mimic affine motion at the pixel level.
214	   Thirdly the decoder side motion vector refinement (DMVR) is a method
215	   to derive MV vector at decoder side based on block matching so that
216	   fewer bits may be spent on motion vectors.  Bi-directional optical
217	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
218	   offset at 4x4 sub-block level that is derived with equations based on
219	   gradients of the prediction samples and a motion difference relative
220	   to CU motion vectors.  Furthermore, merge with motion vector
221	   difference (MMVD) is a special mode, which further signals a limited
222	   set of motion vector differences on top of merge mode.  In addition
223	   to MMVD, there are another three types of special merge modes, i.e.,
224	   sub-block merge, triangle, and combined intra-/inter-prediction
225	   (CIIP).  Sub-block merge list includes one candidate of sub-block
226	   temporal motion vector prediction (SbTMVP) and up to four candidates
227	   of affine motion vectors.  Triangle is based on triangular block
228	   motion compensation.  CIIP combines intra- and inter- predictions
229	   with weighting.  Adaptive weighting may be employed with a block-
230	   level tool called bi-prediction with CU based weighting (BCW) which
231	   provides more flexibility than in HEVC.

233	   Intra prediction and intra-coding
234	   To capture the diversified local image texture directions with finer
235	   granularity, VVC supports 65 angular directions instead of 33
236	   directions in HEVC.  The intra mode coding is based on a 6-most-
237	   probable-mode scheme, and the 6 most probable modes are derived using
238	   the neighboring intra prediction directions.  In addition, to deal
239	   with the different distributions of intra prediction angles for
240	   different block aspect ratios, a wide-angle intra prediction (WAIP)
241	   scheme is applied in VVC by including intra prediction angles beyond
242	   those present in HEVC.  Unlike HEVC which only allows using the most
243	   adjacent line of reference samples for intra prediction, VVC also
244	   allows using two further reference lines, as known as multi-
245	   reference-line (MRL) intra prediction.  The additional reference
246	   lines can be only used for the 6 most probable intra prediction
247	   modes.  To capture the strong correlation between different colour
248	   components, in VVC, a cross-component linear mode (CCLM) is utilized
249	   which assumes a linear relationship between the luma sample values
250	   and their associated chroma samples.  For intra prediction, VVC also
251	   applies a position-dependent prediction combination (PDPC) for
252	   refining the prediction samples closer to the intra prediction block
253	   boundary.  Matrix-based intra prediction (MIP) modes are also used in
254	   VVC which generates an up to 8x8 intra prediction block using a
255	   weighted sum of downsampled neighboring reference samples, and the
256	   weights are hardcoded constants.

258	   Other coding-tool feature

260	   VVC introduces dependent quantization (DQ) to reduce quantization
261	   error by state-based switching between two quantizers.

263	1.1.2.  Systems and Transport Interfaces (informative)

265	   VVC inherits the basic systems and transport interfaces designs from
266	   HEVC and H.264.  These include the NAL-unit-based syntax structure,
267	   the hierarchical syntax and data unit structure, the supplemental
268	   enhancement information (SEI) message mechanism, and the video
269	   buffering model based on the hypothetical reference decoder (HRD).
270	   The scalability features of VVC are conceptually similar to the
271	   scalable variant of HEVC known as SHVC.  The hierarchical syntax and
272	   data unit structure consists of parameter sets at various levels
273	   (decoder, sequence (pertaining to all), sequence (pertaining to a
274	   single), picture), picture-level header parameters, slice-level
275	   header parameters, and lower-level parameters.

277	   A number of key components that influenced the network abstraction
278	   layer design of VVC as well as this memo are described below

280	   Decoding capability information
281	   The decoding capability information includes parameters that stay
282	   constant for the lifetime of a Video Bitstream, which in IETF terms
283	   can translate to the lifetime of a session.  Such information
284	   includes profile, level, and sub-profile information to determine a
285	   maximum capability interop point that is guaranteed to be never
286	   exceeded, even if splicing of video sequences occurs within a
287	   session.  It further includes constraint fields (most of which are
288	   flags), which can optionally be set to indicate that the video
289	   bitstream will be constraint in the use of certain features as
290	   indicated by the values of those fields.  With this, a bitstream can
291	   be labelled as not using certain tools, which allows among other
292	   things for resource allocation in a decoder implementation.

294	   Video parameter set

296	   The ideo parameter set (VPS) pertains to a coded video sequences
297	   (CVS) of multiple layers covering the same range of access units, and
298	   includes, among other information decoding dependency expressed as
299	   information for reference picture list construction of enhancement
300	   layers.  The VPS provides a "big picture" of a scalable sequence,
301	   including what types of operation points are provided, the profile,
302	   tier, and level of the operation points, and some other high-level
303	   properties of the bitstream that can be used as the basis for session
304	   negotiation and content selection, etc.  One VPS may be referenced by
305	   one or more sequence parameter sets.

307	   Sequence parameter set

309	   The sequence parameter set (SPS) contains syntax elements pertaining
310	   to a coded layer video sequence (CLVS), which is a group of pictures
311	   belonging to the same layer, starting with a random access point, and
312	   followed by pictures that may depend on each other, until the next
313	   random access point picture.  In MPGEG-2, the equivalent of a CVS was
314	   a group of pictures (GOP), which normally started with an I frame and
315	   was followed by P and B frames.  While more complex in its options of
316	   random access points, VVC retains this basic concept.  One remarkable
317	   difference of VVC is that a CLVS may start with a Gradual Decoding
318	   Refresh (GDR) picture, without requiring presence of traditional
319	   random access points in the bitstream, such as instantaneous decoding
320	   refresh (IDR) or clean random access (CRA) pictures.  In many TV-like
321	   applications, a CVS contains a few hundred milliseconds to a few
322	   seconds of video.  In video conferencing (without switching MCUs
323	   involved), a CVS can be as long in duration as the whole session.

325	   Picture and adaptation parameter set

327	   The picture parameter set and the adaptation parameter set (PPS and
328	   APS, respectively) carry information pertaining to zero or more
329	   pictures and zero or more slices, respectively.  The PPS contains
330	   information that is likely to stay constant from picture to picture-
331	   at least for pictures for a certain type-whereas the APS contains
332	   information, such as adaptive loop filter coefficients, that are
333	   likely to change from picture to picture or even within a picture.  A
334	   single APS is referenced by all slices of the same picture if that
335	   APS contains information about luma mapping with chroma scaling
336	   (LMCS) or scaling list.  Different APSs containing ALF parameters can
337	   be referenced by slices of the same picture.

339	   Picture header

341	   A Picture Header contains information that is common to all slices
342	   that belong to the same picture.  Being able to send that information
343	   as a separate NAL unit when pictures are split into several slices
344	   allows for saving bitrate, compared to repeating the same information
345	   in all slices.  However, there might be scenarios where low-bitrate
346	   video is transmitted using a single slice per picture.  Having a
347	   separate NAL unit to convey that information incurs in an overhead
348	   for such scenarios.  For such scenarios, the picture header syntax
349	   structure is directly included in the slice header, instead of in its
350	   own NAL unit.  The mode of the picture header syntax structure being
351	   included in its own NAL unit or not can only be switched on/off for
352	   an entire CLVS, and can only be switched off when in the entire CLVS
353	   each picture contains only one slice.

355	   Profile, tier, and level

357	   The profile, tier and level syntax structures in DCI, VPS and SPS
358	   contain profile, tier, level information for all layers that refer to
359	   the DCI, for layers associated with one or more output layer sets
360	   specified by the VPS, and for any layer that refers to the SPS,
361	   respectively.

363	   Sub-profiles

365	   Within the VVC specification, a sub-profile is a 32-bit number, coded
366	   according to ITU-T Rec. T.35, that does not carry a semantics.  It is
367	   carried in the profile_tier_level structure and hence (potentially)
368	   present in the DCI, VPS, and SPS.  External registration bodies can
369	   register a T.35 codepoint with ITU-T registration authorities and
370	   associate with their registration a description of bitstream
371	   restrictions beyond the profiles defined by ITU-T and ISO/IEC.  This
372	   would allow encoder manufacturers to label the bitstreams generated
373	   by their encoder as complying with such sub-profile.  It is expected
374	   that upstream standardization organizations (such as: DVB and ATSC),
375	   as well as walled-garden video services will take advantage of this
376	   labelling system.  In contrast to "normal" profiles, it is expected
377	   that sub-profiles may indicate encoder choices traditionally left
378	   open in the (decoder- centric) video coding specs, such as GOP
379	   structures, minimum/maximum QP values, and the mandatory use of
380	   certain tools or SEI messages.

382	   General constraint fields

384	   The profile_tier_level structure carries a considerable number of
385	   constraint fields (most of which are flags), which an encoder can use
386	   to indicate to a decoder that it will not use a certain tool or
387	   technology.  They were included in reaction to a perceived market
388	   need for labelling a bitstream as not exercising a certain tool that
389	   has become commercially unviable.

391	   Temporal scalability support

393	   VVC includes support of temporal scalability, by inclusion of the
394	   signaling of TemporalId in the NAL unit header, the restriction that
395	   pictures of a particular temporal sublayer cannot be used for inter
396	   prediction reference by pictures of a lower temporal sublayer, the
397	   sub-bitstream extraction process, and the requirement that each sub-
398	   bitstream extraction output be a conforming bitstream.  Media-Aware
399	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
400	   header for stream adaptation purposes based on temporal scalability.

402	   Reference picture resampling (RPR)

404	   In AVC and HEVC, the spatial resolution of pictures cannot change
405	   unless a new sequence using a new SPS starts, with an IRAP picture.
406	   VVC enables picture resolution change within a sequence at a position
407	   without encoding an IRAP picture, which is always intra-coded.  This
408	   feature is sometimes referred to as reference picture resampling
409	   (RPR), as the feature needs resampling of a reference picture used
410	   for inter prediction when that reference picture has a different
411	   resolution than the current picture being decoded.  RPR allows
412	   resolution change without the need of coding an IRAP picture, which
413	   causes a momentary bit rate spike in streaming or video conferencing
414	   scenarios, e.g., to cope with network condition changes.  RPR can
415	   also be used in application scenarios wherein zooming of the entire
416	   video region or some region of interest is needed.

418	   Spatial, SNR, and multiview scalability

420	   VVC includes support for spatial, SNR, and multiview scalability.
421	   Scalable video coding is widely considered to have technical benefits
422	   and enrich services for various video applications.  Until recently,
423	   however, the functionality has not been included in the first version
424	   of specifications of the video codecs.  In VVC, however, all those
425	   forms of scalability are supported in the first version of VVC
426	   natively through the signaling of the layer_id in the NAL unit
427	   header, the VPS which associates layers with given layer_ids to each
428	   other, reference picture selection, reference picture resampling for
429	   spatial scalability, and a number of other mechanisms not relevant
430	   for this memo.

432	      Spatial scalability

434	         With the existence of Reference Picture Resampling (RPR), the
435	         additional burden for scalability support is just a
436	         modification of the high-level syntax (HLS).  The inter-layer
437	         prediction is employed in a scalable system to improve the
438	         coding efficiency of the enhancement layers.  In addition to
439	         the spatial and temporal motion-compensated predictions that
440	         are available in a single-layer codec, the inter-layer
441	         prediction in VVC uses the possibly resampled video data of the
442	         reconstructed reference picture from a reference layer to
443	         predict the current enhancement layer.  The resampling process
444	         for inter-layer prediction, when used, is performed at the
445	         block-level, reusing the existing interpolation process for
446	         motion compensation in single-layer coding.  It means that no
447	         additional resampling process is needed to support spatial
448	         scalability.

450	      SNR scalability

452	         SNR scalability is similar to spatial scalability except that
453	         the resampling factors are 1:1.  In other words, there is no
454	         change in resolution, but there is inter-layer prediction.

456	      Multiview scalability

458	         The first version of VVC also supports multiview scalability,
459	         wherein a multi-layer bitstream carries layers representing
460	         multiple views, and one or more of the represented views can be
461	         output at the same time.

463	   SEI messages

465	   Supplementary enhancement information (SEI) messages are information
466	   in the bitstream that do not influence the decoding process as
467	   specified in the VVC spec, but address issues of representation/
468	   rendering of the decoded bitstream, label the bitstream for certain
469	   applications, among other, similar tasks.  The overall concept of SEI
470	   messages and many of the messages themselves has been inherited from
471	   the H.264 and HEVC specs.  Except for the SEI messages that affect
472	   the specification of the hypothetical reference decoder (HRD), other
473	   SEI messages for use in the VVC environment, which are generally
474	   useful also in other video coding technologies, are not included in
475	   the main VVC specification but in a companion specification [VSEI].

477	1.1.3.  High-Level Picture Partitioning (informative)

479	   VVC inherited the concept of tiles and wavefront parallel processing
480	   (WPP) from HEVC, with some minor to moderate differences.  The basic
481	   concept of slices was kept in VVC but designed in an essentially
482	   different form.  VVC is the first video coding standard that includes
483	   subpictures as a feature, which provides the same functionality as
484	   HEVC motion-constrained tile sets (MCTSs) but designed differently to
485	   have better coding efficiency and to be friendlier for usage in
486	   application systems.  More details of these differences are described
487	   below.

489	   Tiles and WPP

491	   Same as in HEVC, a picture can be split into tile rows and tile
492	   columns in VVC, in-picture prediction across tile boundaries is
493	   disallowed, etc.  However, the syntax for signaling of tile
494	   partitioning has been simplified, by using a unified syntax design
495	   for both the uniform and the non-uniform mode.  In addition,
496	   signaling of entry point offsets for tiles in the slice header is
497	   optional in VVC while it is mandatory in HEVC.  The WPP design in VVC
498	   has two differences compared to HEVC: i) The CTU row delay is reduced
499	   from two CTUs to one CTU; ii) Signaling of entry point offsets for
500	   WPP in the slice header is optional in VVC while it is mandatory in
501	   HEVC.

503	   Slices

505	   In VVC, the conventional slices based on CTUs (as in HEVC) or
506	   macroblocks (as in AVC) have been removed.  The main reasoning behind
507	   this architectural change is as follows.  The advances in video
508	   coding since 2003 (the publication year of AVC v1) have been such
509	   that slice-based error concealment has become practically impossible,
510	   due to the ever-increasing number and efficiency of in-picture and
511	   inter-picture prediction mechanisms.  An error-concealed picture is
512	   the decoding result of a transmitted coded picture for which there is
513	   some data loss (e.g., loss of some slices) of the coded picture or a
514	   reference picture for at least some part of the coded picture is not
515	   error-free (e.g., that reference picture was an error-concealed
516	   picture).  For example, when one of the multiple slices of a picture
517	   is lost, it may be error-concealed using an interpolation of the
518	   neighboring slices.  While advanced video coding prediction
519	   mechanisms provide significantly higher coding efficiency, they also
520	   make it harder for machines to estimate the quality of an error-
521	   concealed picture, which was already a hard problem with the use of
522	   simpler prediction mechanisms.  Advanced in-picture prediction
523	   mechanisms also cause the coding efficiency loss due to splitting a
524	   picture into multiple slices to be more significant.  Furthermore,
525	   network conditions become significantly better while at the same time
526	   techniques for dealing with packet losses have become significantly
527	   improved.  As a result, very few implementations have recently used
528	   slices for maximum transmission unit size matching.  Instead,
529	   substantially all applications where low-delay error resilience is
530	   required (e.g., video telephony and video conferencing) rely on
531	   system/transport-level error resilience (e.g., retransmission,
532	   forward error correction) and/or picture-based error resilience tools
533	   (feedback-based error resilience, insertion of IRAPs, scalability
534	   with higher protection level of the base layer, and so on).
535	   Considering all the above, nowadays it is very rare that a picture
536	   that cannot be correctly decoded is passed to the decoder, and when
537	   such a rare case occurs, the system can afford to wait for an error-
538	   free picture to be decoded and available for display without
539	   resulting in frequent and long periods of picture freezing seen by
540	   end users.

542	   Slices in VVC have two modes: rectangular slices and raster-scan
543	   slices.  The rectangular slice, as indicated by its name, covers a
544	   rectangular region of the picture.  Typically, a rectangular slice
545	   consists of several complete tiles.  However, it is also possible
546	   that a rectangular slice is a subset of a tile and consists of one or
547	   more consecutive, complete CTU rows within a tile.  A raster-scan
548	   slice consists of one or more complete tiles in a tile raster scan
549	   order, hence the region covered by a raster-scan slices need not but
550	   could have a non-rectangular shape, but it may also happen to have
551	   the shape of a rectangle.  The concept of slices in VVC is therefore
552	   strongly linked to or based on tiles instead of CTUs (as in HEVC) or
553	   macroblocks (as in AVC).

555	   Subpictures

557	   VVC is the first video coding standard that includes the support of
558	   subpictures as a feature.  Each subpicture consists of one or more
559	   complete rectangular slices that collectively cover a rectangular
560	   region of the picture.  A subpicture may be either specified to be
561	   extractable (i.e., coded independently of other subpictures of the
562	   same picture and of earlier pictures in decoding order) or not
563	   extractable.  Regardless of whether a subpicture is extractable or
564	   not, the encoder can control whether in-loop filtering (including
565	   deblocking, SAO, and ALF) is applied across the subpicture boundaries
566	   individually for each subpicture.

568	   Functionally, subpictures are similar to the motion-constrained tile
569	   sets (MCTSs) in HEVC.  They both allow independent coding and
570	   extraction of a rectangular subset of a sequence of coded pictures,
571	   for use cases like viewport-dependent 360o video streaming
572	   optimization and region of interest (ROI) applications.

574	   There are several important design differences between subpictures
575	   and MCTSs.  First, the subpictures feature in VVC allows motion
576	   vectors of a coding block pointing outside of the subpicture even
577	   when the subpicture is extractable by applying sample padding at
578	   subpicture boundaries in this case, similarly as at picture
579	   boundaries.  Second, additional changes were introduced for the
580	   selection and derivation of motion vectors in the merge mode and in
581	   the decoder side motion vector refinement process of VVC.  This
582	   allows higher coding efficiency compared to the non-normative motion
583	   constraints applied at the encoder-side for MCTSs.  Third, rewriting
584	   of SHs (and PH NAL units, when present) is not needed when extracting
585	   one or more extractable subpictures from a sequence of pictures to
586	   create a sub-bitstream that is a conforming bitstream.  In sub-
587	   bitstream extractions based on HEVC MCTSs, rewriting of SHs is
588	   needed.  Note that in both HEVC MCTSs extraction and VVC subpictures
589	   extraction, rewriting of SPSs and PPSs is needed.  However, typically
590	   there are only a few parameter sets in a bitstream, while each
591	   picture has at least one slice, therefore rewriting of SHs can be a
592	   significant burden for application systems.  Fourth, slices of
593	   different subpictures within a picture are allowed to have different
594	   NAL unit types.  Fifth, VVC specifies HRD and level definitions for
595	   subpicture sequences, thus the conformance of the sub-bitstream of
596	   each extractable subpicture sequence can be ensured by encoders.

598	1.1.4.  NAL Unit Header

600	   VVC maintains the NAL unit concept of HEVC with modifications.  VVC
601	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
602	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

604	                     +---------------+---------------+
605	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
606	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
607	                     |F|Z| LayerID   |  Type   | TID |
608	                     +---------------+---------------+

610	                   The Structure of the VVC NAL Unit Header.

612	                                 Figure 1

614	   The semantics of the fields in the NAL unit header are as specified
615	   in VVC and described briefly below for convenience.  In addition to
616	   the name and size of each field, the corresponding syntax element
617	   name in VVC is also provided.

619	   F: 1 bit

621	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
622	      inclusion of this bit in the NAL unit header was to enable
623	      transport of VVC video over MPEG-2 transport systems (avoidance of
624	      start code emulations) [MPEG2S].  In the context of this memo the
625	      value 1 may be used to indicate a syntax violation, e.g., for a
626	      NAL unit resulted from aggregating a number of fragmented units of
627	      a NAL unit but missing the last fragment, as described in the last
628	      sentence of section 4.3.3.

630	   Z: 1 bit

632	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
633	      for future extensions by ITU-T and ISO/IEC.
634	      This memo does not overload the "Z" bit for local extensions, as
635	      a) overloading the "F" bit is sufficient and b) to preserve the
636	      usefulness of this memo to possible future versions of [VVC].

638	   LayerId: 6 bits

640	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
641	      a layer may be, e.g., a spatial scalable layer, a quality scalable
642	      layer, a layer containing a different view, etc.

644	   Type: 5 bits

646	      nal_unit_type.  This field specifies the NAL unit type as defined
647	      in Table 5 of [VVC].  For a reference of all currently defined NAL
648	      unit types and their semantics, please refer to Section 7.4.2.2 in
649	      [VVC].

651	   TID: 3 bits

653	      nuh_temporal_id_plus1.  This field specifies the temporal
654	      identifier of the NAL unit plus 1.  The value of TemporalId is
655	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
656	      there is at least one bit in the NAL unit header equal to 1, so to
657	      enable the consideration of start code emulations in the NAL unit
658	      payload data independent of the NAL unit header.

660	1.2.  Overview of the Payload Format

662	   This payload format defines the following processes required for
663	   transport of VVC coded data over RTP [RFC3550]:

665	   o  Usage of RTP header with this payload format

667	   o  Packetization of VVC coded NAL units into RTP packets using three
668	      types of payload structures: a single NAL unit packet, aggregation
669	      packet, and fragment unit

671	   o  Transmission of VVC NAL units of the same bitstream within a
672	      single RTP stream

674	   o  Media type parameters to be used with the Session Description
675	      Protocol (SDP) [RFC4566]

677	   o  Usage of RTCP feedback messages

679	2.  Conventions

681	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
682	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
683	   "OPTIONAL" in this document are to be interpreted as described in BCP
684	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
685	   capitals, as shown above.

687	3.  Definitions and Abbreviations

689	3.1.  Definitions

691	   This document uses the terms and definitions of VVC.  Section 3.1.1
692	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
693	   provides definitions specific to this memo.  All the used terms and
694	   definitions in this memo are verbatim copies of [VVC] specification.

696	3.1.1.  Definitions from the VVC Specification

698	   Access unit (AU): A set of PUs that belong to different layers and
699	   contain coded pictures associated with the same time for output from
700	   the DPB.

702	   Adaptation parameter set (APS): A syntax structure containing syntax
703	   elements that apply to zero or more slices as determined by zero or
704	   more syntax elements found in slice headers.

706	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
707	   byte stream, that forms the representation of a sequence of AUs
708	   forming one or more coded video sequences (CVSs).

710	   Coded picture: A coded representation of a picture comprising VCL NAL
711	   units with a particular value of nuh_layer_id within an AU and
712	   containing all CTUs of the picture.

714	   Clean random access (CRA) PU: A PU in which the coded picture is a
715	   CRA picture.

717	   Clean random access (CRA) picture: An IRAP picture for which each VCL
718	   NAL unit has nal_unit_type equal to CRA_NUT.

720	   Coded video sequence (CVS): A sequence of AUs that consists, in
721	   decoding order, of a CVSS AU, followed by zero or more AUs that are
722	   not CVSS AUs, including all subsequent AUs up to but not including
723	   any subsequent AU that is a CVSS AU.

725	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
726	   for each layer in the CVS and the coded picture in each PU is a CLVSS
727	   picture.

729	   Coded layer video sequence (CLVS): A sequence of PUs with the same
730	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
731	   PU, followed by zero or more PUs that are not CLVSS PUs, including
732	   all subsequent PUs up to but not including any subsequent PU that is
733	   a CLVSS PU.

735	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
736	   picture is a CLVSS picture.

738	   Coded layer video sequence start (CLVSS) picture: A coded picture
739	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
740	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

742	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
743	   of chroma samples of a picture that has three sample arrays, or a CTB
744	   of samples of a monochrome picture or a picture that is coded using
745	   three separate colour planes and syntax structures used to code the
746	   samples.

748	   Decoding Capability Information (DCI): A syntax structure containing
749	   syntax elements that apply to the entire bitstream.

751	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
752	   reference, output reordering, or output delay specified for the
753	   hypothetical reference decoder.

755	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
756	   NAL unit has nal_unit_type equal to GDR_NUT.

758	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
759	   picture is an IDR picture.

761	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
762	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
763	   IDR_N_LP.

765	   Intra random access point (IRAP) AU: An AU in which there is a PU for
766	   each layer in the CVS and the coded picture in each PU is an IRAP
767	   picture.

769	   Intra random access point (IRAP) PU: A PU in which the coded picture
770	   is an IRAP picture.

772	   Intra random access point (IRAP) picture: A coded picture for which
773	   all VCL NAL units have the same value of nal_unit_type in the range
774	   of IDR_W_RADL to CRA_NUT, inclusive.

776	   Layer: A set of VCL NAL units that all have a particular value of
777	   nuh_layer_id and the associated non-VCL NAL units.

779	   Network abstraction layer (NAL) unit: A syntax structure containing
780	   an indication of the type of data to follow and bytes containing that
781	   data in the form of an RBSP interspersed as necessary with emulation
782	   prevention bytes.

784	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

786	   Operation point (OP): A temporal subset of an OLS, identified by an
787	   OLS index and a highest value of TemporalId.

789	   Picture parameter set (PPS): A syntax structure containing syntax
790	   elements that apply to zero or more entire coded pictures as
791	   determined by a syntax element found in each slice header.

793	   Picture unit (PU): A set of NAL units that are associated with each
794	   other according to a specified classification rule, are consecutive
795	   in decoding order, and contain exactly one coded picture.

797	   Random access: The act of starting the decoding process for a
798	   bitstream at a point other than the beginning of the stream.

800	   Sequence parameter set (SPS): A syntax structure containing syntax
801	   elements that apply to zero or more entire CLVSs as determined by the
802	   content of a syntax element found in the PPS referred to by a syntax
803	   element found in each picture header.

805	   Slice: An integer number of complete tiles or an integer number of
806	   consecutive complete CTU rows within a tile of a picture that are
807	   exclusively contained in a single NAL unit.

809	   Slice header (SH): A part of a coded slice containing the data
810	   elements pertaining to all tiles or CTU rows within a tile
811	   represented in the slice.

813	   Sublayer: A temporal scalable layer of a temporal scalable bitstream
814	   consisting of VCL NAL units with a particular value of the TemporalId
815	   variable, and the associated non-VCL NAL units.

817	   Subpicture: An rectangular region of one or more slices within a
818	   picture.

820	   Sublayer representation: A subset of the bitstream consisting of NAL
821	   units of a particular sublayer and the lower sublayers.

823	   Tile: A rectangular region of CTUs within a particular tile column
824	   and a particular tile row in a picture.

826	   Tile column: A rectangular region of CTUs having a height equal to
827	   the height of the picture and a width specified by syntax elements in
828	   the picture parameter set.

830	   Tile row: A rectangular region of CTUs having a height specified by
831	   syntax elements in the picture parameter set and a width equal to the
832	   width of the picture.

834	   Video coding layer (VCL) NAL unit: A collective term for coded slice
835	   NAL units and the subset of NAL units that have reserved values of
836	   nal_unit_type that are classified as VCL NAL units in this
837	   Specification.

839	3.1.2.  Definitions Specific to This Memo

841	   Media-Aware Network Element (MANE): A network element, such as a
842	   middlebox, selective forwarding unit, or application-layer gateway
843	   that is capable of parsing certain aspects of the RTP payload headers
844	   or the RTP payload and reacting to their contents.

846	      Informative note: The concept of a MANE goes beyond normal routers
847	      or gateways in that a MANE has to be aware of the signaling (e.g.,
848	      to learn about the payload type mappings of the media streams),
849	      and in that it has to be trusted when working with Secure RTP
850	      (SRTP).  The advantage of using MANEs is that they allow packets
851	      to be dropped according to the needs of the media coding.  For
852	      example, if a MANE has to drop packets due to congestion on a
853	      certain link, it can identify and remove those packets whose
854	      elimination produces the least adverse effect on the user
855	      experience.  After dropping packets, MANEs must rewrite RTCP
856	      packets to match the changes to the RTP stream, as specified in
857	      Section 7 of [RFC3550].

859	   NAL unit decoding order: A NAL unit order that conforms to the
860	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
861	   follow the Order of NAL units in the bitstream.

863	   RTP stream (See [RFC7656]): Within the scope of this memo, one RTP
864	   stream is utilized to transport a VVC bitstream, which may contain
865	   one or more layers, and each layer may contain one or more temporal
866	   sublayers.

868	   Transmission order: The order of packets in ascending RTP sequence
869	   number order (in modulo arithmetic).  Within an aggregation packet,
870	   the NAL unit transmission order is the same as the order of
871	   appearance of NAL units in the packet.

873	3.2.  Abbreviations

875	   AU         Access Unit

877	   AP         Aggregation Packet

879	   APS        Adaptation Parameter Set

881	   CTU        Coding Tree Unit

883	   CVS        Coded Video Sequence

885	   DPB        Decoded Picture Buffer
886	   DCI        Decoding Capability Information

888	   DON        Decoding Order Number

890	   FIR        Full Intra Request

892	   FU         Fragmentation Unit

894	   GDR        Gradual Decoding Refresh

896	   HRD        Hypothetical Reference Decoder

898	   IDR        Instantaneous Decoding Refresh

900	   MANE       Media-Aware Network Element

902	   MTU        Maximum Transfer Unit

904	   NAL        Network Abstraction Layer

906	   NALU       Network Abstraction Layer Unit

908	   PLI        Picture Loss Indication

910	   PPS        Picture Parameter Set

912	   RPS        Reference Picture Set

914	   RPSI       Reference Picture Selection Indication

916	   SEI        Supplemental Enhancement Information

918	   SLI        Slice Loss Indication

920	   SPS        Sequence Parameter Set

922	   VCL        Video Coding Layer

924	   VPS        Video Parameter Set

926	4.  RTP Payload Format

928	4.1.  RTP Header Usage

930	   The format of the RTP header is specified in [RFC3550] (reprinted as
931	   Figure 2 for convenience).  This payload format uses the fields of
932	   the header in a manner consistent with that specification.

934	   The RTP payload (and the settings for some RTP header bits) for
935	   aggregation packets and fragmentation units are specified in
936	   Section 4.3.2 and Section 4.3.3, respectively.

938	       0                   1                   2                   3
939	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
940	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
941	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
942	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
943	      |                           timestamp                           |
944	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
945	      |           synchronization source (SSRC) identifier            |
946	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
947	      |            contributing source (CSRC) identifiers             |
948	      |                             ....                              |
949	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

951	                        RTP Header According to {{RFC3550}}

953	                                 Figure 2

955	   The RTP header information to be set according to this RTP payload
956	   format is set as follows:

958	   Marker bit (M): 1 bit

960	      Set for the last packet, in transmission order, among each set of
961	      packets that contain NAL units of one access unit.  This is in
962	      line with the normal use of the M bit in video formats to allow an
963	      efficient playout buffer handling.

965	   Payload Type (PT): 7 bits

967	      The assignment of an RTP payload type for this new packet format
968	      is outside the scope of this document and will not be specified
969	      here.  The assignment of a payload type has to be performed either
970	      through the profile used or in a dynamic way.

972	   Sequence Number (SN): 16 bits

974	      Set and used in accordance with [RFC3550].

976	   Timestamp: 32 bits

978	      The RTP timestamp is set to the sampling timestamp of the content.
979	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
980	      properties of its own (e.g., parameter set and SEI NAL units), the
981	      RTP timestamp MUST be set to the RTP timestamp of the coded
982	      pictures of the access unit in which the NAL unit (according to
983	      Section 7.4.2.4 of [VVC]) is included.  Receivers MUST use the RTP
984	      timestamp for the display process, even when the bitstream
985	      contains picture timing SEI messages or decoding unit information
986	      SEI messages as specified in [VVC].

988	         Informative note: When picture timing SEI messages are present,
989	         the RTP sender is responsible to ensure that the RTP timestamps
990	         are consistent with the timing information carried in the
991	         picture timing SEI messages.

993	   Synchronization source (SSRC): 32 bits

995	      Used to identify the source of the RTP packets.  A single SSRC is
996	      used for all parts of a single bitstream.

998	4.2.  Payload Header Usage

1000	   The first two bytes of the payload of an RTP packet are referred to
1001	   as the payload header.  The payload header consists of the same
1002	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
1003	   in Section 1.1.4, irrespective of the type of the payload structure.

1005	   The TID value indicates (among other things) the relative importance
1006	   of an RTP packet, for example, because NAL units belonging to higher
1007	   temporal sublayers are not used for the decoding of lower temporal
1008	   sublayers.  A lower value of TID indicates a higher importance.
1009	   More-important NAL units MAY be better protected against transmission
1010	   losses than less-important NAL units.

1012	      For Discussion: quite possibly something similar can be said for
1013	      the Layer_id in layered coding, but perhaps not in multiview
1014	      coding.  (The relevant part of the spec is relatively new,
1015	      therefore the soft language).  However, for serious layer pruning,
1016	      interpretation of the VPS is required.  We can add language about
1017	      the need for stateful interpretation of LayerID vis-a-vis
1018	      stateless interpretation of TID later.

1020	4.3.  Payload Structures

1022	   Three different types of RTP packet payload structures are specified.
1023	   A receiver can identify the type of an RTP packet payload through the
1024	   Type field in the payload header.

1026	   The three different payload structures are as follows:

1028	   o  Single NAL unit packet: Contains a single NAL unit in the payload,
1029	      and the NAL unit header of the NAL unit also serves as the payload
1030	      header.  This payload structure is specified in Section 4.4.1.

1032	   o  Aggregation Packet (AP): Contains more than one NAL unit within
1033	      one access unit.  This payload structure is specified in
1034	      Section 4.3.2.

1036	   o  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
1037	      This payload structure is specified in Section 4.3.3.

1039	4.3.1.  Single NAL Unit Packets

1041	   A single NAL unit packet contains exactly one NAL unit, and consists
1042	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
1043	   DONL field (in network byte order), and the NAL unit payload data
1044	   (the NAL unit excluding its NAL unit header) of the contained NAL
1045	   unit, as shown in Figure 3.

1047	      0                   1                   2                   3
1048	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1049	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1050	     |           PayloadHdr          |      DONL (conditional)       |
1051	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1052	     |                                                               |
1053	     |                  NAL unit payload data                        |
1054	     |                                                               |
1055	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1056	     |                               :...OPTIONAL RTP padding        |
1057	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1059	                  The Structure of a Single NAL Unit Packet

1061	                                 Figure 3

1063	   The DONL field, when present, specifies the value of the 16 least
1064	   significant bits of the decoding order number of the contained NAL
1065	   unit.  If sprop-max-don-diff is greater than 0, the DONL field MUST
1066	   be present, and the variable DON for the contained NAL unit is
1067	   derived as equal to the value of the DONL field.  Otherwise (sprop-
1068	   max-don-diff is equal to 0), the DONL field MUST NOT be present.

1070	4.3.2.  Aggregation Packets (APs)

1072	   Aggregation Packets (APs) can reduce packetization overhead for small
1073	   NAL units, such as most of the non-VCL NAL units, which are often
1074	   only a few octets in size.

1076	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
1077	   carried in an AP is encapsulated in an aggregation unit.  NAL units
1078	   aggregated in one AP are included in NAL unit decoding order.

1080	   An AP consists of a payload header (denoted as PayloadHdr) followed
1081	   by two or more aggregation units, as shown in Figure 4.

1083	     0                   1                   2                   3
1084	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1085	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1086	    |    PayloadHdr (Type=28)       |                               |
1087	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
1088	    |                                                               |
1089	    |             two or more aggregation units                     |
1090	    |                                                               |
1091	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1092	    |                               :...OPTIONAL RTP padding        |
1093	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1095	                   The Structure of an Aggregation Packet

1097	                                 Figure 4

1099	   The fields in the payload header of an AP are set as follows.  The F
1100	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
1101	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
1102	   be equal to 28.

1104	   The value of LayerId MUST be equal to the lowest value of LayerId of
1105	   all the aggregated NAL units.  The value of TID MUST be the lowest
1106	   value of TID of all the aggregated NAL units.

1108	      Informative note: All VCL NAL units in an AP have the same TID
1109	      value since they belong to the same access unit.  However, an AP
1110	      may contain non-VCL NAL units for which the TID value in the NAL
1111	      unit header may be different than the TID value of the VCL NAL
1112	      units in the same AP.

1114	   An AP MUST carry at least two aggregation units and can carry as many
1115	   aggregation units as necessary; however, the total amount of data in
1116	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1117	   chosen so that the resulting IP packet is smaller than the MTU size
1118	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1119	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1120	   not contain another AP.

1122	   The first aggregation unit in an AP consists of a conditional 16-bit
1123	   DONL field (in network byte order) followed by a 16-bit unsigned size
1124	   information (in network byte order) that indicates the size of the
1125	   NAL unit in bytes (excluding these two octets, but including the NAL
1126	   unit header), followed by the NAL unit itself, including its NAL unit
1127	   header, as shown in Figure 5.

1129	     0                   1                   2                   3
1130	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1131	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1132	    |               :       DONL (conditional)      |   NALU size   |
1133	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1134	    |   NALU size   |                                               |
1135	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1136	    |                                                               |
1137	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1138	    |                               :
1139	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1141	           The Structure of the First Aggregation Unit in an AP

1143	                                 Figure 5

1145	   The DONL field, when present, specifies the value of the 16 least
1146	   significant bits of the decoding order number of the aggregated NAL
1147	   unit.

1149	   If sprop-max-don-diff is greater than 0, the DONL field MUST be
1150	   present in an aggregation unit that is the first aggregation unit in
1151	   an AP, and the variable DON for the aggregated NAL unit is derived as
1152	   equal to the value of the DONL field, and the variable DON for an
1153	   aggregation unit that is not the first aggregation unit in an AP
1154	   aggregated NAL unit is derived as equal to the DON of the preceding
1155	   aggregated NAL unit in the same AP plus 1 modulo 65536.  Otherwise
1156	   (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be
1157	   present in an aggregation unit that is the first aggregation unit in
1158	   an AP.

1160	   An aggregation unit that is not the first aggregation unit in an AP
1161	   will be followed immediately by a 16-bit unsigned size information
1162	   (in network byte order) that indicates the size of the NAL unit in
1163	   bytes (excluding these two octets, but including the NAL unit
1164	   header), followed by the NAL unit itself, including its NAL unit
1165	   header, as shown in Figure 6.

1167	     0                   1                   2                   3
1168	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1169	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1170	    |               :       NALU size               |   NAL unit    |
1171	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1172	    |                                                               |
1173	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1174	    |                               :
1175	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1177	         The Structure of an Aggregation Unit That Is Not the First
1178	                          Aggregation Unit in an AP

1180	                                 Figure 6

1182	   Figure 7 presents an example of an AP that contains two aggregation
1183	   units, labeled as 1 and 2 in the figure, without the DONL field being
1184	   present.

1186	     0                   1                   2                   3
1187	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1188	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1189	    |                          RTP Header                           |
1190	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1191	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1192	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1193	    |          NALU 1 HDR           |                               |
1194	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1195	    |                   . . .                                       |
1196	    |                                                               |
1197	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1198	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1199	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1200	    | NALU 2 HDR    |                                               |
1201	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1202	    |                   . . .                                       |
1203	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1204	    |                               :...OPTIONAL RTP padding        |
1205	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1207	               An Example of an AP Packet Containing
1208	             Two Aggregation Units without the DONL Field

1210	                                 Figure 7

1212	   Figure 8 presents an example of an AP that contains two aggregation
1213	   units, labeled as 1 and 2 in the figure, with the DONL field being
1214	   present.

1216	     0                   1                   2                   3
1217	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1218	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1219	    |                          RTP Header                           |
1220	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1221	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1222	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1223	    |          NALU 1 Size          |            NALU 1 HDR         |
1224	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1225	    |                                                               |
1226	    |                 NALU 1 Data   . . .                           |
1227	    |                                                               |
1228	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1229	    |                               :          NALU 2 Size          |
1230	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1231	    |          NALU 2 HDR           |                               |
1232	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1233	    |                                                               |
1234	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1235	    |                               :...OPTIONAL RTP padding        |
1236	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1238	                   An Example of an AP Containing
1239	                 Two Aggregation Units with the DONL Field

1241	                                 Figure 8

1243	4.3.3.  Fragmentation Units

1245	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1246	   single NAL unit into multiple RTP packets, possibly without
1247	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1248	   unit consists of an integer number of consecutive octets of that NAL
1249	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1250	   order with ascending RTP sequence numbers (with no other RTP packets
1251	   within the same RTP stream being sent between the first and last
1252	   fragment).

1254	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1255	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1256	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1258	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1259	   time of the fragmented NAL unit.

1261	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1262	   header of one octet, a conditional 16-bit DONL field (in network byte
1263	   order), and an FU payload, as shown in Figure 9.

1265	     0                   1                   2                   3
1266	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1267	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1268	    |   PayloadHdr (Type=29)        |   FU header   | DONL (cond)   |
1269	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1270	    |   DONL (cond) |                                               |
1271	    |-+-+-+-+-+-+-+-+                                               |
1272	    |                         FU payload                            |
1273	    |                                                               |
1274	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1275	    |                               :...OPTIONAL RTP padding        |
1276	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1278	                          The Structure of an FU

1280	                                 Figure 9

1282	   The fields in the payload header are set as follows.  The Type field
1283	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1284	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1285	   unit.

1287	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1288	   FuType field, as shown in Figure 10.

1290	                           +---------------+
1291	                           |0|1|2|3|4|5|6|7|
1292	                           +-+-+-+-+-+-+-+-+
1293	                           |S|E|P|  FuType |
1294	                           +---------------+

1296	                       The Structure of FU Header

1298	                                 Figure 10

1300	   The semantics of the FU header fields are as follows:

1302	   S: 1 bit
1303	      When set to 1, the S bit indicates the start of a fragmented NAL
1304	      unit, i.e., the first byte of the FU payload is also the first
1305	      byte of the payload of the fragmented NAL unit.  When the FU
1306	      payload is not the start of the fragmented NAL unit payload, the S
1307	      bit MUST be set to 0.

1309	   E: 1 bit

1311	      When set to 1, the E bit indicates the end of a fragmented NAL
1312	      unit, i.e., the last byte of the payload is also the last byte of
1313	      the fragmented NAL unit.  When the FU payload is not the last
1314	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1316	   P: 1 bit

1318	      When set to 1, the P bit indicates the last NAL unit of a coded
1319	      picture, i.e., the last byte of the FU payload is also the last
1320	      byte of the coded picture.  When the FU payload is not the last
1321	      fragment of a coded picture, the P bit MUST be set to 0.

1323	   FuType: 5 bits

1325	      The field FuType MUST be equal to the field Type of the fragmented
1326	      NAL unit.

1328	   The DONL field, when present, specifies the value of the 16 least
1329	   significant bits of the decoding order number of the fragmented NAL
1330	   unit.

1332	   If sprop-max-don-diff is greater than 0, and the S bit is equal to 1,
1333	   the DONL field MUST be present in the FU, and the variable DON for
1334	   the fragmented NAL unit is derived as equal to the value of the DONL
1335	   field.  Otherwise (sprop-max-don-diff is equal to 0, or the S bit is
1336	   equal to 0), the DONL field MUST NOT be present in the FU.

1338	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1339	   the Start bit and End bit must not both be set to 1 in the same FU
1340	   header.

1342	   The FU payload consists of fragments of the payload of the fragmented
1343	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1344	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1345	   equal to 1, are sequentially concatenated, the payload of the
1346	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1347	   fragmented NAL unit is not included as such in the FU payload, but
1348	   rather the information of the NAL unit header of the fragmented NAL
1349	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1350	   headers of the FUs and the FuType field of the FU header of the FUs.
1351	   An FU payload MUST NOT be empty.

1353	   If an FU is lost, the receiver SHOULD discard all following
1354	   fragmentation units in transmission order corresponding to the same
1355	   fragmented NAL unit, unless the decoder in the receiver is known to
1356	   be prepared to gracefully handle incomplete NAL units.

1358	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1359	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1360	   n of that NAL unit is not received.  In this case, the
1361	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1362	   syntax violation.

1364	4.4.  Decoding Order Number

1366	   For each NAL unit, the variable AbsDon is derived, representing the
1367	   decoding order number that is indicative of the NAL unit decoding
1368	   order.

1370	   Let NAL unit n be the n-th NAL unit in transmission order within an
1371	   RTP stream.

1373	   If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon
1374	   for NAL unit n, is derived as equal to n.

1376	   Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is
1377	   derived as follows, where DON[n] is the value of the variable DON for
1378	   NAL unit n:

1380	   o  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1381	      transmission order), AbsDon[0] is set equal to DON[0].

1383	   o  Otherwise (n is greater than 0), the following applies for
1384	      derivation of AbsDon[n]:

1386	         If DON[n] == DON[n-1],
1387	            AbsDon[n] = AbsDon[n-1]

1389	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1390	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1392	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1393	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1395	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1396	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1397	            DON[n])

1399	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1400	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1402	   For any two NAL units m and n, the following applies:

1404	   o  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1405	      NAL unit m in NAL unit decoding order.

1407	   o  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1408	      of the two NAL units can be in either order.

1410	   o  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1411	      NAL unit m in decoding order.

1413	         Informative note: When two consecutive NAL units in the NAL
1414	         unit decoding order have different values of AbsDon, the
1415	         absolute difference between the two AbsDon values may be
1416	         greater than or equal to 1.

1418	         Informative note: There are multiple reasons to allow for the
1419	         absolute difference of the values of AbsDon for two consecutive
1420	         NAL units in the NAL unit decoding order to be greater than
1421	         one.  An increment by one is not required, as at the time of
1422	         associating values of AbsDon to NAL units, it may not be known
1423	         whether all NAL units are to be delivered to the receiver.  For
1424	         example, a gateway might not forward VCL NAL units of higher
1425	         sublayers or some SEI NAL units when there is congestion in the
1426	         network.  In another example, the first intra-coded picture of
1427	         a pre-encoded clip is transmitted in advance to ensure that it
1428	         is readily available in the receiver, and when transmitting the
1429	         first intra-coded picture, the originator does not exactly know
1430	         how many NAL units will be encoded before the first intra-coded
1431	         picture of the pre-encoded clip follows in decoding order.
1432	         Thus, the values of AbsDon for the NAL units of the first
1433	         intra-coded picture of the pre-encoded clip have to be
1434	         estimated when they are transmitted, and gaps in values of
1435	         AbsDon may occur.

1437	5.  Packetization Rules

1439	   The following packetization rules apply:

1441	   o  If sprop-max-don-diff is greater than 0, the transmission order of
1442	      NAL units carried in the RTP stream MAY be different than the NAL
1443	      unit decoding order.  Otherwise (sprop-max-don-diff is equal to
1444	      0), the transmission order of NAL units carried in the RTP stream
1445	      MUST be the same as the NAL unit decoding order.

1447	   o  A NAL unit of a small size SHOULD be encapsulated in an
1448	      aggregation packet together one or more other NAL units in order
1449	      to avoid the unnecessary packetization overhead for small NAL
1450	      units.  For example, non-VCL NAL units such as access unit
1451	      delimiters, parameter sets, or SEI NAL units are typically small
1452	      and can often be aggregated with VCL NAL units without violating
1453	      MTU size constraints.

1455	   o  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1456	      viewpoint, be encapsulated in an aggregation packet together with
1457	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1458	      be meaningless without the associated VCL NAL unit being
1459	      available.

1461	   o  For carrying exactly one NAL unit in an RTP packet, a single NAL
1462	      unit packet MUST be used.

1464	6.  De-packetization Process

1466	   The general concept behind de-packetization is to get the NAL units
1467	   out of the RTP packets in an RTP stream and pass them to the decoder
1468	   in the NAL unit decoding order.

1470	   The de-packetization process is implementation dependent.  Therefore,
1471	   the following description should be seen as an example of a suitable
1472	   implementation.  Other schemes may be used as well, as long as the
1473	   output for the same input is the same as the process described below.
1474	   The output is the same when the set of output NAL units and their
1475	   order are both identical.  Optimizations relative to the described
1476	   algorithms are possible.

1478	   All normal RTP mechanisms related to buffer management apply.  In
1479	   particular, duplicated or outdated RTP packets (as indicated by the
1480	   RTP sequences number and the RTP timestamp) are removed.  To
1481	   determine the exact time for decoding, factors such as a possible
1482	   intentional delay to allow for proper inter-stream synchronization
1483	   MUST be factored in.

1485	   NAL units with NAL unit type values in the range of 0 to 27,
1486	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1487	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1488	   NOT be passed to the decoder.

1490	   The receiver includes a receiver buffer, which is used to compensate
1491	   for transmission delay jitter within individual RTP stream, and to
1492	   reorder NAL units from transmission order to the NAL unit decoding
1493	   order.  In this section, the receiver operation is described under
1494	   the assumption that there is no transmission delay jitter within an
1495	   RTP stream.  To make a difference from a practical receiver buffer
1496	   that is also used for compensation of transmission delay jitter, the
1497	   receiver buffer is hereafter called the de-packetization buffer in
1498	   this section.  Receivers should also prepare for transmission delay
1499	   jitter; that is, either reserve separate buffers for transmission
1500	   delay jitter buffering and de-packetization buffering or use a
1501	   receiver buffer for both transmission delay jitter and de-
1502	   packetization.  Moreover, receivers should take transmission delay
1503	   jitter into account in the buffering operation, e.g., by additional
1504	   initial buffering before starting of decoding and playback.

1506	   The de-packetization process extracts the NAL units from the RTP
1507	   packets in an RTP stream as follows.  When an RTP packet carries a
1508	   single NAL unit packet, the payload of the RTP packet is extracted as
1509	   a single NAL unit, excluding the DONL field, i.e., third and fourth
1510	   bytes, when sprop-max-don-diff is greater than 0.  When an RTP packet
1511	   carries an Aggregation Packet, several NAL units are extracted from
1512	   the payload of the RTP packet.  In this case, each NAL unit
1513	   corresponds to the part of the payload of each aggregation unit that
1514	   follows the NALU size field as described in Section 4.3.2.  When an
1515	   RTP packet carries a Fragmentation Unit (FU), all RTP packets from
1516	   the first FU (with the S field equal to 1) of the fragmented NAL unit
1517	   up to the last FU (with the E field equal to 1) of the fragmented NAL
1518	   unit are collected.  The NAL unit is extracted from these RTP packets
1519	   by concatenating all FU payloads in the same order as the
1520	   corresponding RTP packets and appending the NAL unit header with the
1521	   fields F, LayerId, and TID, set to equal to the values of the fields
1522	   F, LayerId, and TID in the payload header of the FUs respectively,
1523	   and with the NAL unit type set equal to the value of the field FuType
1524	   in the FU header of the FUs, as described in Section 4.3.3.

1526	   When sprop-max-don-diff is equal to 0, the de-packetization buffer
1527	   size is zero bytes, and the NAL units carried in the single RTP
1528	   stream are directly passed to the decoder in their transmission
1529	   order, which is identical to their decoding order.

1531	   When sprop-max-don-diff is greater than 0, the process described in
1532	   the remainder of this section applies.

1534	   There are two buffering states in the receiver: initial buffering and
1535	   buffering while playing.  Initial buffering starts when the reception
1536	   is initialized.  After initial buffering, decoding and playback are
1537	   started, and the buffering-while-playing mode is used.

1539	   Regardless of the buffering state, the receiver stores incoming NAL
1540	   units in reception order into the de-packetization buffer.  NAL units
1541	   carried in RTP packets are stored in the de-packetization buffer
1542	   individually, and the value of AbsDon is calculated and stored for
1543	   each NAL unit.

1545	   Initial buffering lasts until the difference between the greatest and
1546	   smallest AbsDon values of the NAL units in the de-packetization
1547	   buffer is greater than or equal to the value of sprop-max-don-diff.

1549	   After initial buffering, whenever the difference between the greatest
1550	   and smallest AbsDon values of the NAL units in the de-packetization
1551	   buffer is greater than or equal to the value of sprop-max-don-diff,
1552	   the following operation is repeatedly applied until this difference
1553	   is smaller than sprop-max-don-diff:

1555	   o  The NAL unit in the de-packetization buffer with the smallest
1556	      value of AbsDon is removed from the de-packetization buffer and
1557	      passed to the decoder.

1559	   When no more NAL units are flowing into the de-packetization buffer,
1560	   all NAL units remaining in the de-packetization buffer are removed
1561	   from the buffer and passed to the decoder in the order of increasing
1562	   AbsDon values.

1564	7.  Payload Format Parameters

1566	   This section specifies the optional parameters.  A mapping of the
1567	   parameters with Session Description Protocol (SDP) [RFC4556] is also
1568	   provided for applications that use SDP.

1570	7.1.  Media Type Registration

1572	   The receiver MUST ignore any parameter unspecified in this memo.

1574	   Type name:            video

1576	   Subtype name:         H266

1578	   Required parameters:  none

1580	   Optional parameters:

1582	      profile-id, tier-flag, sub-profile-id, interop-constraints, and
1583	      level-id:

1585	         These parameters indicate the profile, tier, default level,
1586	         sub-profile, and some constraints of the bitstream carried by
1587	         the RTP stream, or a specific set of the profile, tier, default
1588	         level, sub-profile and some constraints the receiver supports.

1590	         The subset of coding tools that may have been used to generate
1591	         the bitstream or that the receiver supports, as well as some
1592	         additional constraints are indicated collectively by profile-
1593	         id, sub-profile-id, and interop-constraints.

1595	            Informative note: There are 128 values of profile-id.  The
1596	            subset of coding tools identified by the profile-id can be
1597	            further constrained with up to 255 instances of sub-profile-
1598	            id.  In addition, 68 bits included in interop-constraints,
1599	            which can be extended up to 324 bits provide means to
1600	            further restrict tools from existing profiles.  To be able
1601	            to support this fine-granular signalling of coding tool
1602	            subsets with profile-id, sub-profile-id and interop-
1603	            constraints, it would be safe to require symmetric use of
1604	            these parameters in SDP offer/answer unless recv-ols-id is
1605	            included in the SDP answer for choosing one of the layers
1606	            offered.

1608	         The tier is indicated by tier-flag.  The default level is
1609	         indicated by level-id.  The tier and the default level specify
1610	         the limits on values of syntax elements or arithmetic
1611	         combinations of values of syntax elements that are followed
1612	         when generating the bitstream or that the receiver supports.

1614	         In SDP offer/answer, when the SDP answer does not include the
1615	         recv-ols-id parameter that is less than the sprop-ols-id
1616	         parameter in the SDP offer, the following applies:

1618	         +  The tier-flag, profile-id, sub-profile-id, and interop-
1619	            constraints parameters MUST be used symmetrically, i.e., the
1620	            value of each of these parameters in the offer MUST be the
1621	            same as that in the answer, either explicitly signaled or
1622	            implicitly inferred.

1624	         +  The level-id parameter is changeable as long as the highest
1625	            level indicated by the answer is either equal to or lower
1626	            than that in the offer.  Note that a highest level higher
1627	            than level-id in the offer for receiving can be included as
1628	            max-recv-level-id.

1630	         In SDP offer/answer, when the SDP answer does include the recv-
1631	         ols-id parameter that is less than the sprop-ols-id parameter
1632	         in the SDP offer, the set of tier- flag, profile-id, sub-
1633	         profile-id, interop-constraints, and level-id parameters
1634	         included in the answer MUST be consistent with that for the
1635	         chosen output layer set as indicated in the SDP offer, with the
1636	         exception that the level-id parameter in the SDP answer is
1637	         changeable as long as the highest level indicated by the answer
1638	         is either lower than or equal to that in the offer.

1640	         More specifications of these parameters, including how they
1641	         relate to syntax elements specified in [VVC] are provided
1642	         below.

1644	      profile-id:

1646	         When profile-id is not present, a value of 1 (i.e., the Main 10
1647	         profile) MUST be inferred.

1649	         When used to indicate properties of a bitstream, profile-id is
1650	         derived from the general_profile_idc syntax element that
1651	         applies to the bitstream in an instance of the
1652	         profile_tier_level( ) syntax structure.

1654	         A profile_tier_level( ) syntax structure may be contained in an
1655	         SPS, VPS, or DCI NAL units as specified in [VVC].  One of the
1656	         following three cases applies to the container NAL unit of the
1657	         profile_tier_level( ) syntax structure containing those PTL
1658	         syntax elements used to derive the values of profile-id, tier-
1659	         flag, level-id, sub-profile-id, or interop-constraints: 1) The
1660	         container NAL unit is an SPS, the bitstream is a single-layer
1661	         bitstream, and the profile_tier_level( ) syntax structures in
1662	         all SPSs referenced by the CVSs in the bitstream has the same
1663	         values respectively for those PTL syntax elements; 2) The
1664	         container NAL unit is a VPS, the profile_tier_level( ) syntax
1665	         structure is the one in the VPS that applies to the OLS
1666	         corresponding to the bitstream, and the profile_tier_level( )
1667	         syntax structures applicable to the OLS corresponding to the
1668	         bitstream in all VPSs referenced by the CVSs in the bitstream
1669	         have the same values respectively for those PTL syntax
1670	         elements; 3) The container NAL unit is a DCI NAL unit and the
1671	         profile_tier_level( ) syntax structures in all DCI NAL units in
1672	         the bitstream has the same values respectively for those PTL
1673	         syntax elements.

1675	      tier-flag, level-id:

1677	         The value of tier-flag MUST be in the range of 0 to 1,
1678	         inclusive.  The value of level-id MUST be in the range of 0 to
1679	         255, inclusive.

1681	         If the tier-flag and level-id parameters are used to indicate
1682	         properties of a bitstream, they indicate the tier and the
1683	         highest level the bitstream complies with.

1685	         If the tier-flag and level-id parameters are used for
1686	         capability exchange, the following applies.  If max-recv-level-
1687	         id is not present, the default level defined by level-id
1688	         indicates the highest level the codec wishes to support.
1689	         Otherwise, max-recv-level-id indicates the highest level the
1690	         codec supports for receiving.  For either receiving or sending,
1691	         all levels that are lower than the highest level supported MUST
1692	         also be supported.

1694	         If no tier-flag is present, a value of 0 MUST be inferred; if
1695	         no level-id is present, a value of 51 (i.e., level 3.1) MUST be
1696	         inferred.

1698	            Informative note: The level values currently defined in the
1699	            VVC specification are in the form of "majorNum.minorNum",
1700	            and the value of the level-id for each of the levels is
1701	            equal to majorNum * 16 + minorNum * 3.  It is expected that
1702	            if any level are defined in the future, the same convention
1703	            will be used, but this cannot be guaranteed.

1705	         When used to indicate properties of a bitstream, the tier-flag
1706	         and level-id parameters are derived respectively from the
1707	         syntax element general_tier_flag, and the syntax element
1708	         general_level_idc or sub_layer_level_idc[j], that apply to the
1709	         bitstream, in an instance of the profile_tier_level( ) syntax
1710	         structure.

1712	         If the tier-flag and level-id are derived from the
1713	         profile_tier_level( ) syntax structure in a DCI NAL unit, the
1714	         following applies:

1716	         +  tier-flag = general_tier_flag

1718	         +  level-id = general_level_idc

1720	         Otherwise, if the tier-flag and level-id are derived from the
1721	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1722	         unit, and the bitstream contains the highest sub-layer
1723	         representation in the OLS corresponding to the bitstream, the
1724	         following applies:

1726	         +  tier-flag = general_tier_flag

1728	         +  level-id = general_level_idc

1730	         Otherwise, if the tier-flag and level-id are derived from the
1731	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1732	         unit, and the bitstream does not contains the highest sub-layer
1733	         representation in the OLS corresponding to the bitstream, the
1734	         following applies, with j being the value of the sprop-sub-
1735	         layer-id parameter:

1737	         +  tier-flag = general_tier_flag

1739	         +  level-id = sub_layer_level_idc[j]

1741	      sub-profile-id:

1743	         The value of the parameter is a comma-separated (',') list of
1744	         data using base64 [RFC4648] representation.

1746	         When used to indicate properties of a bitstream, sub-profile-id
1747	         is derived from each of the ptl_num_sub_profiles
1748	         general_sub_profile_idc[i] syntax elements that apply to the
1749	         bitstream in an profile_tier_level( ) syntax structure.

1751	      interop-constraints:

1753	         A base64 [RFC4648] representation of the data that includes the
1754	         syntax elements ptl_frame_only_constraint_flag and
1755	         ptl_multilayer_enabled_flag and the general_constraints_info( )
1756	         syntax structure that apply to the bitstream in an instance of
1757	         the profile_tier_level( ) syntax structure.

1759	         If the interop-constraints parameter is not present, the
1760	         following MUST be inferred:

1762	         +  ptl_frame_only_constraint_flag = 1

1764	         +  ptl_multilayer_enabled_flag = 0

1766	         +  gci_present_flag in the general_constraints_info( ) syntax
1767	            structure = 0

1769	         Using interop-constraints for capability exchange results in a
1770	         requirement on any bitstream to be compliant with the interop-
1771	         constraints.

1773	      sprop-sub-layer-id:

1775	         This parameter MAY be used to indicate the highest allowed
1776	         value of TID in the highest layer present in the bitstream.
1777	         When not present, the value of sprop-sub-layer-id is inferred
1778	         to be equal to 6.

1780	         The value of sprop-sub-layer-id MUST be in the range of 0 to 6,
1781	         inclusive.

1783	      sprop-ols-id:

1785	         This parameter MAY be used to indicate the OLS that the
1786	         bitstream applies to.  When not present, the value of sprop-
1787	         ols-id is inferred to be equal to TargetOlsIdx as specified in
1788	         8.1.1 in [VVC].  If this optional parameter is present, sprop-
1789	         vps MUST also be present or its content MUST be known a priori
1790	         at the receiver.

1792	         The value of sprop-ols-id MUST be in the range of 0 to 257,
1793	         inclusive.

1795	            Informative note: VVC allows having up to 258 output layer
1796	            sets indicated in the VPS as the number of output layer sets
1797	            minus 2 is indicated with a field of 8 bits.

1799	      recv-sub-layer-id:

1801	         This parameter MAY be used to signal a receiver's choice of the
1802	         offered or declared sub-layer representations in the sprop-vps
1803	         and sprop-sps.  The value of recv-sub-layer-id indicates the
1804	         TID of the highest sub-layer in the highest layer of the
1805	         bitstream that a receiver supports.  When not present, the
1806	         value of recv-sub-layer-id is inferred to be equal to the value
1807	         of the sprop-sub-layer-id parameter in the SDP offer.

1809	         The value of recv-sub-layer-id MUST be in the range of 0 to 6,
1810	         inclusive.

1812	      recv-ols-id:

1814	         This parameter MAY be used to signal a receiver's choice of the
1815	         offered or declared output layer sets in the sprop-vps.  The
1816	         value of recv-ols-id indicates the OLS index of the bitstream
1817	         that a receiver supports.  When not present, the value of recv-
1818	         ols-id is inferred to be equal to the value of the sprop-ols-id
1819	         parameter in the SDP offer.  When present, the value of recv-
1820	         ols-id must be included only when sprop-ols-id was received and
1821	         must refer to an output layer set in the VPS that is in the
1822	         same dependency tree as the OLS referred to by sprop-ols-id.
1823	         If this optional parameter is present, sprop-vps must have been
1824	         received or its content must be known a priori at the receiver.

1826	         The value of recv-ols-id MUST be in the range of 0 to 257,
1827	         inclusive.

1829	      max-recv-level-id:

1831	         This parameter MAY be used to indicate the highest level a
1832	         receiver supports.

1834	         The value of max-recv-level-id MUST be in the range of 0 to
1835	         255, inclusive.

1837	         When max-recv-level-id is not present, the value is inferred to
1838	         be equal to level-id.

1840	         max-recv-level-id MUST NOT be present when the highest level
1841	         the receiver supports is not higher than the default level.

1843	      sprop-dci:

1845	         This parameter MAY be used to convey a decoding capability
1846	         information NAL unit of the bitstream for out-of-band
1847	         transmission.  The parameter MAY also be used for capability
1848	         exchange.  The value of the parameter a base64 [RFC4648]
1849	         representations of the decoding capability information NAL unit
1850	         as specified in Section 7.3.2.1 of [VVC].

1852	      sprop-vps:

1854	         This parameter MAY be used to convey any video parameter set
1855	         NAL unit of the bitstream for out-of-band transmission of video
1856	         parameter sets.  The parameter MAY also be used for capability
1857	         exchange and to indicate sub-stream characteristics (i.e.,
1858	         properties of output layer sets and sublayer representations as
1859	         defined in [VVC]).  The value of the parameter is a comma-
1860	         separated (',') list of base64 [RFC4648] representations of the
1861	         video parameter set NAL units as specified in Section 7.3.2.3
1862	         of [VVC].

1864	         The sprop-vps parameter MAY contain one or more than one video
1865	         parameter set NAL unit.  However, all other video parameter
1866	         sets contained in the sprop-vps parameter MUST be consistent
1867	         with the first video parameter set in the sprop-vps parameter.
1868	         A video parameter set vpsB is said to be consistent with
1869	         another video parameter set vpsA if any decoder that conforms
1870	         to the profile, tier, level, and constraints indicated by the
1871	         data starting from the syntax element general_profile_space to
1872	         the syntax element general_level_idc, inclusive, in the first
1873	         profile_tier_level( ) syntax structure in vpsA can decode any
1874	         bitstream that conforms to the profile, tier, level, and
1875	         constraints indicated by the data starting from the syntax
1876	         element general_profile_space to the syntax element
1877	         general_level_idc, inclusive, in the first profile_tier_level(
1878	         ) syntax structure in vpsB.

1880	      sprop-sps:

1882	         This parameter MAY be used to convey sequence parameter set NAL
1883	         units of the bitstream for out-of-band transmission of sequence
1884	         parameter sets.  The value of the parameter is a comma-
1885	         separated (',') list of base64 [RFC4648] representations of the
1886	         sequence parameter set NAL units as specified in
1887	         Section 7.3.2.4 of [VVC].

1889	      sprop-pps:

1891	         This parameter MAY be used to convey picture parameter set NAL
1892	         units of the bitstream for out-of-band transmission of picture
1893	         parameter sets.  The value of the parameter is a comma-
1894	         separated (',') list of base64 [RFC4648] representations of the
1895	         picture parameter set NAL units as specified in Section 7.3.2.5
1896	         of [VVC].

1898	      sprop-sei:

1900	         This parameter MAY be used to convey one or more SEI messages
1901	         that describe bitstream characteristics.  When present, a
1902	         decoder can rely on the bitstream characteristics that are
1903	         described in the SEI messages for the entire duration of the
1904	         session, independently from the persistence scopes of the SEI
1905	         messages as specified in [VSEI].

1907	         The value of the parameter is a comma-separated (',') list of
1908	         base64 [RFC4648] representations of SEI NAL units as specified
1909	         in [VSEI].

1911	            Informative note: Intentionally, no list of applicable or
1912	            inapplicable SEI messages is specified here.  Conveying
1913	            certain SEI messages in sprop-sei may be sensible in some
1914	            application scenarios and meaningless in others.  However, a
1915	            few examples are described below:

1917	            1) In an environment where the bitstream was created from
1918	            film-based source material, and no splicing is going to
1919	            occur during the lifetime of the session, the film grain
1920	            characteristics SEI message is likely meaningful, and
1921	            sending it in sprop-sei rather than in the bitstream at each
1922	            entry point may help with saving bits and allows one to
1923	            configure the renderer only once, avoiding unwanted
1924	            artifacts.

1926	            2) Examples for SEI messages that would be meaningless to be
1927	            conveyed in sprop-sei include the decoded picture hash SEI
1928	            message (it is close to impossible that all decoded pictures
1929	            have the same hashtag), the display orientation SEI message
1930	            when the device is a handheld device (as the display
1931	            orientation may change when the handheld device is turned
1932	            around), or the filler payload SEI message (as there is no
1933	            point in just having more bits in SDP).

1935	      max-lsr:

1937	         The max-lsr MAY be used to signal the capabilities of a
1938	         receiver implementation and MUST NOT be used for any other
1939	         purpose.  The value of max-lsr is an integer indicating the
1940	         maximum processing rate in units of luma samples per second.
1941	         The max-lsr parameter signals that the receiver is capable of
1942	         decoding video at a higher rate than is required by the highest
1943	         level.

1945	            Informative note: When the OPTIONAL media type parameters
1946	            are used to signal the properties of a bitstream, and max-
1947	            lsr is not present, the values of tier-flag, profile-id,
1948	            sub-profile-id interop-constraints, and level-id must always
1949	            be such that the bitstream complies fully with the specified
1950	            profile, tier, and level.

1952	         When max-lsr is signaled, the receiver MUST be able to decode
1953	         bitstreams that conform to the highest level, with the
1954	         exception that the MaxLumaSr value in Table 136 of [VVC] for
1955	         the highest level is replaced with the value of max-lsr.
1956	         Senders MAY use this knowledge to send pictures of a given size
1957	         at a higher picture rate than is indicated in the highest
1958	         level.

1960	         When not present, the value of max-lsr is inferred to be equal
1961	         to the value of MaxLumaSr given in Table 136 of [VVC] for the
1962	         highest level.

1964	         The value of max-lsr MUST be in the range of MaxLumaSr to 16 *
1965	         MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of
1966	         [VVC] for the highest level.

1968	      max-fps:

1970	         The value of max-fps is an integer indicating the maximum
1971	         picture rate in units of pictures per 100 seconds that can be
1972	         effectively processed by the receiver.  The max-fps parameter
1973	         MAY be used to signal that the receiver has a constraint in
1974	         that it is not capable of processing video effectively at the
1975	         full picture rate that is implied by the highest level and,
1976	         when present, max-lsr.

1978	         The value of max-fps is not necessarily the picture rate at
1979	         which the maximum picture size can be sent, it constitutes a
1980	         constraint on maximum picture rate for all resolutions.

1982	            Informative note: The max-fps parameter is semantically
1983	            different from max-lsr in that max-fps is used to signal a
1984	            constraint, lowering the maximum picture rate from what is
1985	            implied by other parameters.

1987	         The encoder MUST use a picture rate equal to or less than this
1988	         value.  In cases where the max-fps parameter is absent, the
1989	         encoder is free to choose any picture rate according to the
1990	         highest level and any signaled optional parameters.

1992	         The value of max-fps MUST be smaller than or equal to the full
1993	         picture rate that is implied by the highest level and, when
1994	         present, max-lsr.

1996	      sprop-max-don-diff:

1998	         If there is no NAL unit naluA that is followed in transmission
1999	         order by any NAL unit preceding naluA in decoding order (i.e.,
2000	         the transmission order of the NAL units is the same as the
2001	         decoding order), the value of this parameter MUST be equal to
2002	         0.

2004	         Otherwise, this parameter specifies the maximum absolute
2005	         difference between the decoding order number (i.e., AbsDon)
2006	         values of any two NAL units naluA and naluB, where naluA
2007	         follows naluB in decoding order and precedes naluB in
2008	         transmission order.

2010	         The value of sprop-max-don-diff MUST be an integer in the range
2011	         of 0 to 32767, inclusive.

2013	         When not present, the value of sprop-max-don-diff is inferred
2014	         to be equal to 0.

2016	      sprop-depack-buf-bytes:

2018	         This parameter signals the required size of the de-
2019	         packetization buffer in units of bytes.  The value of the
2020	         parameter MUST be greater than or equal to the maximum buffer
2021	         occupancy (in units of bytes) of the de-packetization buffer as
2022	         specified in Section 6.

2024	         The value of sprop-depack-buf-bytes MUST be an integer in the
2025	         range of 0 to 4294967295, inclusive.

2027	         When sprop-max-don-diff is present and greater than 0, this
2028	         parameter MUST be present and the value MUST be greater than 0.
2029	         When not present, the value of sprop-depack-buf-bytes is
2030	         inferred to be equal to 0.

2032	            Informative note: The value of sprop-depack-buf-bytes
2033	            indicates the required size of the de-packetization buffer
2034	            only.  When network jitter can occur, an appropriately sized
2035	            jitter buffer has to be available as well.

2037	      depack-buf-cap:

2039	         This parameter signals the capabilities of a receiver
2040	         implementation and indicates the amount of de-packetization
2041	         buffer space in units of bytes that the receiver has available
2042	         for reconstructing the NAL unit decoding order from NAL units
2043	         carried in the RTP stream.  A receiver is able to handle any
2044	         RTP stream for which the value of the sprop-depack-buf-bytes
2045	         parameter is smaller than or equal to this parameter.

2047	         When not present, the value of depack-buf-cap is inferred to be
2048	         equal to 4294967295.  The value of depack-buf-cap MUST be an
2049	         integer in the range of 1 to 4294967295, inclusive.

2051	            Informative note: depack-buf-cap indicates the maximum
2052	            possible size of the de-packetization buffer of the receiver
2053	            only, without allowing for network jitter.

2055	7.2.  SDP Parameters

2057	   The receiver MUST ignore any parameter unspecified in this memo.

2059	7.2.1.  Mapping of Payload Type Parameters to SDP

2061	   The media type video/H266 string is mapped to fields in the Session
2062	   Description Protocol (SDP) [RFC4566] as follows:

2064	   o  The media name in the "m=" line of SDP MUST be video.

2066	   o  The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the
2067	      media subtype).

2069	   o  The clock rate in the "a=rtpmap" line MUST be 90000.

2071	   o  The OPTIONAL parameters profile-id, tier-flag, sub-profile-id,
2072	      interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id,
2073	      recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max-
2074	      fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf-
2075	      cap, when present, MUST be included in the "a=fmtp" line of SDP.
2076	      This parameter is expressed as a media type string, in the form of
2077	      a semicolon-separated list of parameter=value pairs.

2079	   o  The OPTIONAL parameter sprop-vps, sprop-sps, sprop-pps, sprop-sei,
2080	      and sprop-dci, when present, MUST be included in the "a=fmtp" line
2081	      of SDP or conveyed using the "fmtp" source attribute as specified
2082	      in Section 6.3 of [RFC5576].  For a particular media format (i.e.,
2083	      RTP payload type), sprop-vps, sprop-sps, sprop-pps, sprop-sei, or
2084	      sprop-dci MUST NOT be both included in the "a=fmtp" line of SDP
2085	      and conveyed using the "fmtp" source attribute.  When included in
2086	      the "a=fmtp" line of SDP, those parameters are expressed as a
2087	      media type string, in the form of a semicolon-separated list of
2088	      parameter=value pairs.  When conveyed in the "a=fmtp" line of SDP
2089	      for a particular payload type, the parameters sprop-vps, sprop-
2090	      sps, sprop-pps, sprop-sei, and sprop-dci MUST be applied to each
2091	      SSRC with the payload type.  When conveyed using the "fmtp" source
2092	      attribute, these parameters are only associated with the given
2093	      source and payload type as parts of the "fmtp" source attribute.

2095	   An example of media representation in SDP is as follows:

2097	           m=video 49170 RTP/AVP 98
2098	           a=rtpmap:98 H266/90000
2099	           a=fmtp:98 profile-id=1;
2100	             sprop-vps=<video parameter sets data>;
2101	             sprop-sps=<sequence parameter set data>;
2102	             sprop-pps=<picture parameter set data>;

2104	7.2.2.  Usage with SDP Offer/Answer Model

2106	   This section describes the negotiation of unicast messages using the
2107	   offer-answer model as described in [RFC3264] and its updates.  The
2108	   section is split into subsections, covering a) media format
2109	   configurations not involving non-temporal scalability; b) scalable
2110	   media format configurations; c) the description of the use of those
2111	   parameters not involving the media configuration itself but rather
2112	   the parameters of the payload format design; and d) multicast.

2114	7.2.2.1.  Non-scalable media format configuration

2116	   A non-scalable VVC media configuration is such a configuration where
2117	   no non-temporal scalability mechanisms are allowed.  In [VVC] version
2118	   1, that implies that general_profile_idc indicates one of the
2119	   following profiles: Main10, Main10 still, Main 10 4:4:4, Main10 4:4:4
2120	   still, with general_profile_dic values of 1, 65, 33, and 97,
2121	   respectively.  Note that non-scalable media configurations includes
2122	   temporal scalability, inline with VVC's design philosophy and profile
2123	   structure.

2125	   The following limitations and rules pertaining to the media
2126	   configuration apply:

2128	   o  The parameters identifying a media format configuration for VVC
2129	      are profile-id, tier-flag, sub-profile-id, level-id, and interop-
2130	      constraints.  These media configuration parameters, except level-
2131	      id, MUST be used symmetrically.

2133	      The answerer MUST structure its answer in according to one of the
2134	      following three options:

2136	      1) maintain all configuration parameters with the values remaining
2137	      the same as in the offer for the media format (payload type), with
2138	      the exception that the value of level-id is changeable as long as
2139	      the highest level indicated by the answer is not higher than that
2140	      indicated by the offer;

2142	      2) include in the answer the recv-sub-layer-id parameter, with a
2143	      value less than the sprop-sub-layer-id parameter in the offer, for
2144	      the media format (payload type), and maintain all configuration
2145	      parameters with the values remaining the same as in the offer for
2146	      the media format (payload type), with the exception that the value
2147	      of level-id is changeable as long as the highest level indicated
2148	      by the answer is not higher than the level indicated by the sprop-
2149	      sps or sprop-vps in offer for the chosen sub-layer representation;
2150	      or

2152	      3) remove the media format (payload type) completely (when one or
2153	      more of the parameter values are not supported).

2155	            Informative note: The above requirement for symmetric use
2156	            does not apply for level-id, and does not apply for the
2157	            other bitstream or RTP stream properties and capability
2158	            parameters as described in Section 7.2.2.3 below.

2160	   o  To simplify handling and matching of these configurations, the
2161	      same RTP payload type number used in the offer SHOULD also be used
2162	      in the answer, as specified in [RFC3264].

2164	   o  The same RTP payload type number used in the offer for the media
2165	      subtype H266 MUST be used in the answer when the answer includes
2166	      recv-sub-layer-id.  When the answer does not include recv-sub-
2167	      layer-id, the answer MUST NOT contain a payload type number used
2168	      in the offer for the media subtype H266 unless the configuration
2169	      is exactly the same as in the offer or the configuration in the
2170	      answer only differs from that in the offer with a different value
2171	      of level-id.  The answer MAY contain the recv-sub-layer-id
2172	      parameter if an VVC bitstream contains multiple operation points
2173	      (using temporal scalability and sub-layers) and sprop-sps or
2174	      sprop-vps is included in the offer where information of sub-layers
2175	      are present in the first sequence parameter set or video parameter
2176	      set contained in sprop-sps or sprop-vps respectively.  If the
2177	      sprop-sps or sprop-vps is provided in an offer, an answerer MAY
2178	      select a particular operation point indicated in the first
2179	      sequence parameter set or video parameter set contained in sprop-
2180	      sps or sprop-vps respectively.  When the answer includes a recv-
2181	      sub-layer-id that is less than a sprop-sub-layer-id in the offer,
2182	      the following applies:

2184	      1) When sprop-sps parameter is present, all sequence parameter
2185	      sets contained in the sprop-sps parameter in the SDP answer and
2186	      all sequence parameter sets sent in-band for either the offerer-
2187	      to-answerer direction or the answerer-to-offerer direction MUST be
2188	      consistent with the first sequence parameter set in the sprop-sps
2189	      parameter of the offer (see the semantics of sprop-sps in
2190	      Section 7.1 of this document on one sequence parameter set being
2191	      consistent with another sequence parameter set).

2193	      2) When sprop-vps parameter is present, all video parameter sets
2194	      contained in the sprop-vps parameter in the SDP answer and all
2195	      video parameter sets sent in-band for either the offerer-to-
2196	      answerer direction or the answerer-to-offerer direction MUST be
2197	      consistent with the first video parameter set in the sprop-vps
2198	      parameter of the offer (see the semantics of sprop-vps in
2199	      Section 7.1 of this document on one video parameter set being
2200	      consistent with another video parameter set).

2202	      3) The bitstream sent in either direction MUST conform to the
2203	      profile, tier, level, and constraints of the chosen sub-layer
2204	      representation as indicated by the profile_tier_level( ) syntax
2205	      structure in the first sequence parameter set in the sprop-sps
2206	      parameter or by the first profile_tier_level( ) syntax structure
2207	      in the first video parameter set in the sprop-vps parameter of the
2208	      offer.

2210	            Informative note: When an offerer receives an answer that
2211	            does not include recv-sub-layer-id, it has to compare
2212	            payload types not declared in the offer based on the media
2213	            type (i.e., video/H266) and the above media configuration
2214	            parameters with any payload types it has already declared.
2215	            This will enable it to determine whether the configuration
2216	            in question is new or if it is equivalent to configuration
2217	            already offered, since a different payload type number may
2218	            be used in the answer.  The ability to perform operation
2219	            point selection enables a receiver to utilize the temporal
2220	            scalable nature of an VVC bitstream.

2222	7.2.2.2.  Scalable media format configuration

2224	   A scalable VVC media configuration is such a configuration where non-
2225	   temporal scalability mechanisms are allowed.  In [VVC] version 1,
2226	   that implies that general_profile_idc indicates one of the following
2227	   profiles: Multilayer Main 10, and Multilayer Main 10 4:4:4, with
2228	   general_profile_idc values of 17 and 49, respectively.

2230	   The following limitations and rules pertaining to the media
2231	   configuration apply.  They are listed in an order that would be
2232	   logical for an implementation to follow:

2234	   o  The parameters identifying a media format configuration for
2235	      scalable VVC are profile-id, tier-flag, sub-profile-id, level-id,
2236	      interop-constraints, and sprop-vps.  These media configuration
2237	      parameters, except level-id, MUST be used symmetrically, except as
2238	      noted below.

2240	   o  The answerer MAY include a level-id that MUST be lower or equal
2241	      than the level-id indicated in the offer (either expressed by
2242	      level-id in the offer, or implied by the default level as specific
2243	      in Section 7.1).

2245	   o  The offerer MUST include sprop-vps including at least one valid
2246	      VPS, so to allow the answerer to meaningfully interpret sprop-ols-
2247	      id and select recv-ols-id (see below).

2249	   o  The offerer MUST include sprop-ols-id.  The answerer MUST include
2250	      recv-ols-id, and recv-ols-id MUST indicate a supported output
2251	      layer set in the same dependency tree as sprop-ols-id.  If unable,
2252	      the answerer MUST remove the media format.

2254	         Informative note: if an offerer wants to offer more than one
2255	         output layer set, in can do so by offering multiple VVC media
2256	         with different payload types.

2258	   o  The offerer MAY include sprop-sub-layer-id which, in case of
2259	      scalable VVC, is interpreted as the highest sub-layer of the
2260	      highest enhancement layer in the OLS indicated by sprop-ols-id.
2261	      The answerer MAY include recv-sub-layer-id which can be used to
2262	      downgrade the sublayer of the highest enhancement layer.  This
2263	      specification does not support downgrading the sub-layer of any
2264	      layers in the OLS that are not the highest layer.

2266	         Informative note: in other words, using this mechanism, an
2267	         answerer can downgrade only the frame rate for the highest
2268	         spatial/quality layer (typically corresponding to the highest
2269	         resolution or bitrate, hence the most complex to decode), but
2270	         not for lower spatial/quality layers.  The answerer must
2271	         support all sublayers for lower layers in the OLS, or reject
2272	         the offer.  That's not a big burden, as the receiver/decoder
2273	         has the option to discard any sublayers it cannot decode,
2274	         irrespective of what is being signalled through offer/answer.

2276	   o  The answerer MUST maintain all configuration parameters with the
2277	      values being the same as signaled in the sprop-vps for the
2278	      operating point with the largest number of sublayers for the
2279	      chosen output layer set, with the exception that the value of
2280	      level-id is changeable as long as the highest level indicated by
2281	      the answer is not higher than the level indicated by the sprop-vps
2282	      in offer for the operating point with the largest number of
2283	      sublayers for the chosen output layer set.

2285	7.2.2.3.  Payload format configuration

2287	   The following limitations and rules pertain to the configuration of
2288	   the payload format mechanisms---buffer management mostly and apply to
2289	   both scalable and non-scalable VVC.

2291	   o  The parameters sprop-max-don-diff, and sprop-depack-buf-bytes
2292	      describe the properties of an RTP stream that the offerer or the
2293	      answerer is sending for the media format configuration.  This
2294	      differs from the normal usage of the offer/answer parameters:
2295	      normally such parameters declare the properties of the bitstream
2296	      or RTP stream that the offerer or the answerer is able to receive.
2297	      When dealing with VVC, the offerer assumes that the answerer will
2298	      be able to receive media encoded using the configuration being
2299	      offered.

2301	         Informative note: The above parameters apply for any RTP
2302	         stream, when present, sent by a declaring entity with the same
2303	         configuration.  In other words, the applicability of the above
2304	         parameters to RTP streams depends on the source endpoint.
2305	         Rather than being bound to the payload type, the values may
2306	         have to be applied to another payload type when being sent, as
2307	         they apply for the configuration.

2309	   o  The capability parameter max-lsr MAY be used to declare further
2310	      capabilities of the offerer or answerer for receiving.  It MUST
2311	      NOT be present when the direction attribute is sendonly.

2313	   o  The capability parameter max-fps MAY be used to declare lower
2314	      capabilities of the offerer or answerer for receiving.  It MUST
2315	      NOT be present when the direction attribute is sendonly.

2317	   o  When an offerer offers an interleaved stream, indicated by the
2318	      presence of sprop-max-don-diff with a value larger than zero, the
2319	      offerer MUST include the size of the de-packetization buffer
2320	      sprop-depack-buf-bytes.

2322	   o  To enable the offerer and answerer to inform each other about
2323	      their capabilities for de-packetization buffering in receiving RTP
2324	      streams, both parties are RECOMMENDED to include depack-buf-cap.

2326	   o  The sprop-dci, sprop-vps, sprop-sps, or sprop-pps, when present
2327	      (included in the "a=fmtp" line of SDP or conveyed using the "fmtp"
2328	      source attribute as specified in Section 6.3 of [RFC5576]), are
2329	      used for out-of-band transport of the parameter sets (DCI, VPS,
2330	      SPS, or PPS, respectively).

2332	   o  The answerer MAY use either out-of-band or in-band transport of
2333	      parameter sets for the bitstream it is sending, regardless of
2334	      whether out-of-band parameter sets transport has been used in the
2335	      offerer-to-answerer direction.  Parameter sets included in an
2336	      answer are independent of those parameter sets included in the
2337	      offer, as they are used for decoding two different bitstreams, one
2338	      from the answerer to the offerer and the other in the opposit
2339	      direction.  In case some RTP packets are sent before the SDP
2340	      offer/answer settles down, in-band parameter sets MUST be used for
2341	      those RTP stream parts sent before the SDP offer/answer.

2343	   o  The following rules apply to transport of parameter set in the
2344	      offerer-to-answerer direction.

2346	      *  An offer MAY include sprop-dci, sprop-vps, sprop-sps, and/or
2347	         sprop-pps.  If none of these parameters is present in the
2348	         offer, then only in-band transport of parameter sets is used.

2350	      *  If the level to use in the offerer-to-answerer direction is
2351	         equal to the default level in the offer, the answerer MUST be
2352	         prepared to use the parameter sets included in sprop-vps,
2353	         sprop-sps, and sprop-pps (either included in the "a=fmtp" line
2354	         of SDP or conveyed using the "fmtp" source attribute) for
2355	         decoding the incoming bitstream, e.g., by passing these
2356	         parameter set NAL units to the video decoder before passing any
2357	         NAL units carried in the RTP streams.  Otherwise, the answerer
2358	         MUST ignore sprop-vps, sprop-sps, and sprop-pps (either
2359	         included in the "a=fmtp" line of SDP or conveyed using the
2360	         "fmtp" source attribute) and the offerer MUST transmit
2361	         parameter sets in-band.

2363	   o  The following rules apply to transport of parameter set in the
2364	      answerer-to-offerer direction.

2366	      *  An answer MAY include sprop-dci, sprop-vps, sprop-sps, and/or
2367	         sprop-pps.  If none of these parameters is present in the
2368	         answer, then only in-band transport of parameter sets is used.

2370	      *  The offerer MUST be prepared to use the parameter sets included
2371	         in sprop-vps, sprop-sps, and sprop-pps (either included in the
2372	         "a=fmtp" line of SDP or conveyed using the "fmtp" source
2373	         attribute) for decoding the incoming bitstream, e.g., by
2374	         passing these parameter set NAL units to the video decoder
2375	         before passing any NAL units carried in the RTP streams.

2377	   o  When sprop-dci, sprop-vps, sprop-sps, and/or sprop-pps are
2378	      conveyed using the "fmtp" source attribute as specified in
2379	      Section 6.3 of [RFC5576], the receiver of the parameters MUST
2380	      store the parameter sets included in sprop-dci, sprop-vps, sprop-
2381	      sps, and/or sprop-pps and associate them with the source given as
2382	      part of the "fmtp" source attribute.  Parameter sets associated
2383	      with one source (given as part of the "fmtp" source attribute)
2384	      MUST only be used to decode NAL units conveyed in RTP packets from
2385	      the same source (given as part of the "fmtp" source attribute).
2386	      When this mechanism is in use, SSRC collision detection and
2387	      resolution MUST be performed as specified in [RFC5576].

2389	   Table 1 lists the interpretation of all the parameters that MAY be
2390	   used for the various combinations of offer, answer, and direction
2391	   attributes.  Note that the two columns wherein the recv-ols-id
2392	   parameter is used only apply to answers, whereas the other columns
2393	   apply to both offers and answers.

2395	   Table 1.  Interpretation of parameters for various combinations of
2396	   offers, answers, direction attributes, with and without recv-ols-id.
2397	   Columns that do not indicate offer or answer apply to both.

2399	                                       sendonly --+
2400	               answer: recvonly, recv-ols-id --+  |
2401	                 recvonly w/o recv-ols-id --+  |  |
2402	         answer: sendrecv, recv-ols-id --+  |  |  |
2403	           sendrecv w/o recv-ols-id --+  |  |  |  |
2404	                                      |  |  |  |  |
2405	   profile-id                         C  D  C  D  P
2406	   tier-flag                          C  D  C  D  P
2407	   level-id                           D  D  D  D  P
2408	   sub-profile-id                     C  D  C  D  P
2409	   interop-constraints                C  D  C  D  P
2410	   max-recv-level-id                  R  R  R  R  -
2411	   sprop-max-don-diff                 P  P  -  -  P
2412	   sprop-depack-buf-bytes             P  P  -  -  P
2413	   depack-buf-cap                     R  R  R  R  -
2414	   max-lsr                            R  R  R  R  -
2415	   max-fps                            R  R  R  R  -
2416	   sprop-dci                          P  P -  -  P
2417	   sprop-vps                          P  P  -  -  P
2418	   sprop-sps                          P  P  -  -  P
2419	   sprop-pps                          P  P  -  -  P
2420	   sprop-sub-layer-id                 P  P  -  -  P
2421	   recv-sub-layer-id                  O  O  O  O  -
2422	   sprop-ols-id                       P  P  -  -  P
2423	   recv-ols-id                        X  O  X  O  -

2425	   Legend:

2427	    C: configuration for sending and receiving bitstreams
2428	    D: changeable configuration, same as C except possible
2429	       to answer with a different but consistent value (see the
2430	       semantics of the six parameters related to profile, tier,
2431	       and level on these parameters being consistent)
2432	    P: properties of the bitstream to be sent
2433	    R: receiver capabilities
2434	    O: operation point selection
2435	    X: MUST NOT be present
2436	    -: not usable, when present MUST be ignored

2438	   Parameters used for declaring receiver capabilities are, in general,
2439	   downgradable; i.e., they express the upper limit for a sender's
2440	   possible behavior.  Thus, a sender MAY select to set its encoder
2441	   using only lower/lesser or equal values of these parameters.

2443	   When the answer does not include a recv-ols-id that is less than the
2444	   sprop-ols-id in the offer, parameters declaring a configuration point
2445	   are not changeable, with the exception of the level-id parameter for
2446	   unicast usage, and these parameters express values a receiver expects
2447	   to be used and MUST be used verbatim in the answer as in the offer.

2449	   When a sender's capabilities are declared with the configuration
2450	   parameters, these parameters express a configuration that is
2451	   acceptable for the sender to receive bitstreams.  In order to achieve
2452	   high interoperability levels, it is often advisable to offer multiple
2453	   alternative configurations.  It is impossible to offer multiple
2454	   configurations in a single payload type.  Thus, when multiple
2455	   configuration offers are made, each offer requires its own RTP
2456	   payload type associated with the offer.  However, it is possible to
2457	   offer multiple operation points using one configuration in a single
2458	   payload type by including sprop-vps in the offer and recv-ols- id in
2459	   the answer.

2461	   A receiver SHOULD understand all media type parameters, even if it
2462	   only supports a subset of the payload format's functionality.  This
2463	   ensures that a receiver is capable of understanding when an offer to
2464	   receive media can be downgraded to what is supported by the receiver
2465	   of the offer.

2467	   An answerer MAY extend the offer with additional media format
2468	   configurations.  However, to enable their usage, in most cases a
2469	   second offer is required from the offerer to provide the bitstream
2470	   property parameters that the media sender will use.  This also has
2471	   the effect that the offerer has to be able to receive this media
2472	   format configuration, not only to send it.

2474	7.2.2.4.  Multicast

2476	   For bitstreams being delivered over multicast, the following rules
2477	   apply:

2479	   o  The media format configuration is identified by profile-id, tier-
2480	      flag, sub-profile-id, level-id, and interop-constraints.  These
2481	      media format configuration parameters, including level-id, MUST be
2482	      used symmetrically; that is, the answerer MUST either maintain all
2483	      configuration parameters or remove the media format (payload type)
2484	      completely.  Note that this implies that the level-id for offer/
2485	      answer in multicast is not changeable.

2487	   o  To simplify the handling and matching of these configurations, the
2488	      same RTP payload type number used in the offer SHOULD also be used
2489	      in the answer, as specified in [RFC3264].  An answer MUST NOT
2490	      contain a payload type number used in the offer unless the
2491	      configuration is the same as in the offer.

2493	   o  Parameter sets received MUST be associated with the originating
2494	      source and MUST only be used in decoding the incoming bitstream
2495	      from the same source.

2497	   o  The rules for other parameters are the same as above for unicast
2498	      as long as the three above rules are obeyed.

2500	7.2.3.  Usage in Declarative Session Descriptions

2502	   When VVC over RTP is offered with SDP in a declarative style, as in
2503	   Real Time Streaming Protocol (RTSP) [RFC2326] or Session Announcement
2504	   Protocol (SAP) [RFC2974], the following considerations are necessary.

2506	   o  All parameters capable of indicating both bitstream properties and
2507	      receiver capabilities are used to indicate only bitstream
2508	      properties.  For example, in this case, the parameter profile-id,
2509	      tier-id, level-id declares the values used by the bitstream, not
2510	      the capabilities for receiving bitstreams.  As a result, the
2511	      following interpretation of the parameters MUST be used:

2513	      *  Declaring actual configuration or bitstream properties:

2515	         +  profile-id

2517	         +  tier-flag

2519	         +  level-id

2521	         +  interop-constraints

2523	         +  sub-profile-id

2525	         +  sprop-dci

2527	         +  sprop-vps

2529	         +  sprop-sps

2531	         +  sprop-pps

2533	         +  sprop-max-don-diff

2535	         +  sprop-depack-buf-bytes

2537	         +  sprop-sub-layer-id
2538	         +  sprop-ols-id

2540	      *  Not usable (when present, they MUST be ignored):

2542	         +  max-lsr

2544	         +  max-fps

2546	         +  max-recv-level-id

2548	         +  depack-buf-cap

2550	         +  recv-sublayer-id

2552	         +  recv-ols-id

2554	      *  A receiver of the SDP is required to support all parameters and
2555	         values of the parameters provided; otherwise, the receiver MUST
2556	         reject (RTSP) or not participate in (SAP) the session.  It
2557	         falls on the creator of the session to use values that are
2558	         expected to be supported by the receiving application.

2560	7.2.4.  Considerations for Parameter Sets

2562	   When out-of-band transport of parameter sets is used, parameter sets
2563	   MAY still be additionally transported in-band unless explicitly
2564	   disallowed by an application, and some of these additional parameter
2565	   sets may update some of the out-of-band transported parameter sets.
2566	   Update of a parameter set refers to the sending of a parameter set of
2567	   the same type using the same parameter set ID but with different
2568	   values for at least one other parameter of the parameter set.

2570	8.  Use with Feedback Messages

2572	   The following subsections define the use of the Picture Loss
2573	   Indication (PLI) and Full Intra Request (FIR) feedback messages with
2574	   [VVC].  The PLI is defined in [RFC4585], and the FIR message is
2575	   defined in [RFC5104].  In accordance with this memo, unlike [HEVC], a
2576	   sender MUST NOT send Slice Loss Indication (SLI) or Reference Picture
2577	   Selection Indication (RPSI), and a receiver SHOULD ignore RPSI and
2578	   treat a received SLI as a PLI.

2580	8.1.  Picture Loss Indication (PLI)

2582	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
2583	   media sender indicates "the loss of an undefined amount of coded
2584	   video data belonging to one or more pictures".  Without having any
2585	   specific knowledge of the setup of the bitstream (such as use and
2586	   location of in-band parameter sets, non-IRAP decoder refresh points,
2587	   picture structures, and so forth), a reaction to the reception of an
2588	   PLI by a VVC sender SHOULD be to send an IRAP picture and relevant
2589	   parameter sets; potentially with sufficient redundancy so to ensure
2590	   correct reception.  However, sometimes information about the
2591	   bitstream structure is known.  For example, state could have been
2592	   established outside of the mechanisms defined in this document that
2593	   parameter sets are conveyed out of band only, and stay static for the
2594	   duration of the session.  In that case, it is obviously unnecessary
2595	   to send them in-band as a result of the reception of a PLI.  Other
2596	   examples could be devised based on a priori knowledge of different
2597	   aspects of the bitstream structure.  In all cases, the timing and
2598	   congestion control mechanisms of RFC 4585 MUST be observed.

2600	8.2.  Full Intra Request (FIR)

2602	   The purpose of the FIR message is to force an encoder to send an
2603	   independent decoder refresh point as soon as possible, while
2604	   observing applicable congestion-control-related constraints, such as
2605	   those set out in [RFC8082]).

2607	   Upon reception of a FIR, a sender MUST send an IDR picture.
2608	   Parameter sets MUST also be sent, except when there is a priori
2609	   knowledge that the parameter sets have been correctly established.  A
2610	   typical example for that is an understanding between sender and
2611	   receiver, established by means outside this document, that parameter
2612	   sets are exclusively sent out-of-band.

2614	9.  Security Considerations

2616	   The scope of this Security Considerations section is limited to the
2617	   payload format itself and to one feature of [VVC] that may pose a
2618	   particularly serious security risk if implemented naively.  The
2619	   payload format, in isolation, does not form a complete system.
2620	   Implementers are advised to read and understand relevant security-
2621	   related documents, especially those pertaining to RTP (see the
2622	   Security Considerations section in [RFC3550] ), and the security of
2623	   the call-control stack chosen (that may make use of the media type
2624	   registration of this memo).  Implementers should also consider known
2625	   security vulnerabilities of video coding and decoding implementations
2626	   in general and avoid those.

2628	   Within this RTP payload format, and with the exception of the user
2629	   data SEI message as described below, no security threats other than
2630	   those common to RTP payload formats are known.  In other words,
2631	   neither the various media-plane-based mechanisms, nor the signaling
2632	   part of this memo, seems to pose a security risk beyond those common
2633	   to all RTP-based systems.

2635	   RTP packets using the payload format defined in this specification
2636	   are subject to the security considerations discussed in the RTP
2637	   specification [RFC3550] , and in any applicable RTP profile such as
2638	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
2639	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
2640	   Does Not Mandate a Single Media Security Solution" [RFC7202]
2641	   discusses, it is not an RTP payload format's responsibility to
2642	   discuss or mandate what solutions are used to meet the basic security
2643	   goals like confidentiality, integrity and source authenticity for RTP
2644	   in general.  This responsibility lays on anyone using RTP in an
2645	   application.  They can find guidance on available security mechanisms
2646	   and important considerations in "Options for Securing RTP Sessions"
2647	   [RFC7201] . The rest of this section discusses the security impacting
2648	   properties of the payload format itself.

2650	   Because the data compression used with this payload format is applied
2651	   end-to-end, any encryption needs to be performed after compression.
2652	   A potential denial-of-service threat exists for data encodings using
2653	   compression techniques that have non-uniform receiver-end
2654	   computational load.  The attacker can inject pathological datagrams
2655	   into the bitstream that are complex to decode and that cause the
2656	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
2657	   attacks, as it is extremely simple to generate datagrams containing
2658	   NAL units that affect the decoding process of many future NAL units.
2659	   Therefore, the usage of data origin authentication and data integrity
2660	   protection of at least the RTP packet is RECOMMENDED, for example,
2661	   with SRTP [RFC3711] .

2663	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
2664	   Enhancement Information (SEI) message.  This SEI message allows
2665	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
2666	   bitstring could include JavaScript, machine code, and other active
2667	   content.  [VVC] leaves the handling of this SEI message to the
2668	   receiving system.  In order to avoid harmful side effects the user
2669	   data SEI message, decoder implementations cannot naively trust its
2670	   content.  For example, it would be a bad and insecure implementation
2671	   practice to forward any JavaScript a decoder implementation detects
2672	   to a web browser.  The safest way to deal with user data SEI messages
2673	   is to simply discard them, but that can have negative side effects on
2674	   the quality of experience by the user.

2676	   End-to-end security with authentication, integrity, or
2677	   confidentiality protection will prevent a MANE from performing media-
2678	   aware operations other than discarding complete packets.  In the case
2679	   of confidentiality protection, it will even be prevented from
2680	   discarding packets in a media-aware way.  To be allowed to perform
2681	   such operations, a MANE is required to be a trusted entity that is
2682	   included in the security context establishment.

2684	10.  Congestion Control

2686	   Congestion control for RTP SHALL be used in accordance with RTP
2687	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
2688	   If best-effort service is being used, an additional requirement is
2689	   that users of this payload format MUST monitor packet loss to ensure
2690	   that the packet loss rate is within an acceptable range.  Packet loss
2691	   is considered acceptable if a TCP flow across the same network path,
2692	   and experiencing the same network conditions, would achieve an
2693	   average throughput, measured on a reasonable timescale, that is not
2694	   less than all RTP streams combined are achieving.  This condition can
2695	   be satisfied by implementing congestion-control mechanisms to adapt
2696	   the transmission rate, the number of layers subscribed for a layered
2697	   multicast session, or by arranging for a receiver to leave the
2698	   session if the loss rate is unacceptably high.

2700	   The bitrate adaptation necessary for obeying the congestion control
2701	   principle is easily achievable when real-time encoding is used, for
2702	   example, by adequately tuning the quantization parameter.  However,
2703	   when pre-encoded content is being transmitted, bandwidth adaptation
2704	   requires the pre-coded bitstream to be tailored for such adaptivity.
2705	   The key mechanisms available in [VVC] are temporal scalability, and
2706	   spatial/SNR scalability.  A media sender can remove NAL units
2707	   belonging to higher temporal sublayers (i.e., those NAL units with a
2708	   high value of TID) or higher spatio-SNR layers (as indicated by
2709	   interpreting the VPS) until the sending bitrate drops to an
2710	   acceptable range.

2712	   The mechanisms mentioned above generally work within a defined
2713	   profile and level and, therefore, no renegotiation of the channel is
2714	   required.  Only when non-downgradable parameters (such as profile)
2715	   are required to be changed does it become necessary to terminate and
2716	   restart the RTP stream(s).  This may be accomplished by using
2717	   different RTP payload types.

2719	   MANEs MAY remove certain unusable packets from the RTP stream when
2720	   that RTP stream was damaged due to previous packet losses.  This can
2721	   help reduce the network load in certain special cases.  For example,
2722	   MANES can remove those FUs where the leading FUs belonging to the
2723	   same NAL unit have been lost or those dependent slice segments when
2724	   the leading slice segments belonging to the same slice have been
2725	   lost, because the trailing FUs or dependent slice segments are
2726	   meaningless to most decoders.  MANES can also remove higher temporal
2727	   scalable layers if the outbound transmission (from the MANE's
2728	   viewpoint) experiences congestion.

2730	11.  IANA Considerations

2732	   Placeholder

2734	12.  Acknowledgements

2736	   Dr. Byeongdoo Choi is thanked for the video codec related technical
2737	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
2738	   are thanked for their contributions on [VVC] specification
2739	   descriptive content.  Spencer Dawkins is thanked for his valuable
2740	   review comments that led to great improvements of this memo.  Some
2741	   parts of this specification share text with the RTP payload format
2742	   for HEVC [RFC7798].  We thank the authors of that specification for
2743	   their excellent work.

2745	13.  References

2747	13.1.  Normative References

2749	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2750	              Requirement Levels", BCP 14, RFC 2119,
2751	              DOI 10.17487/RFC2119, March 1997,
2752	              <https://www.rfc-editor.org/info/rfc2119>.

2754	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
2755	              with Session Description Protocol (SDP)", RFC 3264,
2756	              DOI 10.17487/RFC3264, June 2002,
2757	              <https://www.rfc-editor.org/info/rfc3264>.

2759	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2760	              Jacobson, "RTP: A Transport Protocol for Real-Time
2761	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
2762	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

2764	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
2765	              Video Conferences with Minimal Control", STD 65, RFC 3551,
2766	              DOI 10.17487/RFC3551, July 2003,
2767	              <https://www.rfc-editor.org/info/rfc3551>.

2769	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
2770	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
2771	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
2772	              <https://www.rfc-editor.org/info/rfc3711>.

2774	   [RFC4556]  Zhu, L. and B. Tung, "Public Key Cryptography for Initial
2775	              Authentication in Kerberos (PKINIT)", RFC 4556,
2776	              DOI 10.17487/RFC4556, June 2006,
2777	              <https://www.rfc-editor.org/info/rfc4556>.

2779	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
2780	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
2781	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

2783	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
2784	              "Extended RTP Profile for Real-time Transport Control
2785	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
2786	              DOI 10.17487/RFC4585, July 2006,
2787	              <https://www.rfc-editor.org/info/rfc4585>.

2789	   [RFC4648]  Josefsson, S., "The Base16, Base32, and Base64 Data
2790	              Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006,
2791	              <https://www.rfc-editor.org/info/rfc4648>.

2793	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
2794	              "Codec Control Messages in the RTP Audio-Visual Profile
2795	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
2796	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

2798	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
2799	              Real-time Transport Control Protocol (RTCP)-Based Feedback
2800	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
2801	              2008, <https://www.rfc-editor.org/info/rfc5124>.

2803	   [RFC5576]  Lennox, J., Ott, J., and T. Schierl, "Source-Specific
2804	              Media Attributes in the Session Description Protocol
2805	              (SDP)", RFC 5576, DOI 10.17487/RFC5576, June 2009,
2806	              <https://www.rfc-editor.org/info/rfc5576>.

2808	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
2809	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
2810	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
2811	              DOI 10.17487/RFC7656, November 2015,
2812	              <https://www.rfc-editor.org/info/rfc7656>.

2814	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
2815	              "Using Codec Control Messages in the RTP Audio-Visual
2816	              Profile with Feedback with Layered Codecs", RFC 8082,
2817	              DOI 10.17487/RFC8082, March 2017,
2818	              <https://www.rfc-editor.org/info/rfc8082>.

2820	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2821	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
2822	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

2824	   [VSEI]     "Versatile supplemental enhancement information messages
2825	              for coded video bitstreams", 2020,
2826	              <http://handle.itu.int/11.1002/1000/14337>.

2828	   [VVC]      "Versatile Video Coding, ITU-T Recommendation H.266",
2829	              2020, <http://handle.itu.int/11.1002/1000/14336>.

2831	13.2.  Informative References

2833	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
2834	              HEVC, IEEE Transactions on Circuts and Systems for Video
2835	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
2836	              2012.

2838	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
2839	              H.265", 2019, <http://handle.itu.int/11.1002/1000/14107>.

2841	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
2842	              ofmoving pictures and associated audio information - Part
2843	              1:Systems, ISO International Standard 13818-1", 2013.

2845	   [RFC2326]  Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time
2846	              Streaming Protocol (RTSP)", RFC 2326,
2847	              DOI 10.17487/RFC2326, April 1998,
2848	              <https://www.rfc-editor.org/info/rfc2326>.

2850	   [RFC2974]  Handley, M., Perkins, C., and E. Whelan, "Session
2851	              Announcement Protocol", RFC 2974, DOI 10.17487/RFC2974,
2852	              October 2000, <https://www.rfc-editor.org/info/rfc2974>.

2854	   [RFC6184]  Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP
2855	              Payload Format for H.264 Video", RFC 6184,
2856	              DOI 10.17487/RFC6184, May 2011,
2857	              <https://www.rfc-editor.org/info/rfc6184>.

2859	   [RFC6190]  Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis,
2860	              "RTP Payload Format for Scalable Video Coding", RFC 6190,
2861	              DOI 10.17487/RFC6190, May 2011,
2862	              <https://www.rfc-editor.org/info/rfc6190>.

2864	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
2865	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
2866	              <https://www.rfc-editor.org/info/rfc7201>.

2868	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
2869	              Framework: Why RTP Does Not Mandate a Single Media
2870	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
2871	              2014, <https://www.rfc-editor.org/info/rfc7202>.

2873	   [RFC7798]  Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M.
2874	              Hannuksela, "RTP Payload Format for High Efficiency Video
2875	              Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March
2876	              2016, <https://www.rfc-editor.org/info/rfc7798>.

2878	Appendix A.  Change History

2880	   draft-zhao-payload-rtp-vvc-00 ........ initial version

2882	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
2883	   corrections

2885	   draft-ietf-payload-rtp-vvc-00 ........ initial WG draft

2887	   draft-ietf-payload-rtp-vvc-01 ........ VVC specification update

2889	   draft-ietf-payload-rtp-vvc-02 ........ VVC specification update

2891	   draft-ietf-payload-rtp-vvc-03 ........ VVC coding tool introduction
2892	   update

2894	   draft-ietf-payload-rtp-vvc-04 ........ VVC coding tool introduction
2895	   update

2897	   draft-ietf-payload-rtp-vvc-05 ........ reference udpate and adding
2898	   placement for open issues

2900	   draft-ietf-payload-rtp-vvc-06 ........ address editor's note

2902	   draft-ietf-payload-rtp-vvc-07 ........ address editor's notes

2904	   draft-ietf-payload-rtp-vvc-08 ........ address editor's notes

2906	   draft-ietf-payload-rtp-vvc-09 ........ address editor's notes
2907	   draft-ietf-payload-rtp-vvc-10 ........ address editor's notes

2909	Authors' Addresses

2911	   Shuai Zhao
2912	   Tencent
2913	   2747 Park Blvd
2914	   Palo Alto  94588
2915	   USA

2917	   Email: shuai.zhao@ieee.org

2919	   Stephan Wenger
2920	   Tencent
2921	   2747 Park Blvd
2922	   Palo Alto  94588
2923	   USA

2925	   Email: stewe@stewe.org

2927	   Yago Sanchez
2928	   Fraunhofer HHI
2929	   Einsteinufer 37
2930	   Berlin  10587
2931	   Germany

2933	   Email: yago.sanchez@hhi.fraunhofer.de

2935	   Ye-Kui Wang
2936	   Bytedance Inc.
2937	   8910 University Center Lane
2938	   San Diego  92122
2939	   USA

2941	   Email: yekui.wang@bytedance.com