idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-08.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (7 March 2021) is 1145 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1372

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: 8 September 2021                                     Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                              Y.-K. Wang
8	                                                          Bytedance Inc.
9	                                                            7 March 2021

11	          RTP Payload Format for Versatile Video Coding (VVC)
12	                     draft-ietf-avtcore-rtp-vvc-08

14	Abstract

16	   This memo describes an RTP payload format for the video coding
17	   standard ITU-T Recommendation H.266 and ISO/IEC International
18	   Standard 23090-3, both also known as Versatile Video Coding (VVC) and
19	   developed by the Joint Video Experts Team (JVET).  The RTP payload
20	   format allows for packetization of one or more Network Abstraction
21	   Layer (NAL) units in each RTP packet payload as well as fragmentation
22	   of a NAL unit into multiple RTP packets.  The payload format has wide
23	   applicability in videoconferencing, Internet video streaming, and
24	   high-bitrate entertainment-quality video, among other applications.

26	Status of This Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at https://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on 8 September 2021.

43	Copyright Notice

45	   Copyright (c) 2021 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
50	   license-info) in effect on the date of publication of this document.
51	   Please review these documents carefully, as they describe your rights
52	   and restrictions with respect to this document.  Code Components
53	   extracted from this document must include Simplified BSD License text
54	   as described in Section 4.e of the Trust Legal Provisions and are
55	   provided without warranty as described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
61	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   3
62	       1.1.2.  Systems and Transport Interfaces (informative)  . . .   6
63	       1.1.3.  High-Level Picture Partitioning (informative) . . . .  11
64	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  13
65	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  14
66	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  15
67	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  15
68	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  15
69	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  15
70	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  18
71	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  19
72	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  20
73	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  20
74	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  21
75	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  22
76	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  22
77	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  23
78	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  27
79	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  30
80	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  31
81	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  32
82	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  34
83	     7.1.  Media Type Registration . . . . . . . . . . . . . . . . .  34
84	     7.2.  SDP Parameters  . . . . . . . . . . . . . . . . . . . . .  44
85	       7.2.1.  Mapping of Payload Type Parameters to SDP . . . . . .  44
86	       7.2.2.  Usage with SDP Offer/Answer Model . . . . . . . . . .  44
87	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  45
88	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  45
89	     8.2.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  45
90	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  46
91	   10. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  47
92	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  48
93	   12. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  48
94	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  48
95	     13.1.  Normative References . . . . . . . . . . . . . . . . . .  48
96	     13.2.  Informative References . . . . . . . . . . . . . . . . .  50
97	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  51
98	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  51

100	1.  Introduction

102	   The Versatile Video Coding [VVC] specification, formally published as
103	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
104	   23090-3, is currently in the ITU-T publication process and the ISO/
105	   IEC approval process.  VVC is reported to provide significant coding
106	   efficiency gains over HEVC [HEVC] as known as H.265, and other
107	   earlier video codecs.

109	   This memo specifies an RTP payload format for VVC.  It shares its
110	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
111	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
112	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
113	   and their respective predecessors.  With respect to design
114	   philosophy, security, congestion control, and overall implementation
115	   complexity, it has similar properties to those earlier payload format
116	   specifications.  This is a conscious choice, as at least RFC 6184 is
117	   widely deployed and generally known in the relevant implementer
118	   communities.  Certain mechanisms known from [RFC6190] were
119	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
120	   signal-to-noise ratio (SNR) scalability.

122	1.1.  Overview of the VVC Codec

124	   VVC and HEVC share a similar hybrid video codec design.  In this
125	   memo, we provide a very brief overview of those features of VVC that
126	   are, in some form, addressed by the payload format specified herein.
127	   Implementers have to read, understand, and apply the ITU-T/ISO/IEC
128	   specifications pertaining to VVC to arrive at interoperable, well-
129	   performing implementations.

131	   Conceptually, both VVC and HEVC include a Video Coding Layer (VCL),
132	   which is often used to refer to the coding-tool features, and a NAL,
133	   which is often used to refer to the systems and transport interface
134	   aspects of the codecs.

136	1.1.1.  Coding-Tool Features (informative)

138	   Coding tool features are described below with occasional reference to
139	   the coding tool set of HEVC, which is well known in the community.

141	   Similar to earlier hybrid-video-coding-based standards, including
142	   HEVC, the following basic video coding design is employed by VVC.  A
143	   prediction signal is first formed by either intra- or motion-
144	   compensated prediction, and the residual (the difference between the
145	   original and the prediction) is then coded.  The gains in coding
146	   efficiency are achieved by redesigning and improving almost all parts
147	   of the codec over earlier designs.  In addition, VVC includes several
148	   tools to make the implementation on parallel architectures easier.

150	   Finally, VVC includes temporal, spatial, and SNR scalability as well
151	   as multiview coding support.

153	   Coding blocks and transform structure

155	   Among major coding-tool differences between HEVC and VVC, one of the
156	   important improvements is the more flexible coding tree structure in
157	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
158	   ternary trees are also supported, which contributes significant
159	   improvement in coding efficiency.  Moreover, the maximum size of
160	   coding tree unit (CTU) is increased from 64x64 to 128x128.  To
161	   improve the coding efficiency of chroma signal, luma chroma separated
162	   trees at CTU level may be employed for intra-slices.  The square
163	   transforms in HEVC are extended to non-square transforms for
164	   rectangular blocks resulting from binary and ternary tree splits.
165	   Besides, VVC supports multiple transform sets (MTS), including DCT-2,
166	   DST-7, and DCT-8 as well as the non-separable secondary transform.
167	   The transforms used in VVC can have different sizes with support for
168	   larger transform sizes.  For DCT-2, the transform sizes range from
169	   2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from
170	   4x4 to 32x32.  In addition, VVC also support sub-block transform for
171	   both intra and inter coded blocks.  For intra coded blocks, intra
172	   sub-partitioning (ISP) may be used to allow sub-block based intra
173	   prediction and transform.  For inter blocks, sub-block transform may
174	   be used assuming that only a part of an inter-block has non-zero
175	   transform coefficients.

177	   Entropy coding

179	   Similar to HEVC, VVC uses a single entropy-coding engine, which is
180	   based on context adaptive binary arithmetic coding [CABAC], but with
181	   the support of multi-window sizes.  The window sizes can be
182	   initialized differently for different context models.  Due to such a
183	   design, it has more efficient adaptation speed and better coding
184	   efficiency.  A joint chroma residual coding scheme is applied to
185	   further exploit the correlation between the residuals of two color
186	   components.  In VVC, different residual coding schemes are applied
187	   for regular transform coefficients and residual samples generated
188	   using transform-skip mode.

190	   In-loop filtering
191	   VVC has more feature support in loop filters than HEVC.  The
192	   deblocking filter in VVC is similar to HEVC but operates at a smaller
193	   grid.  After deblocking and sample adaptive offset (SAO), an adaptive
194	   loop filter (ALF) may be used.  As a Wiener filter, ALF reduces
195	   distortion of decoded pictures.  Besides, VVC introduces a new module
196	   before deblocking called luma mapping with chroma scaling to fully
197	   utilize the dynamic range of signal so that rate-distortion
198	   performance of both SDR and HDR content is improved.

200	   Motion prediction and coding

202	   Compared to HEVC, VVC introduces several improvements in this area.
203	   First, there is the adaptive motion vector resolution (AMVR), which
204	   can save bit cost for motion vectors by adaptively signaling motion
205	   vector resolution.  Then the affine motion compensation is included
206	   to capture complicated motion like zooming and rotation.  Meanwhile,
207	   prediction refinement with the optical flow with affine mode (PROF)
208	   is further deployed to mimic affine motion at the pixel level.
209	   Thirdly the decoder side motion vector refinement (DMVR) is a method
210	   to derive MV vector at decoder side based on block matching so that
211	   fewer bits may be spent on motion vectors.  Bi-directional optical
212	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
213	   offset at 4x4 sub-block level that is derived with equations based on
214	   gradients of the prediction samples and a motion difference relative
215	   to CU motion vectors.  Furthermore, merge with motion vector
216	   difference (MMVD) is a special mode, which further signals a limited
217	   set of motion vector differences on top of merge mode.  In addition
218	   to MMVD, there are another three types of special merge modes, i.e.,
219	   sub-block merge, triangle, and combined intra-/inter-prediction
220	   (CIIP).  Sub-block merge list includes one candidate of sub-block
221	   temporal motion vector prediction (SbTMVP) and up to four candidates
222	   of affine motion vectors.  Triangle is based on triangular block
223	   motion compensation.  CIIP combines intra- and inter- predictions
224	   with weighting.  Adaptive weighting may be employed with a block-
225	   level tool called bi-prediction with CU based weighting (BCW) which
226	   provides more flexibility than in HEVC.

228	   Intra prediction and intra-coding

230	   To capture the diversified local image texture directions with finer
231	   granularity, VVC supports 65 angular directions instead of 33
232	   directions in HEVC.  The intra mode coding is based on a 6-most-
233	   probable-mode scheme, and the 6 most probable modes are derived using
234	   the neighboring intra prediction directions.  In addition, to deal
235	   with the different distributions of intra prediction angles for
236	   different block aspect ratios, a wide-angle intra prediction (WAIP)
237	   scheme is applied in VVC by including intra prediction angles beyond
238	   those present in HEVC.  Unlike HEVC which only allows using the most
239	   adjacent line of reference samples for intra prediction, VVC also
240	   allows using two further reference lines, as known as multi-
241	   reference-line (MRL) intra prediction.  The additional reference
242	   lines can be only used for the 6 most probable intra prediction
243	   modes.  To capture the strong correlation between different colour
244	   components, in VVC, a cross-component linear mode (CCLM) is utilized
245	   which assumes a linear relationship between the luma sample values
246	   and their associated chroma samples.  For intra prediction, VVC also
247	   applies a position-dependent prediction combination (PDPC) for
248	   refining the prediction samples closer to the intra prediction block
249	   boundary.  Matrix-based intra prediction (MIP) modes are also used in
250	   VVC which generates an up to 8x8 intra prediction block using a
251	   weighted sum of downsampled neighboring reference samples, and the
252	   weights are hardcoded constants.

254	   Other coding-tool feature

256	   VVC introduces dependent quantization (DQ) to reduce quantization
257	   error by state-based switching between two quantizers.

259	1.1.2.  Systems and Transport Interfaces (informative)

261	   VVC inherits the basic systems and transport interfaces designs from
262	   HEVC and H.264.  These include the NAL-unit-based syntax structure,
263	   the hierarchical syntax and data unit structure, the supplemental
264	   enhancement information (SEI) message mechanism, and the video
265	   buffering model based on the hypothetical reference decoder (HRD).
266	   The scalability features of VVC are conceptually similar to the
267	   scalable variant of HEVC known as SHVC.  The hierarchical syntax and
268	   data unit structure consists of parameter sets at various levels
269	   (decoder, sequence (pertaining to all), sequence (pertaining to a
270	   single), picture), picture-level header parameters, slice-level
271	   header parameters, and lower-level parameters.

273	   A number of key components that influenced the network abstraction
274	   layer design of VVC as well as this memo are described below

276	   Decoding capability information

278	   The decoding capability information includes parameters that stay
279	   constant for the lifetime of a Video Bitstream, which in IETF terms
280	   can translate to the lifetime of a session.  Such information
281	   includes profile, level, and sub-profile information to determine a
282	   maximum capability interop point that is guaranteed to be never
283	   exceeded, even if splicing of video sequences occurs within a
284	   session.  It further includes constraint fields (most of which are
285	   flags), which can optionally be set to indicate that the video
286	   bitstream will be constraint in the use of certain features as
287	   indicated by the values of those fields.  With this, a bitstream can
288	   be labelled as not using certain tools, which allows among other
289	   things for resource allocation in a decoder implementation.

291	   Video parameter set

293	   The ideo parameter set (VPS) pertains to a coded video sequences
294	   (CVS) of multiple layers covering the same range of access units, and
295	   includes, among other information decoding dependency expressed as
296	   information for reference picture list construction of enhancement
297	   layers.  The VPS provides a "big picture" of a scalable sequence,
298	   including what types of operation points are provided, the profile,
299	   tier, and level of the operation points, and some other high-level
300	   properties of the bitstream that can be used as the basis for session
301	   negotiation and content selection, etc.  One VPS may be referenced by
302	   one or more sequence parameter sets.

304	   Sequence parameter set

306	   The sequence parameter set (SPS) contains syntax elements pertaining
307	   to a coded layer video sequence (CLVS), which is a group of pictures
308	   belonging to the same layer, starting with a random access point, and
309	   followed by pictures that may depend on each other, until the next
310	   random access point picture.  In MPGEG-2, the equivalent of a CVS was
311	   a group of pictures (GOP), which normally started with an I frame and
312	   was followed by P and B frames.  While more complex in its options of
313	   random access points, VVC retains this basic concept.  One remarkable
314	   difference of VVC is that a CLVS may start with a Gradual Decoding
315	   Refresh (GDR) picture, without requiring presence of traditional
316	   random access points in the bitstream, such as instantaneous decoding
317	   refresh (IDR) or clean random access (CRA) pictures.  In many TV-like
318	   applications, a CVS contains a few hundred milliseconds to a few
319	   seconds of video.  In video conferencing (without switching MCUs
320	   involved), a CVS can be as long in duration as the whole session.

322	   Picture and adaptation parameter set

324	   The picture parameter set and the adaptation parameter set (PPS and
325	   APS, respectively) carry information pertaining to zero or more
326	   pictures and zero or more slices, respectively.  The PPS contains
327	   information that is likely to stay constant from picture to picture-
328	   at least for pictures for a certain type-whereas the APS contains
329	   information, such as adaptive loop filter coefficients, that are
330	   likely to change from picture to picture or even within a picture.  A
331	   single APS is referenced by all slices of the same picture if that
332	   APS contains information about luma mapping with chroma scaling
333	   (LMCS) or scaling list.  Different APSs containing ALF parameters can
334	   be referenced by slices of the same picture.

336	   Picture header

338	   A Picture Header contains information that is common to all slices
339	   that belong to the same picture.  Being able to send that information
340	   as a separate NAL unit when pictures are split into several slices
341	   allows for saving bitrate, compared to repeating the same information
342	   in all slices.  However, there might be scenarios where low-bitrate
343	   video is transmitted using a single slice per picture.  Having a
344	   separate NAL unit to convey that information incurs in an overhead
345	   for such scenarios.  For such scenarios, the picture header syntax
346	   structure is directly included in the slice header, instead of in its
347	   own NAL unit.  The mode of the picture header syntax structure being
348	   included in its own NAL unit or not can only be switched on/off for
349	   an entire CLVS, and can only be switched off when in the entire CLVS
350	   each picture contains only one slice.

352	   Profile, tier, and level

354	   The profile, tier and level syntax structures in DCI, VPS and SPS
355	   contain profile, tier, level information for all layers that refer to
356	   the DCI, for layers associated with one or more output layer sets
357	   specified by the VPS, and for any layer that refers to the SPS,
358	   respectively.

360	   Sub-profiles

362	   Within the VVC specification, a sub-profile is a 32-bit number, coded
363	   according to ITU-T Rec. T.35, that does not carry a semantics.  It is
364	   carried in the profile_tier_level structure and hence (potentially)
365	   present in the DCI, VPS, and SPS.  External registration bodies can
366	   register a T.35 codepoint with ITU-T registration authorities and
367	   associate with their registration a description of bitstream
368	   restrictions beyond the profiles defined by ITU-T and ISO/IEC.  This
369	   would allow encoder manufacturers to label the bitstreams generated
370	   by their encoder as complying with such sub-profile.  It is expected
371	   that upstream standardization organizations (such as: DVB and ATSC),
372	   as well as walled-garden video services will take advantage of this
373	   labelling system.  In contrast to "normal" profiles, it is expected
374	   that sub-profiles may indicate encoder choices traditionally left
375	   open in the (decoder- centric) video coding specs, such as GOP
376	   structures, minimum/maximum QP values, and the mandatory use of
377	   certain tools or SEI messages.

379	   General constraint fields

381	   The profile_tier_level structure carries a considerable number of
382	   constraint fields (most of which are flags), which an encoder can use
383	   to indicate to a decoder that it will not use a certain tool or
384	   technology.  They were included in reaction to a perceived market
385	   need for labelling a bitstream as not exercising a certain tool that
386	   has become commercially unviable.

388	   Temporal scalability support

390	   VVC includes support of temporal scalability, by inclusion of the
391	   signaling of TemporalId in the NAL unit header, the restriction that
392	   pictures of a particular temporal sublayer cannot be used for inter
393	   prediction reference by pictures of a lower temporal sublayer, the
394	   sub-bitstream extraction process, and the requirement that each sub-
395	   bitstream extraction output be a conforming bitstream.  Media-Aware
396	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
397	   header for stream adaptation purposes based on temporal scalability.

399	   Reference picture resampling (RPR)

401	   In AVC and HEVC, the spatial resolution of pictures cannot change
402	   unless a new sequence using a new SPS starts, with an IRAP picture.
403	   VVC enables picture resolution change within a sequence at a position
404	   without encoding an IRAP picture, which is always intra-coded.  This
405	   feature is sometimes referred to as reference picture resampling
406	   (RPR), as the feature needs resampling of a reference picture used
407	   for inter prediction when that reference picture has a different
408	   resolution than the current picture being decoded.  RPR allows
409	   resolution change without the need of coding an IRAP picture, which
410	   causes a momentary bit rate spike in streaming or video conferencing
411	   scenarios, e.g., to cope with network condition changes.  RPR can
412	   also be used in application scenarios wherein zooming of the entire
413	   video region or some region of interest is needed.

415	   Spatial, SNR, and multiview scalability

417	   VVC includes support for spatial, SNR, and multiview scalability.
418	   Scalable video coding is widely considered to have technical benefits
419	   and enrich services for various video applications.  Until recently,
420	   however, the functionality has not been included in the first version
421	   of specifications of the video codecs.  In VVC, however, all those
422	   forms of scalability are supported in the first version of VVC
423	   natively through the signaling of the layer_id in the NAL unit
424	   header, the VPS which associates layers with given layer_ids to each
425	   other, reference picture selection, reference picture resampling for
426	   spatial scalability, and a number of other mechanisms not relevant
427	   for this memo.

429	      Spatial scalability
430	         With the existence of Reference Picture Resampling (RPR), the
431	         additional burden for scalability support is just a
432	         modification of the high-level syntax (HLS).  The inter-layer
433	         prediction is employed in a scalable system to improve the
434	         coding efficiency of the enhancement layers.  In addition to
435	         the spatial and temporal motion-compensated predictions that
436	         are available in a single-layer codec, the inter-layer
437	         prediction in VVC uses the possibly resampled video data of the
438	         reconstructed reference picture from a reference layer to
439	         predict the current enhancement layer.  The resampling process
440	         for inter-layer prediction, when used, is performed at the
441	         block-level, reusing the existing interpolation process for
442	         motion compensation in single-layer coding.  It means that no
443	         additional resampling process is needed to support spatial
444	         scalability.

446	      SNR scalability

448	         SNR scalability is similar to spatial scalability except that
449	         the resampling factors are 1:1.  In other words, there is no
450	         change in resolution, but there is inter-layer prediction.

452	      Multiview scalability

454	         The first version of VVC also supports multiview scalability,
455	         wherein a multi-layer bitstream carries layers representing
456	         multiple views, and one or more of the represented views can be
457	         output at the same time.

459	   SEI messages

461	   Supplementary enhancement information (SEI) messages are information
462	   in the bitstream that do not influence the decoding process as
463	   specified in the VVC spec, but address issues of representation/
464	   rendering of the decoded bitstream, label the bitstream for certain
465	   applications, among other, similar tasks.  The overall concept of SEI
466	   messages and many of the messages themselves has been inherited from
467	   the H.264 and HEVC specs.  Except for the SEI messages that affect
468	   the specification of the hypothetical reference decoder (HRD), other
469	   SEI messages for use in the VVC environment, which are generally
470	   useful also in other video coding technologies, are not included in
471	   the main VVC specification but in a companion specification [VSEI].

473	1.1.3.  High-Level Picture Partitioning (informative)

475	   VVC inherited the concept of tiles and wavefront parallel processing
476	   (WPP) from HEVC, with some minor to moderate differences.  The basic
477	   concept of slices was kept in VVC but designed in an essentially
478	   different form.  VVC is the first video coding standard that includes
479	   subpictures as a feature, which provides the same functionality as
480	   HEVC motion-constrained tile sets (MCTSs) but designed differently to
481	   have better coding efficiency and to be friendlier for usage in
482	   application systems.  More details of these differences are described
483	   below.

485	   Tiles and WPP

487	   Same as in HEVC, a picture can be split into tile rows and tile
488	   columns in VVC, in-picture prediction across tile boundaries is
489	   disallowed, etc.  However, the syntax for signaling of tile
490	   partitioning has been simplified, by using a unified syntax design
491	   for both the uniform and the non-uniform mode.  In addition,
492	   signaling of entry point offsets for tiles in the slice header is
493	   optional in VVC while it is mandatory in HEVC.  The WPP design in VVC
494	   has two differences compared to HEVC: i) The CTU row delay is reduced
495	   from two CTUs to one CTU; ii) Signaling of entry point offsets for
496	   WPP in the slice header is optional in VVC while it is mandatory in
497	   HEVC.

499	   Slices

501	   In VVC, the conventional slices based on CTUs (as in HEVC) or
502	   macroblocks (as in AVC) have been removed.  The main reasoning behind
503	   this architectural change is as follows.  The advances in video
504	   coding since 2003 (the publication year of AVC v1) have been such
505	   that slice-based error concealment has become practically impossible,
506	   due to the ever-increasing number and efficiency of in-picture and
507	   inter-picture prediction mechanisms.  An error-concealed picture is
508	   the decoding result of a transmitted coded picture for which there is
509	   some data loss (e.g., loss of some slices) of the coded picture or a
510	   reference picture for at least some part of the coded picture is not
511	   error-free (e.g., that reference picture was an error-concealed
512	   picture).  For example, when one of the multiple slices of a picture
513	   is lost, it may be error-concealed using an interpolation of the
514	   neighboring slices.  While advanced video coding prediction
515	   mechanisms provide significantly higher coding efficiency, they also
516	   make it harder for machines to estimate the quality of an error-
517	   concealed picture, which was already a hard problem with the use of
518	   simpler prediction mechanisms.  Advanced in-picture prediction
519	   mechanisms also cause the coding efficiency loss due to splitting a
520	   picture into multiple slices to be more significant.  Furthermore,
521	   network conditions become significantly better while at the same time
522	   techniques for dealing with packet losses have become significantly
523	   improved.  As a result, very few implementations have recently used
524	   slices for maximum transmission unit size matching.  Instead,
525	   substantially all applications where low-delay error resilience is
526	   required (e.g., video telephony and video conferencing) rely on
527	   system/transport-level error resilience (e.g., retransmission,
528	   forward error correction) and/or picture-based error resilience tools
529	   (feedback-based error resilience, insertion of IRAPs, scalability
530	   with higher protection level of the base layer, and so on).
531	   Considering all the above, nowadays it is very rare that a picture
532	   that cannot be correctly decoded is passed to the decoder, and when
533	   such a rare case occurs, the system can afford to wait for an error-
534	   free picture to be decoded and available for display without
535	   resulting in frequent and long periods of picture freezing seen by
536	   end users.

538	   Slices in VVC have two modes: rectangular slices and raster-scan
539	   slices.  The rectangular slice, as indicated by its name, covers a
540	   rectangular region of the picture.  Typically, a rectangular slice
541	   consists of several complete tiles.  However, it is also possible
542	   that a rectangular slice is a subset of a tile and consists of one or
543	   more consecutive, complete CTU rows within a tile.  A raster-scan
544	   slice consists of one or more complete tiles in a tile raster scan
545	   order, hence the region covered by a raster-scan slices need not but
546	   could have a non-rectangular shape, but it may also happen to have
547	   the shape of a rectangle.  The concept of slices in VVC is therefore
548	   strongly linked to or based on tiles instead of CTUs (as in HEVC) or
549	   macroblocks (as in AVC).

551	   Subpictures

553	   VVC is the first video coding standard that includes the support of
554	   subpictures as a feature.  Each subpicture consists of one or more
555	   complete rectangular slices that collectively cover a rectangular
556	   region of the picture.  A subpicture may be either specified to be
557	   extractable (i.e., coded independently of other subpictures of the
558	   same picture and of earlier pictures in decoding order) or not
559	   extractable.  Regardless of whether a subpicture is extractable or
560	   not, the encoder can control whether in-loop filtering (including
561	   deblocking, SAO, and ALF) is applied across the subpicture boundaries
562	   individually for each subpicture.

564	   Functionally, subpictures are similar to the motion-constrained tile
565	   sets (MCTSs) in HEVC.  They both allow independent coding and
566	   extraction of a rectangular subset of a sequence of coded pictures,
567	   for use cases like viewport-dependent 360o video streaming
568	   optimization and region of interest (ROI) applications.

570	   There are several important design differences between subpictures
571	   and MCTSs.  First, the subpictures feature in VVC allows motion
572	   vectors of a coding block pointing outside of the subpicture even
573	   when the subpicture is extractable by applying sample padding at
574	   subpicture boundaries in this case, similarly as at picture
575	   boundaries.  Second, additional changes were introduced for the
576	   selection and derivation of motion vectors in the merge mode and in
577	   the decoder side motion vector refinement process of VVC.  This
578	   allows higher coding efficiency compared to the non-normative motion
579	   constraints applied at the encoder-side for MCTSs.  Third, rewriting
580	   of SHs (and PH NAL units, when present) is not needed when extracting
581	   one or more extractable subpictures from a sequence of pictures to
582	   create a sub-bitstream that is a conforming bitstream.  In sub-
583	   bitstream extractions based on HEVC MCTSs, rewriting of SHs is
584	   needed.  Note that in both HEVC MCTSs extraction and VVC subpictures
585	   extraction, rewriting of SPSs and PPSs is needed.  However, typically
586	   there are only a few parameter sets in a bitstream, while each
587	   picture has at least one slice, therefore rewriting of SHs can be a
588	   significant burden for application systems.  Fourth, slices of
589	   different subpictures within a picture are allowed to have different
590	   NAL unit types.  Fifth, VVC specifies HRD and level definitions for
591	   subpicture sequences, thus the conformance of the sub-bitstream of
592	   each extractable subpicture sequence can be ensured by encoders.

594	1.1.4.  NAL Unit Header

596	   VVC maintains the NAL unit concept of HEVC with modifications.  VVC
597	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
598	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

600	                     +---------------+---------------+
601	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
602	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
603	                     |F|Z| LayerID   |  Type   | TID |
604	                     +---------------+---------------+

606	                   The Structure of the VVC NAL Unit Header.

608	                                  Figure 1

610	   The semantics of the fields in the NAL unit header are as specified
611	   in VVC and described briefly below for convenience.  In addition to
612	   the name and size of each field, the corresponding syntax element
613	   name in VVC is also provided.

615	   F: 1 bit
616	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
617	      inclusion of this bit in the NAL unit header was to enable
618	      transport of VVC video over MPEG-2 transport systems (avoidance of
619	      start code emulations) [MPEG2S].  In the context of this memo the
620	      value 1 may be used to indicate a syntax violation, e.g., for a
621	      NAL unit resulted from aggregating a number of fragmented units of
622	      a NAL unit but missing the last fragment, as described in
623	      Section TBD.

625	   Z: 1 bit

627	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
628	      for future extensions by ITU-T and ISO/IEC.

630	      This memo does not overload the "Z" bit for local extensions, as
631	      a) overloading the "F" bit is sufficient and b) to preserve the
632	      usefulness of this memo to possible future versions of [VVC].

634	   LayerId: 6 bits

636	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
637	      a layer may be, e.g., a spatial scalable layer, a quality scalable
638	      layer, a layer containing a different view, etc.

640	   Type: 5 bits

642	      nal_unit_type.  This field specifies the NAL unit type as defined
643	      in Table 5 of [VVC].  For a reference of all currently defined NAL
644	      unit types and their semantics, please refer to Section 7.4.2.2 in
645	      [VVC].

647	   TID: 3 bits

649	      nuh_temporal_id_plus1.  This field specifies the temporal
650	      identifier of the NAL unit plus 1.  The value of TemporalId is
651	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
652	      there is at least one bit in the NAL unit header equal to 1, so to
653	      enable independent considerations of start code emulations in the
654	      NAL unit header and in the NAL unit payload data.

656	1.2.  Overview of the Payload Format

658	   This payload format defines the following processes required for
659	   transport of VVC coded data over RTP [RFC3550]:

661	   *  Usage of RTP header with this payload format
662	   *  Packetization of VVC coded NAL units into RTP packets using three
663	      types of payload structures: a single NAL unit packet, aggregation
664	      packet, and fragment unit

666	   *  Transmission of VVC NAL units of the same bitstream within a
667	      single RTP stream

669	   *  Media type parameters to be used with the Session Description
670	      Protocol (SDP) [RFC4566]

672	   *  Usage of RTCP feedback messages

674	2.  Conventions

676	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
677	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
678	   "OPTIONAL" in this document are to be interpreted as described in BCP
679	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
680	   capitals, as shown above.

682	3.  Definitions and Abbreviations

684	3.1.  Definitions

686	   This document uses the terms and definitions of VVC.  Section 3.1.1
687	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
688	   provides definitions specific to this memo.

690	3.1.1.  Definitions from the VVC Specification

692	   Access unit (AU): A set of PUs that belong to different layers and
693	   contain coded pictures associated with the same time for output from
694	   the DPB.

696	   Adaptation parameter set (APS): A syntax structure containing syntax
697	   elements that apply to zero or more slices as determined by zero or
698	   more syntax elements found in slice headers.

700	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
701	   byte stream, that forms the representation of a sequence of AUs
702	   forming one or more coded video sequences (CVSs).

704	   Coded picture: A coded representation of a picture comprising VCL NAL
705	   units with a particular value of nuh_layer_id within an AU and
706	   containing all CTUs of the picture.

708	   Clean random access (CRA) PU: A PU in which the coded picture is a
709	   CRA picture.

711	   Clean random access (CRA) picture: An IRAP picture for which each VCL
712	   NAL unit has nal_unit_type equal to CRA_NUT.

714	   Coded video sequence (CVS): A sequence of AUs that consists, in
715	   decoding order, of a CVSS AU, followed by zero or more AUs that are
716	   not CVSS AUs, including all subsequent AUs up to but not including
717	   any subsequent AU that is a CVSS AU.

719	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
720	   for each layer in the CVS and the coded picture in each PU is a CLVSS
721	   picture.

723	   Coded layer video sequence (CLVS): A sequence of PUs with the same
724	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
725	   PU, followed by zero or more PUs that are not CLVSS PUs, including
726	   all subsequent PUs up to but not including any subsequent PU that is
727	   a CLVSS PU.

729	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
730	   picture is a CLVSS picture.

732	   Coded layer video sequence start (CLVSS) picture: A coded picture
733	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
734	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

736	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
737	   of chroma samples of a picture that has three sample arrays, or a CTB
738	   of samples of a monochrome picture or a picture that is coded using
739	   three separate colour planes and syntax structures used to code the
740	   samples.

742	   Decoding Capability Information (DCI): A syntax structure containing
743	   syntax elements that apply to the entire bitstream.

745	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
746	   reference, output reordering, or output delay specified for the
747	   hypothetical reference decoder.

749	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
750	   NAL unit has nal_unit_type equal to GDR_NUT.

752	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
753	   picture is an IDR picture.

755	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
756	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
757	   IDR_N_LP.

759	   Intra random access point (IRAP) AU: An AU in which there is a PU for
760	   each layer in the CVS and the coded picture in each PU is an IRAP
761	   picture.

763	   Intra random access point (IRAP) PU: A PU in which the coded picture
764	   is an IRAP picture.

766	   Intra random access point (IRAP) picture: A coded picture for which
767	   all VCL NAL units have the same value of nal_unit_type in the range
768	   of IDR_W_RADL to CRA_NUT, inclusive.

770	   Layer: A set of VCL NAL units that all have a particular value of
771	   nuh_layer_id and the associated non-VCL NAL units.

773	   Network abstraction layer (NAL) unit: A syntax structure containing
774	   an indication of the type of data to follow and bytes containing that
775	   data in the form of an RBSP interspersed as necessary with emulation
776	   prevention bytes.

778	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

780	   Operation point (OP): A temporal subset of an OLS, identified by an
781	   OLS index and a highest value of TemporalId.

783	   Picture parameter set (PPS): A syntax structure containing syntax
784	   elements that apply to zero or more entire coded pictures as
785	   determined by a syntax element found in each slice header.

787	   Picture unit (PU): A set of NAL units that are associated with each
788	   other according to a specified classification rule, are consecutive
789	   in decoding order, and contain exactly one coded picture.

791	   Random access: The act of starting the decoding process for a
792	   bitstream at a point other than the beginning of the stream.

794	   Sequence parameter set (SPS): A syntax structure containing syntax
795	   elements that apply to zero or more entire CLVSs as determined by the
796	   content of a syntax element found in the PPS referred to by a syntax
797	   element found in each picture header.

799	   Slice: An integer number of complete tiles or an integer number of
800	   consecutive complete CTU rows within a tile of a picture that are
801	   exclusively contained in a single NAL unit.

803	   Slice header (SH): A part of a coded slice containing the data
804	   elements pertaining to all tiles or CTU rows within a tile
805	   represented in the slice.

807	   Sublayer: A temporal scalable layer of a temporal scalable bitstream
808	   consisting of VCL NAL units with a particular value of the TemporalId
809	   variable, and the associated non-VCL NAL units.

811	   Subpicture: An rectangular region of one or more slices within a
812	   picture.

814	   Sublayer representation: A subset of the bitstream consisting of NAL
815	   units of a particular sublayer and the lower sublayers.

817	   Tile: A rectangular region of CTUs within a particular tile column
818	   and a particular tile row in a picture.

820	   Tile column: A rectangular region of CTUs having a height equal to
821	   the height of the picture and a width specified by syntax elements in
822	   the picture parameter set.

824	   Tile row: A rectangular region of CTUs having a height specified by
825	   syntax elements in the picture parameter set and a width equal to the
826	   width of the picture.

828	   Video coding layer (VCL) NAL unit: A collective term for coded slice
829	   NAL units and the subset of NAL units that have reserved values of
830	   nal_unit_type that are classified as VCL NAL units in this
831	   Specification.

833	3.1.2.  Definitions Specific to This Memo

835	   Media-Aware Network Element (MANE): A network element, such as a
836	   middlebox, selective forwarding unit, or application-layer gateway
837	   that is capable of parsing certain aspects of the RTP payload headers
838	   or the RTP payload and reacting to their contents.

840	      Informative note: The concept of a MANE goes beyond normal routers
841	      or gateways in that a MANE has to be aware of the signaling (e.g.,
842	      to learn about the payload type mappings of the media streams),
843	      and in that it has to be trusted when working with Secure RTP
844	      (SRTP).  The advantage of using MANEs is that they allow packets
845	      to be dropped according to the needs of the media coding.  For
846	      example, if a MANE has to drop packets due to congestion on a
847	      certain link, it can identify and remove those packets whose
848	      elimination produces the least adverse effect on the user
849	      experience.  After dropping packets, MANEs must rewrite RTCP
850	      packets to match the changes to the RTP stream, as specified in
851	      Section 7 of [RFC3550].

853	   NAL unit decoding order: A NAL unit order that conforms to the
854	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
855	   follow the Order of NAL units in the bitstream.

857	   RTP stream (See [RFC7656]): Within the scope of this memo, one RTP
858	   stream is utilized to transport a VVC bitstream, which may contain
859	   one or more layers, and each layer may contain one or more temporal
860	   sublayers.

862	   Transmission order: The order of packets in ascending RTP sequence
863	   number order (in modulo arithmetic).  Within an aggregation packet,
864	   the NAL unit transmission order is the same as the order of
865	   appearance of NAL units in the packet.

867	3.2.  Abbreviations

869	   AU         Access Unit

871	   AP         Aggregation Packet

873	   APS        Adaptation Parameter Set

875	   CTU        Coding Tree Unit

877	   CVS        Coded Video Sequence

879	   DPB        Decoded Picture Buffer

881	   DCI        Decoding Capability Information

883	   DON        Decoding Order Number

885	   FIR        Full Intra Request

887	   FU         Fragmentation Unit

889	   GDR        Gradual Decoding Refresh

891	   HRD        Hypothetical Reference Decoder

893	   IDR        Instantaneous Decoding Refresh

895	   MANE       Media-Aware Network Element

897	   MTU        Maximum Transfer Unit

899	   NAL        Network Abstraction Layer
900	   NALU       Network Abstraction Layer Unit

902	   PLI        Picture Loss Indication

904	   PPS        Picture Parameter Set

906	   RPS        Reference Picture Set

908	   RPSI       Reference Picture Selection Indication

910	   SEI        Supplemental Enhancement Information

912	   SLI        Slice Loss Indication

914	   SPS        Sequence Parameter Set

916	   VCL        Video Coding Layer

918	   VPS        Video Parameter Set

920	4.  RTP Payload Format

922	4.1.  RTP Header Usage

924	   The format of the RTP header is specified in [RFC3550] (reprinted as
925	   Figure 2 for convenience).  This payload format uses the fields of
926	   the header in a manner consistent with that specification.

928	   The RTP payload (and the settings for some RTP header bits) for
929	   aggregation packets and fragmentation units are specified in
930	   Section 4.3.2 and Section 4.3.3, respectively.

932	       0                   1                   2                   3
933	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
934	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
935	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
936	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
937	      |                           timestamp                           |
938	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
939	      |           synchronization source (SSRC) identifier            |
940	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
941	      |            contributing source (CSRC) identifiers             |
942	      |                             ....                              |
943	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

945	                        RTP Header According to {{RFC3550}}

947	                                  Figure 2

949	   The RTP header information to be set according to this RTP payload
950	   format is set as follows:

952	   Marker bit (M): 1 bit

954	      Set for the last packet, in transmission order, among each set of
955	      packets that contain NAL units of one access unit.  This is in
956	      line with the normal use of the M bit in video formats to allow an
957	      efficient playout buffer handling.

959	   Payload Type (PT): 7 bits

961	      The assignment of an RTP payload type for this new packet format
962	      is outside the scope of this document and will not be specified
963	      here.  The assignment of a payload type has to be performed either
964	      through the profile used or in a dynamic way.

966	   Sequence Number (SN): 16 bits

968	      Set and used in accordance with [RFC3550].

970	   Timestamp: 32 bits

972	      The RTP timestamp is set to the sampling timestamp of the content.
973	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
974	      properties of its own (e.g., parameter set and SEI NAL units), the
975	      RTP timestamp MUST be set to the RTP timestamp of the coded
976	      pictures of the access unit in which the NAL unit (according to
977	      Section 7.4.2.4 of [VVC]) is included.  Receivers MUST use the RTP
978	      timestamp for the display process, even when the bitstream
979	      contains picture timing SEI messages or decoding unit information
980	      SEI messages as specified in [VVC].

982	   Synchronization source (SSRC): 32 bits

984	      Used to identify the source of the RTP packets.  A single SSRC is
985	      used for all parts of a single bitstream.

987	4.2.  Payload Header Usage

989	   The first two bytes of the payload of an RTP packet are referred to
990	   as the payload header.  The payload header consists of the same
991	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
992	   in Section 1.1.4, irrespective of the type of the payload structure.

994	   The TID value indicates (among other things) the relative importance
995	   of an RTP packet, for example, because NAL units belonging to higher
996	   temporal sublayers are not used for the decoding of lower temporal
997	   sublayers.  A lower value of TID indicates a higher importance.
998	   More-important NAL units MAY be better protected against transmission
999	   losses than less-important NAL units.

1001	      For Discussion: quite possibly something similar can be said for
1002	      the Layer_id in layered coding, but perhaps not in multiview
1003	      coding.  (The relevant part of the spec is relatively new,
1004	      therefore the soft language).  However, for serious layer pruning,
1005	      interpretation of the VPS is required.  We can add language about
1006	      the need for stateful interpretation of LayerID vis-a-vis
1007	      stateless interpretation of TID later.

1009	4.3.  Payload Structures

1011	   Three different types of RTP packet payload structures are specified.
1012	   A receiver can identify the type of an RTP packet payload through the
1013	   Type field in the payload header.

1015	   The three different payload structures are as follows:

1017	   *  Single NAL unit packet: Contains a single NAL unit in the payload,
1018	      and the NAL unit header of the NAL unit also serves as the payload
1019	      header.  This payload structure is specified in Section 4.4.1.

1021	   *  Aggregation Packet (AP): Contains more than one NAL unit within
1022	      one access unit.  This payload structure is specified in
1023	      Section 4.3.2.

1025	   *  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
1026	      This payload structure is specified in Section 4.3.3.

1028	4.3.1.  Single NAL Unit Packets

1030	   A single NAL unit packet contains exactly one NAL unit, and consists
1031	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
1032	   DONL field (in network byte order), and the NAL unit payload data
1033	   (the NAL unit excluding its NAL unit header) of the contained NAL
1034	   unit, as shown in Figure 3.

1036	      0                   1                   2                   3
1037	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1038	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1039	     |           PayloadHdr          |      DONL (conditional)       |
1040	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1041	     |                                                               |
1042	     |                  NAL unit payload data                        |
1043	     |                                                               |
1044	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1045	     |                               :...OPTIONAL RTP padding        |
1046	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1048	                  The Structure of a Single NAL Unit Packet

1050	                                  Figure 3

1052	   The DONL field, when present, specifies the value of the 16 least
1053	   significant bits of the decoding order number of the contained NAL
1054	   unit.  If sprop-max-don-diff is greater than 0, the DONL field MUST
1055	   be present, and the variable DON for the contained NAL unit is
1056	   derived as equal to the value of the DONL field.  Otherwise (sprop-
1057	   max-don-diff is equal to 0), the DONL field MUST NOT be present.

1059	4.3.2.  Aggregation Packets (APs)

1061	   Aggregation Packets (APs) can reduce packetization overhead for small
1062	   NAL units, such as most of the non- VCL NAL units, which are often
1063	   only a few octets in size.

1065	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
1066	   carried in an AP is encapsulated in an aggregation unit.  NAL units
1067	   aggregated in one AP are included in NAL unit decoding order.

1069	   An AP consists of a payload header (denoted as PayloadHdr) followed
1070	   by two or more aggregation units, as shown in Figure 4.

1072	     0                   1                   2                   3
1073	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1074	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075	    |    PayloadHdr (Type=28)       |                               |
1076	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
1077	    |                                                               |
1078	    |             two or more aggregation units                     |
1079	    |                                                               |
1080	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1081	    |                               :...OPTIONAL RTP padding        |
1082	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1084	                   The Structure of an Aggregation Packet

1086	                                  Figure 4

1088	   The fields in the payload header of an AP are set as follows.  The F
1089	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
1090	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
1091	   be equal to 28.

1093	   The value of LayerId MUST be equal to the lowest value of LayerId of
1094	   all the aggregated NAL units.  The value of TID MUST be the lowest
1095	   value of TID of all the aggregated NAL units.

1097	      Informative note: All VCL NAL units in an AP have the same TID
1098	      value since they belong to the same access unit.  However, an AP
1099	      may contain non-VCL NAL units for which the TID value in the NAL
1100	      unit header may be different than the TID value of the VCL NAL
1101	      units in the same AP.

1103	   An AP MUST carry at least two aggregation units and can carry as many
1104	   aggregation units as necessary; however, the total amount of data in
1105	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1106	   chosen so that the resulting IP packet is smaller than the MTU size
1107	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1108	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1109	   not contain another AP.

1111	   The first aggregation unit in an AP consists of a conditional 16-bit
1112	   DONL field (in network byte order) followed by a 16-bit unsigned size
1113	   information (in network byte order) that indicates the size of the
1114	   NAL unit in bytes (excluding these two octets, but including the NAL
1115	   unit header), followed by the NAL unit itself, including its NAL unit
1116	   header, as shown in Figure 5.

1118	     0                   1                   2                   3
1119	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1120	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1121	    |               :       DONL (conditional)      |   NALU size   |
1122	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1123	    |   NALU size   |                                               |
1124	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1125	    |                                                               |
1126	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1127	    |                               :
1128	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1130	           The Structure of the First Aggregation Unit in an AP

1132	                                  Figure 5

1134	   The DONL field, when present, specifies the value of the 16 least
1135	   significant bits of the decoding order number of the aggregated NAL
1136	   unit.

1138	   If sprop-max-don-diff is greater than 0, the DONL field MUST be
1139	   present in an aggregation unit that is the first aggregation unit in
1140	   an AP, and the variable DON for the aggregated NAL unit is derived as
1141	   equal to the value of the DONL field, and the variable DON for an
1142	   aggregation unit that is not the first aggregation unit in an AP
1143	   aggregated NAL unit is derived as equal to the DON of the preceding
1144	   aggregated NAL unit in the same AP plus 1 modulo 65536.  Otherwise
1145	   (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be
1146	   present in an aggregation unit that is the first aggregation unit in
1147	   an AP.

1149	   An aggregation unit that is not the first aggregation unit in an AP
1150	   will be followed immediately by a 16-bit unsigned size information
1151	   (in network byte order) that indicates the size of the NAL unit in
1152	   bytes (excluding these two octets, but including the NAL unit
1153	   header), followed by the NAL unit itself, including its NAL unit
1154	   header, as shown in Figure 6.

1156	     0                   1                   2                   3
1157	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1158	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1159	    |               :       NALU size               |   NAL unit    |
1160	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1161	    |                                                               |
1162	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1163	    |                               :
1164	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1166	         The Structure of an Aggregation Unit That Is Not the First
1167	                          Aggregation Unit in an AP

1169	                                  Figure 6

1171	   Figure 7 presents an example of an AP that contains two aggregation
1172	   units, labeled as 1 and 2 in the figure, without the DONL field being
1173	   present.

1175	     0                   1                   2                   3
1176	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1177	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1178	    |                          RTP Header                           |
1179	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1180	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1181	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1182	    |          NALU 1 HDR           |                               |
1183	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1184	    |                   . . .                                       |
1185	    |                                                               |
1186	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1187	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1188	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1189	    | NALU 2 HDR    |                                               |
1190	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1191	    |                   . . .                                       |
1192	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1193	    |                               :...OPTIONAL RTP padding        |
1194	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1196	               An Example of an AP Packet Containing
1197	             Two Aggregation Units without the DONL Field

1199	                                  Figure 7

1201	   Figure 8 presents an example of an AP that contains two aggregation
1202	   units, labeled as 1 and 2 in the figure, with the DONL field being
1203	   present.

1205	     0                   1                   2                   3
1206	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1207	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1208	    |                          RTP Header                           |
1209	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1210	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1211	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1212	    |          NALU 1 Size          |            NALU 1 HDR         |
1213	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1214	    |                                                               |
1215	    |                 NALU 1 Data   . . .                           |
1216	    |                                                               |
1217	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1218	    |                               :          NALU 2 Size          |
1219	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1220	    |          NALU 2 HDR           |                               |
1221	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1222	    |                                                               |
1223	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1224	    |                               :...OPTIONAL RTP padding        |
1225	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1227	                   An Example of an AP Containing
1228	                 Two Aggregation Units with the DONL Field

1230	                                  Figure 8

1232	4.3.3.  Fragmentation Units

1234	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1235	   single NAL unit into multiple RTP packets, possibly without
1236	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1237	   unit consists of an integer number of consecutive octets of that NAL
1238	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1239	   order with ascending RTP sequence numbers (with no other RTP packets
1240	   within the same RTP stream being sent between the first and last
1241	   fragment).

1243	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1244	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1245	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1247	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1248	   time of the fragmented NAL unit.

1250	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1251	   header of one octet, a conditional 16-bit DONL field (in network byte
1252	   order), and an FU payload, as shown in Figure 9.

1254	     0                   1                   2                   3
1255	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1256	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1257	    |   PayloadHdr (Type=29)        |   FU header   | DONL (cond)   |
1258	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1259	    |   DONL (cond) |                                               |
1260	    |-+-+-+-+-+-+-+-+                                               |
1261	    |                         FU payload                            |
1262	    |                                                               |
1263	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1264	    |                               :...OPTIONAL RTP padding        |
1265	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1267	                          The Structure of an FU

1269	                                  Figure 9

1271	   The fields in the payload header are set as follows.  The Type field
1272	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1273	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1274	   unit.

1276	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1277	   FuType field, as shown in Figure 10.

1279	                           +---------------+
1280	                           |0|1|2|3|4|5|6|7|
1281	                           +-+-+-+-+-+-+-+-+
1282	                           |S|E|R|  FuType |
1283	                           +---------------+

1285	                       The Structure of FU Header

1287	                                 Figure 10

1289	   The semantics of the FU header fields are as follows:

1291	   S: 1 bit

1293	      When set to 1, the S bit indicates the start of a fragmented NAL
1294	      unit, i.e., the first byte of the FU payload is also the first
1295	      byte of the payload of the fragmented NAL unit.  When the FU
1296	      payload is not the start of the fragmented NAL unit payload, the S
1297	      bit MUST be set to 0.

1299	   E: 1 bit
1300	      When set to 1, the E bit indicates the end of a fragmented NAL
1301	      unit, i.e., the last byte of the payload is also the last byte of
1302	      the fragmented NAL unit.  When the FU payload is not the last
1303	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1305	   Reserved: 1 bit

1307	      editor-note 24: to be removed upon wg consensus

1309	      When set to 1, the R bit indicates the last NAL unit of a coded
1310	      picture, i.e., the last byte of the FU payload is also the last
1311	      byte of the coded picture.  When the FU payload is not the last
1312	      fragment of a coded picture, the R bit MUST be set to 0.

1314	   FuType: 5 bits

1316	      The field FuType MUST be equal to the field Type of the fragmented
1317	      NAL unit.

1319	   The DONL field, when present, specifies the value of the 16 least
1320	   significant bits of the decoding order number of the fragmented NAL
1321	   unit.

1323	   If sprop-max-don-diff is greater than 0, and the S bit is equal to 1,
1324	   the DONL field MUST be present in the FU, and the variable DON for
1325	   the fragmented NAL unit is derived as equal to the value of the DONL
1326	   field.  Otherwise (sprop-max-don-diff is equal to 0, or the S bit is
1327	   equal to 0), the DONL field MUST NOT be present in the FU.

1329	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1330	   the Start bit and End bit must not both be set to 1 in the same FU
1331	   header.

1333	   The FU payload consists of fragments of the payload of the fragmented
1334	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1335	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1336	   equal to 1, are sequentially concatenated, the payload of the
1337	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1338	   fragmented NAL unit is not included as such in the FU payload, but
1339	   rather the information of the NAL unit header of the fragmented NAL
1340	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1341	   headers of the FUs and the FuType field of the FU header of the FUs.
1342	   An FU payload MUST NOT be empty.

1344	   If an FU is lost, the receiver SHOULD discard all following
1345	   fragmentation units in transmission order corresponding to the same
1346	   fragmented NAL unit, unless the decoder in the receiver is known to
1347	   be prepared to gracefully handle incomplete NAL units.

1349	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1350	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1351	   n of that NAL unit is not received.  In this case, the
1352	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1353	   syntax violation.

1355	4.4.  Decoding Order Number

1357	   For each NAL unit, the variable AbsDon is derived, representing the
1358	   decoding order number that is indicative of the NAL unit decoding
1359	   order.

1361	   Let NAL unit n be the n-th NAL unit in transmission order within an
1362	   RTP stream.

1364	   If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon
1365	   for NAL unit n, is derived as equal to n.

1367	   Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is
1368	   derived as follows, where DON[n] is the value of the variable DON for
1369	   NAL unit n:

1371	   *  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1372	      transmission order), AbsDon[0] is set equal to DON[0].

1374	   *  Otherwise (n is greater than 0), the following applies for
1375	      derivation of AbsDon[n]:

1377	         If DON[n] == DON[n-1],
1378	            AbsDon[n] = AbsDon[n-1]

1380	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1381	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1383	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1384	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1386	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1387	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1388	            DON[n])

1390	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1391	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1393	   For any two NAL units m and n, the following applies:

1395	   *  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1396	      NAL unit m in NAL unit decoding order.

1398	   *  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1399	      of the two NAL units can be in either order.

1401	   *  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1402	      NAL unit m in decoding order.

1404	         Informative note: When two consecutive NAL units in the NAL
1405	         unit decoding order have different values of AbsDon, the
1406	         absolute difference between the two AbsDon values may be
1407	         greater than or equal to 1.

1409	         Informative note: There are multiple reasons to allow for the
1410	         absolute difference of the values of AbsDon for two consecutive
1411	         NAL units in the NAL unit decoding order to be greater than
1412	         one.  An increment by one is not required, as at the time of
1413	         associating values of AbsDon to NAL units, it may not be known
1414	         whether all NAL units are to be delivered to the receiver.  For
1415	         example, a gateway might not forward VCL NAL units of higher
1416	         sublayers or some SEI NAL units when there is congestion in the
1417	         network.  In another example, the first intra-coded picture of
1418	         a pre-encoded clip is transmitted in advance to ensure that it
1419	         is readily available in the receiver, and when transmitting the
1420	         first intra-coded picture, the originator does not exactly know
1421	         how many NAL units will be encoded before the first intra-coded
1422	         picture of the pre-encoded clip follows in decoding order.
1423	         Thus, the values of AbsDon for the NAL units of the first
1424	         intra-coded picture of the pre-encoded clip have to be
1425	         estimated when they are transmitted, and gaps in values of
1426	         AbsDon may occur.

1428	5.  Packetization Rules

1430	   The following packetization rules apply:

1432	   *  If sprop-max-don-diff is greater than 0, the transmission order of
1433	      NAL units carried in the RTP stream MAY be different than the NAL
1434	      unit decoding order.  Otherwise (sprop-max-don-diff is equal to
1435	      0), the transmission order of NAL units carried in the RTP stream
1436	      MUST be the same as the NAL unit decoding order.

1438	   *  A NAL unit of a small size SHOULD be encapsulated in an
1439	      aggregation packet together one or more other NAL units in order
1440	      to avoid the unnecessary packetization overhead for small NAL
1441	      units.  For example, non-VCL NAL units such as access unit
1442	      delimiters, parameter sets, or SEI NAL units are typically small
1443	      and can often be aggregated with VCL NAL units without violating
1444	      MTU size constraints.

1446	   *  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1447	      viewpoint, be encapsulated in an aggregation packet together with
1448	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1449	      be meaningless without the associated VCL NAL unit being
1450	      available.

1452	   *  For carrying exactly one NAL unit in an RTP packet, a single NAL
1453	      unit packet MUST be used.

1455	6.  De-packetization Process

1457	   The general concept behind de-packetization is to get the NAL units
1458	   out of the RTP packets in an RTP stream and pass them to the decoder
1459	   in the NAL unit decoding order.

1461	   The de-packetization process is implementation dependent.  Therefore,
1462	   the following description should be seen as an example of a suitable
1463	   implementation.  Other schemes may be used as well, as long as the
1464	   output for the same input is the same as the process described below.
1465	   The output is the same when the set of output NAL units and their
1466	   order are both identical.  Optimizations relative to the described
1467	   algorithms are possible.

1469	   All normal RTP mechanisms related to buffer management apply.  In
1470	   particular, duplicated or outdated RTP packets (as indicated by the
1471	   RTP sequences number and the RTP timestamp) are removed.  To
1472	   determine the exact time for decoding, factors such as a possible
1473	   intentional delay to allow for proper inter-stream synchronization
1474	   MUST be factored in.

1476	   NAL units with NAL unit type values in the range of 0 to 27,
1477	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1478	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1479	   NOT be passed to the decoder.

1481	   The receiver includes a receiver buffer, which is used to compensate
1482	   for transmission delay jitter within individual RTP stream, to
1483	   reorder NAL units from transmission order to the NAL unit decoding
1484	   order.  In this section, the receiver operation is described under
1485	   the assumption that there is no transmission delay jitter within an
1486	   RTP stream.  To make a difference from a practical receiver buffer
1487	   that is also used for compensation of transmission delay jitter, the
1488	   receiver buffer is hereafter called the de-packetization buffer in
1489	   this section.  Receivers should also prepare for transmission delay
1490	   jitter; that is, either reserve separate buffers for transmission
1491	   delay jitter buffering and de-packetization buffering or use a
1492	   receiver buffer for both transmission delay jitter and de-
1493	   packetization.  Moreover, receivers should take transmission delay
1494	   jitter into account in the buffering operation, e.g., by additional
1495	   initial buffering before starting of decoding and playback.

1497	   When sprop-max-don-diff is equal to 0, the de-packetization buffer
1498	   size is zero bytes, and the process described in the remainder of
1499	   this paragraph applies.  The NAL units carried in the single RTP
1500	   stream are directly passed to the decoder in their transmission
1501	   order, which is identical to their decoding order.

1503	   When sprop-max-don-diff is greater than 0, the process described in
1504	   the remainder of this section applies.

1506	   There are two buffering states in the receiver: initial buffering and
1507	   buffering while playing.  Initial buffering starts when the reception
1508	   is initialized.  After initial buffering, decoding and playback are
1509	   started, and the buffering-while-playing mode is used.

1511	   Regardless of the buffering state, the receiver stores incoming NAL
1512	   units in reception order into the de-packetization buffer.  NAL units
1513	   carried in RTP packets are stored in the de-packetization buffer
1514	   individually, and the value of AbsDon is calculated and stored for
1515	   each NAL unit.

1517	   Initial buffering lasts until condition A (the difference between the
1518	   greatest and smallest AbsDon values of the NAL units in the de-
1519	   packetization buffer is greater than or equal to the value of sprop-
1520	   max-don-diff) or condition B (the number of NAL units in the de-
1521	   packetization buffer is greater than the value of sprop-depack-buf-
1522	   nalus) is true.

1524	   After initial buffering, whenever condition A or condition B is true,
1525	   the following operation is repeatedly applied until both condition A
1526	   and condition B become false:

1528	   *  The NAL unit in the de-packetization buffer with the smallest
1529	      value of AbsDon is removed from the de-packetization buffer and
1530	      passed to the decoder.

1532	   When no more NAL units are flowing into the de-packetization buffer,
1533	   all NAL units remaining in the de-packetization buffer are removed
1534	   from the buffer and passed to the decoder in the order of increasing
1535	   AbsDon values.

1537	7.  Payload Format Parameters

1539	   This section specifies the optional parameters.  A mapping of the
1540	   parameters with Session Description Protocol (SDP) [RFC4556] is also
1541	   provided for applications that use SDP.

1543	7.1.  Media Type Registration

1545	   The receiver MUST ignore any parameter unspecified in this memo.

1547	   Type name:            video

1549	   Subtype name:         H266

1551	   Required parameters:  none

1553	   Optional parameters:

1555	      profile-id, tier-flag, sub-profile-id, interop-constraints, and
1556	      level-id:

1558	         These parameters indicate the profile, tier, default level,
1559	         sub-profile, and some constraints of the bitstream carried by
1560	         the RTP stream, or a specific set of the profile, tier, default
1561	         level, sub-profile and some constraints the receiver supports.

1563	         The subset of coding tools that may have been used to generate
1564	         the bitstream or that the receiver supports, as well as some
1565	         additional constraints are indicated collectively by profile-
1566	         id, sub-profile-id, and interop-constraints.

1568	            Informative note: There are 128 values of profile-id.  The
1569	            subset of coding tools identified by the profile-id can be
1570	            further constrained with up to 255 instances of sub-profile-
1571	            id.  In addition, 68 bits included in interop-constraints,
1572	            which can be extended up to 324 bits provide means to
1573	            further restrict tools from existing profiles.  To be able
1574	            to support this fine-granular signalling of coding tool
1575	            subsets with profile-id, sub-profile-id and interop-
1576	            constraints, it would be safe to require symmetric use of
1577	            these parameters in SDP offer/answer unless recv-ols-id is
1578	            included in the SDP answer for choosing one of the layers
1579	            offered.

1581	         The tier is indicated by tier-flag.  The default level is
1582	         indicated by level-id.  The tier and the default level specify
1583	         the limits on values of syntax elements or arithmetic
1584	         combinations of values of syntax elements that are followed
1585	         when generating the bitstream or that the receiver supports.

1587	         In SDP offer/answer, when the SDP answer does not include the
1588	         recv-ols-id parameter that is less than the sprop-ols-id
1589	         parameter in the SDP offer, the following applies:

1591	         o  The tier-flag, profile-id, sub-profile-id, and interop-
1592	            constraints parameters MUST be used symmetrically, i.e., the
1593	            value of each of these parameters in the offer MUST be the
1594	            same as that in the answer, either explicitly signaled or
1595	            implicitly inferred.

1597	         o  The level-id parameter is changeable as long as the highest
1598	            level indicated by the answer is either equal to or lower
1599	            than that in the offer.  Note that a highest level higher
1600	            than level-id in the offer for receiving can be included as
1601	            max-recv-level-id.

1603	         In SDP offer/answer, when the SDP answer does include the recv-
1604	         ols-id parameter that is less than the sprop-ols-id parameter
1605	         in the SDP offer, the set of tier- flag, profile-id, sub-
1606	         profile-id, interop-constraints, and level-id parameters
1607	         included in the answer MUST be consistent with that for the
1608	         chosen output layer set as indicated in the SDP offer, with the
1609	         exception that the level-id parameter in the SDP answer is
1610	         changeable as long as the highest level indicated by the answer
1611	         is either lower than or equal to that in the offer.

1613	         More specifications of these parameters, including how they
1614	         relate to syntax elements specified in [VVC] are provided
1615	         below.

1617	      profile-id:

1619	         When profile-id is not present, a value of 1 (i.e., the Main 10
1620	         profile) MUST be inferred.

1622	         When used to indicate properties of a bitstream, profile-id is
1623	         derived from the general_profile_idc syntax element that
1624	         applies to the bitstream in an instance of the
1625	         profile_tier_level( ) syntax structure.

1627	         A profile_tier_level( ) syntax structure may be contained in an
1628	         SPS, VPS, or DCI NAL units as specified in [VVC].  One of the
1629	         following three cases applies to the container NAL unit of the
1630	         profile_tier_level( ) syntax structure containing those PTL
1631	         syntax elements used to derive the values of profile-id, tier-
1632	         flag, level-id, sub-profile-id, or interop-constraints: 1) The
1633	         container NAL unit is an SPS, the bitstream is a single-layer
1634	         bitstream, and the profile_tier_level( ) syntax structures in
1635	         all SPSs referenced by the CVSs in the bitstream has the same
1636	         values respectively for those PTL syntax elements; 2) The
1637	         container NAL unit is a VPS, the profile_tier_level( ) syntax
1638	         structure is the one in the VPS that applies to the OLS
1639	         corresponding to the bitstream, and the profile_tier_level( )
1640	         syntax structures applicable to the OLS corresponding to the
1641	         bitstream in all VPSs referenced by the CVSs in the bitstream
1642	         have the same values respectively for those PTL syntax
1643	         elements; 3) The container NAL unit is a DCI NAL unit and the
1644	         profile_tier_level( ) syntax structures in all DCI NAL units in
1645	         the bitstream has the same values respectively for those PTL
1646	         syntax elements.

1648	      tier-flag, level-id:

1650	         The value of tier-flag MUST be in the range of 0 to 1,
1651	         inclusive.  The value of level-id MUST be in the range of 0 to
1652	         255, inclusive.

1654	         If the tier-flag and level-id parameters are used to indicate
1655	         properties of a bitstream, they indicate the tier and the
1656	         highest level the bitstream complies with.

1658	         If the tier-flag and level-id parameters are used for
1659	         capability exchange, the following applies.  If max-recv-level-
1660	         id is not present, the default level defined by level-id
1661	         indicates the highest level the codec wishes to support.
1662	         Otherwise, max-recv-level-id indicates the highest level the
1663	         codec supports for receiving.  For either receiving or sending,
1664	         all levels that are lower than the highest level supported MUST
1665	         also be supported.

1667	         If no tier-flag is present, a value of 0 MUST be inferred; if
1668	         no level-id is present, a value of 51 (i.e., level 3.1) MUST be
1669	         inferred.

1671	            Informative note: The level values currently defined in the
1672	            VVC specification are in the form of "majorNum.minorNum",
1673	            and the value of the level-id for each of the levels is
1674	            equal to majorNum * 16 + minorNum * 3.  It is expected that
1675	            if any level are defined in the future, the same convention
1676	            will be used, but this cannot be guaranteed.

1678	         When used to indicate properties of a bitstream, the tier-flag
1679	         and level-id parameters are derived respectively from the
1680	         syntax element general_tier_flag, and the syntax element
1681	         general_level_idc or sub_layer_level_idc[j], that apply to the
1682	         bitstream, in an instance of the profile_tier_level( ) syntax
1683	         structure.

1685	         If the tier-flag and level-id are derived from the
1686	         profile_tier_level( ) syntax structure in a DCI NAL unit, the
1687	         following applies:

1689	         o  tier-flag = general_tier_flag

1691	         o  level-id = general_level_idc

1693	         Otherwise, if the tier-flag and level-id are derived from the
1694	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1695	         unit, and the bitstream contains the highest sub-layer
1696	         representation in the OLS corresponding to the bitstream, the
1697	         following applies:

1699	         o  tier-flag = general_tier_flag

1701	         o  level-id = general_level_idc

1703	         Otherwise, if the tier-flag and level-id are derived from the
1704	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1705	         unit, and the bitstream does not contains the highest sub-layer
1706	         representation in the OLS corresponding to the bitstream, the
1707	         following applies, with j being the value of the sprop-sub-
1708	         layer-id parameter:

1710	         o  tier-flag = general_tier_flag

1712	         o  level-id = sub_layer_level_idc[j]

1714	      sub-profile-id:

1716	         The value of the parameter is a comma-separated (',') list of
1717	         values.

1719	   editor-note 11: What is the value? integer, base32?

1721	         When used to indicate properties of a bitstream, sub-profile-id
1722	         is derived from each of the ptl_num_sub_profiles
1723	         general_sub_profile_idc[i] syntax elements that apply to the
1724	         bitstream in an profile_tier_level( ) syntax structure.

1726	      interop-constraints:

1728	         A base16 [RFC4648] (hexadecimal) representation of the data
1729	         that includes the syntax elements
1730	         ptl_frame_only_constraint_flag and ptl_multilayer_enabled_flag
1731	         and the general_constraints_info( ) syntax structure that apply
1732	         to the bitstream in an instance of the profile_tier_level( )
1733	         syntax structure.

1735	         If the interop-constraints parameter is not present, the
1736	         following MUST be inferred:

1738	         o  ptl_frame_only_constraint_flag = 0

1740	         o  ptl_multilayer_enabled_flag = 1

1742	         o  gci_present_flag in the general_constraints_info( ) syntax
1743	            structure = 1

1745	   editor-note 14: Double check the default values.  Currently, no
1746	   constraints, but actually, with the Main 10 profile as default multi-
1747	   layer not possible.

1749	         Using interop-constraints for capability exchange results in a
1750	         requirement on any bitstream to be compliant with the interop-
1751	         constraints.

1753	      sprop-sub-layer-id:

1755	         This parameter MAY be used to indicate the highest allowed
1756	         value of TID in the bitstream.  When not present, the value of
1757	         sprop-sub-layer-id is inferred to be equal to 6.

1759	         The value of sprop-sub-layer-id MUST be in the range of 0 to 6,
1760	         inclusive.

1762	      sprop-ols-id:

1764	         This parameter MAY be used to indicate the OLS that the
1765	         bitstream applies to.  When not present, the value of sprop-
1766	         ols-id is inferred to be equal to TargetOlsIdx as specified in
1767	         8.1.1 in [VVC].  If this optional parameter is present, sprop-
1768	         vps MUST also be present or its content MUST be known a priori
1769	         at the receiver.

1771	         The value of sprop-ols-id MUST be in the range of 0 to 257,
1772	         inclusive.

1774	      recv-sub-layer-id:

1776	         This parameter MAY be used to signal a receiver's choice of the
1777	         offered or declared sub-layer representations in the sprop-vps
1778	         and sprop-sps.  The value of recv-sub-layer-id indicates the
1779	         TID of the highest sub-layer of the bitstream that a receiver
1780	         supports.  When not present, the value of recv-sub-layer-id is
1781	         inferred to be equal to the value of the sprop-sub-layer-id
1782	         parameter in the SDP offer.

1784	         The value of recv-sub-layer-id MUST be in the range of 0 to 6,
1785	         inclusive.

1787	      recv-ols-id:

1789	         This parameter MAY be used to signal a receiver's choice of the
1790	         offered or declared output layer sets in the sprop-vps.  The
1791	         value of recv-ols-id indicates the OLS index of the bitstream
1792	         that a receiver supports.  When not present, the value of recv-
1793	         ols-id is inferred to be equal to the value of the sprop-ols-id
1794	         parameter in the SDP offer.  When present, the value of recv-
1795	         ols-id must be included only when sprop-ols-id was received and
1796	         must refer to an output layer set in the VPS that is in the
1797	         same dependency tree as the OLS referred to by sprop-ols-id.
1798	         If this optional parameter is present, sprop-vps must have been
1799	         received or its content must be known a priori at the receiver.

1801	         The value of recv-ols-id MUST be in the range of 0 to 257,
1802	         inclusive.

1804	      max-recv-level-id:

1806	         This parameter MAY be used to indicate the highest level a
1807	         receiver supports.

1809	         The value of max-recv-level-id MUST be in the range of 0 to
1810	         255, inclusive.

1812	         When max-recv-level-id is not present, the value is inferred to
1813	         be equal to level-id.

1815	         max-recv-level-id MUST NOT be present when the highest level
1816	         the receiver supports is not higher than the default level.

1818	      sprop-dci:

1820	         This parameter MAY be used to convey a decoding capability
1821	         information NAL unit of the bitstream for out-of-band
1822	         transmission.  The parameter MAY also be used for capability
1823	         exchange.  The value of the parameter a base64 [RFC4648]
1824	         representations of the decoding capability information NAL unit
1825	         as specified in Section 7.3.2.1 of [VVC].

1827	      sprop-vps:

1829	         This parameter MAY be used to convey any video parameter set
1830	         NAL unit of the bitstream for out-of-band transmission of video
1831	         parameter sets.  The parameter MAY also be used for capability
1832	         exchange and to indicate sub-stream characteristics (i.e.,
1833	         properties of output layer sets and sublayer representations as
1834	         defined in [VVC]).  The value of the parameter is a comma-
1835	         separated (',') list of base64 [RFC4648] representations of the
1836	         video parameter set NAL units as specified in Section 7.3.2.3
1837	         of [VVC].

1839	         The sprop-vps parameter MAY contain one or more than one video
1840	         parameter set NAL unit.  However, all other video parameter
1841	         sets contained in the sprop-vps parameter MUST be consistent
1842	         with the first video parameter set in the sprop-vps parameter.
1843	         A video parameter set vpsB is said to be consistent with
1844	         another video parameter set vpsA if any decoder that conforms
1845	         to the profile, tier, level, and constraints indicated by the
1846	         12 bytes of data starting from the syntax element
1847	         general_profile_space to the syntax element general_level_idc,
1848	         inclusive, in the first profile_tier_level( ) syntax structure
1849	         in vpsA can decode any bitstream that conforms to the profile,
1850	         tier, level, and constraints indicated by the 12 bytes of data
1851	         starting from the syntax element general_profile_space to the
1852	         syntax element general_level_idc, inclusive, in the first
1853	         profile_tier_level( ) syntax structure in vpsB.

1855	      sprop-sei:

1857	         This parameter MAY be used to convey one or more SEI messages
1858	         that describe bitstream characteristics.  When present, a
1859	         decoder can rely on the bitstream characteristics that are
1860	         described in the SEI messages for the entire duration of the
1861	         session, independently from the persistence scopes of the SEI
1862	         messages as specified in [VSEI].

1864	         The value of the parameter is a comma-separated (',') list of
1865	         base64 [RFC4648] representations of SEI NAL units as specified
1866	         in [VSEI].

1868	            Informative note: Intentionally, no list of applicable or
1869	            inapplicable SEI messages is specified here.  Conveying
1870	            certain SEI messages in sprop-sei may be sensible in some
1871	            application scenarios and meaningless in others.  However, a
1872	            few examples are described below:

1874	            1) In an environment where the bitstream was created from
1875	            film-based source material, and no splicing is going to
1876	            occur during the lifetime of the session, the film grain
1877	            characteristics SEI message is likely meaningful, and
1878	            sending it in sprop-sei rather than in the bitstream at each
1879	            entry point may help with saving bits and allows one to
1880	            configure the renderer only once, avoiding unwanted
1881	            artifacts.

1883	            2) Examples for SEI messages that would be meaningless to be
1884	            conveyed in sprop-sei include the decoded picture hash SEI
1885	            message (it is close to impossible that all decoded pictures
1886	            have the same hashtag), the display orientation SEI message
1887	            when the device is a handheld device (as the display
1888	            orientation may change when the handheld device is turned
1889	            around), or the filler payload SEI message (as there is no
1890	            point in just having more bits in SDP).

1892	      max-lsr:

1894	         The max-lsr MAY be used to signal the capabilities of a
1895	         receiver implementation and MUST NOT be used for any other
1896	         purpose.  The value of max-lsr is an integer indicating the
1897	         maximum processing rate in units of luma samples per second.
1898	         The max-lsr parameter signals that the receiver is capable of
1899	         decoding video at a higher rate than is required by the highest
1900	         level.

1902	            Informative note: When the OPTIONAL media type parameters
1903	            are used to signal the properties of a bitstream, and max-
1904	            lsr is not present, the values of tier-flag, profile-id,
1905	            sub-profile-id interop-constraints, and level-id must always
1906	            be such that the bitstream complies fully with the specified
1907	            profile, tier, and level.

1909	         When max-lsr is signaled, the receiver MUST be able to decode
1910	         bitstreams that conform to the highest level, with the
1911	         exception that the MaxLumaSr value in Table 136 of [VVC] for
1912	         the highest level is replaced with the value of max-lsr.
1913	         Senders MAY use this knowledge to send pictures of a given size
1914	         at a higher picture rate than is indicated in the highest
1915	         level.

1917	         When not present, the value of max-lsr is inferred to be equal
1918	         to the value of MaxLumaSr given in Table 136 of [VVC] for the
1919	         highest level.

1921	         The value of max-lsr MUST be in the range of MaxLumaSr to 16 *
1922	         MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of
1923	         [VVC] for the highest level.

1925	      max-fps:

1927	         The value of max-fps is an integer indicating the maximum
1928	         picture rate in units of pictures per 100 seconds that can be
1929	         effectively processed by the receiver.  The max-fps parameter
1930	         MAY be used to signal that the receiver has a constraint in
1931	         that it is not capable of processing video effectively at the
1932	         full picture rate that is implied by the highest level and,
1933	         when present, max-lsr.

1935	         The value of max-fps is not necessarily the picture rate at
1936	         which the maximum picture size can be sent, it constitutes a
1937	         constraint on maximum picture rate for all resolutions.

1939	            Informative note: The max-fps parameter is semantically
1940	            different from max-lsr in that max-fps is used to signal a
1941	            constraint, lowering the maximum picture rate from what is
1942	            implied by other parameters.

1944	         The encoder MUST use a picture rate equal to or less than this
1945	         value.  In cases where the max-fps parameter is absent, the
1946	         encoder is free to choose any picture rate according to the
1947	         highest level and any signaled optional parameters.

1949	         The value of max-fps MUST be smaller than or equal to the full
1950	         picture rate that is implied by the highest level and, when
1951	         present, max-lsr.

1953	      sprop-max-don-diff:

1955	         If there is no NAL unit naluA that is followed in transmission
1956	         order by any NAL unit preceding naluA in decoding order (i.e.,
1957	         the transmission order of the NAL units is the same as the
1958	         decoding order), the value of this parameter MUST be equal to
1959	         0.

1961	         Otherwise, this parameter specifies the maximum absolute
1962	         difference between the decoding order number (i.e., AbsDon)
1963	         values of any two NAL units naluA and naluB, where naluA
1964	         follows naluB in decoding order and precedes naluB in
1965	         transmission order.

1967	         The value of sprop-max-don-diff MUST be an integer in the range
1968	         of 0 to 32767, inclusive.

1970	         When not present, the value of sprop-max-don-diff is inferred
1971	         to be equal to 0.

1973	      sprop-depack-buf-bytes:

1975	         This parameter signals the required size of the de-
1976	         packetization buffer in units of bytes.  The value of the
1977	         parameter MUST be greater than or equal to the maximum buffer
1978	         occupancy (in units of bytes) of the de-packetization buffer as
1979	         specified in Section 6.

1981	         The value of sprop-depack-buf-bytes MUST be an integer in the
1982	         range of 0 to 4294967295, inclusive.

1984	         When sprop-max-don-diff is present and greater than 0, this
1985	         parameter MUST be present and the value MUST be greater than 0.
1986	         When not present, the value of sprop-depack-buf-bytes is
1987	         inferred to be equal to 0.

1989	            Informative note: The value of sprop-depack-buf-bytes
1990	            indicates the required size of the de-packetization buffer
1991	            only.  When network jitter can occur, an appropriately sized
1992	            jitter buffer has to be available as well.

1994	      depack-buf-cap:

1996	         This parameter signals the capabilities of a receiver
1997	         implementation and indicates the amount of de-packetization
1998	         buffer space in units of bytes that the receiver has available
1999	         for reconstructing the NAL unit decoding order from NAL units
2000	         carried in the RTP stream.  A receiver is able to handle any
2001	         RTP stream for which the value of the sprop-depack-buf-bytes
2002	         parameter is smaller than or equal to this parameter.

2004	         When not present, the value of depack-buf-cap is inferred to be
2005	         equal to 4294967295.  The value of depack-buf-cap MUST be an
2006	         integer in the range of 1 to 4294967295, inclusive.

2008	            Informative note: depack-buf-cap indicates the maximum
2009	            possible size of the de-packetization buffer of the receiver
2010	            only, without allowing for network jitter.

2012	   editor-note 19: sprop-depack-buf-nalus not included but mentioned in
2013	   section 6 for startup in de-packetization process.  We should decide
2014	   on whether it needs to be included or not.

2016	7.2.  SDP Parameters

2018	   The receiver MUST ignore any parameter unspecified in this memo.

2020	7.2.1.  Mapping of Payload Type Parameters to SDP

2022	   The media type video/H266 string is mapped to fields in the Session
2023	   Description Protocol (SDP) [RFC4566] as follows:

2025	   *  The media name in the "m=" line of SDP MUST be video.

2027	   *  The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the
2028	      media subtype).

2030	   *  The clock rate in the "a=rtpmap" line MUST be 90000.

2032	   *  The OPTIONAL parameters profile-id, tier-flag, sub-profile-id,
2033	      interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id,
2034	      recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max-
2035	      fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf-
2036	      cap, when present, MUST be included in the "a=fmtp" line of SDP.
2037	      This parameter is expressed as a media type string, in the form of
2038	      a semicolon-separated list of parameter=value pairs.

2040	      editor-note 20: To Be updated

2042	   An example of media representation in SDP is as follows:

2044	       m=video 49170 RTP/AVP 98
2045	       a=rtpmap:98 H266/90000
2046	       a=fmtp:98 profile-id=1; sprop-vps=<video parameter sets data>

2048	7.2.2.  Usage with SDP Offer/Answer Model

2050	   When [VVC] is offered over RTP using SDP in an offer/answer model
2051	   [RFC3264] for negotiation for unicast usage, the following
2052	   limitations and rules apply:

2054	      editor-note 21: the following needs to be updated

2056	   *  Parameters to identify a media format configuration as VVC:

2058	   *  Parameters as bitstream properties:

2060	   *  SDP answer for media configurations.

2062	   *  capability parameters:

2064	   *  others:

2066	8.  Use with Feedback Messages

2068	   The following subsections define the use of the Picture Loss
2069	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
2070	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
2071	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
2072	   [RFC4585], and the FIR message is defined in [RFC5104].

2074	8.1.  Picture Loss Indication (PLI)

2076	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
2077	   media sender indicates "the loss of an undefined amount of coded
2078	   video data belonging to one or more pictures".  Without having any
2079	   specific knowledge of the setup of the bitstream (such as use and
2080	   location of in-band parameter sets, non-IRAP decoder refresh points,
2081	   picture structures, and so forth), a reaction to the reception of an
2082	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
2083	   parameter sets; potentially with sufficient redundancy so to ensure
2084	   correct reception.  However, sometimes information about the
2085	   bitstream structure is known.  For example, state could have been
2086	   established outside of the mechanisms defined in this document that
2087	   parameter sets are conveyed out of band only, and stay static for the
2088	   duration of the session.  In that case, it is obviously unnecessary
2089	   to send them in-band as a result of the reception of a PLI.  Other
2090	   examples could be devised based on a priori knowledge of different
2091	   aspects of the bitstream structure.  In all cases, the timing and
2092	   congestion control mechanisms of RFC 4585 MUST be observed.

2094	8.2.  Full Intra Request (FIR)

2096	   The purpose of the FIR message is to force an encoder to send an
2097	   independent decoder refresh point as soon as possible, while
2098	   observing applicable congestion-control-related constraints, such as
2099	   those set out in [RFC8082]).

2101	   Upon reception of a FIR, a sender MUST send an IDR picture.
2102	   Parameter sets MUST also be sent, except when there is a priori
2103	   knowledge that the parameter sets have been correctly established.  A
2104	   typical example for that is an understanding between sender and
2105	   receiver, established by means outside this document, that parameter
2106	   sets are exclusively sent out-of-band.

2108	9.  Security Considerations

2110	   The scope of this Security Considerations section is limited to the
2111	   payload format itself and to one feature of [VVC] that may pose a
2112	   particularly serious security risk if implemented naively.  The
2113	   payload format, in isolation, does not form a complete system.
2114	   Implementers are advised to read and understand relevant security-
2115	   related documents, especially those pertaining to RTP (see the
2116	   Security Considerations section in [RFC3550] ), and the security of
2117	   the call-control stack chosen (that may make use of the media type
2118	   registration of this memo).  Implementers should also consider known
2119	   security vulnerabilities of video coding and decoding implementations
2120	   in general and avoid those.

2122	   Within this RTP payload format, and with the exception of the user
2123	   data SEI message as described below, no security threats other than
2124	   those common to RTP payload formats are known.  In other words,
2125	   neither the various media-plane-based mechanisms, nor the signaling
2126	   part of this memo, seems to pose a security risk beyond those common
2127	   to all RTP-based systems.

2129	   RTP packets using the payload format defined in this specification
2130	   are subject to the security considerations discussed in the RTP
2131	   specification [RFC3550] , and in any applicable RTP profile such as
2132	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
2133	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
2134	   Does Not Mandate a Single Media Security Solution" [RFC7202]
2135	   discusses, it is not an RTP payload format's responsibility to
2136	   discuss or mandate what solutions are used to meet the basic security
2137	   goals like confidentiality, integrity and source authenticity for RTP
2138	   in general.  This responsibility lays on anyone using RTP in an
2139	   application.  They can find guidance on available security mechanisms
2140	   and important considerations in "Options for Securing RTP Sessions"
2141	   [RFC7201] . The rest of this section discusses the security impacting
2142	   properties of the payload format itself.

2144	   Because the data compression used with this payload format is applied
2145	   end-to-end, any encryption needs to be performed after compression.
2146	   A potential denial-of-service threat exists for data encodings using
2147	   compression techniques that have non-uniform receiver-end
2148	   computational load.  The attacker can inject pathological datagrams
2149	   into the bitstream that are complex to decode and that cause the
2150	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
2151	   attacks, as it is extremely simple to generate datagrams containing
2152	   NAL units that affect the decoding process of many future NAL units.
2153	   Therefore, the usage of data origin authentication and data integrity
2154	   protection of at least the RTP packet is RECOMMENDED, for example,
2155	   with SRTP [RFC3711] .

2157	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
2158	   Enhancement Information (SEI) message.  This SEI message allows
2159	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
2160	   bitstring could include JavaScript, machine code, and other active
2161	   content.  [VVC] leaves the handling of this SEI message to the
2162	   receiving system.  In order to avoid harmful side effects the user
2163	   data SEI message, decoder implementations cannot naively trust its
2164	   content.  For example, it would be a bad and insecure implementation
2165	   practice to forward any JavaScript a decoder implementation detects
2166	   to a web browser.  The safest way to deal with user data SEI messages
2167	   is to simply discard them, but that can have negative side effects on
2168	   the quality of experience by the user.

2170	   End-to-end security with authentication, integrity, or
2171	   confidentiality protection will prevent a MANE from performing media-
2172	   aware operations other than discarding complete packets.  In the case
2173	   of confidentiality protection, it will even be prevented from
2174	   discarding packets in a media-aware way.  To be allowed to perform
2175	   such operations, a MANE is required to be a trusted entity that is
2176	   included in the security context establishment.

2178	10.  Congestion Control

2180	   Congestion control for RTP SHALL be used in accordance with RTP
2181	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
2182	   If best-effort service is being used, an additional requirement is
2183	   that users of this payload format MUST monitor packet loss to ensure
2184	   that the packet loss rate is within an acceptable range.  Packet loss
2185	   is considered acceptable if a TCP flow across the same network path,
2186	   and experiencing the same network conditions, would achieve an
2187	   average throughput, measured on a reasonable timescale, that is not
2188	   less than all RTP streams combined are achieving.  This condition can
2189	   be satisfied by implementing congestion-control mechanisms to adapt
2190	   the transmission rate, the number of layers subscribed for a layered
2191	   multicast session, or by arranging for a receiver to leave the
2192	   session if the loss rate is unacceptably high.

2194	   The bitrate adaptation necessary for obeying the congestion control
2195	   principle is easily achievable when real-time encoding is used, for
2196	   example, by adequately tuning the quantization parameter.  However,
2197	   when pre-encoded content is being transmitted, bandwidth adaptation
2198	   requires the pre-coded bitstream to be tailored for such adaptivity.
2199	   The key mechanisms available in [VVC] are temporal scalability, and
2200	   spatial/SNR scalability.  A media sender can remove NAL units
2201	   belonging to higher temporal sublayers (i.e., those NAL units with a
2202	   high value of TID) or higher spatio-SNR layers (as indicated by
2203	   interpreting the VPS) until the sending bitrate drops to an
2204	   acceptable range.

2206	   The mechanisms mentioned above generally work within a defined
2207	   profile and level and, therefore, no renegotiation of the channel is
2208	   required.  Only when non-downgradable parameters (such as profile)
2209	   are required to be changed does it become necessary to terminate and
2210	   restart the RTP stream(s).  This may be accomplished by using
2211	   different RTP payload types.

2213	   MANEs MAY remove certain unusable packets from the RTP stream when
2214	   that RTP stream was damaged due to previous packet losses.  This can
2215	   help reduce the network load in certain special cases.  For example,
2216	   MANES can remove those FUs where the leading FUs belonging to the
2217	   same NAL unit have been lost or those dependent slice segments when
2218	   the leading slice segments belonging to the same slice have been
2219	   lost, because the trailing FUs or dependent slice segments are
2220	   meaningless to most decoders.  MANES can also remove higher temporal
2221	   scalable layers if the outbound transmission (from the MANE's
2222	   viewpoint) experiences congestion.

2224	11.  IANA Considerations

2226	   Placeholder

2228	12.  Acknowledgements

2230	   Dr. Byeongdoo Choi is thanked for the video codec related technical
2231	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
2232	   are thanked for their contributions on [VVC] specification
2233	   descriptive content.  Spencer Dawkins is thanked for his valuable
2234	   review comments that led to great improvements of this memo.  Some
2235	   parts of this specification share text with the RTP payload format
2236	   for HEVC [RFC7798].  We thank the authors of that specification for
2237	   their excellent work.

2239	13.  References

2241	13.1.  Normative References

2243	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2244	              Requirement Levels", BCP 14, RFC 2119,
2245	              DOI 10.17487/RFC2119, March 1997,
2246	              <https://www.rfc-editor.org/info/rfc2119>.

2248	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
2249	              with Session Description Protocol (SDP)", RFC 3264,
2250	              DOI 10.17487/RFC3264, June 2002,
2251	              <https://www.rfc-editor.org/info/rfc3264>.

2253	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2254	              Jacobson, "RTP: A Transport Protocol for Real-Time
2255	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
2256	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

2258	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
2259	              Video Conferences with Minimal Control", STD 65, RFC 3551,
2260	              DOI 10.17487/RFC3551, July 2003,
2261	              <https://www.rfc-editor.org/info/rfc3551>.

2263	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
2264	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
2265	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
2266	              <https://www.rfc-editor.org/info/rfc3711>.

2268	   [RFC4556]  Zhu, L. and B. Tung, "Public Key Cryptography for Initial
2269	              Authentication in Kerberos (PKINIT)", RFC 4556,
2270	              DOI 10.17487/RFC4556, June 2006,
2271	              <https://www.rfc-editor.org/info/rfc4556>.

2273	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
2274	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
2275	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

2277	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
2278	              "Extended RTP Profile for Real-time Transport Control
2279	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
2280	              DOI 10.17487/RFC4585, July 2006,
2281	              <https://www.rfc-editor.org/info/rfc4585>.

2283	   [RFC4648]  Josefsson, S., "The Base16, Base32, and Base64 Data
2284	              Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006,
2285	              <https://www.rfc-editor.org/info/rfc4648>.

2287	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
2288	              "Codec Control Messages in the RTP Audio-Visual Profile
2289	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
2290	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

2292	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
2293	              Real-time Transport Control Protocol (RTCP)-Based Feedback
2294	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
2295	              2008, <https://www.rfc-editor.org/info/rfc5124>.

2297	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
2298	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
2299	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
2300	              DOI 10.17487/RFC7656, November 2015,
2301	              <https://www.rfc-editor.org/info/rfc7656>.

2303	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
2304	              "Using Codec Control Messages in the RTP Audio-Visual
2305	              Profile with Feedback with Layered Codecs", RFC 8082,
2306	              DOI 10.17487/RFC8082, March 2017,
2307	              <https://www.rfc-editor.org/info/rfc8082>.

2309	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2310	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
2311	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

2313	   [VSEI]     "ISO/IEC 23002-7 (ITU-T H.274) Versatile supplemental
2314	              enhancement information messages for coded video
2315	              bitstreams", 2020,
2316	              <https://www.iso.org/standard/79112.html>.

2318	   [VVC]      "ISO/IEC FDIS 23090-3 Information technology --- Coded
2319	              representation of immersive media --- Part 3 - Versatile
2320	              video coding", 2020,
2321	              <https://www.iso.org/standard/73022.html>.

2323	13.2.  Informative References

2325	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
2326	              HEVC, IEEE Transactions on Circuts and Systems for Video
2327	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
2328	              2012, <https://doi.org/10.1109/TCSVT.2012.2223055>.

2330	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
2331	              H.265", April 2013.

2333	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
2334	              ofmoving pictures and associated audio information - Part
2335	              1:Systems, ISO International Standard 13818-1", 2013.

2337	   [RFC6184]  Wang, Y.-K., Even, R., Kristensen, T., and R. Jesup, "RTP
2338	              Payload Format for H.264 Video", RFC 6184,
2339	              DOI 10.17487/RFC6184, May 2011,
2340	              <https://www.rfc-editor.org/info/rfc6184>.

2342	   [RFC6190]  Wenger, S., Wang, Y.-K., Schierl, T., and A.
2343	              Eleftheriadis, "RTP Payload Format for Scalable Video
2344	              Coding", RFC 6190, DOI 10.17487/RFC6190, May 2011,
2345	              <https://www.rfc-editor.org/info/rfc6190>.

2347	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
2348	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
2349	              <https://www.rfc-editor.org/info/rfc7201>.

2351	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
2352	              Framework: Why RTP Does Not Mandate a Single Media
2353	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
2354	              2014, <https://www.rfc-editor.org/info/rfc7202>.

2356	   [RFC7798]  Wang, Y.-K., Sanchez, Y., Schierl, T., Wenger, S., and M.
2357	              M. Hannuksela, "RTP Payload Format for High Efficiency
2358	              Video Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798,
2359	              March 2016, <https://www.rfc-editor.org/info/rfc7798>.

2361	Appendix A.  Change History

2363	   draft-zhao-payload-rtp-vvc-00 ........ initial version

2365	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
2366	   corrections

2368	   draft-ietf-payload-rtp-vvc-00 ........ initial WG draft

2370	   draft-ietf-payload-rtp-vvc-01 ........ VVC specification update

2372	   draft-ietf-payload-rtp-vvc-02 ........ VVC specification update

2374	   draft-ietf-payload-rtp-vvc-03 ........ VVC coding tool introduction
2375	   update

2377	   draft-ietf-payload-rtp-vvc-04 ........ VVC coding tool introduction
2378	   update

2380	   draft-ietf-payload-rtp-vvc-05 ........ reference udpate and adding
2381	   placement for open issues

2383	   draft-ietf-payload-rtp-vvc-06 ........ address editor's note

2385	   draft-ietf-payload-rtp-vvc-07 ........ address editor's notes

2387	Authors' Addresses
2388	   Shuai Zhao
2389	   Tencent
2390	   2747 Park Blvd
2391	   Palo Alto,  94588
2392	   United States of America

2394	   Email: shuai.zhao@ieee.org

2396	   Stephan Wenger
2397	   Tencent
2398	   2747 Park Blvd
2399	   Palo Alto,  94588
2400	   United States of America

2402	   Email: stewe@stewe.org

2404	   Yago Sanchez
2405	   Fraunhofer HHI
2406	   Einsteinufer 37
2407	   10587 Berlin
2408	   Germany

2410	   Email: yago.sanchez@hhi.fraunhofer.de

2412	   Ye-Kui Wang
2413	   Bytedance Inc.
2414	   8910 University Center Lane
2415	   San Diego,  92122
2416	   United States of America

2418	   Email: yekui.wang@bytedance.com