idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-09.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (2 June 2021) is 1052 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1370

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VSEI'

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: 4 December 2021                                      Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                              Y.-K. Wang
8	                                                          Bytedance Inc.
9	                                                             2 June 2021

11	          RTP Payload Format for Versatile Video Coding (VVC)
12	                     draft-ietf-avtcore-rtp-vvc-09

14	Abstract

16	   This memo describes an RTP payload format for the video coding
17	   standard ITU-T Recommendation H.266 and ISO/IEC International
18	   Standard 23090-3, both also known as Versatile Video Coding (VVC) and
19	   developed by the Joint Video Experts Team (JVET).  The RTP payload
20	   format allows for packetization of one or more Network Abstraction
21	   Layer (NAL) units in each RTP packet payload as well as fragmentation
22	   of a NAL unit into multiple RTP packets.  The payload format has wide
23	   applicability in videoconferencing, Internet video streaming, and
24	   high-bitrate entertainment-quality video, among other applications.

26	Status of This Memo

28	   This Internet-Draft is submitted in full conformance with the
29	   provisions of BCP 78 and BCP 79.

31	   Internet-Drafts are working documents of the Internet Engineering
32	   Task Force (IETF).  Note that other groups may also distribute
33	   working documents as Internet-Drafts.  The list of current Internet-
34	   Drafts is at https://datatracker.ietf.org/drafts/current/.

36	   Internet-Drafts are draft documents valid for a maximum of six months
37	   and may be updated, replaced, or obsoleted by other documents at any
38	   time.  It is inappropriate to use Internet-Drafts as reference
39	   material or to cite them other than as "work in progress."

41	   This Internet-Draft will expire on 4 December 2021.

43	Copyright Notice

45	   Copyright (c) 2021 IETF Trust and the persons identified as the
46	   document authors.  All rights reserved.

48	   This document is subject to BCP 78 and the IETF Trust's Legal
49	   Provisions Relating to IETF Documents (https://trustee.ietf.org/
50	   license-info) in effect on the date of publication of this document.
51	   Please review these documents carefully, as they describe your rights
52	   and restrictions with respect to this document.  Code Components
53	   extracted from this document must include Simplified BSD License text
54	   as described in Section 4.e of the Trust Legal Provisions and are
55	   provided without warranty as described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
61	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   3
62	       1.1.2.  Systems and Transport Interfaces (informative)  . . .   6
63	       1.1.3.  High-Level Picture Partitioning (informative) . . . .  11
64	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  13
65	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  14
66	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  15
67	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  15
68	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  15
69	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  15
70	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  18
71	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  19
72	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  20
73	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  20
74	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  21
75	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  22
76	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  22
77	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  23
78	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  27
79	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  30
80	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  31
81	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  32
82	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  34
83	     7.1.  Media Type Registration . . . . . . . . . . . . . . . . .  34
84	     7.2.  SDP Parameters  . . . . . . . . . . . . . . . . . . . . .  44
85	       7.2.1.  Mapping of Payload Type Parameters to SDP . . . . . .  44
86	       7.2.2.  Usage with SDP Offer/Answer Model . . . . . . . . . .  45
87	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  45
88	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  46
89	     8.2.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  46
90	   9.  Security Considerations . . . . . . . . . . . . . . . . . . .  46
91	   10. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  48
92	   11. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  49
93	   12. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  49
94	   13. References  . . . . . . . . . . . . . . . . . . . . . . . . .  49
95	     13.1.  Normative References . . . . . . . . . . . . . . . . . .  49
96	     13.2.  Informative References . . . . . . . . . . . . . . . . .  51
97	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  52
98	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  52

100	1.  Introduction

102	   The Versatile Video Coding [VVC] specification, formally published as
103	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
104	   23090-3, is currently in the ITU-T publication process and the ISO/
105	   IEC approval process.  VVC is reported to provide significant coding
106	   efficiency gains over HEVC [HEVC] as known as H.265, and other
107	   earlier video codecs.

109	   This memo specifies an RTP payload format for VVC.  It shares its
110	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
111	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
112	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
113	   and their respective predecessors.  With respect to design
114	   philosophy, security, congestion control, and overall implementation
115	   complexity, it has similar properties to those earlier payload format
116	   specifications.  This is a conscious choice, as at least RFC 6184 is
117	   widely deployed and generally known in the relevant implementer
118	   communities.  Certain mechanisms known from [RFC6190] were
119	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
120	   signal-to-noise ratio (SNR) scalability.

122	1.1.  Overview of the VVC Codec

124	   VVC and HEVC share a similar hybrid video codec design.  In this
125	   memo, we provide a very brief overview of those features of VVC that
126	   are, in some form, addressed by the payload format specified herein.
127	   Implementers have to read, understand, and apply the ITU-T/ISO/IEC
128	   specifications pertaining to VVC to arrive at interoperable, well-
129	   performing implementations.

131	   Conceptually, both VVC and HEVC include a Video Coding Layer (VCL),
132	   which is often used to refer to the coding-tool features, and a NAL,
133	   which is often used to refer to the systems and transport interface
134	   aspects of the codecs.

136	1.1.1.  Coding-Tool Features (informative)

138	   Coding tool features are described below with occasional reference to
139	   the coding tool set of HEVC, which is well known in the community.

141	   Similar to earlier hybrid-video-coding-based standards, including
142	   HEVC, the following basic video coding design is employed by VVC.  A
143	   prediction signal is first formed by either intra- or motion-
144	   compensated prediction, and the residual (the difference between the
145	   original and the prediction) is then coded.  The gains in coding
146	   efficiency are achieved by redesigning and improving almost all parts
147	   of the codec over earlier designs.  In addition, VVC includes several
148	   tools to make the implementation on parallel architectures easier.

150	   Finally, VVC includes temporal, spatial, and SNR scalability as well
151	   as multiview coding support.

153	   Coding blocks and transform structure

155	   Among major coding-tool differences between HEVC and VVC, one of the
156	   important improvements is the more flexible coding tree structure in
157	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
158	   ternary trees are also supported, which contributes significant
159	   improvement in coding efficiency.  Moreover, the maximum size of
160	   coding tree unit (CTU) is increased from 64x64 to 128x128.  To
161	   improve the coding efficiency of chroma signal, luma chroma separated
162	   trees at CTU level may be employed for intra-slices.  The square
163	   transforms in HEVC are extended to non-square transforms for
164	   rectangular blocks resulting from binary and ternary tree splits.
165	   Besides, VVC supports multiple transform sets (MTS), including DCT-2,
166	   DST-7, and DCT-8 as well as the non-separable secondary transform.
167	   The transforms used in VVC can have different sizes with support for
168	   larger transform sizes.  For DCT-2, the transform sizes range from
169	   2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range from
170	   4x4 to 32x32.  In addition, VVC also support sub-block transform for
171	   both intra and inter coded blocks.  For intra coded blocks, intra
172	   sub-partitioning (ISP) may be used to allow sub-block based intra
173	   prediction and transform.  For inter blocks, sub-block transform may
174	   be used assuming that only a part of an inter-block has non-zero
175	   transform coefficients.

177	   Entropy coding

179	   Similar to HEVC, VVC uses a single entropy-coding engine, which is
180	   based on context adaptive binary arithmetic coding [CABAC], but with
181	   the support of multi-window sizes.  The window sizes can be
182	   initialized differently for different context models.  Due to such a
183	   design, it has more efficient adaptation speed and better coding
184	   efficiency.  A joint chroma residual coding scheme is applied to
185	   further exploit the correlation between the residuals of two color
186	   components.  In VVC, different residual coding schemes are applied
187	   for regular transform coefficients and residual samples generated
188	   using transform-skip mode.

190	   In-loop filtering
191	   VVC has more feature support in loop filters than HEVC.  The
192	   deblocking filter in VVC is similar to HEVC but operates at a smaller
193	   grid.  After deblocking and sample adaptive offset (SAO), an adaptive
194	   loop filter (ALF) may be used.  As a Wiener filter, ALF reduces
195	   distortion of decoded pictures.  Besides, VVC introduces a new module
196	   before deblocking called luma mapping with chroma scaling to fully
197	   utilize the dynamic range of signal so that rate-distortion
198	   performance of both SDR and HDR content is improved.

200	   Motion prediction and coding

202	   Compared to HEVC, VVC introduces several improvements in this area.
203	   First, there is the adaptive motion vector resolution (AMVR), which
204	   can save bit cost for motion vectors by adaptively signaling motion
205	   vector resolution.  Then the affine motion compensation is included
206	   to capture complicated motion like zooming and rotation.  Meanwhile,
207	   prediction refinement with the optical flow with affine mode (PROF)
208	   is further deployed to mimic affine motion at the pixel level.
209	   Thirdly the decoder side motion vector refinement (DMVR) is a method
210	   to derive MV vector at decoder side based on block matching so that
211	   fewer bits may be spent on motion vectors.  Bi-directional optical
212	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
213	   offset at 4x4 sub-block level that is derived with equations based on
214	   gradients of the prediction samples and a motion difference relative
215	   to CU motion vectors.  Furthermore, merge with motion vector
216	   difference (MMVD) is a special mode, which further signals a limited
217	   set of motion vector differences on top of merge mode.  In addition
218	   to MMVD, there are another three types of special merge modes, i.e.,
219	   sub-block merge, triangle, and combined intra-/inter-prediction
220	   (CIIP).  Sub-block merge list includes one candidate of sub-block
221	   temporal motion vector prediction (SbTMVP) and up to four candidates
222	   of affine motion vectors.  Triangle is based on triangular block
223	   motion compensation.  CIIP combines intra- and inter- predictions
224	   with weighting.  Adaptive weighting may be employed with a block-
225	   level tool called bi-prediction with CU based weighting (BCW) which
226	   provides more flexibility than in HEVC.

228	   Intra prediction and intra-coding

230	   To capture the diversified local image texture directions with finer
231	   granularity, VVC supports 65 angular directions instead of 33
232	   directions in HEVC.  The intra mode coding is based on a 6-most-
233	   probable-mode scheme, and the 6 most probable modes are derived using
234	   the neighboring intra prediction directions.  In addition, to deal
235	   with the different distributions of intra prediction angles for
236	   different block aspect ratios, a wide-angle intra prediction (WAIP)
237	   scheme is applied in VVC by including intra prediction angles beyond
238	   those present in HEVC.  Unlike HEVC which only allows using the most
239	   adjacent line of reference samples for intra prediction, VVC also
240	   allows using two further reference lines, as known as multi-
241	   reference-line (MRL) intra prediction.  The additional reference
242	   lines can be only used for the 6 most probable intra prediction
243	   modes.  To capture the strong correlation between different colour
244	   components, in VVC, a cross-component linear mode (CCLM) is utilized
245	   which assumes a linear relationship between the luma sample values
246	   and their associated chroma samples.  For intra prediction, VVC also
247	   applies a position-dependent prediction combination (PDPC) for
248	   refining the prediction samples closer to the intra prediction block
249	   boundary.  Matrix-based intra prediction (MIP) modes are also used in
250	   VVC which generates an up to 8x8 intra prediction block using a
251	   weighted sum of downsampled neighboring reference samples, and the
252	   weights are hardcoded constants.

254	   Other coding-tool feature

256	   VVC introduces dependent quantization (DQ) to reduce quantization
257	   error by state-based switching between two quantizers.

259	1.1.2.  Systems and Transport Interfaces (informative)

261	   VVC inherits the basic systems and transport interfaces designs from
262	   HEVC and H.264.  These include the NAL-unit-based syntax structure,
263	   the hierarchical syntax and data unit structure, the supplemental
264	   enhancement information (SEI) message mechanism, and the video
265	   buffering model based on the hypothetical reference decoder (HRD).
266	   The scalability features of VVC are conceptually similar to the
267	   scalable variant of HEVC known as SHVC.  The hierarchical syntax and
268	   data unit structure consists of parameter sets at various levels
269	   (decoder, sequence (pertaining to all), sequence (pertaining to a
270	   single), picture), picture-level header parameters, slice-level
271	   header parameters, and lower-level parameters.

273	   A number of key components that influenced the network abstraction
274	   layer design of VVC as well as this memo are described below

276	   Decoding capability information

278	   The decoding capability information includes parameters that stay
279	   constant for the lifetime of a Video Bitstream, which in IETF terms
280	   can translate to the lifetime of a session.  Such information
281	   includes profile, level, and sub-profile information to determine a
282	   maximum capability interop point that is guaranteed to be never
283	   exceeded, even if splicing of video sequences occurs within a
284	   session.  It further includes constraint fields (most of which are
285	   flags), which can optionally be set to indicate that the video
286	   bitstream will be constraint in the use of certain features as
287	   indicated by the values of those fields.  With this, a bitstream can
288	   be labelled as not using certain tools, which allows among other
289	   things for resource allocation in a decoder implementation.

291	   Video parameter set

293	   The ideo parameter set (VPS) pertains to a coded video sequences
294	   (CVS) of multiple layers covering the same range of access units, and
295	   includes, among other information decoding dependency expressed as
296	   information for reference picture list construction of enhancement
297	   layers.  The VPS provides a "big picture" of a scalable sequence,
298	   including what types of operation points are provided, the profile,
299	   tier, and level of the operation points, and some other high-level
300	   properties of the bitstream that can be used as the basis for session
301	   negotiation and content selection, etc.  One VPS may be referenced by
302	   one or more sequence parameter sets.

304	   Sequence parameter set

306	   The sequence parameter set (SPS) contains syntax elements pertaining
307	   to a coded layer video sequence (CLVS), which is a group of pictures
308	   belonging to the same layer, starting with a random access point, and
309	   followed by pictures that may depend on each other, until the next
310	   random access point picture.  In MPGEG-2, the equivalent of a CVS was
311	   a group of pictures (GOP), which normally started with an I frame and
312	   was followed by P and B frames.  While more complex in its options of
313	   random access points, VVC retains this basic concept.  One remarkable
314	   difference of VVC is that a CLVS may start with a Gradual Decoding
315	   Refresh (GDR) picture, without requiring presence of traditional
316	   random access points in the bitstream, such as instantaneous decoding
317	   refresh (IDR) or clean random access (CRA) pictures.  In many TV-like
318	   applications, a CVS contains a few hundred milliseconds to a few
319	   seconds of video.  In video conferencing (without switching MCUs
320	   involved), a CVS can be as long in duration as the whole session.

322	   Picture and adaptation parameter set

324	   The picture parameter set and the adaptation parameter set (PPS and
325	   APS, respectively) carry information pertaining to zero or more
326	   pictures and zero or more slices, respectively.  The PPS contains
327	   information that is likely to stay constant from picture to picture-
328	   at least for pictures for a certain type-whereas the APS contains
329	   information, such as adaptive loop filter coefficients, that are
330	   likely to change from picture to picture or even within a picture.  A
331	   single APS is referenced by all slices of the same picture if that
332	   APS contains information about luma mapping with chroma scaling
333	   (LMCS) or scaling list.  Different APSs containing ALF parameters can
334	   be referenced by slices of the same picture.

336	   Picture header

338	   A Picture Header contains information that is common to all slices
339	   that belong to the same picture.  Being able to send that information
340	   as a separate NAL unit when pictures are split into several slices
341	   allows for saving bitrate, compared to repeating the same information
342	   in all slices.  However, there might be scenarios where low-bitrate
343	   video is transmitted using a single slice per picture.  Having a
344	   separate NAL unit to convey that information incurs in an overhead
345	   for such scenarios.  For such scenarios, the picture header syntax
346	   structure is directly included in the slice header, instead of in its
347	   own NAL unit.  The mode of the picture header syntax structure being
348	   included in its own NAL unit or not can only be switched on/off for
349	   an entire CLVS, and can only be switched off when in the entire CLVS
350	   each picture contains only one slice.

352	   Profile, tier, and level

354	   The profile, tier and level syntax structures in DCI, VPS and SPS
355	   contain profile, tier, level information for all layers that refer to
356	   the DCI, for layers associated with one or more output layer sets
357	   specified by the VPS, and for any layer that refers to the SPS,
358	   respectively.

360	   Sub-profiles

362	   Within the VVC specification, a sub-profile is a 32-bit number, coded
363	   according to ITU-T Rec. T.35, that does not carry a semantics.  It is
364	   carried in the profile_tier_level structure and hence (potentially)
365	   present in the DCI, VPS, and SPS.  External registration bodies can
366	   register a T.35 codepoint with ITU-T registration authorities and
367	   associate with their registration a description of bitstream
368	   restrictions beyond the profiles defined by ITU-T and ISO/IEC.  This
369	   would allow encoder manufacturers to label the bitstreams generated
370	   by their encoder as complying with such sub-profile.  It is expected
371	   that upstream standardization organizations (such as: DVB and ATSC),
372	   as well as walled-garden video services will take advantage of this
373	   labelling system.  In contrast to "normal" profiles, it is expected
374	   that sub-profiles may indicate encoder choices traditionally left
375	   open in the (decoder- centric) video coding specs, such as GOP
376	   structures, minimum/maximum QP values, and the mandatory use of
377	   certain tools or SEI messages.

379	   General constraint fields

381	   The profile_tier_level structure carries a considerable number of
382	   constraint fields (most of which are flags), which an encoder can use
383	   to indicate to a decoder that it will not use a certain tool or
384	   technology.  They were included in reaction to a perceived market
385	   need for labelling a bitstream as not exercising a certain tool that
386	   has become commercially unviable.

388	   Temporal scalability support

390	   VVC includes support of temporal scalability, by inclusion of the
391	   signaling of TemporalId in the NAL unit header, the restriction that
392	   pictures of a particular temporal sublayer cannot be used for inter
393	   prediction reference by pictures of a lower temporal sublayer, the
394	   sub-bitstream extraction process, and the requirement that each sub-
395	   bitstream extraction output be a conforming bitstream.  Media-Aware
396	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
397	   header for stream adaptation purposes based on temporal scalability.

399	   Reference picture resampling (RPR)

401	   In AVC and HEVC, the spatial resolution of pictures cannot change
402	   unless a new sequence using a new SPS starts, with an IRAP picture.
403	   VVC enables picture resolution change within a sequence at a position
404	   without encoding an IRAP picture, which is always intra-coded.  This
405	   feature is sometimes referred to as reference picture resampling
406	   (RPR), as the feature needs resampling of a reference picture used
407	   for inter prediction when that reference picture has a different
408	   resolution than the current picture being decoded.  RPR allows
409	   resolution change without the need of coding an IRAP picture, which
410	   causes a momentary bit rate spike in streaming or video conferencing
411	   scenarios, e.g., to cope with network condition changes.  RPR can
412	   also be used in application scenarios wherein zooming of the entire
413	   video region or some region of interest is needed.

415	   Spatial, SNR, and multiview scalability

417	   VVC includes support for spatial, SNR, and multiview scalability.
418	   Scalable video coding is widely considered to have technical benefits
419	   and enrich services for various video applications.  Until recently,
420	   however, the functionality has not been included in the first version
421	   of specifications of the video codecs.  In VVC, however, all those
422	   forms of scalability are supported in the first version of VVC
423	   natively through the signaling of the layer_id in the NAL unit
424	   header, the VPS which associates layers with given layer_ids to each
425	   other, reference picture selection, reference picture resampling for
426	   spatial scalability, and a number of other mechanisms not relevant
427	   for this memo.

429	      Spatial scalability
430	         With the existence of Reference Picture Resampling (RPR), the
431	         additional burden for scalability support is just a
432	         modification of the high-level syntax (HLS).  The inter-layer
433	         prediction is employed in a scalable system to improve the
434	         coding efficiency of the enhancement layers.  In addition to
435	         the spatial and temporal motion-compensated predictions that
436	         are available in a single-layer codec, the inter-layer
437	         prediction in VVC uses the possibly resampled video data of the
438	         reconstructed reference picture from a reference layer to
439	         predict the current enhancement layer.  The resampling process
440	         for inter-layer prediction, when used, is performed at the
441	         block-level, reusing the existing interpolation process for
442	         motion compensation in single-layer coding.  It means that no
443	         additional resampling process is needed to support spatial
444	         scalability.

446	      SNR scalability

448	         SNR scalability is similar to spatial scalability except that
449	         the resampling factors are 1:1.  In other words, there is no
450	         change in resolution, but there is inter-layer prediction.

452	      Multiview scalability

454	         The first version of VVC also supports multiview scalability,
455	         wherein a multi-layer bitstream carries layers representing
456	         multiple views, and one or more of the represented views can be
457	         output at the same time.

459	   SEI messages

461	   Supplementary enhancement information (SEI) messages are information
462	   in the bitstream that do not influence the decoding process as
463	   specified in the VVC spec, but address issues of representation/
464	   rendering of the decoded bitstream, label the bitstream for certain
465	   applications, among other, similar tasks.  The overall concept of SEI
466	   messages and many of the messages themselves has been inherited from
467	   the H.264 and HEVC specs.  Except for the SEI messages that affect
468	   the specification of the hypothetical reference decoder (HRD), other
469	   SEI messages for use in the VVC environment, which are generally
470	   useful also in other video coding technologies, are not included in
471	   the main VVC specification but in a companion specification [VSEI].

473	1.1.3.  High-Level Picture Partitioning (informative)

475	   VVC inherited the concept of tiles and wavefront parallel processing
476	   (WPP) from HEVC, with some minor to moderate differences.  The basic
477	   concept of slices was kept in VVC but designed in an essentially
478	   different form.  VVC is the first video coding standard that includes
479	   subpictures as a feature, which provides the same functionality as
480	   HEVC motion-constrained tile sets (MCTSs) but designed differently to
481	   have better coding efficiency and to be friendlier for usage in
482	   application systems.  More details of these differences are described
483	   below.

485	   Tiles and WPP

487	   Same as in HEVC, a picture can be split into tile rows and tile
488	   columns in VVC, in-picture prediction across tile boundaries is
489	   disallowed, etc.  However, the syntax for signaling of tile
490	   partitioning has been simplified, by using a unified syntax design
491	   for both the uniform and the non-uniform mode.  In addition,
492	   signaling of entry point offsets for tiles in the slice header is
493	   optional in VVC while it is mandatory in HEVC.  The WPP design in VVC
494	   has two differences compared to HEVC: i) The CTU row delay is reduced
495	   from two CTUs to one CTU; ii) Signaling of entry point offsets for
496	   WPP in the slice header is optional in VVC while it is mandatory in
497	   HEVC.

499	   Slices

501	   In VVC, the conventional slices based on CTUs (as in HEVC) or
502	   macroblocks (as in AVC) have been removed.  The main reasoning behind
503	   this architectural change is as follows.  The advances in video
504	   coding since 2003 (the publication year of AVC v1) have been such
505	   that slice-based error concealment has become practically impossible,
506	   due to the ever-increasing number and efficiency of in-picture and
507	   inter-picture prediction mechanisms.  An error-concealed picture is
508	   the decoding result of a transmitted coded picture for which there is
509	   some data loss (e.g., loss of some slices) of the coded picture or a
510	   reference picture for at least some part of the coded picture is not
511	   error-free (e.g., that reference picture was an error-concealed
512	   picture).  For example, when one of the multiple slices of a picture
513	   is lost, it may be error-concealed using an interpolation of the
514	   neighboring slices.  While advanced video coding prediction
515	   mechanisms provide significantly higher coding efficiency, they also
516	   make it harder for machines to estimate the quality of an error-
517	   concealed picture, which was already a hard problem with the use of
518	   simpler prediction mechanisms.  Advanced in-picture prediction
519	   mechanisms also cause the coding efficiency loss due to splitting a
520	   picture into multiple slices to be more significant.  Furthermore,
521	   network conditions become significantly better while at the same time
522	   techniques for dealing with packet losses have become significantly
523	   improved.  As a result, very few implementations have recently used
524	   slices for maximum transmission unit size matching.  Instead,
525	   substantially all applications where low-delay error resilience is
526	   required (e.g., video telephony and video conferencing) rely on
527	   system/transport-level error resilience (e.g., retransmission,
528	   forward error correction) and/or picture-based error resilience tools
529	   (feedback-based error resilience, insertion of IRAPs, scalability
530	   with higher protection level of the base layer, and so on).
531	   Considering all the above, nowadays it is very rare that a picture
532	   that cannot be correctly decoded is passed to the decoder, and when
533	   such a rare case occurs, the system can afford to wait for an error-
534	   free picture to be decoded and available for display without
535	   resulting in frequent and long periods of picture freezing seen by
536	   end users.

538	   Slices in VVC have two modes: rectangular slices and raster-scan
539	   slices.  The rectangular slice, as indicated by its name, covers a
540	   rectangular region of the picture.  Typically, a rectangular slice
541	   consists of several complete tiles.  However, it is also possible
542	   that a rectangular slice is a subset of a tile and consists of one or
543	   more consecutive, complete CTU rows within a tile.  A raster-scan
544	   slice consists of one or more complete tiles in a tile raster scan
545	   order, hence the region covered by a raster-scan slices need not but
546	   could have a non-rectangular shape, but it may also happen to have
547	   the shape of a rectangle.  The concept of slices in VVC is therefore
548	   strongly linked to or based on tiles instead of CTUs (as in HEVC) or
549	   macroblocks (as in AVC).

551	   Subpictures

553	   VVC is the first video coding standard that includes the support of
554	   subpictures as a feature.  Each subpicture consists of one or more
555	   complete rectangular slices that collectively cover a rectangular
556	   region of the picture.  A subpicture may be either specified to be
557	   extractable (i.e., coded independently of other subpictures of the
558	   same picture and of earlier pictures in decoding order) or not
559	   extractable.  Regardless of whether a subpicture is extractable or
560	   not, the encoder can control whether in-loop filtering (including
561	   deblocking, SAO, and ALF) is applied across the subpicture boundaries
562	   individually for each subpicture.

564	   Functionally, subpictures are similar to the motion-constrained tile
565	   sets (MCTSs) in HEVC.  They both allow independent coding and
566	   extraction of a rectangular subset of a sequence of coded pictures,
567	   for use cases like viewport-dependent 360o video streaming
568	   optimization and region of interest (ROI) applications.

570	   There are several important design differences between subpictures
571	   and MCTSs.  First, the subpictures feature in VVC allows motion
572	   vectors of a coding block pointing outside of the subpicture even
573	   when the subpicture is extractable by applying sample padding at
574	   subpicture boundaries in this case, similarly as at picture
575	   boundaries.  Second, additional changes were introduced for the
576	   selection and derivation of motion vectors in the merge mode and in
577	   the decoder side motion vector refinement process of VVC.  This
578	   allows higher coding efficiency compared to the non-normative motion
579	   constraints applied at the encoder-side for MCTSs.  Third, rewriting
580	   of SHs (and PH NAL units, when present) is not needed when extracting
581	   one or more extractable subpictures from a sequence of pictures to
582	   create a sub-bitstream that is a conforming bitstream.  In sub-
583	   bitstream extractions based on HEVC MCTSs, rewriting of SHs is
584	   needed.  Note that in both HEVC MCTSs extraction and VVC subpictures
585	   extraction, rewriting of SPSs and PPSs is needed.  However, typically
586	   there are only a few parameter sets in a bitstream, while each
587	   picture has at least one slice, therefore rewriting of SHs can be a
588	   significant burden for application systems.  Fourth, slices of
589	   different subpictures within a picture are allowed to have different
590	   NAL unit types.  Fifth, VVC specifies HRD and level definitions for
591	   subpicture sequences, thus the conformance of the sub-bitstream of
592	   each extractable subpicture sequence can be ensured by encoders.

594	1.1.4.  NAL Unit Header

596	   VVC maintains the NAL unit concept of HEVC with modifications.  VVC
597	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
598	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

600	                     +---------------+---------------+
601	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
602	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
603	                     |F|Z| LayerID   |  Type   | TID |
604	                     +---------------+---------------+

606	                   The Structure of the VVC NAL Unit Header.

608	                                  Figure 1

610	   The semantics of the fields in the NAL unit header are as specified
611	   in VVC and described briefly below for convenience.  In addition to
612	   the name and size of each field, the corresponding syntax element
613	   name in VVC is also provided.

615	   F: 1 bit
616	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
617	      inclusion of this bit in the NAL unit header was to enable
618	      transport of VVC video over MPEG-2 transport systems (avoidance of
619	      start code emulations) [MPEG2S].  In the context of this memo the
620	      value 1 may be used to indicate a syntax violation, e.g., for a
621	      NAL unit resulted from aggregating a number of fragmented units of
622	      a NAL unit but missing the last fragment, as described in
623	      Section TBD.

625	   Z: 1 bit

627	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
628	      for future extensions by ITU-T and ISO/IEC.

630	      This memo does not overload the "Z" bit for local extensions, as
631	      a) overloading the "F" bit is sufficient and b) to preserve the
632	      usefulness of this memo to possible future versions of [VVC].

634	   LayerId: 6 bits

636	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
637	      a layer may be, e.g., a spatial scalable layer, a quality scalable
638	      layer, a layer containing a different view, etc.

640	   Type: 5 bits

642	      nal_unit_type.  This field specifies the NAL unit type as defined
643	      in Table 5 of [VVC].  For a reference of all currently defined NAL
644	      unit types and their semantics, please refer to Section 7.4.2.2 in
645	      [VVC].

647	   TID: 3 bits

649	      nuh_temporal_id_plus1.  This field specifies the temporal
650	      identifier of the NAL unit plus 1.  The value of TemporalId is
651	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
652	      there is at least one bit in the NAL unit header equal to 1, so to
653	      enable independent considerations of start code emulations in the
654	      NAL unit header and in the NAL unit payload data.

656	1.2.  Overview of the Payload Format

658	   This payload format defines the following processes required for
659	   transport of VVC coded data over RTP [RFC3550]:

661	   *  Usage of RTP header with this payload format
662	   *  Packetization of VVC coded NAL units into RTP packets using three
663	      types of payload structures: a single NAL unit packet, aggregation
664	      packet, and fragment unit

666	   *  Transmission of VVC NAL units of the same bitstream within a
667	      single RTP stream

669	   *  Media type parameters to be used with the Session Description
670	      Protocol (SDP) [RFC4566]

672	   *  Usage of RTCP feedback messages

674	2.  Conventions

676	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
677	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
678	   "OPTIONAL" in this document are to be interpreted as described in BCP
679	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
680	   capitals, as shown above.

682	3.  Definitions and Abbreviations

684	3.1.  Definitions

686	   This document uses the terms and definitions of VVC.  Section 3.1.1
687	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
688	   provides definitions specific to this memo.

690	3.1.1.  Definitions from the VVC Specification

692	   Access unit (AU): A set of PUs that belong to different layers and
693	   contain coded pictures associated with the same time for output from
694	   the DPB.

696	   Adaptation parameter set (APS): A syntax structure containing syntax
697	   elements that apply to zero or more slices as determined by zero or
698	   more syntax elements found in slice headers.

700	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
701	   byte stream, that forms the representation of a sequence of AUs
702	   forming one or more coded video sequences (CVSs).

704	   Coded picture: A coded representation of a picture comprising VCL NAL
705	   units with a particular value of nuh_layer_id within an AU and
706	   containing all CTUs of the picture.

708	   Clean random access (CRA) PU: A PU in which the coded picture is a
709	   CRA picture.

711	   Clean random access (CRA) picture: An IRAP picture for which each VCL
712	   NAL unit has nal_unit_type equal to CRA_NUT.

714	   Coded video sequence (CVS): A sequence of AUs that consists, in
715	   decoding order, of a CVSS AU, followed by zero or more AUs that are
716	   not CVSS AUs, including all subsequent AUs up to but not including
717	   any subsequent AU that is a CVSS AU.

719	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
720	   for each layer in the CVS and the coded picture in each PU is a CLVSS
721	   picture.

723	   Coded layer video sequence (CLVS): A sequence of PUs with the same
724	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
725	   PU, followed by zero or more PUs that are not CLVSS PUs, including
726	   all subsequent PUs up to but not including any subsequent PU that is
727	   a CLVSS PU.

729	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
730	   picture is a CLVSS picture.

732	   Coded layer video sequence start (CLVSS) picture: A coded picture
733	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
734	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

736	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
737	   of chroma samples of a picture that has three sample arrays, or a CTB
738	   of samples of a monochrome picture or a picture that is coded using
739	   three separate colour planes and syntax structures used to code the
740	   samples.

742	   Decoding Capability Information (DCI): A syntax structure containing
743	   syntax elements that apply to the entire bitstream.

745	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
746	   reference, output reordering, or output delay specified for the
747	   hypothetical reference decoder.

749	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
750	   NAL unit has nal_unit_type equal to GDR_NUT.

752	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
753	   picture is an IDR picture.

755	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
756	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
757	   IDR_N_LP.

759	   Intra random access point (IRAP) AU: An AU in which there is a PU for
760	   each layer in the CVS and the coded picture in each PU is an IRAP
761	   picture.

763	   Intra random access point (IRAP) PU: A PU in which the coded picture
764	   is an IRAP picture.

766	   Intra random access point (IRAP) picture: A coded picture for which
767	   all VCL NAL units have the same value of nal_unit_type in the range
768	   of IDR_W_RADL to CRA_NUT, inclusive.

770	   Layer: A set of VCL NAL units that all have a particular value of
771	   nuh_layer_id and the associated non-VCL NAL units.

773	   Network abstraction layer (NAL) unit: A syntax structure containing
774	   an indication of the type of data to follow and bytes containing that
775	   data in the form of an RBSP interspersed as necessary with emulation
776	   prevention bytes.

778	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

780	   Operation point (OP): A temporal subset of an OLS, identified by an
781	   OLS index and a highest value of TemporalId.

783	   Picture parameter set (PPS): A syntax structure containing syntax
784	   elements that apply to zero or more entire coded pictures as
785	   determined by a syntax element found in each slice header.

787	   Picture unit (PU): A set of NAL units that are associated with each
788	   other according to a specified classification rule, are consecutive
789	   in decoding order, and contain exactly one coded picture.

791	   Random access: The act of starting the decoding process for a
792	   bitstream at a point other than the beginning of the stream.

794	   Sequence parameter set (SPS): A syntax structure containing syntax
795	   elements that apply to zero or more entire CLVSs as determined by the
796	   content of a syntax element found in the PPS referred to by a syntax
797	   element found in each picture header.

799	   Slice: An integer number of complete tiles or an integer number of
800	   consecutive complete CTU rows within a tile of a picture that are
801	   exclusively contained in a single NAL unit.

803	   Slice header (SH): A part of a coded slice containing the data
804	   elements pertaining to all tiles or CTU rows within a tile
805	   represented in the slice.

807	   Sublayer: A temporal scalable layer of a temporal scalable bitstream
808	   consisting of VCL NAL units with a particular value of the TemporalId
809	   variable, and the associated non-VCL NAL units.

811	   Subpicture: An rectangular region of one or more slices within a
812	   picture.

814	   Sublayer representation: A subset of the bitstream consisting of NAL
815	   units of a particular sublayer and the lower sublayers.

817	   Tile: A rectangular region of CTUs within a particular tile column
818	   and a particular tile row in a picture.

820	   Tile column: A rectangular region of CTUs having a height equal to
821	   the height of the picture and a width specified by syntax elements in
822	   the picture parameter set.

824	   Tile row: A rectangular region of CTUs having a height specified by
825	   syntax elements in the picture parameter set and a width equal to the
826	   width of the picture.

828	   Video coding layer (VCL) NAL unit: A collective term for coded slice
829	   NAL units and the subset of NAL units that have reserved values of
830	   nal_unit_type that are classified as VCL NAL units in this
831	   Specification.

833	3.1.2.  Definitions Specific to This Memo

835	   Media-Aware Network Element (MANE): A network element, such as a
836	   middlebox, selective forwarding unit, or application-layer gateway
837	   that is capable of parsing certain aspects of the RTP payload headers
838	   or the RTP payload and reacting to their contents.

840	      Informative note: The concept of a MANE goes beyond normal routers
841	      or gateways in that a MANE has to be aware of the signaling (e.g.,
842	      to learn about the payload type mappings of the media streams),
843	      and in that it has to be trusted when working with Secure RTP
844	      (SRTP).  The advantage of using MANEs is that they allow packets
845	      to be dropped according to the needs of the media coding.  For
846	      example, if a MANE has to drop packets due to congestion on a
847	      certain link, it can identify and remove those packets whose
848	      elimination produces the least adverse effect on the user
849	      experience.  After dropping packets, MANEs must rewrite RTCP
850	      packets to match the changes to the RTP stream, as specified in
851	      Section 7 of [RFC3550].

853	   NAL unit decoding order: A NAL unit order that conforms to the
854	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
855	   follow the Order of NAL units in the bitstream.

857	   RTP stream (See [RFC7656]): Within the scope of this memo, one RTP
858	   stream is utilized to transport a VVC bitstream, which may contain
859	   one or more layers, and each layer may contain one or more temporal
860	   sublayers.

862	   Transmission order: The order of packets in ascending RTP sequence
863	   number order (in modulo arithmetic).  Within an aggregation packet,
864	   the NAL unit transmission order is the same as the order of
865	   appearance of NAL units in the packet.

867	3.2.  Abbreviations

869	   AU         Access Unit

871	   AP         Aggregation Packet

873	   APS        Adaptation Parameter Set

875	   CTU        Coding Tree Unit

877	   CVS        Coded Video Sequence

879	   DPB        Decoded Picture Buffer

881	   DCI        Decoding Capability Information

883	   DON        Decoding Order Number

885	   FIR        Full Intra Request

887	   FU         Fragmentation Unit

889	   GDR        Gradual Decoding Refresh

891	   HRD        Hypothetical Reference Decoder

893	   IDR        Instantaneous Decoding Refresh

895	   MANE       Media-Aware Network Element

897	   MTU        Maximum Transfer Unit

899	   NAL        Network Abstraction Layer
900	   NALU       Network Abstraction Layer Unit

902	   PLI        Picture Loss Indication

904	   PPS        Picture Parameter Set

906	   RPS        Reference Picture Set

908	   RPSI       Reference Picture Selection Indication

910	   SEI        Supplemental Enhancement Information

912	   SLI        Slice Loss Indication

914	   SPS        Sequence Parameter Set

916	   VCL        Video Coding Layer

918	   VPS        Video Parameter Set

920	4.  RTP Payload Format

922	4.1.  RTP Header Usage

924	   The format of the RTP header is specified in [RFC3550] (reprinted as
925	   Figure 2 for convenience).  This payload format uses the fields of
926	   the header in a manner consistent with that specification.

928	   The RTP payload (and the settings for some RTP header bits) for
929	   aggregation packets and fragmentation units are specified in
930	   Section 4.3.2 and Section 4.3.3, respectively.

932	       0                   1                   2                   3
933	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
934	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
935	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
936	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
937	      |                           timestamp                           |
938	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
939	      |           synchronization source (SSRC) identifier            |
940	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
941	      |            contributing source (CSRC) identifiers             |
942	      |                             ....                              |
943	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

945	                        RTP Header According to {{RFC3550}}

947	                                  Figure 2

949	   The RTP header information to be set according to this RTP payload
950	   format is set as follows:

952	   Marker bit (M): 1 bit

954	      Set for the last packet, in transmission order, among each set of
955	      packets that contain NAL units of one access unit.  This is in
956	      line with the normal use of the M bit in video formats to allow an
957	      efficient playout buffer handling.

959	   Payload Type (PT): 7 bits

961	      The assignment of an RTP payload type for this new packet format
962	      is outside the scope of this document and will not be specified
963	      here.  The assignment of a payload type has to be performed either
964	      through the profile used or in a dynamic way.

966	   Sequence Number (SN): 16 bits

968	      Set and used in accordance with [RFC3550].

970	   Timestamp: 32 bits

972	      The RTP timestamp is set to the sampling timestamp of the content.
973	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
974	      properties of its own (e.g., parameter set and SEI NAL units), the
975	      RTP timestamp MUST be set to the RTP timestamp of the coded
976	      pictures of the access unit in which the NAL unit (according to
977	      Section 7.4.2.4 of [VVC]) is included.  Receivers MUST use the RTP
978	      timestamp for the display process, even when the bitstream
979	      contains picture timing SEI messages or decoding unit information
980	      SEI messages as specified in [VVC].

982	   Synchronization source (SSRC): 32 bits

984	      Used to identify the source of the RTP packets.  A single SSRC is
985	      used for all parts of a single bitstream.

987	4.2.  Payload Header Usage

989	   The first two bytes of the payload of an RTP packet are referred to
990	   as the payload header.  The payload header consists of the same
991	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
992	   in Section 1.1.4, irrespective of the type of the payload structure.

994	   The TID value indicates (among other things) the relative importance
995	   of an RTP packet, for example, because NAL units belonging to higher
996	   temporal sublayers are not used for the decoding of lower temporal
997	   sublayers.  A lower value of TID indicates a higher importance.
998	   More-important NAL units MAY be better protected against transmission
999	   losses than less-important NAL units.

1001	      For Discussion: quite possibly something similar can be said for
1002	      the Layer_id in layered coding, but perhaps not in multiview
1003	      coding.  (The relevant part of the spec is relatively new,
1004	      therefore the soft language).  However, for serious layer pruning,
1005	      interpretation of the VPS is required.  We can add language about
1006	      the need for stateful interpretation of LayerID vis-a-vis
1007	      stateless interpretation of TID later.

1009	4.3.  Payload Structures

1011	   Three different types of RTP packet payload structures are specified.
1012	   A receiver can identify the type of an RTP packet payload through the
1013	   Type field in the payload header.

1015	   The three different payload structures are as follows:

1017	   *  Single NAL unit packet: Contains a single NAL unit in the payload,
1018	      and the NAL unit header of the NAL unit also serves as the payload
1019	      header.  This payload structure is specified in Section 4.4.1.

1021	   *  Aggregation Packet (AP): Contains more than one NAL unit within
1022	      one access unit.  This payload structure is specified in
1023	      Section 4.3.2.

1025	   *  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
1026	      This payload structure is specified in Section 4.3.3.

1028	4.3.1.  Single NAL Unit Packets

1030	   A single NAL unit packet contains exactly one NAL unit, and consists
1031	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
1032	   DONL field (in network byte order), and the NAL unit payload data
1033	   (the NAL unit excluding its NAL unit header) of the contained NAL
1034	   unit, as shown in Figure 3.

1036	      0                   1                   2                   3
1037	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1038	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1039	     |           PayloadHdr          |      DONL (conditional)       |
1040	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1041	     |                                                               |
1042	     |                  NAL unit payload data                        |
1043	     |                                                               |
1044	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1045	     |                               :...OPTIONAL RTP padding        |
1046	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1048	                  The Structure of a Single NAL Unit Packet

1050	                                  Figure 3

1052	   The DONL field, when present, specifies the value of the 16 least
1053	   significant bits of the decoding order number of the contained NAL
1054	   unit.  If sprop-max-don-diff is greater than 0, the DONL field MUST
1055	   be present, and the variable DON for the contained NAL unit is
1056	   derived as equal to the value of the DONL field.  Otherwise (sprop-
1057	   max-don-diff is equal to 0), the DONL field MUST NOT be present.

1059	4.3.2.  Aggregation Packets (APs)

1061	   Aggregation Packets (APs) can reduce packetization overhead for small
1062	   NAL units, such as most of the non- VCL NAL units, which are often
1063	   only a few octets in size.

1065	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
1066	   carried in an AP is encapsulated in an aggregation unit.  NAL units
1067	   aggregated in one AP are included in NAL unit decoding order.

1069	   An AP consists of a payload header (denoted as PayloadHdr) followed
1070	   by two or more aggregation units, as shown in Figure 4.

1072	     0                   1                   2                   3
1073	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1074	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075	    |    PayloadHdr (Type=28)       |                               |
1076	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
1077	    |                                                               |
1078	    |             two or more aggregation units                     |
1079	    |                                                               |
1080	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1081	    |                               :...OPTIONAL RTP padding        |
1082	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1084	                   The Structure of an Aggregation Packet

1086	                                  Figure 4

1088	   The fields in the payload header of an AP are set as follows.  The F
1089	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
1090	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
1091	   be equal to 28.

1093	   The value of LayerId MUST be equal to the lowest value of LayerId of
1094	   all the aggregated NAL units.  The value of TID MUST be the lowest
1095	   value of TID of all the aggregated NAL units.

1097	      Informative note: All VCL NAL units in an AP have the same TID
1098	      value since they belong to the same access unit.  However, an AP
1099	      may contain non-VCL NAL units for which the TID value in the NAL
1100	      unit header may be different than the TID value of the VCL NAL
1101	      units in the same AP.

1103	   An AP MUST carry at least two aggregation units and can carry as many
1104	   aggregation units as necessary; however, the total amount of data in
1105	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1106	   chosen so that the resulting IP packet is smaller than the MTU size
1107	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1108	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1109	   not contain another AP.

1111	   The first aggregation unit in an AP consists of a conditional 16-bit
1112	   DONL field (in network byte order) followed by a 16-bit unsigned size
1113	   information (in network byte order) that indicates the size of the
1114	   NAL unit in bytes (excluding these two octets, but including the NAL
1115	   unit header), followed by the NAL unit itself, including its NAL unit
1116	   header, as shown in Figure 5.

1118	     0                   1                   2                   3
1119	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1120	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1121	    |               :       DONL (conditional)      |   NALU size   |
1122	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1123	    |   NALU size   |                                               |
1124	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1125	    |                                                               |
1126	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1127	    |                               :
1128	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1130	           The Structure of the First Aggregation Unit in an AP

1132	                                  Figure 5

1134	   The DONL field, when present, specifies the value of the 16 least
1135	   significant bits of the decoding order number of the aggregated NAL
1136	   unit.

1138	   If sprop-max-don-diff is greater than 0, the DONL field MUST be
1139	   present in an aggregation unit that is the first aggregation unit in
1140	   an AP, and the variable DON for the aggregated NAL unit is derived as
1141	   equal to the value of the DONL field, and the variable DON for an
1142	   aggregation unit that is not the first aggregation unit in an AP
1143	   aggregated NAL unit is derived as equal to the DON of the preceding
1144	   aggregated NAL unit in the same AP plus 1 modulo 65536.  Otherwise
1145	   (sprop-max-don-diff is equal to 0), the DONL field MUST NOT be
1146	   present in an aggregation unit that is the first aggregation unit in
1147	   an AP.

1149	   An aggregation unit that is not the first aggregation unit in an AP
1150	   will be followed immediately by a 16-bit unsigned size information
1151	   (in network byte order) that indicates the size of the NAL unit in
1152	   bytes (excluding these two octets, but including the NAL unit
1153	   header), followed by the NAL unit itself, including its NAL unit
1154	   header, as shown in Figure 6.

1156	     0                   1                   2                   3
1157	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1158	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1159	    |               :       NALU size               |   NAL unit    |
1160	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1161	    |                                                               |
1162	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1163	    |                               :
1164	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1166	         The Structure of an Aggregation Unit That Is Not the First
1167	                          Aggregation Unit in an AP

1169	                                  Figure 6

1171	   Figure 7 presents an example of an AP that contains two aggregation
1172	   units, labeled as 1 and 2 in the figure, without the DONL field being
1173	   present.

1175	     0                   1                   2                   3
1176	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1177	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1178	    |                          RTP Header                           |
1179	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1180	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1181	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1182	    |          NALU 1 HDR           |                               |
1183	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1184	    |                   . . .                                       |
1185	    |                                                               |
1186	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1187	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1188	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1189	    | NALU 2 HDR    |                                               |
1190	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1191	    |                   . . .                                       |
1192	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1193	    |                               :...OPTIONAL RTP padding        |
1194	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1196	               An Example of an AP Packet Containing
1197	             Two Aggregation Units without the DONL Field

1199	                                  Figure 7

1201	   Figure 8 presents an example of an AP that contains two aggregation
1202	   units, labeled as 1 and 2 in the figure, with the DONL field being
1203	   present.

1205	     0                   1                   2                   3
1206	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1207	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1208	    |                          RTP Header                           |
1209	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1210	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1211	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1212	    |          NALU 1 Size          |            NALU 1 HDR         |
1213	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1214	    |                                                               |
1215	    |                 NALU 1 Data   . . .                           |
1216	    |                                                               |
1217	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1218	    |                               :          NALU 2 Size          |
1219	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1220	    |          NALU 2 HDR           |                               |
1221	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1222	    |                                                               |
1223	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1224	    |                               :...OPTIONAL RTP padding        |
1225	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1227	                   An Example of an AP Containing
1228	                 Two Aggregation Units with the DONL Field

1230	                                  Figure 8

1232	4.3.3.  Fragmentation Units

1234	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1235	   single NAL unit into multiple RTP packets, possibly without
1236	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1237	   unit consists of an integer number of consecutive octets of that NAL
1238	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1239	   order with ascending RTP sequence numbers (with no other RTP packets
1240	   within the same RTP stream being sent between the first and last
1241	   fragment).

1243	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1244	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1245	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1247	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1248	   time of the fragmented NAL unit.

1250	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1251	   header of one octet, a conditional 16-bit DONL field (in network byte
1252	   order), and an FU payload, as shown in Figure 9.

1254	     0                   1                   2                   3
1255	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1256	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1257	    |   PayloadHdr (Type=29)        |   FU header   | DONL (cond)   |
1258	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1259	    |   DONL (cond) |                                               |
1260	    |-+-+-+-+-+-+-+-+                                               |
1261	    |                         FU payload                            |
1262	    |                                                               |
1263	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1264	    |                               :...OPTIONAL RTP padding        |
1265	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1267	                          The Structure of an FU

1269	                                  Figure 9

1271	   The fields in the payload header are set as follows.  The Type field
1272	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1273	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1274	   unit.

1276	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1277	   FuType field, as shown in Figure 10.

1279	                           +---------------+
1280	                           |0|1|2|3|4|5|6|7|
1281	                           +-+-+-+-+-+-+-+-+
1282	                           |S|E|P|  FuType |
1283	                           +---------------+

1285	                       The Structure of FU Header

1287	                                 Figure 10

1289	   The semantics of the FU header fields are as follows:

1291	   S: 1 bit

1293	      When set to 1, the S bit indicates the start of a fragmented NAL
1294	      unit, i.e., the first byte of the FU payload is also the first
1295	      byte of the payload of the fragmented NAL unit.  When the FU
1296	      payload is not the start of the fragmented NAL unit payload, the S
1297	      bit MUST be set to 0.

1299	   E: 1 bit
1300	      When set to 1, the E bit indicates the end of a fragmented NAL
1301	      unit, i.e., the last byte of the payload is also the last byte of
1302	      the fragmented NAL unit.  When the FU payload is not the last
1303	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1305	   P: 1 bit

1307	      When set to 1, the P bit indicates the last NAL unit of a coded
1308	      picture, i.e., the last byte of the FU payload is also the last
1309	      byte of the coded picture.  When the FU payload is not the last
1310	      fragment of a coded picture, the P bit MUST be set to 0.

1312	   FuType: 5 bits

1314	      The field FuType MUST be equal to the field Type of the fragmented
1315	      NAL unit.

1317	   The DONL field, when present, specifies the value of the 16 least
1318	   significant bits of the decoding order number of the fragmented NAL
1319	   unit.

1321	   If sprop-max-don-diff is greater than 0, and the S bit is equal to 1,
1322	   the DONL field MUST be present in the FU, and the variable DON for
1323	   the fragmented NAL unit is derived as equal to the value of the DONL
1324	   field.  Otherwise (sprop-max-don-diff is equal to 0, or the S bit is
1325	   equal to 0), the DONL field MUST NOT be present in the FU.

1327	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1328	   the Start bit and End bit must not both be set to 1 in the same FU
1329	   header.

1331	   The FU payload consists of fragments of the payload of the fragmented
1332	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1333	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1334	   equal to 1, are sequentially concatenated, the payload of the
1335	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1336	   fragmented NAL unit is not included as such in the FU payload, but
1337	   rather the information of the NAL unit header of the fragmented NAL
1338	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1339	   headers of the FUs and the FuType field of the FU header of the FUs.
1340	   An FU payload MUST NOT be empty.

1342	   If an FU is lost, the receiver SHOULD discard all following
1343	   fragmentation units in transmission order corresponding to the same
1344	   fragmented NAL unit, unless the decoder in the receiver is known to
1345	   be prepared to gracefully handle incomplete NAL units.

1347	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1348	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1349	   n of that NAL unit is not received.  In this case, the
1350	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1351	   syntax violation.

1353	4.4.  Decoding Order Number

1355	   For each NAL unit, the variable AbsDon is derived, representing the
1356	   decoding order number that is indicative of the NAL unit decoding
1357	   order.

1359	   Let NAL unit n be the n-th NAL unit in transmission order within an
1360	   RTP stream.

1362	   If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon
1363	   for NAL unit n, is derived as equal to n.

1365	   Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is
1366	   derived as follows, where DON[n] is the value of the variable DON for
1367	   NAL unit n:

1369	   *  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1370	      transmission order), AbsDon[0] is set equal to DON[0].

1372	   *  Otherwise (n is greater than 0), the following applies for
1373	      derivation of AbsDon[n]:

1375	         If DON[n] == DON[n-1],
1376	            AbsDon[n] = AbsDon[n-1]

1378	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1379	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1381	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1382	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1384	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1385	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1386	            DON[n])

1388	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1389	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1391	   For any two NAL units m and n, the following applies:

1393	   *  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1394	      NAL unit m in NAL unit decoding order.

1396	   *  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1397	      of the two NAL units can be in either order.

1399	   *  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1400	      NAL unit m in decoding order.

1402	         Informative note: When two consecutive NAL units in the NAL
1403	         unit decoding order have different values of AbsDon, the
1404	         absolute difference between the two AbsDon values may be
1405	         greater than or equal to 1.

1407	         Informative note: There are multiple reasons to allow for the
1408	         absolute difference of the values of AbsDon for two consecutive
1409	         NAL units in the NAL unit decoding order to be greater than
1410	         one.  An increment by one is not required, as at the time of
1411	         associating values of AbsDon to NAL units, it may not be known
1412	         whether all NAL units are to be delivered to the receiver.  For
1413	         example, a gateway might not forward VCL NAL units of higher
1414	         sublayers or some SEI NAL units when there is congestion in the
1415	         network.  In another example, the first intra-coded picture of
1416	         a pre-encoded clip is transmitted in advance to ensure that it
1417	         is readily available in the receiver, and when transmitting the
1418	         first intra-coded picture, the originator does not exactly know
1419	         how many NAL units will be encoded before the first intra-coded
1420	         picture of the pre-encoded clip follows in decoding order.
1421	         Thus, the values of AbsDon for the NAL units of the first
1422	         intra-coded picture of the pre-encoded clip have to be
1423	         estimated when they are transmitted, and gaps in values of
1424	         AbsDon may occur.

1426	5.  Packetization Rules

1428	   The following packetization rules apply:

1430	   *  If sprop-max-don-diff is greater than 0, the transmission order of
1431	      NAL units carried in the RTP stream MAY be different than the NAL
1432	      unit decoding order.  Otherwise (sprop-max-don-diff is equal to
1433	      0), the transmission order of NAL units carried in the RTP stream
1434	      MUST be the same as the NAL unit decoding order.

1436	   *  A NAL unit of a small size SHOULD be encapsulated in an
1437	      aggregation packet together one or more other NAL units in order
1438	      to avoid the unnecessary packetization overhead for small NAL
1439	      units.  For example, non-VCL NAL units such as access unit
1440	      delimiters, parameter sets, or SEI NAL units are typically small
1441	      and can often be aggregated with VCL NAL units without violating
1442	      MTU size constraints.

1444	   *  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1445	      viewpoint, be encapsulated in an aggregation packet together with
1446	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1447	      be meaningless without the associated VCL NAL unit being
1448	      available.

1450	   *  For carrying exactly one NAL unit in an RTP packet, a single NAL
1451	      unit packet MUST be used.

1453	6.  De-packetization Process

1455	   The general concept behind de-packetization is to get the NAL units
1456	   out of the RTP packets in an RTP stream and pass them to the decoder
1457	   in the NAL unit decoding order.

1459	   The de-packetization process is implementation dependent.  Therefore,
1460	   the following description should be seen as an example of a suitable
1461	   implementation.  Other schemes may be used as well, as long as the
1462	   output for the same input is the same as the process described below.
1463	   The output is the same when the set of output NAL units and their
1464	   order are both identical.  Optimizations relative to the described
1465	   algorithms are possible.

1467	   All normal RTP mechanisms related to buffer management apply.  In
1468	   particular, duplicated or outdated RTP packets (as indicated by the
1469	   RTP sequences number and the RTP timestamp) are removed.  To
1470	   determine the exact time for decoding, factors such as a possible
1471	   intentional delay to allow for proper inter-stream synchronization
1472	   MUST be factored in.

1474	   NAL units with NAL unit type values in the range of 0 to 27,
1475	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1476	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1477	   NOT be passed to the decoder.

1479	   The receiver includes a receiver buffer, which is used to compensate
1480	   for transmission delay jitter within individual RTP stream, to
1481	   reorder NAL units from transmission order to the NAL unit decoding
1482	   order.  In this section, the receiver operation is described under
1483	   the assumption that there is no transmission delay jitter within an
1484	   RTP stream.  To make a difference from a practical receiver buffer
1485	   that is also used for compensation of transmission delay jitter, the
1486	   receiver buffer is hereafter called the de-packetization buffer in
1487	   this section.  Receivers should also prepare for transmission delay
1488	   jitter; that is, either reserve separate buffers for transmission
1489	   delay jitter buffering and de-packetization buffering or use a
1490	   receiver buffer for both transmission delay jitter and de-
1491	   packetization.  Moreover, receivers should take transmission delay
1492	   jitter into account in the buffering operation, e.g., by additional
1493	   initial buffering before starting of decoding and playback.

1495	   When sprop-max-don-diff is equal to 0, the de-packetization buffer
1496	   size is zero bytes, and the process described in the remainder of
1497	   this paragraph applies.  The NAL units carried in the single RTP
1498	   stream are directly passed to the decoder in their transmission
1499	   order, which is identical to their decoding order.

1501	   When sprop-max-don-diff is greater than 0, the process described in
1502	   the remainder of this section applies.

1504	   There are two buffering states in the receiver: initial buffering and
1505	   buffering while playing.  Initial buffering starts when the reception
1506	   is initialized.  After initial buffering, decoding and playback are
1507	   started, and the buffering-while-playing mode is used.

1509	   Regardless of the buffering state, the receiver stores incoming NAL
1510	   units in reception order into the de-packetization buffer.  NAL units
1511	   carried in RTP packets are stored in the de-packetization buffer
1512	   individually, and the value of AbsDon is calculated and stored for
1513	   each NAL unit.

1515	   Initial buffering lasts until the difference between the greatest and
1516	   smallest AbsDon values of the NAL units in the de-packetization
1517	   buffer is greater than or equal to the value of sprop-max-don-diff.

1519	   After initial buffering, whenever condition A or condition B is true,
1520	   the following operation is repeatedly applied until both condition A
1521	   and condition B become false:

1523	   *  The NAL unit in the de-packetization buffer with the smallest
1524	      value of AbsDon is removed from the de-packetization buffer and
1525	      passed to the decoder.

1527	   When no more NAL units are flowing into the de-packetization buffer,
1528	   all NAL units remaining in the de-packetization buffer are removed
1529	   from the buffer and passed to the decoder in the order of increasing
1530	   AbsDon values.

1532	7.  Payload Format Parameters

1534	   This section specifies the optional parameters.  A mapping of the
1535	   parameters with Session Description Protocol (SDP) [RFC4556] is also
1536	   provided for applications that use SDP.

1538	7.1.  Media Type Registration

1540	   The receiver MUST ignore any parameter unspecified in this memo.

1542	   Type name:            video

1544	   Subtype name:         H266

1546	   Required parameters:  none

1548	   Optional parameters:

1550	      profile-id, tier-flag, sub-profile-id, interop-constraints, and
1551	      level-id:

1553	         These parameters indicate the profile, tier, default level,
1554	         sub-profile, and some constraints of the bitstream carried by
1555	         the RTP stream, or a specific set of the profile, tier, default
1556	         level, sub-profile and some constraints the receiver supports.

1558	         The subset of coding tools that may have been used to generate
1559	         the bitstream or that the receiver supports, as well as some
1560	         additional constraints are indicated collectively by profile-
1561	         id, sub-profile-id, and interop-constraints.

1563	            Informative note: There are 128 values of profile-id.  The
1564	            subset of coding tools identified by the profile-id can be
1565	            further constrained with up to 255 instances of sub-profile-
1566	            id.  In addition, 68 bits included in interop-constraints,
1567	            which can be extended up to 324 bits provide means to
1568	            further restrict tools from existing profiles.  To be able
1569	            to support this fine-granular signalling of coding tool
1570	            subsets with profile-id, sub-profile-id and interop-
1571	            constraints, it would be safe to require symmetric use of
1572	            these parameters in SDP offer/answer unless recv-ols-id is
1573	            included in the SDP answer for choosing one of the layers
1574	            offered.

1576	         The tier is indicated by tier-flag.  The default level is
1577	         indicated by level-id.  The tier and the default level specify
1578	         the limits on values of syntax elements or arithmetic
1579	         combinations of values of syntax elements that are followed
1580	         when generating the bitstream or that the receiver supports.

1582	         In SDP offer/answer, when the SDP answer does not include the
1583	         recv-ols-id parameter that is less than the sprop-ols-id
1584	         parameter in the SDP offer, the following applies:

1586	         o  The tier-flag, profile-id, sub-profile-id, and interop-
1587	            constraints parameters MUST be used symmetrically, i.e., the
1588	            value of each of these parameters in the offer MUST be the
1589	            same as that in the answer, either explicitly signaled or
1590	            implicitly inferred.

1592	         o  The level-id parameter is changeable as long as the highest
1593	            level indicated by the answer is either equal to or lower
1594	            than that in the offer.  Note that a highest level higher
1595	            than level-id in the offer for receiving can be included as
1596	            max-recv-level-id.

1598	         In SDP offer/answer, when the SDP answer does include the recv-
1599	         ols-id parameter that is less than the sprop-ols-id parameter
1600	         in the SDP offer, the set of tier- flag, profile-id, sub-
1601	         profile-id, interop-constraints, and level-id parameters
1602	         included in the answer MUST be consistent with that for the
1603	         chosen output layer set as indicated in the SDP offer, with the
1604	         exception that the level-id parameter in the SDP answer is
1605	         changeable as long as the highest level indicated by the answer
1606	         is either lower than or equal to that in the offer.

1608	         More specifications of these parameters, including how they
1609	         relate to syntax elements specified in [VVC] are provided
1610	         below.

1612	      profile-id:

1614	         When profile-id is not present, a value of 1 (i.e., the Main 10
1615	         profile) MUST be inferred.

1617	         When used to indicate properties of a bitstream, profile-id is
1618	         derived from the general_profile_idc syntax element that
1619	         applies to the bitstream in an instance of the
1620	         profile_tier_level( ) syntax structure.

1622	         A profile_tier_level( ) syntax structure may be contained in an
1623	         SPS, VPS, or DCI NAL units as specified in [VVC].  One of the
1624	         following three cases applies to the container NAL unit of the
1625	         profile_tier_level( ) syntax structure containing those PTL
1626	         syntax elements used to derive the values of profile-id, tier-
1627	         flag, level-id, sub-profile-id, or interop-constraints: 1) The
1628	         container NAL unit is an SPS, the bitstream is a single-layer
1629	         bitstream, and the profile_tier_level( ) syntax structures in
1630	         all SPSs referenced by the CVSs in the bitstream has the same
1631	         values respectively for those PTL syntax elements; 2) The
1632	         container NAL unit is a VPS, the profile_tier_level( ) syntax
1633	         structure is the one in the VPS that applies to the OLS
1634	         corresponding to the bitstream, and the profile_tier_level( )
1635	         syntax structures applicable to the OLS corresponding to the
1636	         bitstream in all VPSs referenced by the CVSs in the bitstream
1637	         have the same values respectively for those PTL syntax
1638	         elements; 3) The container NAL unit is a DCI NAL unit and the
1639	         profile_tier_level( ) syntax structures in all DCI NAL units in
1640	         the bitstream has the same values respectively for those PTL
1641	         syntax elements.

1643	      tier-flag, level-id:

1645	         The value of tier-flag MUST be in the range of 0 to 1,
1646	         inclusive.  The value of level-id MUST be in the range of 0 to
1647	         255, inclusive.

1649	         If the tier-flag and level-id parameters are used to indicate
1650	         properties of a bitstream, they indicate the tier and the
1651	         highest level the bitstream complies with.

1653	         If the tier-flag and level-id parameters are used for
1654	         capability exchange, the following applies.  If max-recv-level-
1655	         id is not present, the default level defined by level-id
1656	         indicates the highest level the codec wishes to support.
1657	         Otherwise, max-recv-level-id indicates the highest level the
1658	         codec supports for receiving.  For either receiving or sending,
1659	         all levels that are lower than the highest level supported MUST
1660	         also be supported.

1662	         If no tier-flag is present, a value of 0 MUST be inferred; if
1663	         no level-id is present, a value of 51 (i.e., level 3.1) MUST be
1664	         inferred.

1666	            Informative note: The level values currently defined in the
1667	            VVC specification are in the form of "majorNum.minorNum",
1668	            and the value of the level-id for each of the levels is
1669	            equal to majorNum * 16 + minorNum * 3.  It is expected that
1670	            if any level are defined in the future, the same convention
1671	            will be used, but this cannot be guaranteed.

1673	         When used to indicate properties of a bitstream, the tier-flag
1674	         and level-id parameters are derived respectively from the
1675	         syntax element general_tier_flag, and the syntax element
1676	         general_level_idc or sub_layer_level_idc[j], that apply to the
1677	         bitstream, in an instance of the profile_tier_level( ) syntax
1678	         structure.

1680	         If the tier-flag and level-id are derived from the
1681	         profile_tier_level( ) syntax structure in a DCI NAL unit, the
1682	         following applies:

1684	         o  tier-flag = general_tier_flag

1686	         o  level-id = general_level_idc

1688	         Otherwise, if the tier-flag and level-id are derived from the
1689	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1690	         unit, and the bitstream contains the highest sub-layer
1691	         representation in the OLS corresponding to the bitstream, the
1692	         following applies:

1694	         o  tier-flag = general_tier_flag

1696	         o  level-id = general_level_idc

1698	         Otherwise, if the tier-flag and level-id are derived from the
1699	         profile_tier_level( ) syntax structure in an SPS or VPS NAL
1700	         unit, and the bitstream does not contains the highest sub-layer
1701	         representation in the OLS corresponding to the bitstream, the
1702	         following applies, with j being the value of the sprop-sub-
1703	         layer-id parameter:

1705	         o  tier-flag = general_tier_flag

1707	         o  level-id = sub_layer_level_idc[j]

1709	      sub-profile-id:

1711	         The value of the parameter is a comma-separated (',') list of
1712	         data using base64[RFC4648] (hexadecimal) representation.

1714	         When used to indicate properties of a bitstream, sub-profile-id
1715	         is derived from each of the ptl_num_sub_profiles
1716	         general_sub_profile_idc[i] syntax elements that apply to the
1717	         bitstream in an profile_tier_level( ) syntax structure.

1719	      interop-constraints:

1721	         A base16 [RFC4648] (hexadecimal) representation of the data
1722	         that includes the syntax elements
1723	         ptl_frame_only_constraint_flag and ptl_multilayer_enabled_flag
1724	         and the general_constraints_info( ) syntax structure that apply
1725	         to the bitstream in an instance of the profile_tier_level( )
1726	         syntax structure.

1728	         If the interop-constraints parameter is not present, the
1729	         following MUST be inferred:

1731	         o  ptl_frame_only_constraint_flag = 0

1733	         o  ptl_multilayer_enabled_flag = 1

1735	         o  gci_present_flag in the general_constraints_info( ) syntax
1736	            structure = 1

1738	   editor-note 14: Double check the default values.  Currently, no
1739	   constraints, but actually, with the Main 10 profile as default multi-
1740	   layer not possible.

1742	         Using interop-constraints for capability exchange results in a
1743	         requirement on any bitstream to be compliant with the interop-
1744	         constraints.

1746	      sprop-sub-layer-id:

1748	         This parameter MAY be used to indicate the highest allowed
1749	         value of TID in the bitstream.  When not present, the value of
1750	         sprop-sub-layer-id is inferred to be equal to 6.

1752	         The value of sprop-sub-layer-id MUST be in the range of 0 to 6,
1753	         inclusive.

1755	      sprop-ols-id:

1757	         This parameter MAY be used to indicate the OLS that the
1758	         bitstream applies to.  When not present, the value of sprop-
1759	         ols-id is inferred to be equal to TargetOlsIdx as specified in
1760	         8.1.1 in [VVC].  If this optional parameter is present, sprop-
1761	         vps MUST also be present or its content MUST be known a priori
1762	         at the receiver.

1764	         The value of sprop-ols-id MUST be in the range of 0 to 257,
1765	         inclusive.

1767	      recv-sub-layer-id:

1769	         This parameter MAY be used to signal a receiver's choice of the
1770	         offered or declared sub-layer representations in the sprop-vps
1771	         and sprop-sps.  The value of recv-sub-layer-id indicates the
1772	         TID of the highest sub-layer of the bitstream that a receiver
1773	         supports.  When not present, the value of recv-sub-layer-id is
1774	         inferred to be equal to the value of the sprop-sub-layer-id
1775	         parameter in the SDP offer.

1777	         The value of recv-sub-layer-id MUST be in the range of 0 to 6,
1778	         inclusive.

1780	      recv-ols-id:

1782	         This parameter MAY be used to signal a receiver's choice of the
1783	         offered or declared output layer sets in the sprop-vps.  The
1784	         value of recv-ols-id indicates the OLS index of the bitstream
1785	         that a receiver supports.  When not present, the value of recv-
1786	         ols-id is inferred to be equal to the value of the sprop-ols-id
1787	         parameter in the SDP offer.  When present, the value of recv-
1788	         ols-id must be included only when sprop-ols-id was received and
1789	         must refer to an output layer set in the VPS that is in the
1790	         same dependency tree as the OLS referred to by sprop-ols-id.
1791	         If this optional parameter is present, sprop-vps must have been
1792	         received or its content must be known a priori at the receiver.

1794	         The value of recv-ols-id MUST be in the range of 0 to 257,
1795	         inclusive.

1797	      max-recv-level-id:

1799	         This parameter MAY be used to indicate the highest level a
1800	         receiver supports.

1802	         The value of max-recv-level-id MUST be in the range of 0 to
1803	         255, inclusive.

1805	         When max-recv-level-id is not present, the value is inferred to
1806	         be equal to level-id.

1808	         max-recv-level-id MUST NOT be present when the highest level
1809	         the receiver supports is not higher than the default level.

1811	      sprop-dci:

1813	         This parameter MAY be used to convey a decoding capability
1814	         information NAL unit of the bitstream for out-of-band
1815	         transmission.  The parameter MAY also be used for capability
1816	         exchange.  The value of the parameter a base64 [RFC4648]
1817	         representations of the decoding capability information NAL unit
1818	         as specified in Section 7.3.2.1 of [VVC].

1820	      sprop-vps:

1822	         This parameter MAY be used to convey any video parameter set
1823	         NAL unit of the bitstream for out-of-band transmission of video
1824	         parameter sets.  The parameter MAY also be used for capability
1825	         exchange and to indicate sub-stream characteristics (i.e.,
1826	         properties of output layer sets and sublayer representations as
1827	         defined in [VVC]).  The value of the parameter is a comma-
1828	         separated (',') list of base64 [RFC4648] representations of the
1829	         video parameter set NAL units as specified in Section 7.3.2.3
1830	         of [VVC].

1832	         The sprop-vps parameter MAY contain one or more than one video
1833	         parameter set NAL unit.  However, all other video parameter
1834	         sets contained in the sprop-vps parameter MUST be consistent
1835	         with the first video parameter set in the sprop-vps parameter.
1836	         A video parameter set vpsB is said to be consistent with
1837	         another video parameter set vpsA if any decoder that conforms
1838	         to the profile, tier, level, and constraints indicated by the
1839	         12 bytes of data starting from the syntax element
1840	         general_profile_space to the syntax element general_level_idc,
1841	         inclusive, in the first profile_tier_level( ) syntax structure
1842	         in vpsA can decode any bitstream that conforms to the profile,
1843	         tier, level, and constraints indicated by the 12 bytes of data
1844	         starting from the syntax element general_profile_space to the
1845	         syntax element general_level_idc, inclusive, in the first
1846	         profile_tier_level( ) syntax structure in vpsB.

1848	      sprop-sei:

1850	         This parameter MAY be used to convey one or more SEI messages
1851	         that describe bitstream characteristics.  When present, a
1852	         decoder can rely on the bitstream characteristics that are
1853	         described in the SEI messages for the entire duration of the
1854	         session, independently from the persistence scopes of the SEI
1855	         messages as specified in [VSEI].

1857	         The value of the parameter is a comma-separated (',') list of
1858	         base64 [RFC4648] representations of SEI NAL units as specified
1859	         in [VSEI].

1861	            Informative note: Intentionally, no list of applicable or
1862	            inapplicable SEI messages is specified here.  Conveying
1863	            certain SEI messages in sprop-sei may be sensible in some
1864	            application scenarios and meaningless in others.  However, a
1865	            few examples are described below:

1867	            1) In an environment where the bitstream was created from
1868	            film-based source material, and no splicing is going to
1869	            occur during the lifetime of the session, the film grain
1870	            characteristics SEI message is likely meaningful, and
1871	            sending it in sprop-sei rather than in the bitstream at each
1872	            entry point may help with saving bits and allows one to
1873	            configure the renderer only once, avoiding unwanted
1874	            artifacts.

1876	            2) Examples for SEI messages that would be meaningless to be
1877	            conveyed in sprop-sei include the decoded picture hash SEI
1878	            message (it is close to impossible that all decoded pictures
1879	            have the same hashtag), the display orientation SEI message
1880	            when the device is a handheld device (as the display
1881	            orientation may change when the handheld device is turned
1882	            around), or the filler payload SEI message (as there is no
1883	            point in just having more bits in SDP).

1885	      max-lsr:

1887	         The max-lsr MAY be used to signal the capabilities of a
1888	         receiver implementation and MUST NOT be used for any other
1889	         purpose.  The value of max-lsr is an integer indicating the
1890	         maximum processing rate in units of luma samples per second.
1891	         The max-lsr parameter signals that the receiver is capable of
1892	         decoding video at a higher rate than is required by the highest
1893	         level.

1895	            Informative note: When the OPTIONAL media type parameters
1896	            are used to signal the properties of a bitstream, and max-
1897	            lsr is not present, the values of tier-flag, profile-id,
1898	            sub-profile-id interop-constraints, and level-id must always
1899	            be such that the bitstream complies fully with the specified
1900	            profile, tier, and level.

1902	         When max-lsr is signaled, the receiver MUST be able to decode
1903	         bitstreams that conform to the highest level, with the
1904	         exception that the MaxLumaSr value in Table 136 of [VVC] for
1905	         the highest level is replaced with the value of max-lsr.
1906	         Senders MAY use this knowledge to send pictures of a given size
1907	         at a higher picture rate than is indicated in the highest
1908	         level.

1910	         When not present, the value of max-lsr is inferred to be equal
1911	         to the value of MaxLumaSr given in Table 136 of [VVC] for the
1912	         highest level.

1914	         The value of max-lsr MUST be in the range of MaxLumaSr to 16 *
1915	         MaxLumaSr, inclusive, where MaxLumaSr is given in Table 136 of
1916	         [VVC] for the highest level.

1918	      max-fps:

1920	         The value of max-fps is an integer indicating the maximum
1921	         picture rate in units of pictures per 100 seconds that can be
1922	         effectively processed by the receiver.  The max-fps parameter
1923	         MAY be used to signal that the receiver has a constraint in
1924	         that it is not capable of processing video effectively at the
1925	         full picture rate that is implied by the highest level and,
1926	         when present, max-lsr.

1928	         The value of max-fps is not necessarily the picture rate at
1929	         which the maximum picture size can be sent, it constitutes a
1930	         constraint on maximum picture rate for all resolutions.

1932	            Informative note: The max-fps parameter is semantically
1933	            different from max-lsr in that max-fps is used to signal a
1934	            constraint, lowering the maximum picture rate from what is
1935	            implied by other parameters.

1937	         The encoder MUST use a picture rate equal to or less than this
1938	         value.  In cases where the max-fps parameter is absent, the
1939	         encoder is free to choose any picture rate according to the
1940	         highest level and any signaled optional parameters.

1942	         The value of max-fps MUST be smaller than or equal to the full
1943	         picture rate that is implied by the highest level and, when
1944	         present, max-lsr.

1946	      sprop-max-don-diff:

1948	         If there is no NAL unit naluA that is followed in transmission
1949	         order by any NAL unit preceding naluA in decoding order (i.e.,
1950	         the transmission order of the NAL units is the same as the
1951	         decoding order), the value of this parameter MUST be equal to
1952	         0.

1954	         Otherwise, this parameter specifies the maximum absolute
1955	         difference between the decoding order number (i.e., AbsDon)
1956	         values of any two NAL units naluA and naluB, where naluA
1957	         follows naluB in decoding order and precedes naluB in
1958	         transmission order.

1960	         The value of sprop-max-don-diff MUST be an integer in the range
1961	         of 0 to 32767, inclusive.

1963	         When not present, the value of sprop-max-don-diff is inferred
1964	         to be equal to 0.

1966	      sprop-depack-buf-bytes:

1968	         This parameter signals the required size of the de-
1969	         packetization buffer in units of bytes.  The value of the
1970	         parameter MUST be greater than or equal to the maximum buffer
1971	         occupancy (in units of bytes) of the de-packetization buffer as
1972	         specified in Section 6.

1974	         The value of sprop-depack-buf-bytes MUST be an integer in the
1975	         range of 0 to 4294967295, inclusive.

1977	         When sprop-max-don-diff is present and greater than 0, this
1978	         parameter MUST be present and the value MUST be greater than 0.
1979	         When not present, the value of sprop-depack-buf-bytes is
1980	         inferred to be equal to 0.

1982	            Informative note: The value of sprop-depack-buf-bytes
1983	            indicates the required size of the de-packetization buffer
1984	            only.  When network jitter can occur, an appropriately sized
1985	            jitter buffer has to be available as well.

1987	      depack-buf-cap:

1989	         This parameter signals the capabilities of a receiver
1990	         implementation and indicates the amount of de-packetization
1991	         buffer space in units of bytes that the receiver has available
1992	         for reconstructing the NAL unit decoding order from NAL units
1993	         carried in the RTP stream.  A receiver is able to handle any
1994	         RTP stream for which the value of the sprop-depack-buf-bytes
1995	         parameter is smaller than or equal to this parameter.

1997	         When not present, the value of depack-buf-cap is inferred to be
1998	         equal to 4294967295.  The value of depack-buf-cap MUST be an
1999	         integer in the range of 1 to 4294967295, inclusive.

2001	            Informative note: depack-buf-cap indicates the maximum
2002	            possible size of the de-packetization buffer of the receiver
2003	            only, without allowing for network jitter.

2005	7.2.  SDP Parameters

2007	   The receiver MUST ignore any parameter unspecified in this memo.

2009	7.2.1.  Mapping of Payload Type Parameters to SDP

2011	   The media type video/H266 string is mapped to fields in the Session
2012	   Description Protocol (SDP) [RFC4566] as follows:

2014	   *  The media name in the "m=" line of SDP MUST be video.

2016	   *  The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the
2017	      media subtype).

2019	   *  The clock rate in the "a=rtpmap" line MUST be 90000.

2021	   *  The OPTIONAL parameters profile-id, tier-flag, sub-profile-id,
2022	      interop-constraints, level-id, sprop-sub-layer-id, sprop-ols-id,
2023	      recv-sub-layer-id, recv-ols-id, max-recv-level-id, max-lsr, max-
2024	      fps, sprop-max-don-diff, sprop-depack-buf-bytes and depack-buf-
2025	      cap, when present, MUST be included in the "a=fmtp" line of SDP.
2026	      This parameter is expressed as a media type string, in the form of
2027	      a semicolon-separated list of parameter=value pairs.

2029	   *  The OPTIONAL parameter sprop-vps, when present, MUST be included
2030	      in the "a=fmtp" line of SDP or conveyed using the "fmtp" source
2031	      attribute as specified in Section 6.3 of [RFC5576].  For a
2032	      particular media format (i.e., RTP payload type), sprop-vps MUST
2033	      NOT be both included in the "a=fmtp" line of SDP and conveyed
2034	      using the "fmtp" source attribute.  When included in the "a=fmtp"
2035	      line of SDP, sprop-vps is expressed as a media type string, in the
2036	      form of a parameter=value pair.  When conveyed in the "a=fmtp"
2037	      line of SDP for a particular payload type, the parameter sprop-vps
2038	      MUST be applied to each SSRC with the payload type.  When conveyed
2039	      using the "fmtp" source attribute, sprop-vps is only associated
2040	      with the given source and payload type as parts of the "fmtp"
2041	      source attribute.

2043	   An example of media representation in SDP is as follows:

2045	       m=video 49170 RTP/AVP 98
2046	       a=rtpmap:98 H266/90000
2047	       a=fmtp:98 profile-id=1; sprop-vps=<video parameter sets data>

2049	7.2.2.  Usage with SDP Offer/Answer Model

2051	   When [VVC] is offered over RTP using SDP in an offer/answer model
2052	   [RFC3264] for negotiation for unicast usage, the following
2053	   limitations and rules apply:

2055	      editor-note 21: the following needs to be updated

2057	   *  Parameters to identify a media format configuration as VVC:

2059	   *  Parameters as bitstream properties:

2061	   *  SDP answer for media configurations.

2063	   *  capability parameters:

2065	   *  others:

2067	8.  Use with Feedback Messages

2069	   The following subsections define the use of the Picture Loss
2070	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
2071	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
2072	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
2073	   [RFC4585], and the FIR message is defined in [RFC5104].

2075	8.1.  Picture Loss Indication (PLI)

2077	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
2078	   media sender indicates "the loss of an undefined amount of coded
2079	   video data belonging to one or more pictures".  Without having any
2080	   specific knowledge of the setup of the bitstream (such as use and
2081	   location of in-band parameter sets, non-IRAP decoder refresh points,
2082	   picture structures, and so forth), a reaction to the reception of an
2083	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
2084	   parameter sets; potentially with sufficient redundancy so to ensure
2085	   correct reception.  However, sometimes information about the
2086	   bitstream structure is known.  For example, state could have been
2087	   established outside of the mechanisms defined in this document that
2088	   parameter sets are conveyed out of band only, and stay static for the
2089	   duration of the session.  In that case, it is obviously unnecessary
2090	   to send them in-band as a result of the reception of a PLI.  Other
2091	   examples could be devised based on a priori knowledge of different
2092	   aspects of the bitstream structure.  In all cases, the timing and
2093	   congestion control mechanisms of RFC 4585 MUST be observed.

2095	8.2.  Full Intra Request (FIR)

2097	   The purpose of the FIR message is to force an encoder to send an
2098	   independent decoder refresh point as soon as possible, while
2099	   observing applicable congestion-control-related constraints, such as
2100	   those set out in [RFC8082]).

2102	   Upon reception of a FIR, a sender MUST send an IDR picture.
2103	   Parameter sets MUST also be sent, except when there is a priori
2104	   knowledge that the parameter sets have been correctly established.  A
2105	   typical example for that is an understanding between sender and
2106	   receiver, established by means outside this document, that parameter
2107	   sets are exclusively sent out-of-band.

2109	9.  Security Considerations

2111	   The scope of this Security Considerations section is limited to the
2112	   payload format itself and to one feature of [VVC] that may pose a
2113	   particularly serious security risk if implemented naively.  The
2114	   payload format, in isolation, does not form a complete system.
2115	   Implementers are advised to read and understand relevant security-
2116	   related documents, especially those pertaining to RTP (see the
2117	   Security Considerations section in [RFC3550] ), and the security of
2118	   the call-control stack chosen (that may make use of the media type
2119	   registration of this memo).  Implementers should also consider known
2120	   security vulnerabilities of video coding and decoding implementations
2121	   in general and avoid those.

2123	   Within this RTP payload format, and with the exception of the user
2124	   data SEI message as described below, no security threats other than
2125	   those common to RTP payload formats are known.  In other words,
2126	   neither the various media-plane-based mechanisms, nor the signaling
2127	   part of this memo, seems to pose a security risk beyond those common
2128	   to all RTP-based systems.

2130	   RTP packets using the payload format defined in this specification
2131	   are subject to the security considerations discussed in the RTP
2132	   specification [RFC3550] , and in any applicable RTP profile such as
2133	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
2134	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
2135	   Does Not Mandate a Single Media Security Solution" [RFC7202]
2136	   discusses, it is not an RTP payload format's responsibility to
2137	   discuss or mandate what solutions are used to meet the basic security
2138	   goals like confidentiality, integrity and source authenticity for RTP
2139	   in general.  This responsibility lays on anyone using RTP in an
2140	   application.  They can find guidance on available security mechanisms
2141	   and important considerations in "Options for Securing RTP Sessions"
2142	   [RFC7201] . The rest of this section discusses the security impacting
2143	   properties of the payload format itself.

2145	   Because the data compression used with this payload format is applied
2146	   end-to-end, any encryption needs to be performed after compression.
2147	   A potential denial-of-service threat exists for data encodings using
2148	   compression techniques that have non-uniform receiver-end
2149	   computational load.  The attacker can inject pathological datagrams
2150	   into the bitstream that are complex to decode and that cause the
2151	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
2152	   attacks, as it is extremely simple to generate datagrams containing
2153	   NAL units that affect the decoding process of many future NAL units.
2154	   Therefore, the usage of data origin authentication and data integrity
2155	   protection of at least the RTP packet is RECOMMENDED, for example,
2156	   with SRTP [RFC3711] .

2158	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
2159	   Enhancement Information (SEI) message.  This SEI message allows
2160	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
2161	   bitstring could include JavaScript, machine code, and other active
2162	   content.  [VVC] leaves the handling of this SEI message to the
2163	   receiving system.  In order to avoid harmful side effects the user
2164	   data SEI message, decoder implementations cannot naively trust its
2165	   content.  For example, it would be a bad and insecure implementation
2166	   practice to forward any JavaScript a decoder implementation detects
2167	   to a web browser.  The safest way to deal with user data SEI messages
2168	   is to simply discard them, but that can have negative side effects on
2169	   the quality of experience by the user.

2171	   End-to-end security with authentication, integrity, or
2172	   confidentiality protection will prevent a MANE from performing media-
2173	   aware operations other than discarding complete packets.  In the case
2174	   of confidentiality protection, it will even be prevented from
2175	   discarding packets in a media-aware way.  To be allowed to perform
2176	   such operations, a MANE is required to be a trusted entity that is
2177	   included in the security context establishment.

2179	10.  Congestion Control

2181	   Congestion control for RTP SHALL be used in accordance with RTP
2182	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
2183	   If best-effort service is being used, an additional requirement is
2184	   that users of this payload format MUST monitor packet loss to ensure
2185	   that the packet loss rate is within an acceptable range.  Packet loss
2186	   is considered acceptable if a TCP flow across the same network path,
2187	   and experiencing the same network conditions, would achieve an
2188	   average throughput, measured on a reasonable timescale, that is not
2189	   less than all RTP streams combined are achieving.  This condition can
2190	   be satisfied by implementing congestion-control mechanisms to adapt
2191	   the transmission rate, the number of layers subscribed for a layered
2192	   multicast session, or by arranging for a receiver to leave the
2193	   session if the loss rate is unacceptably high.

2195	   The bitrate adaptation necessary for obeying the congestion control
2196	   principle is easily achievable when real-time encoding is used, for
2197	   example, by adequately tuning the quantization parameter.  However,
2198	   when pre-encoded content is being transmitted, bandwidth adaptation
2199	   requires the pre-coded bitstream to be tailored for such adaptivity.
2200	   The key mechanisms available in [VVC] are temporal scalability, and
2201	   spatial/SNR scalability.  A media sender can remove NAL units
2202	   belonging to higher temporal sublayers (i.e., those NAL units with a
2203	   high value of TID) or higher spatio-SNR layers (as indicated by
2204	   interpreting the VPS) until the sending bitrate drops to an
2205	   acceptable range.

2207	   The mechanisms mentioned above generally work within a defined
2208	   profile and level and, therefore, no renegotiation of the channel is
2209	   required.  Only when non-downgradable parameters (such as profile)
2210	   are required to be changed does it become necessary to terminate and
2211	   restart the RTP stream(s).  This may be accomplished by using
2212	   different RTP payload types.

2214	   MANEs MAY remove certain unusable packets from the RTP stream when
2215	   that RTP stream was damaged due to previous packet losses.  This can
2216	   help reduce the network load in certain special cases.  For example,
2217	   MANES can remove those FUs where the leading FUs belonging to the
2218	   same NAL unit have been lost or those dependent slice segments when
2219	   the leading slice segments belonging to the same slice have been
2220	   lost, because the trailing FUs or dependent slice segments are
2221	   meaningless to most decoders.  MANES can also remove higher temporal
2222	   scalable layers if the outbound transmission (from the MANE's
2223	   viewpoint) experiences congestion.

2225	11.  IANA Considerations

2227	   Placeholder

2229	12.  Acknowledgements

2231	   Dr. Byeongdoo Choi is thanked for the video codec related technical
2232	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
2233	   are thanked for their contributions on [VVC] specification
2234	   descriptive content.  Spencer Dawkins is thanked for his valuable
2235	   review comments that led to great improvements of this memo.  Some
2236	   parts of this specification share text with the RTP payload format
2237	   for HEVC [RFC7798].  We thank the authors of that specification for
2238	   their excellent work.

2240	13.  References

2242	13.1.  Normative References

2244	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
2245	              Requirement Levels", BCP 14, RFC 2119,
2246	              DOI 10.17487/RFC2119, March 1997,
2247	              <https://www.rfc-editor.org/info/rfc2119>.

2249	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
2250	              with Session Description Protocol (SDP)", RFC 3264,
2251	              DOI 10.17487/RFC3264, June 2002,
2252	              <https://www.rfc-editor.org/info/rfc3264>.

2254	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
2255	              Jacobson, "RTP: A Transport Protocol for Real-Time
2256	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
2257	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

2259	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
2260	              Video Conferences with Minimal Control", STD 65, RFC 3551,
2261	              DOI 10.17487/RFC3551, July 2003,
2262	              <https://www.rfc-editor.org/info/rfc3551>.

2264	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
2265	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
2266	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
2267	              <https://www.rfc-editor.org/info/rfc3711>.

2269	   [RFC4556]  Zhu, L. and B. Tung, "Public Key Cryptography for Initial
2270	              Authentication in Kerberos (PKINIT)", RFC 4556,
2271	              DOI 10.17487/RFC4556, June 2006,
2272	              <https://www.rfc-editor.org/info/rfc4556>.

2274	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
2275	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
2276	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

2278	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
2279	              "Extended RTP Profile for Real-time Transport Control
2280	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
2281	              DOI 10.17487/RFC4585, July 2006,
2282	              <https://www.rfc-editor.org/info/rfc4585>.

2284	   [RFC4648]  Josefsson, S., "The Base16, Base32, and Base64 Data
2285	              Encodings", RFC 4648, DOI 10.17487/RFC4648, October 2006,
2286	              <https://www.rfc-editor.org/info/rfc4648>.

2288	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
2289	              "Codec Control Messages in the RTP Audio-Visual Profile
2290	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
2291	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

2293	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
2294	              Real-time Transport Control Protocol (RTCP)-Based Feedback
2295	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
2296	              2008, <https://www.rfc-editor.org/info/rfc5124>.

2298	   [RFC5576]  Lennox, J., Ott, J., and T. Schierl, "Source-Specific
2299	              Media Attributes in the Session Description Protocol
2300	              (SDP)", RFC 5576, DOI 10.17487/RFC5576, June 2009,
2301	              <https://www.rfc-editor.org/info/rfc5576>.

2303	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
2304	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
2305	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
2306	              DOI 10.17487/RFC7656, November 2015,
2307	              <https://www.rfc-editor.org/info/rfc7656>.

2309	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
2310	              "Using Codec Control Messages in the RTP Audio-Visual
2311	              Profile with Feedback with Layered Codecs", RFC 8082,
2312	              DOI 10.17487/RFC8082, March 2017,
2313	              <https://www.rfc-editor.org/info/rfc8082>.

2315	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
2316	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
2317	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

2319	   [VSEI]     "ISO/IEC 23002-7 (ITU-T H.274) Versatile supplemental
2320	              enhancement information messages for coded video
2321	              bitstreams", 2020,
2322	              <https://www.iso.org/standard/79112.html>.

2324	   [VVC]      "ISO/IEC FDIS 23090-3 Information technology --- Coded
2325	              representation of immersive media --- Part 3 - Versatile
2326	              video coding", 2020,
2327	              <https://www.iso.org/standard/73022.html>.

2329	13.2.  Informative References

2331	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
2332	              HEVC, IEEE Transactions on Circuts and Systems for Video
2333	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
2334	              2012, <https://doi.org/10.1109/TCSVT.2012.2223055>.

2336	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
2337	              H.265", April 2013.

2339	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
2340	              ofmoving pictures and associated audio information - Part
2341	              1:Systems, ISO International Standard 13818-1", 2013.

2343	   [RFC6184]  Wang, Y.-K., Even, R., Kristensen, T., and R. Jesup, "RTP
2344	              Payload Format for H.264 Video", RFC 6184,
2345	              DOI 10.17487/RFC6184, May 2011,
2346	              <https://www.rfc-editor.org/info/rfc6184>.

2348	   [RFC6190]  Wenger, S., Wang, Y.-K., Schierl, T., and A.
2349	              Eleftheriadis, "RTP Payload Format for Scalable Video
2350	              Coding", RFC 6190, DOI 10.17487/RFC6190, May 2011,
2351	              <https://www.rfc-editor.org/info/rfc6190>.

2353	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
2354	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
2355	              <https://www.rfc-editor.org/info/rfc7201>.

2357	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
2358	              Framework: Why RTP Does Not Mandate a Single Media
2359	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
2360	              2014, <https://www.rfc-editor.org/info/rfc7202>.

2362	   [RFC7798]  Wang, Y.-K., Sanchez, Y., Schierl, T., Wenger, S., and M.
2363	              M. Hannuksela, "RTP Payload Format for High Efficiency
2364	              Video Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798,
2365	              March 2016, <https://www.rfc-editor.org/info/rfc7798>.

2367	Appendix A.  Change History

2369	   draft-zhao-payload-rtp-vvc-00 ........ initial version

2371	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
2372	   corrections

2374	   draft-ietf-payload-rtp-vvc-00 ........ initial WG draft

2376	   draft-ietf-payload-rtp-vvc-01 ........ VVC specification update

2378	   draft-ietf-payload-rtp-vvc-02 ........ VVC specification update

2380	   draft-ietf-payload-rtp-vvc-03 ........ VVC coding tool introduction
2381	   update

2383	   draft-ietf-payload-rtp-vvc-04 ........ VVC coding tool introduction
2384	   update

2386	   draft-ietf-payload-rtp-vvc-05 ........ reference udpate and adding
2387	   placement for open issues

2389	   draft-ietf-payload-rtp-vvc-06 ........ address editor's note

2391	   draft-ietf-payload-rtp-vvc-07 ........ address editor's notes

2393	Authors' Addresses

2395	   Shuai Zhao
2396	   Tencent
2397	   2747 Park Blvd
2398	   Palo Alto,  94588
2399	   United States of America

2401	   Email: shuai.zhao@ieee.org
2402	   Stephan Wenger
2403	   Tencent
2404	   2747 Park Blvd
2405	   Palo Alto,  94588
2406	   United States of America

2408	   Email: stewe@stewe.org

2410	   Yago Sanchez
2411	   Fraunhofer HHI
2412	   Einsteinufer 37
2413	   10587 Berlin
2414	   Germany

2416	   Email: yago.sanchez@hhi.fraunhofer.de

2418	   Ye-Kui Wang
2419	   Bytedance Inc.
2420	   8910 University Center Lane
2421	   San Diego,  92122
2422	   United States of America

2424	   Email: yekui.wang@bytedance.com