idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([ISO23090-3]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (February 25, 2020) is 1523 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1231

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3'

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: August 28, 2020                               February 25, 2020

7	          RTP Payload Format for Versatile Video Coding (VVC)
8	                     draft-ietf-avtcore-rtp-vvc-00

10	Abstract

12	   This memo describes an RTP payload format for the video coding
13	   standard ITU-T Recommendation [H.266] and ISO/IEC International
14	   Standard [ISO23090-3], both also known as Versatile Video Coding
15	   (VVC) and developed by the Joint Video Experts Team (JVET).  The RTP
16	   payload format allows for packetization of one or more Network
17	   Abstraction Layer (NAL) units in each RTP packet payload as well as
18	   fragmentation of a NAL unit into multiple RTP packets.  The payload
19	   format has wide applicability in videoconferencing, Internet video
20	   streaming, and high-bitrate entertainment-quality video, among other
21	   applications.

23	Status of This Memo

25	   This Internet-Draft is submitted in full conformance with the
26	   provisions of BCP 78 and BCP 79.

28	   Internet-Drafts are working documents of the Internet Engineering
29	   Task Force (IETF).  Note that other groups may also distribute
30	   working documents as Internet-Drafts.  The list of current Internet-
31	   Drafts is at https://datatracker.ietf.org/drafts/current/.

33	   Internet-Drafts are draft documents valid for a maximum of six months
34	   and may be updated, replaced, or obsoleted by other documents at any
35	   time.  It is inappropriate to use Internet-Drafts as reference
36	   material or to cite them other than as "work in progress."

38	   This Internet-Draft will expire on August 28, 2020.

40	Copyright Notice

42	   Copyright (c) 2020 IETF Trust and the persons identified as the
43	   document authors.  All rights reserved.

45	   This document is subject to BCP 78 and the IETF Trust's Legal
46	   Provisions Relating to IETF Documents
47	   (https://trustee.ietf.org/license-info) in effect on the date of
48	   publication of this document.  Please review these documents
49	   carefully, as they describe your rights and restrictions with respect
50	   to this document.  Code Components extracted from this document must
51	   include Simplified BSD License text as described in Section 4.e of
52	   the Trust Legal Provisions and are provided without warranty as
53	   described in the Simplified BSD License.

55	Table of Contents

57	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
58	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
59	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   3
60	       1.1.2.  Systems and Transport Interfaces  . . . . . . . . . .   6
61	       1.1.3.  Parallel Processing Support (informative) . . . . . .  10
62	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  10
63	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  11
64	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  12
65	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  12
66	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  12
67	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  12
68	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  15
69	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  16
70	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  17
71	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  17
72	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  19
73	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  19
74	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  19
75	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  20
76	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  24
77	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  27
78	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  28
79	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  29
80	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  31
81	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  31
82	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  31
83	     8.2.  Slice Loss Indication (SLI) . . . . . . . . . . . . . . .  31
84	     8.3.  Reference Picture Selection Indication (RPSI) . . . . . .  32
85	     8.4.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  32
86	   9.  Frame marking . . . . . . . . . . . . . . . . . . . . . . . .  32
87	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  32
88	   11. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  34
89	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  35
90	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  35
91	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  35
92	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  35
93	     14.2.  Informative References . . . . . . . . . . . . . . . . .  37
94	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  38
95	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  38

97	1.  Introduction

99	   The Versatile Video Coding [VVC] specification, formally published as
100	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
101	   23090-3 [ISO23090-3], is currently in the ISO/IEC approval process
102	   and is planned for ratification in mid 2020.  H.266 is reported to
103	   provide significant coding efficiency gains over H.265 and earlier
104	   video codec formats.

106	   This memo describes an RTP payload format for VVC.  It shares its
107	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
108	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
109	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
110	   and their respective predecessors.  With respect to design
111	   philosophy, security, congestion control, and overall implementation
112	   complexity, it has similar properties to those earlier payload format
113	   specifications.  This is a conscious choice, as at least RFC 6184 is
114	   widely deployed and generally known in the relevant implementer
115	   communities.  Certain mechanisms known from [RFC6190] were
116	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
117	   signal-to-noise ratio (SNR) scalability.

119	1.1.  Overview of the VVC Codec

121	   [VVC] and [HEVC] share a similar hybrid video codec design.  In this
122	   memo, we provide a very brief overview of those features of VVC that
123	   are, in some form, addressed by the payload format specified herein.
124	   Implementers have to read, understand, and apply the ITU- T/ISO/IEC
125	   specifications pertaining to [VVC] to arrive at interoperable, well-
126	   performing implementations.

128	   Conceptually, both [VVC] and [HEVC] include a Video Coding Layer
129	   (VCL), which is often used to refer to the coding-tool features, and
130	   a NAL, which is often used to refer to the systems and transport
131	   interface aspects of the codecs.

133	1.1.1.  Coding-Tool Features (informative)

135	   Coding tool features are described below with occasional reference to
136	   the coding tool set of [HEVC], which is well known in the community.

138	   Similar to earlier hybrid-video-coding-based standards, including
139	   HEVC, the following basic video coding design is employed by VVC.  A
140	   prediction signal is first formed by either intra- or motion-
141	   compensated prediction, and the residual (the difference between the
142	   original and the prediction) is then coded.  The gains in coding
143	   efficiency are achieved by redesigning and improving almost all parts
144	   of the codec over earlier designs.  In addition, [VVC] includes
145	   several tools to make the implementation on parallel architectures
146	   easier.

148	   Finally, [VVC] includes temporal, spatial, and SNR scalability as
149	   well as multiview coding support.

151	   Coding blocks and transform structure

153	   Among major coding-tool differences between HEVC and VVC, one of the
154	   important improvements is the more flexible coding tree structure in
155	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
156	   ternary trees are also supported, which contributes significant
157	   improvement in coding efficiency.  Moreover, the maximum size of
158	   Coding Tree Unit (CTU) is increased from 64x64 to 128x128.  To
159	   improve the coding efficiency of chroma signal, luma chroma separated
160	   trees at CTU level may be employed for intra-slices.  The square
161	   transforms in HEVC are extended to non-square transforms for
162	   rectangular blocks resulting from binary and ternary tree splits.
163	   Besides, [VVC] supports multiple transform sets (MTS), including DCT-
164	   2, DST-7, and DCT-8 as well as the non-separable secondary transform.
165	   The transforms used in [VVC] can have different sizes with support
166	   for larger transform sizes.  For DCT-2, the transform sizes range
167	   from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range
168	   from 4x4 to 32x32.  In addition, [VVC] also support sub-block
169	   transform for both intra and inter coded blocks.  For intra coded
170	   blocks, intra sub-partitioning (ISP) may be used to allow sub-block
171	   based intra prediction and transform.  For inter blocks, sub-block
172	   transform may be used assuming that only a part of an inter-block has
173	   non-zero transform coefficients.

175	   Entropy coding

177	   Similar to HEVC , [VVC] uses a single entropy-coding engine, which is
178	   based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC],
179	   but with the support of multi-window sizes.  The window sizes can be
180	   initialized differently for different context models.  Due to such a
181	   design, it has more efficient adaptation speed and better coding
182	   efficiency.  A joint chroma residual coding scheme is applied to
183	   further exploit the correlation between the residuals of two color
184	   components.  In VVC, different residual coding schemes are applied
185	   for regular transform coefficients and residual samples generated
186	   using transform-skip mode.

188	   In-loop filtering

190	   [VVC] has more feature support in loop filters than HEVC.  The
191	   deblocking filter in [VVC] is similar to HEVC but operates at a
192	   smaller grid.  After deblocking and sample adaptive offset (SAO), an
193	   adaptive loop filter (ALF) may be used.  As a Wiener filter, ALF
194	   reduces distortion of decoded pictures.  Besides, [VVC] introduces a
195	   new module before deblocking called luma mapping with chroma scaling
196	   to fully utilize the dynamic range of signal so that rate-distortion
197	   performance of both SDR and HDR content is improved.

199	   Motion prediction and coding

201	   Compared to HEVC, [VVC] introduces several improvements in this area.
202	   First, there is the Adaptive motion vector resolution (AMVR), which
203	   can save bit cost for motion vectors by adaptively signaling motion
204	   vector resolution.  Then the Affine motion compensation is included
205	   to capture complicated motion like zooming and rotation.  Meanwhile,
206	   prediction refinement with the optical flow with affine mode (PROF)
207	   is further deployed to mimic affine motion at the pixel level.
208	   Thirdly the decoder side motion vector refinement (DMVR) is a method
209	   to derive MV vector at decoder side so that fewer bits may be spent
210	   on motion vectors.  Bi-directional optical flow (BDOF) is a similar
211	   method to DMVR but at 4x4 sub-block level.  Another difference is
212	   that DMVR is based on block matching while BDOF derives MVs with
213	   equations.  Furthermore, merge with motion vector difference (MMVD)
214	   is a special mode, which further signals a limited set of motion
215	   vector differences on top of merge mode.  In addition to MMVD, there
216	   are another three types of special merge modes, i.e., sub-block
217	   merge, triangle, and combined intra-/inter- prediction (CIIP).  Sub-
218	   block merge list includes one candidate of sub-block temporal motion
219	   vector prediction (SbTMVP) and up to four candidates of affine motion
220	   vectors.  Triangle is based on triangular block motion compensation.
221	   CIIP combines intra- and inter- predictions with weighting.  Adaptive
222	   weighting may be employed with a block-level tool called bi-
223	   prediction with CU based weighting (BCW) which provides more
224	   flexibility than in HEVC.

226	   Intra prediction and intra-coding

228	   To capture the diversified local image texture directions with finer
229	   granularity, [VVC] supports 65 angular directions instead of 33
230	   directions in HEVC.  The intra mode coding is based on a 6 most
231	   probable mode scheme, and the 6 most probable modes are derived using
232	   the neighboring intra prediction directions.  In addition, to deal
233	   with the different distributions of intra prediction angles for
234	   different block aspect ratios, a wide-angle intra prediction (WAIP)
235	   scheme is applied in [VVC] by including intra prediction angles
236	   beyond those present in HEVC.  Unlike HEVC which only allows using
237	   the most adjacent line of reference samples for intra prediction,
238	   [VVC] also allows using two further reference lines, as known as
239	   multi-reference-line (MRL) intra prediction.  The additional
240	   reference lines can be only used for 6 most probable intra prediction
241	   modes.  To capture the strong correlation between different colour
242	   components, in VVC, a cross-component linear mode (CCLM) is utilized
243	   which assumes a linear relationship between the luma sample

245	   values and their associated chroma samples.  For intra prediction,
246	   [VVC] also applies a position-dependent prediction combination (PDPC)
247	   for refining the prediction samples closer to the intra prediction
248	   block boundary.  Matrix-based intra prediction (MIP) modes are also
249	   used in [VVC] which generates an up to 8x8 intra prediction block
250	   using a weighted sum of downsampled neighboring reference samples,
251	   and the weightings are hardcoded constants.

253	   Other coding-tool feature

255	   [VVC] introduces dependent quantization (DQ) to reduce quantization
256	   error by state-based switching between two quantizers.

258	1.1.2.  Systems and Transport Interfaces

260	   [VVC] inherits the basic systems and transport interfaces designs
261	   from HEVC and H.264.  These include the NAL-unit-based syntax
262	   structure, the hierarchical syntax and data unit structure, the
263	   Supplemental Enhancement Information (SEI) message mechanism, and the
264	   video buffering model based on the Hypothetical Reference Decoder
265	   (HRD).  The scalability features of [VVC] are conceptually similar to
266	   the scalable variant of HEVC known as SHVC.  The hierarchical syntax
267	   and data unit structure consists of parameter sets at various levels
268	   (decoder, sequence (pertaining to all), sequence (pertaining to a
269	   single), picture), slice-level header parameters, and lower-level
270	   parameters.

272	   A number of key components that influenced the Network Abstraction
273	   Layer design of [VVC] as well as this memo are described below

275	   Decoding Capability Information

277	   The Decoding capability information includes parameters that stay
278	   constant for the lifetime of a Video Bitstream, which in IETF terms
279	   can translate to the lifetime of a session.  Decoding capability
280	   informations can include profile, level, and sub-profile information
281	   to determine a maximum complexity interop point that is guaranteed to
282	   be never exceeded, even if splicing of video sequences occurs within
283	   a session.  It further optionally includes constraint flags, which
284	   indicate that the video bitstream will be constraint in the use of
285	   certain features as indicated by the values of those flags.  With
286	   this, a bitstream can be labelled as not using certain tools, which
287	   allows among other things for resource allocation in a decoder
288	   implementation.

290	   Video parameter set

292	   The Video Parameter Set (VPS) pertains to a Coded Video Sequences
293	   (CVS) of multiple layers covering the same range of picture units,
294	   and includes, among other information decoding dependency expressed
295	   as information for reference picture set construction of enhancement
296	   layers.  The VPS provides a "big picture" of a scalable sequence,
297	   including what types of operation points are provided, the profile,
298	   tier, and level of the operation points, and some other high-level
299	   properties of the bitstream that can be used as the basis for session
300	   negotiation and content selection, etc.  One VPS may be referenced by
301	   one or more Sequence parameter sets.

303	   Sequence parameter set

305	   The Sequence Parameter Set (SPS) contains syntax elements pertaining
306	   to a coded layer video sequence (CLVS), which is a group of pictures
307	   belonging to the same layer, starting with a random access point, and
308	   followed by pictures that may depend on each other and the random
309	   access point picture.  In MPGEG-2, the equivalent of a CVS was a
310	   Group of Pictures (GOP), which normally started with an I frame and
311	   was followed by P and B frames.  While more complex in its options of
312	   random access points, VVC retains this basic concept.  In many TV-
313	   like applications, a CVS contains a few hundred milliseconds to a few
314	   seconds of video.  In video conferencing (without switching MCUs
315	   involved), a CVS can be as long in duration as the whole session.

317	   Picture and Adaptation parameter set

319	   The Picture Parameter Set and the Adaptation Parameter Set (PPS and
320	   APS, respectively) carry information pertaining to zero or more
321	   pictures and zero or more slices, respectively.  The PPS contains
322	   information that is likely to stay constant from picture to picture-
323	   at least for pictures for a certain type-whereas the APS contains
324	   information, such as adaptive loop filter coefficients, that are
325	   likely to change from picture to picture.

327	   Profile, tier, and level

329	   The profile, tiler and level syntax structures in DCI, VPS and SPS
330	   contain profile, tier, level information for all layers that refer to
331	   the DCI, for layers associated with one or more output layer sets
332	   specified by the VPS, and for the lowest layer among the layers that
333	   refers to the SPS, respectively.

335	   Sub-Profiles
336	   Within the [VVC] specification, a sub-profile is a 32-bit number
337	   coded according to ITU-T Rec. T.35, that does not carry a semantic.
338	   It is carried in the profile_tier_level structure and hence
339	   (potentially) present in the DCI, VPS, and SPS.  External
340	   registration bodies can register a T.35 codepoint with ITU-T
341	   registration authorities and associate with their registration a
342	   description of bitstream complexity restrictions beyond the profiles
343	   defined by ITU-T and ISO/IEC.  This would allow encoder manufacturers
344	   to label the bitstreams generated by their encoder as complying with
345	   such sub-profile.  It is expected that upstream standardization
346	   organizations (such as: DVB and ATSC), as well as walled-garden video
347	   services will take advantage of this labelling system.  In contrast
348	   to "normal" profiles, it is expected that sub-profiles may indicate
349	   encoder choices traditionally left open in the (decoder- centric)
350	   video coding specs, such as GOP structures, minimum/maximum QP
351	   values, and the mandatory use of certain tools or SEI messages.

353	   Constraint Flags

355	   The profile_tier_level structure optionally carries a considerable
356	   number of constraint flags, which an encoder can use to indicate to a
357	   decoder that it will not use a certain tool or technology.  They were
358	   included in reaction to a perceived market need for labelling a
359	   bitstream as not exercising a certain tool that has become
360	   commercially unviable.

362	   Temporal scalability support

364	      Editor notes: need will update along with VVC new draft in the
365	      future

367	   [VVC] includes support of temporal scalability, by inclusion of the
368	   signaling of TemporalId in the NAL unit header, the restriction that
369	   pictures of a particular temporal sub-layer cannot be used for inter
370	   prediction reference by pictures of a lower temporal sub-layer, the
371	   sub-bitstream extraction process, and the requirement that each sub-
372	   bitstream extraction output be a conforming bitstream.  Media-Aware
373	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
374	   header for stream adaptation purposes based on temporal scalability.

376	   Spatial, SNR, View Scalability

378	   [VVC] includes support for spatial, SNR, and View scalability.
379	   Scalable video coding is widely considered to have technical benefits
380	   and enrich services for various video applications.  Until recently,
381	   however, the functionality has not been included in the main profiles
382	   of video codecs and not wide deployed due to additional costs.  In
383	   VVC, however, all those forms of scalability are supported natively
384	   through the signaling of the layer_id in the NAL unit header, the VPS
385	   which associates layers with given layer_ids to each other, reference
386	   picture selection, reference picture resampling for spatial
387	   scalability, and a number of other mechanisms not relevant for this
388	   memo.  Scalability support can be implemented in a single decoding
389	   "loop" and is widely considered a comparatively lightweight
390	   operation.

392	      Spatial Scalability

394	         With the existence of Reference Picture Resampling, in the
395	         "main" profile of VVC, the additional burden for scalability
396	         support is just a minor modification of the high-level syntax
397	         (HLS).  In technical aspects, the inter-layer prediction is
398	         employed in a scalable system to improve the coding efficiency
399	         of the enhancement layers.  In addition to the spatial and
400	         temporal motion-compensated predictions that are available in a
401	         single- layer codec, the inter-layer prediction in [VVC] uses
402	         the resampled video data of the reconstructed reference picture
403	         from a reference layer to predict the current enhancement
404	         layer.  Then, the resampling process for inter-layer prediction
405	         is performed at the block-level, by modifying the existing
406	         interpolation process for motion compensation.  It means that
407	         no additional resampling process is needed to support
408	         scalability.

410	      SNR Scalability

412	         SNR scalability is similar to Spatial Scalability except that
413	         the resampling factors are 1:1-in other words, there is no
414	         change in resolution, but there is inter-layer prediction.

416	   SEI Messages

418	   Supplementary Enhancement Information (SEI) messages are codepoints
419	   in the bitstream that do not influence the decoding process as
420	   specified in the [VVC] spec, but address issues of representation/
421	   rendering of the decoded bitstream, label the bitstream for certain
422	   applications, among other, similar tasks.  The overall concept of SEI
423	   messages and many of the messages themselves has been inherited from
424	   the H.264 and HEVC specs.  In the [VVC] environment, some of the SEI
425	   messages considered to be generally useful also in other video coding
426	   technologies have been moved out of the main specification into a
427	   companion document (TO DO: add reference once ITU designation is
428	   known).

430	1.1.3.  Parallel Processing Support (informative)

432	   Compared to HEVC, the [VVC] design to support parallelization offers
433	   numerous improvements.  Some of those improvements are still
434	   undergoing changes in JVET.  Information, to the extent relevant for
435	   this memo, will be added in future versions of this memo as the
436	   standardization in JVET progresses and the technology stabilizes.

438	      Editor notes: udpate on sub-picture/slice/tile is needed following
439	      new VVC draft

441	1.1.4.  NAL Unit Header

443	   [VVC] maintains the NAL unit concept of HEVC with modifications.  VVC
444	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
445	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

447	                     +---------------+---------------+
448	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
449	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
450	                     |F|Z| LayerID   |  Type   | TID |
451	                     +---------------+---------------+

453	                   The Structure of the VVC NAL Unit Header.

455	                                 Figure 1

457	   The semantics of the fields in the NAL unit header are as specified
458	   in [VVC] and described briefly below for convenience.  In addition to
459	   the name and size of each field, the corresponding syntax element
460	   name in [VVC] is also provided.

462	   F: 1 bit

464	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
465	      inclusion of this bit in the NAL unit header was to enable
466	      transport of [VVC] video over MPEG-2 transport systems (avoidance
467	      of start code emulations) [MPEG2S].  In the context of this memo
468	      the value 1 may be used to indicate a syntax violation, e.g., for
469	      a NAL unit resulted from aggregating a number of fragmented units
470	      of a NAL unit but missing the last fragment, as described in
471	      Section TBD.

473	   Z: 1 bit
474	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
475	      for future extensions by ITU-T and ISO/IEC.
476	      This memo does not overload the "Z" bit for local extensions, as
477	      a) overloading the "F" bit is sufficient and b) to preserve the
478	      usefulness of this memo to possible future versions of [VVC].

480	   LayerId: 6 bits

482	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
483	      a layer may be, e.g., a spatial scalable layer, a quality scalable
484	      layer .

486	   Type: 6 bits

488	      nal_unit_type.  This field specifies the NAL unit type as defined
489	      in Table 7-1 of VVC.  For a reference of all currently defined NAL
490	      unit types and their semantics, please refer to Section 7.4.2.2 in
491	      [VVC].

493	   TID: 3 bits

495	      nuh_temporal_id_plus1.  This field specifies the temporal
496	      identifier of the NAL unit plus 1.  The value of TemporalId is
497	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
498	      there is at least one bit in the NAL unit header equal to 1, so to
499	      enable independent considerations of start code emulations in the
500	      NAL unit header and in the NAL unit payload data.

502	1.2.  Overview of the Payload Format

504	   This payload format defines the following processes required for
505	   transport of [VVC] coded data over RTP [RFC3550]:

507	   o  Usage of RTP header with this payload format

509	   o  Packetization of [VVC] coded NAL units into RTP packets using
510	      three types of payload structures: a single NAL unit packet,
511	      aggregation packet, and fragment unit

513	   o  Transmission of [VVC] NAL units of the same bitstream within a
514	      single RTP stream.

516	   o  Media type parameters to be used with the Session Description
517	      Protocol (SDP) [RFC4566]

519	   o  Frame-marking mapping [FrameMarking]

521	2.  Conventions

523	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
524	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
525	   "OPTIONAL" in this document are to be interpreted as described in BCP
526	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
527	   capitals, as shown above.

529	3.  Definitions and Abbreviations

531	3.1.  Definitions

533	   This document uses the terms and definitions of VVC.  Section 3.1.1
534	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
535	   provides definitions specific to this memo.

537	3.1.1.  Definitions from the VVC Specification

539	      Editor notes:

541	   Access unit (AU): A set of PUs that belong to different layers and
542	   contain coded pictures associated with the same time for output from
543	   the DPB.

545	   Adaptation parameter set (APS): A syntax structure containing syntax
546	   elements that apply to zero or more slices as determined by zero or
547	   more syntax elements found in slice headers.

549	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
550	   byte stream, that forms the representation of a sequence of AUs
551	   forming one or more coded video sequences (CVSs).

553	   Coded picture: A coded representation of a picture comprising VCL NAL
554	   units with a particular value of nuh_layer_id within an AU and
555	   containing all CTUs of the picture.

557	   Clean random access (CRA) PU: A PU in which the coded picture is a
558	   CRA picture.

560	   Clean random access (CRA) picture: An IRAP picture for which each VCL
561	   NAL unit has nal_unit_type equal to CRA_NUT.

563	   Coded video sequence (CVS): A sequence of AUs that consists, in
564	   decoding order, of a CVSS AU, followed by zero or more AUs that are
565	   not CVSS AUs, including all subsequent AUs up to but not including
566	   any subsequent AU that is a CVSS AU.

568	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
569	   for each layer in the CVS and the coded picture in each PU is a CLVSS
570	   picture.

572	   Coded layer video sequence (CLVS): A sequence of PUs with the same
573	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
574	   PU, followed by zero or more PUs that are not CLVSS PUs, including
575	   all subsequent PUs up to but not including any subsequent PU that is
576	   a CLVSS PU.

578	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
579	   picture is a CLVSS picture.

581	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
582	   of chroma samples of a picture that has three sample arrays, or a CTB
583	   of samples of a monochrome picture or a picture that is coded using
584	   three separate colour planes and syntax structures used to code the
585	   samples.

587	   Decoding Capability Information (DCI): A syntax structure containing
588	   syntax elements that apply to the entire bitstream.

590	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
591	   reference, output reordering, or output delay specified for the
592	   hypothetical reference decoder.

594	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
595	   picture is an IDR picture.

597	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
598	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
599	   IDR_N_LP..

601	   Intra random access point (IRAP) AU: An AU in which there is a PU for
602	   each layer in the CVS and the coded picture in each PU is an IRAP
603	   picture.

605	   Intra random access point (IRAP) PU: A PU in which the coded picture
606	   is an IRAP picture.

608	   Layer: A set of VCL NAL units that all have a particular value of
609	   nuh_layer_id and the associated non-VCL NAL units.

611	   Network abstraction layer (NAL) unit: A syntax structure containing
612	   an indication of the type of data to follow and bytes containing that
613	   data in the form of an RBSP interspersed as necessary with emulation
614	   prevention bytes.

616	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

618	   Operation point (OP): A temporal subset of an OLS, identified by an
619	   OLS index and a highest value of TemporalId.

621	   Picture parameter set (PPS): A syntax structure containing syntax
622	   elements that apply to zero or more entire coded pictures as
623	   determined by a syntax element found in each slice header.

625	   Picture unit (PU): A set of NAL units that are associated with each
626	   other according to a specified classification rule, are consecutive
627	   in decoding order, and contain exactly one coded picture.

629	   Random access: The act of starting the decoding process for a
630	   bitstream at a point other than the beginning of the stream.

632	   Sequence parameter set (SPS): A syntax structure containing syntax
633	   elements that apply to zero or more entire CLVSs as determined by the
634	   content of a syntax element found in the PPS referred to by a syntax
635	   element found in each picture header.

637	   Slice: An integer number of complete tiles or an integer number of
638	   consecutive complete CTU rows within a tile of a picture that are
639	   exclusively contained in a single NAL unit.

641	   Sub-layer: A temporal scalable layer of a temporal scalable bitstream
642	   consisting of VCL NAL units with a particular value of the TemporalId
643	   variable, and the associated non-VCL NAL units.

645	   Subpicture: An rectangular region of one or more slices within a
646	   picture.

648	   Sub-layer representation: A subset of the bitstream consisting of NAL
649	   units of a particular sub-layer and the lower sub-layers.

651	   Tile: A rectangular region of CTUs within a particular tile column
652	   and a particular tile row in a picture.

654	   Tile column: A rectangular region of CTUs having a height equal to
655	   the height of the picture and a width specified by syntax elements in
656	   the picture parameter set.

658	   Tile row: A rectangular region of CTUs having a height specified by
659	   syntax elements in the picture parameter set and a width equal to the
660	   width of the picture.

662	   Video coding layer (VCL) NAL unit: A collective term for coded slice
663	   NAL units and the subset of NAL units that have reserved values of
664	   nal_unit_type that are classified as VCL NAL units in this
665	   Specification.

667	3.1.2.  Definitions Specific to This Memo

669	   Media-Aware Network Element (MANE): A network element, such as a
670	   middlebox, selective forwarding unit, or application-layer gateway
671	   that is capable of parsing certain aspects of the RTP payload headers
672	   or the RTP payload and reacting to their contents.

674	      Editor Notes: the following informative needs to be updated along
675	      with frame marking update

677	      Informative note: The concept of a MANE goes beyond normal routers
678	      or gateways in that a MANE has to be aware of the signaling (e.g.,
679	      to learn about the payload type mappings of the media streams),
680	      and in that it has to be trusted when working with Secure RTP
681	      (SRTP).  The advantage of using MANEs is that they allow packets
682	      to be dropped according to the needs of the media coding.  For
683	      example, if a MANE has to drop packets due to congestion on a
684	      certain link, it can identify and remove those packets whose
685	      elimination produces the least adverse effect on the user
686	      experience.  After dropping packets, MANEs must rewrite RTCP
687	      packets to match the changes to the RTP stream, as specified in
688	      Section 7 of [RFC3550].

690	   NAL unit decoding order: A NAL unit order that conforms to the
691	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
692	   follow the Order of NAL units in the bitstream.

694	   NAL unit output order: A NAL unit order in which NAL units of
695	   different access units are in the output order of the decoded
696	   pictures corresponding to the access units, as specified in [VVC],
697	   and in which NAL units within an access unit are in their decoding
698	   order.

700	   RTP stream: See [RFC7656].  Within the scope of this memo, one RTP
701	   stream is utilized to transport one or more temporal sub-layers.

703	   Transmission order: The order of packets in ascending RTP sequence
704	   number order (in modulo arithmetic).  Within an aggregation packet,
705	   the NAL unit transmission order is the same as the order of
706	   appearance of NAL units in the packet.

708	3.2.  Abbreviations

710	   AU         Access Unit

712	   AP         Aggregation Packet

714	   CTU        Coding Tree Unit

716	   CVS        Coded Video Sequence

718	   DPB        Decoded Picture Buffer

720	   DCI        Decoding capability information

722	   DON        Decoding Order Number

724	   DONB       Decoding Order Number Base

726	   FIR        Full Intra Request

728	   FU         Fragmentation Unit

730	   HRD        Hypothetical Reference Decoder

732	   IDR        Instantaneous Decoding Refresh

734	   MANE       Media-Aware Network Element

736	   MTU        Maximum Transfer Unit

738	   NAL        Network Abstraction Layer

740	   NALU       Network Abstraction Layer Unit

742	   PLI        Picture Loss Indication

744	   PPS        Picture Parameter Set

746	   RPS        Reference Picture Set

748	   RPSI       Reference Picture Selection Indication

750	   SEI        Supplemental Enhancement Information

752	   SLI        Slice Loss Indication

754	   SPS        Sequence Parameter Set
755	   VCL        Video Coding Layer

757	   VPS        Video Parameter Set

759	4.  RTP Payload Format

761	4.1.  RTP Header Usage

763	   The format of the RTP header is specified in [RFC3550] (reprinted as
764	   Figure 2 for convenience).  This payload format uses the fields of
765	   the header in a manner consistent with that specification.

767	   The RTP payload (and the settings for some RTP header bits) for
768	   aggregation packets and fragmentation units are specified in
769	   Section 4.3.2 and Section 4.3.3, respectively.

771	       0                   1                   2                   3
772	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
773	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
774	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
775	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
776	      |                           timestamp                           |
777	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
778	      |           synchronization source (SSRC) identifier            |
779	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
780	      |            contributing source (CSRC) identifiers             |
781	      |                             ....                              |
782	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

784	                        RTP Header According to {{RFC3550}}

786	                                 Figure 2

788	   The RTP header information to be set according to this RTP payload
789	   format is set as follows:

791	   Marker bit (M): 1 bit

793	      Set for the last packet of the access unit, carried in the current
794	      RTP stream.  This is in line with the normal use of the M bit in
795	      video formats to allow an efficient playout buffer handling.

797	         Editor notes: The informative note below needs updating once
798	         the NAL unit type table is stable in the [VVC] spec.

800	         Informative note: The content of a NAL unit does not tell
801	         whether or not the NAL unit is the last NAL unit, in decoding
802	         order, of an access unit.  An RTP sender implementation may
803	         obtain this information from the video encoder.  If, however,
804	         the implementation cannot obtain this information directly from
805	         the encoder, e.g., when the bitstream was pre-encoded, and also
806	         there is no timestamp allocated for each NAL unit, then the
807	         sender implementation can inspect subsequent NAL units in
808	         decoding order to determine whether or not the NAL unit is the
809	         last NAL unit of an access unit as follows.  A NAL unit is
810	         determined to be the last NAL unit of an access unit if it is
811	         the last NAL unit of the bitstream.  A NAL unit naluX is also
812	         determined to be the last NAL unit of an access unit if both
813	         the following conditions are true: 1) the next VCL NAL unit
814	         naluY in decoding order has the high-order bit of the first
815	         byte after its NAL unit header equal to 1 or nal_unit_type
816	         equal to 19, and 2) all NAL units between naluX and naluY, when
817	         present, have nal_unit_type in the range of 13 to17, inclusive,
818	         equal to 20, equal to 23 or equal to 26.

820	   Payload Type (PT): 7 bits

822	      The assignment of an RTP payload type for this new packet format
823	      is outside the scope of this document and will not be specified
824	      here.  The assignment of a payload type has to be performed either
825	      through the profile used or in a dynamic way.

827	   Sequence Number (SN): 16 bits

829	      Set and used in accordance with [RFC3550].

831	   Timestamp: 32 bits

833	      The RTP timestamp is set to the sampling timestamp of the content.
834	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
835	      properties of its own (e.g., parameter set and SEI NAL units), the
836	      RTP timestamp MUST be set to the RTP timestamp of the coded
837	      picture of the access unit in which the NAL unit (according to
838	      Annex D of VVC) is included.  Receivers MUST use the RTP timestamp
839	      for the display process, even when the bitstream contains picture
840	      timing SEI messages or decoding unit information SEI messages as
841	      specified in VVC.

843	   Synchronization source (SSRC): 32 bits

845	      Used to identify the source of the RTP packets.  A single SSRC is
846	      used for all parts of a single bitstream.

848	4.2.  Payload Header Usage

850	   The first two bytes of the payload of an RTP packet are referred to
851	   as the payload header.  The payload header consists of the same
852	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
853	   in Section 1.1.4, irrespective of the type of the payload structure.

855	   The TID value indicates (among other things) the relative importance
856	   of an RTP packet, for example, because NAL units belonging to higher
857	   temporal sub-layers are not used for the decoding of lower temporal
858	   sub-layers.  A lower value of TID indicates a higher importance.
859	   More-important NAL units MAY be better protected against transmission
860	   losses than less-important NAL units.

862	      For Discussion: quite possibly something similar can be said for
863	      the Layer_id in layered coding, but perhaps not in multiview
864	      coding.  (The relevant part of the spec is relatively new,
865	      therefore the soft language).  However, for serious layer pruning,
866	      interpretation of the VPS is required.  We can add language about
867	      the need for stateful interpretation of LayerID vis-a-vis
868	      stateless interpretation of TID later.

870	4.3.  Payload Structures

872	   Three different types of RTP packet payload structures are specified.
873	   A receiver can identify the type of an RTP packet payload through the
874	   Type field in the payload header.

876	   The four different payload structures are as follows:

878	   o  Single NAL unit packet: Contains a single NAL unit in the payload,
879	      and the NAL unit header of the NAL unit also serves as the payload
880	      header.  This payload structure is specified in Section 4.4.1.

882	   o  Aggregation Packet (AP): Contains more than one NAL unit within
883	      one access unit.  This payload structure is specified in
884	      Section 4.3.2.

886	   o  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
887	      This payload structure is specified in Section 4.3.3.

889	4.3.1.  Single NAL Unit Packets

891	      Editor notes: its better to add a section to describe DONL and
892	      sprop-max_don_diff

894	   A single NAL unit packet contains exactly one NAL unit, and consists
895	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
896	   DONL field (in network byte order), and the NAL unit payload data
897	   (the NAL unit excluding its NAL unit header) of the contained NAL
898	   unit, as shown in Figure 3.

900	      0                   1                   2                   3
901	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
902	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
903	     |           PayloadHdr          |      DONL (conditional)       |
904	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
905	     |                                                               |
906	     |                  NAL unit payload data                        |
907	     |                                                               |
908	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
909	     |                               :...OPTIONAL RTP padding        |
910	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

912	                  The Structure of a Single NAL Unit Packet

914	                                 Figure 3

916	   The DONL field, when present, specifies the value of the 16 least
917	   significant bits of the decoding order number of the contained NAL
918	   unit.  If sprop-max-don-diff is greater than 0 for any of the RTP
919	   streams, the DONL field MUST be present, and the variable DON for the
920	   contained NAL unit is derived as equal to the value of the DONL
921	   field.  Otherwise (sprop-max-don-diff is equal to 0 for all the RTP
922	   streams), the DONL field MUST NOT be present.

924	4.3.2.  Aggregation Packets (APs)

926	   Aggregation Packets (APs) can reduce of packetization overhead for
927	   small NAL units, such as most of the non- VCL NAL units, which are
928	   often only a few octets in size.

930	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
931	   carried in an AP is encapsulated in an aggregation unit.  NAL units
932	   aggregated in one AP are included in NAL unit decoding order.

934	   An AP consists of a payload header (denoted as PayloadHdr) followed
935	   by two or more aggregation units, as shown in Figure 4.

937	     0                   1                   2                   3
938	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
939	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
940	    |    PayloadHdr (Type=28)       |                               |
941	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
942	    |                                                               |
943	    |             two or more aggregation units                     |
944	    |                                                               |
945	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
946	    |                               :...OPTIONAL RTP padding        |
947	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

949	                   The Structure of an Aggregation Packet

951	                                 Figure 4

953	   The fields in the payload header of an AP are set as follows.  The F
954	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
955	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
956	   be equal to 28.

958	   The value of LayerId MUST be equal to the lowest value of LayerId of
959	   all the aggregated NAL units.  The value of TID MUST be the lowest
960	   value of TID of all the aggregated NAL units.

962	      Informative note: All VCL NAL units in an AP have the same TID
963	      value since they belong to the same access unit.  However, an AP
964	      may contain non-VCL NAL units for which the TID value in the NAL
965	      unit header may be different than the TID value of the VCL NAL
966	      units in the same AP.

968	   An AP MUST carry at least two aggregation units and can carry as many
969	   aggregation units as necessary; however, the total amount of data in
970	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
971	   chosen so that the resulting IP packet is smaller than the MTU size
972	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
973	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
974	   not contain another AP.

976	   The first aggregation unit in an AP consists of a conditional 16-bit
977	   DONL field (in network byte order) followed by a 16-bit unsigned size
978	   information (in network byte order) that indicates the size of the
979	   NAL unit in bytes (excluding these two octets, but including the NAL
980	   unit header), followed by the NAL unit itself, including its NAL unit
981	   header, as shown in Figure 5.

983	     0                   1                   2                   3
984	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
985	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
986	    |               :       DONL (conditional)      |   NALU size   |
987	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
988	    |   NALU size   |                                               |
989	    +-+-+-+-+-+-+-+-+         NAL unit                              |
990	    |                                                               |
991	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
992	    |                               :
993	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

995	           The Structure of the First Aggregation Unit in an AP

997	                                 Figure 5

999	   The DONL field, when present, specifies the value of the 16 least
1000	   significant bits of the decoding order number of the aggregated NAL
1001	   unit.

1003	   If sprop-max-don-diff is greater than 0 for any of the RTP streams,
1004	   the DONL field MUST be present in an aggregation unit that is the
1005	   first aggregation unit in an AP, and the variable DON for the
1006	   aggregated NAL unit is derived as equal to the value of the DONL
1007	   field.  Otherwise (sprop-max-don-diff is equal to 0 for all the RTP
1008	   streams), the DONL field MUST NOT be present in an aggregation unit
1009	   that is the first aggregation unit in an AP.

1011	   An aggregation unit that is not the first aggregation unit in an AP
1012	   will be followed immediately by a 16-bit unsigned size information
1013	   (in network byte order) that indicates the size of the NAL unit in
1014	   bytes (excluding these two octets, but including the NAL unit
1015	   header), followed by the NAL unit itself, including its NAL unit
1016	   header, as shown in Figure 6.

1018	     0                   1                   2                   3
1019	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1020	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1021	    |               :       NALU size               |   NAL unit    |
1022	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1023	    |                                                               |
1024	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1025	    |                               :
1026	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1028	         The Structure of an Aggregation Unit That Is Not the First
1029	                          Aggregation Unit in an AP

1031	                                 Figure 6

1033	   Figure 7 presents an example of an AP that contains two aggregation
1034	   units, labeled as 1 and 2 in the figure, without the DONL field being
1035	   present.

1037	     0                   1                   2                   3
1038	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1039	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1040	    |                          RTP Header                           |
1041	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1042	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1043	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1044	    |          NALU 1 HDR           |                               |
1045	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1046	    |                   . . .                                       |
1047	    |                                                               |
1048	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1049	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1050	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1051	    | NALU 2 HDR    |                                               |
1052	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1053	    |                   . . .                                       |
1054	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1055	    |                               :...OPTIONAL RTP padding        |
1056	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1058	               An Example of an AP Packet Containing
1059	             Two Aggregation Units without the DONL Field

1061	                                 Figure 7

1063	   Figure 8 presents an example of an AP that contains two aggregation
1064	   units, labeled as 1 and 2 in the figure, with the DONL field being
1065	   present.

1067	     0                   1                   2                   3
1068	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1069	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1070	    |                          RTP Header                           |
1071	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1072	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1073	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1074	    |          NALU 1 Size          |            NALU 1 HDR         |
1075	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1076	    |                                                               |
1077	    |                 NALU 1 Data   . . .                           |
1078	    |                                                               |
1079	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1080	    |                               :          NALU 2 Size          |
1081	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082	    |          NALU 2 HDR           |                               |
1083	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1084	    |                                                               |
1085	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1086	    |                               :...OPTIONAL RTP padding        |
1087	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1089	                   An Example of an AP Containing
1090	                 Two Aggregation Units with the DONL Field

1092	                                 Figure 8

1094	4.3.3.  Fragmentation Units

1096	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1097	   single NAL unit into multiple RTP packets, possibly without
1098	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1099	   unit consists of an integer number of consecutive octets of that NAL
1100	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1101	   order with ascending RTP sequence numbers (with no other RTP packets
1102	   within the same RTP stream being sent between the first and last
1103	   fragment).

1105	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1106	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1107	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1109	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1110	   time of the fragmented NAL unit.

1112	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1113	   header of one octet, a conditional 16-bit DONL field (in network byte
1114	   order), and an FU payload, as shown in Figure 9}.

1116	     0                   1                   2                   3
1117	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1118	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1119	    |    PayloadHdr (Type=29)       |   FU header   | DONL (cond)   |
1120	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1121	    | DONL (cond)   |                                               |
1122	    |-+-+-+-+-+-+-+-+                                               |
1123	    |                         FU payload                            |
1124	    |                                                               |
1125	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1126	    |                               :...OPTIONAL RTP padding        |
1127	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1129	                          The Structure of an FU

1131	                                 Figure 9

1133	   The fields in the payload header are set as follows.  The Type field
1134	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1135	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1136	   unit.

1138	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1139	   FuType field, as shown in Figure 10.

1141	                             +---------------+
1142	                             |0|1|2|3|4|5|6|7|
1143	                             +-+-+-+-+-+-+-+-+
1144	                             |S|E|R|  FuType |
1145	                             +---------------+

1147	                         The Structure of FU Header

1149	                                 Figure 10

1151	   The semantics of the FU header fields are as follows:

1153	   S: 1 bit

1155	      When set to 1, the S bit indicates the start of a fragmented NAL
1156	      unit, i.e., the first byte of the FU payload is also the first
1157	      byte of the payload of the fragmented NAL unit.  When the FU
1158	      payload is not the start of the fragmented NAL unit payload, the S
1159	      bit MUST be set to 0.

1161	   E: 1 bit
1162	      When set to 1, the E bit indicates the end of a fragmented NAL
1163	      unit, i.e., the last byte of the payload is also the last byte of
1164	      the fragmented NAL unit.  When the FU payload is not the last
1165	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1167	   Reserved: 1 bit

1169	      Placeholder

1171	   FuType: 5 bits

1173	      The field FuType MUST be equal to the field Type of the fragmented
1174	      NAL unit.

1176	   The DONL field, when present, specifies the value of the 16 least
1177	   significant bits of the decoding order number of the fragmented NAL
1178	   unit.

1180	   If sprop-max-don-diff is greater than 0 for any of the RTP streams,
1181	   and the S bit is equal to 1, the DONL field MUST be present in the
1182	   FU, and the variable DON for the fragmented NAL unit is derived as
1183	   equal to the value of the DONL field.  Otherwise (sprop-max-don-diff
1184	   is equal to 0 for all the RTP streams, or the S bit is equal to 0),
1185	   the DONL field MUST NOT be present in the FU.

1187	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1188	   the Start bit and End bit must not both be set to 1 in the same FU
1189	   header.

1191	   The FU payload consists of fragments of the payload of the fragmented
1192	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1193	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1194	   equal to 1, are sequentially concatenated, the payload of the
1195	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1196	   fragmented NAL unit is not included as such in the FU payload, but
1197	   rather the information of the NAL unit header of the fragmented NAL
1198	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1199	   headers of the FUs and the FuType field of the FU header of the FUs.
1200	   An FU payload MUST NOT be empty.

1202	   If an FU is lost, the receiver SHOULD discard all following
1203	   fragmentation units in transmission order corresponding to the same
1204	   fragmented NAL unit, unless the decoder in the receiver is known to
1205	   be prepared to gracefully handle incomplete NAL units.

1207	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1208	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1209	   n of that NAL unit is not received.  In this case, the
1210	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1211	   syntax violation.

1213	4.4.  Decoding Order Number

1215	   For each NAL unit, the variable AbsDon is derived, representing the
1216	   decoding order number that is indicative of the NAL unit decoding
1217	   order.

1219	   Let NAL unit n be the n-th NAL unit in transmission order within an
1220	   RTP stream.

1222	   If sprop-max-don-diff is equal to 0 for all the RTP streams carrying
1223	   the [VVC] bitstream, AbsDon[n], the value of AbsDon for NAL unit n,
1224	   is derived as equal to n.

1226	   Otherwise (sprop-max-don-diff is greater than 0 for any of the RTP
1227	   streams), AbsDon[n] is derived as follows, where DON[n] is the value
1228	   of the variable DON for NAL unit n:

1230	   o  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1231	      transmission order), AbsDon[0] is set equal to DON[0].

1233	   o  Otherwise (n is greater than 0), the following applies for
1234	      derivation of AbsDon[n]:

1236	         If DON[n] == DON[n-1],
1237	            AbsDon[n] = AbsDon[n-1]

1239	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1240	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1242	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1243	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1245	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1246	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1247	            DON[n])

1249	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1250	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1252	   For any two NAL units m and n, the following applies:

1254	   o  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1255	      NAL unit m in NAL unit decoding order.

1257	   o  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1258	      of the two NAL units can be in either order.

1260	   o  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1261	      NAL unit m in decoding order.

1263	         Informative note: When two consecutive NAL units in the NAL
1264	         unit decoding order have different values of AbsDon, the
1265	         absolute difference between the two AbsDon values may be
1266	         greater than or equal to 1.

1268	         Informative note: There are multiple reasons to allow for the
1269	         absolute difference of the values of AbsDon for two consecutive
1270	         NAL units in the NAL unit decoding order to be greater than
1271	         one.  An increment by one is not required, as at the time of
1272	         associating values of AbsDon to NAL units, it may not be known
1273	         whether all NAL units are to be delivered to the receiver.  For
1274	         example, a gateway might not forward VCL NAL units of higher
1275	         sub- layers or some SEI NAL units when there is congestion in
1276	         the network.  In another example, the first intra-coded picture
1277	         of a pre-encoded clip is transmitted in advance to ensure that
1278	         it is readily available in the receiver, and when transmitting
1279	         the first intra-coded picture, the originator does not exactly
1280	         know how many NAL units will be encoded before the first intra-
1281	         coded picture of the pre-encoded clip follows in decoding
1282	         order.  Thus, the values of AbsDon for the NAL units of the
1283	         first intra-coded picture of the pre-encoded clip have to be
1284	         estimated when they are transmitted, and gaps in values of
1285	         AbsDon may occur.

1287	5.  Packetization Rules

1289	   The following packetization rules apply:

1291	   o  If sprop-max-don-diff is greater than 0 for any of the RTP
1292	      streams, the transmission order of NAL units carried in the RTP
1293	      stream MAY be different than the NAL unit decoding order and the
1294	      NAL unit output order.

1296	   o  A NAL unit of a small size SHOULD be encapsulated in an
1297	      aggregation packet together one or more other NAL units in order
1298	      to avoid the unnecessary packetization overhead for small NAL
1299	      units.  For example, non-VCL NAL units such as access unit
1300	      delimiters, parameter sets, or SEI NAL units are typically small
1301	      and can often be aggregated with VCL NAL units without violating
1302	      MTU size constraints.

1304	   o  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1305	      viewpoint, be encapsulated in an aggregation packet together with
1306	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1307	      be meaningless without the associated VCL NAL unit being
1308	      available.

1310	   o  For carrying exactly one NAL unit in an RTP packet, a single NAL
1311	      unit packet MUST be used.

1313	6.  De-packetization Process

1315	   The general concept behind de-packetization is to get the NAL units
1316	   out of the RTP packets in an RTP stream and pass them to the decoder
1317	   in the NAL unit decoding order.

1319	   The de-packetization process is implementation dependent.  Therefore,
1320	   the following description should be seen as an example of a suitable
1321	   implementation.  Other schemes may be used as well, as long as the
1322	   output for the same input is the same as the process described below.
1323	   The output is the same when the set of output NAL units and their
1324	   order are both identical.  Optimizations relative to the described
1325	   algorithms are possible.

1327	   All normal RTP mechanisms related to buffer management apply.  In
1328	   particular, duplicated or outdated RTP packets (as indicated by the
1329	   RTP sequences number and the RTP timestamp) are removed.  To
1330	   determine the exact time for decoding, factors such as a possible
1331	   intentional delay to allow for proper inter-stream synchronization
1332	   MUST be factored in.

1334	   NAL units with NAL unit type values in the range of 0 to 27,
1335	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1336	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1337	   NOT be passed to the decoder.

1339	   The receiver includes a receiver buffer, which is used to compensate
1340	   for transmission delay jitter within individual RTP streams and
1341	   across RTP streams, to reorder NAL units from transmission order to
1342	   the NAL unit decoding order.  In this section, the receiver operation
1343	   is described under the assumption that there is no transmission delay
1344	   jitter within an RTP stream and across RTP streams.  To make a
1345	   difference from a practical receiver buffer that is also used for
1346	   compensation of transmission delay jitter, the receiver buffer is
1347	   hereafter called the de-packetization buffer in this section.
1348	   Receivers should also prepare for transmission delay jitter; that is,
1349	   either reserve separate buffers for transmission delay jitter
1350	   buffering and de-packetization buffering or use a receiver buffer for
1351	   both transmission delay jitter and de- packetization.  Moreover,
1352	   receivers should take transmission delay jitter into account in the
1353	   buffering operation, e.g., by additional initial buffering before
1354	   starting of decoding and playback.

1356	   When sprop-max-don-diff is equal to 0 for all the received RTP
1357	   streams, the de-packetization buffer size is zero bytes, and the
1358	   process described in the remainder of this paragraph applies.
1359	   The NAL units carried in the single RTP stream are directly passed to
1360	   the decoder in their transmission order, which is identical to their
1361	   decoding order.  When there are several NAL units of the same RTP
1362	   stream with the same NTP timestamp, the order to pass them to the
1363	   decoder is their transmission order.

1365	      Informative note: The mapping between RTP and NTP timestamps is
1366	      conveyed in RTCP SR packets.  In addition, the mechanisms for
1367	      faster media timestamp synchronization discussed in [RFC6051] may
1368	      be used to speed up the acquisition of the RTP-to-wall-clock
1369	      mapping.

1371	   When sprop-max-don-diff is greater than 0 for any the received RTP
1372	   streams, the process described in the remainder of this section
1373	   applies.

1375	   There are two buffering states in the receiver: initial buffering and
1376	   buffering while playing.  Initial buffering starts when the reception
1377	   is initialized.  After initial buffering, decoding and playback are
1378	   started, and the buffering-while-playing mode is used.

1380	   Regardless of the buffering state, the receiver stores incoming NAL
1381	   units, in reception order, into the de-packetization buffer.  NAL
1382	   units carried in RTP packets are stored in the de-packetization
1383	   buffer individually, and the value of AbsDon is calculated and stored
1384	   for each NAL unit.

1386	   Initial buffering lasts until condition A (the difference between the
1387	   greatest and smallest AbsDon values of the NAL units in the de-
1388	   packetization buffer is greater than or equal to the value of sprop-
1389	   max-don-diff) or condition B (the number of NAL units in the de-
1390	   packetization buffer is greater than the value of sprop-depack-buf-
1391	   nalus) is true.

1393	   After initial buffering, whenever condition A or condition B is true,
1394	   the following operation is repeatedly applied until both condition A
1395	   and condition B become false:

1397	   o  The NAL unit in the de-packetization buffer with the smallest
1398	      value of AbsDon is removed from the de-packetization buffer and
1399	      passed to the decoder.

1401	   When no more NAL units are flowing into the de-packetization buffer,
1402	   all NAL units remaining in the de-packetization buffer are removed
1403	   from the buffer and passed to the decoder in the order of increasing
1404	   AbsDon values.

1406	7.  Payload Format Parameters

1408	   Placeholder

1410	8.  Use with Feedback Messages

1412	   The following subsections define the use of the Picture Loss
1413	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
1414	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
1415	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
1416	   [RFC4585], and the FIR message is defined in [RFC5104].

1418	8.1.  Picture Loss Indication (PLI)

1420	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
1421	   media sender indicates "the loss of an undefined amount of coded
1422	   video data belonging to one or more pictures".  Without having any
1423	   specific knowledge of the setup of the bitstream (such as use and
1424	   location of in-band parameter sets, non-IRAP decoder refresh points,
1425	   picture structures, and so forth), a reaction to the reception of an
1426	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
1427	   parameter sets; potentially with sufficient redundancy so to ensure
1428	   correct reception.  However, sometimes information about the
1429	   bitstream structure is known.  For example, state could have been
1430	   established outside of the mechanisms defined in this document that
1431	   parameter sets are conveyed out of band only, and stay static for the
1432	   duration of the session.  In that case, it is obviously unnecessary
1433	   to send them in-band as a result of the reception of a PLI.  Other
1434	   examples could be devised based on a priori knowledge of different
1435	   aspects of the bitstream structure.  In all cases, the timing and
1436	   congestion control mechanisms of RFC 4585 MUST be observed.

1438	8.2.  Slice Loss Indication (SLI)

1440	   For further study.  Maybe remove as there are no known
1441	   implementations of SDLI in [HEVC] based systems

1443	8.3.  Reference Picture Selection Indication (RPSI)

1445	   Feedback-based reference picture selection has been shown as a
1446	   powerful tool to stop temporal error propagation for improved error
1447	   resilience [Girod99] [Wang05].  In one approach, the decoder side
1448	   tracks errors in the decoded pictures and informs the encoder side
1449	   that a particular picture that has been decoded relatively earlier is
1450	   correct and still present in the decoded picture buffer; it requests
1451	   the encoder to use that correct picture-availability information when
1452	   encoding the next picture, so to stop further temporal error
1453	   propagation.  For this approach, the decoder side should use the RPSI
1454	   feedback message.

1456	   Encoders can encode some long-term reference pictures as specified in
1457	   [VVC] for purposes described in the previous paragraph without the
1458	   need of a huge decoded picture buffer.  As shown in [Wang05], with a
1459	   flexible reference picture management scheme, as in VVC, even a
1460	   decoded picture buffer size of two picture storage buffers would work
1461	   for the approach described in the previous paragraph.

1463	   The text above is copy-paste from RFC 7798.  If we keep the RPSI
1464	   message, it needs adaptation to the [VVC] syntax.  Doing so shouldn't
1465	   be too hard as the [VVC] reference picture mechanism is not too
1466	   different from the [HEVC] one.

1468	8.4.  Full Intra Request (FIR)

1470	   The purpose of the FIR message is to force an encoder to send an
1471	   independent decoder refresh point as soon as possible, while
1472	   observing applicable congestion-control-related constraints, such as
1473	   those set out in [RFC8082]).

1475	   Upon reception of a FIR, a sender MUST send an IDR picture.
1476	   Parameter sets MUST also be sent, except when there is a priori
1477	   knowledge that the parameter sets have been correctly established.  A
1478	   typical example for that is an understanding between sender and
1479	   receiver, established by means outside this document, that parameter
1480	   sets are exclusively sent out-of-band.

1482	9.  Frame marking

1484	      placeholder

1486	10.  Security Considerations

1488	   The scope of this Security Considerations section is limited to the
1489	   payload format itself and to one feature of [VVC] that may pose a
1490	   particularly serious security risk if implemented naively.  The
1491	   payload format, in isolation, does not form a complete system.
1492	   Implementers are advised to read and understand relevant security-
1493	   related documents, especially those pertaining to RTP (see the
1494	   Security Considerations section in [RFC3550] ), and the security of
1495	   the call-control stack chosen (that may make use of the media type
1496	   registration of this memo).  Implementers should also consider known
1497	   security vulnerabilities of video coding and decoding implementations
1498	   in general and avoid those.

1500	   Within this RTP payload format, and with the exception of the user
1501	   data SEI message as described below, no security threats other than
1502	   those common to RTP payload formats are known.  In other words,
1503	   neither the various media-plane-based mechanisms, nor the signaling
1504	   part of this memo, seems to pose a security risk beyond those common
1505	   to all RTP-based systems.

1507	   RTP packets using the payload format defined in this specification
1508	   are subject to the security considerations discussed in the RTP
1509	   specification [RFC3550] , and in any applicable RTP profile such as
1510	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
1511	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
1512	   Does Not Mandate a Single Media Security Solution" [RFC7202]
1513	   discusses, it is not an RTP payload format's responsibility to
1514	   discuss or mandate what solutions are used to meet the basic security
1515	   goals like confidentiality, integrity and source authenticity for RTP
1516	   in general.  This responsibility lays on anyone using RTP in an
1517	   application.  They can find guidance on available security mechanisms
1518	   and important considerations in "Options for Securing RTP Sessions"
1519	   [RFC7201] . The rest of this section discusses the security impacting
1520	   properties of the payload format itself.

1522	   Because the data compression used with this payload format is applied
1523	   end-to-end, any encryption needs to be performed after compression.
1524	   A potential denial-of-service threat exists for data encodings using
1525	   compression techniques that have non-uniform receiver-end
1526	   computational load.  The attacker can inject pathological datagrams
1527	   into the bitstream that are complex to decode and that cause the
1528	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
1529	   attacks, as it is extremely simple to generate datagrams containing
1530	   NAL units that affect the decoding process of many future NAL units.
1531	   Therefore, the usage of data origin authentication and data integrity
1532	   protection of at least the RTP packet is RECOMMENDED, for example,
1533	   with SRTP [RFC3711] .

1535	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
1536	   Enhancement Information (SEI) message.  This SEI message allows
1537	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
1538	   bitstring could include JavaScript, machine code, and other active
1539	   content.  [VVC] leaves the handling of this SEI message to the
1540	   receiving system.  In order to avoid harmful side effects the user
1541	   data SEI message, decoder implementations cannot naively trust its
1542	   content.  For example, it would be a bad and insecure implementation
1543	   practice to forward any JavaScript a decoder implementation detects
1544	   to a web browser.  The safest way to deal with user data SEI messages
1545	   is to simply discard them, but that can have negative side effects on
1546	   the quality of experience by the user.

1548	   End-to-end security with authentication, integrity, or
1549	   confidentiality protection will prevent a MANE from performing media-
1550	   aware operations other than discarding complete packets.  In the case
1551	   of confidentiality protection, it will even be prevented from
1552	   discarding packets in a media-aware way.  To be allowed to perform
1553	   such operations, a MANE is required to be a trusted entity that is
1554	   included in the security context establishment.

1556	11.  Congestion Control

1558	   Congestion control for RTP SHALL be used in accordance with RTP
1559	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
1560	   If best-effort service is being used, an additional requirement is
1561	   that users of this payload format MUST monitor packet loss to ensure
1562	   that the packet loss rate is within an acceptable range.  Packet loss
1563	   is considered acceptable if a TCP flow across the same network path,
1564	   and experiencing the same network conditions, would achieve an
1565	   average throughput, measured on a reasonable timescale, that is not
1566	   less than all RTP streams combined are achieving.  This condition can
1567	   be satisfied by implementing congestion-control mechanisms to adapt
1568	   the transmission rate, the number of layers subscribed for a layered
1569	   multicast session, or by arranging for a receiver to leave the
1570	   session if the loss rate is unacceptably high.

1572	   The bitrate adaptation necessary for obeying the congestion control
1573	   principle is easily achievable when real-time encoding is used, for
1574	   example, by adequately tuning the quantization parameter.  However,
1575	   when pre-encoded content is being transmitted, bandwidth adaptation
1576	   requires the pre-coded bitstream to be tailored for such adaptivity.
1577	   The key mechanisms available in [VVC] are temporal scalability, and
1578	   spatial/SNR scalability.  A media sender can remove NAL units
1579	   belonging to higher temporal sub-layers (i.e., those NAL units with a
1580	   high value of TID) or higher spatio-SNR layers (as indicated by
1581	   interpreting the VPS) until the sending bitrate drops to an
1582	   acceptable range.

1584	   The mechanisms mentioned above generally work within a defined
1585	   profile and level and, therefore, no renegotiation of the channel is
1586	   required.  Only when non-downgradable parameters (such as profile)
1587	   are required to be changed does it become necessary to terminate and
1588	   restart the RTP stream(s).  This may be accomplished by using
1589	   different RTP payload types.

1591	   MANEs MAY remove certain unusable packets from the RTP stream when
1592	   that RTP stream was damaged due to previous packet losses.  This can
1593	   help reduce the network load in certain special cases.  For example,
1594	   MANES can remove those FUs where the leading FUs belonging to the
1595	   same NAL unit have been lost or those dependent slice segments when
1596	   the leading slice segments belonging to the same slice have been
1597	   lost, because the trailing FUs or dependent slice segments are
1598	   meaningless to most decoders.  MANES can also remove higher temporal
1599	   scalable layers if the outbound transmission (from the MANE's
1600	   viewpoint) experiences congestion.

1602	12.  IANA Considerations

1604	   Placeholder

1606	13.  Acknowledgements

1608	   Dr. Byeongdoo Choi is thanked for the video codec related technical
1609	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
1610	   are thanked for their contributions on [VVC] specification
1611	   descriptive content.  Spencer Dawkins is thanked for his valuable
1612	   review comments that led to great improvements of this memo.  Some
1613	   parts of this specification share text with the RTP payload format
1614	   for HEVC [RFC7798].  We thank the authors of that specification for
1615	   their excellent work.

1617	14.  References

1619	14.1.  Normative References

1621	   [H.266]    "ITU-T, Versatile Video Coding", n.d..

1623	   [ISO23090-3]
1624	              "ISO/IEC DIS Information technology --- Coded
1625	              representation of immersive media --- Part 3 Versatile
1626	              video codings", n.d.,
1627	              <https://www.iso.org/standard/73022.html>.

1629	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1630	              Requirement Levels", BCP 14, RFC 2119,
1631	              DOI 10.17487/RFC2119, March 1997,
1632	              <https://www.rfc-editor.org/info/rfc2119>.

1634	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1635	              Jacobson, "RTP: A Transport Protocol for Real-Time
1636	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
1637	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

1639	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
1640	              Video Conferences with Minimal Control", STD 65, RFC 3551,
1641	              DOI 10.17487/RFC3551, July 2003,
1642	              <https://www.rfc-editor.org/info/rfc3551>.

1644	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
1645	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
1646	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
1647	              <https://www.rfc-editor.org/info/rfc3711>.

1649	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
1650	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
1651	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

1653	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
1654	              "Extended RTP Profile for Real-time Transport Control
1655	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
1656	              DOI 10.17487/RFC4585, July 2006,
1657	              <https://www.rfc-editor.org/info/rfc4585>.

1659	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
1660	              "Codec Control Messages in the RTP Audio-Visual Profile
1661	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
1662	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

1664	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
1665	              Real-time Transport Control Protocol (RTCP)-Based Feedback
1666	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
1667	              2008, <https://www.rfc-editor.org/info/rfc5124>.

1669	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
1670	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
1671	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
1672	              DOI 10.17487/RFC7656, November 2015,
1673	              <https://www.rfc-editor.org/info/rfc7656>.

1675	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
1676	              "Using Codec Control Messages in the RTP Audio-Visual
1677	              Profile with Feedback with Layered Codecs", RFC 8082,
1678	              DOI 10.17487/RFC8082, March 2017,
1679	              <https://www.rfc-editor.org/info/rfc8082>.

1681	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
1682	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
1683	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

1685	   [VVC]      "Versatile Video Coding (Draft 8), Joint Video Experts
1686	              Team (JVET)", January 2020.

1688	14.2.  Informative References

1690	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
1691	              HEVC, IEEE Transactions on Circuts and Systems for Video
1692	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
1693	              2012.

1695	   [FrameMarking]
1696	              Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame
1697	              Marking RTP Header Extension", Work in Progress draft-
1698	              berger-avtext-framemarking , 2015.

1700	   [Girod99]  Girod, B, . and . et al, "Feedback-based error control for
1701	              mobile video transmission, Proceedings of the IEEE",
1702	              DOI 110.1109/5.790632, October 1999.

1704	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
1705	              H.265", April 2013.

1707	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
1708	              ofmoving pictures and associated audio information - Part
1709	              1:Systems, ISO International Standard 13818-1", 2013.

1711	   [RFC6051]  Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP
1712	              Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010,
1713	              <https://www.rfc-editor.org/info/rfc6051>.

1715	   [RFC6184]  Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP
1716	              Payload Format for H.264 Video", RFC 6184,
1717	              DOI 10.17487/RFC6184, May 2011,
1718	              <https://www.rfc-editor.org/info/rfc6184>.

1720	   [RFC6190]  Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis,
1721	              "RTP Payload Format for Scalable Video Coding", RFC 6190,
1722	              DOI 10.17487/RFC6190, May 2011,
1723	              <https://www.rfc-editor.org/info/rfc6190>.

1725	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
1726	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
1727	              <https://www.rfc-editor.org/info/rfc7201>.

1729	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
1730	              Framework: Why RTP Does Not Mandate a Single Media
1731	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
1732	              2014, <https://www.rfc-editor.org/info/rfc7202>.

1734	   [RFC7798]  Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M.
1735	              Hannuksela, "RTP Payload Format for High Efficiency Video
1736	              Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March
1737	              2016, <https://www.rfc-editor.org/info/rfc7798>.

1739	   [Wang05]   Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient
1740	              video coding using flexible reference fames", Visual
1741	              Communications and Image Processing 2005 (VCIP 2005) ,
1742	              July 2005.

1744	Appendix A.  Change History

1746	   draft-zhao-payload-rtp-vvc-00 ........ initial version

1748	Authors' Addresses

1750	   Shuai Zhao
1751	   Tencent
1752	   2747 Park Blvd
1753	   Palo Alto  94588
1754	   USA

1756	   Email: shuai.zhao@ieee.org

1758	   Stephan Wenger
1759	   Tencent
1760	   2747 Park Blvd
1761	   Palo Alto  94588

1763	   Email: stewe@stewe.org