idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-01.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([ISO23090-3]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (March 30, 2020) is 1488 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1264

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3'

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: October 1, 2020                                      Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                          March 30, 2020

9	          RTP Payload Format for Versatile Video Coding (VVC)
10	                     draft-ietf-avtcore-rtp-vvc-01

12	Abstract

14	   This memo describes an RTP payload format for the video coding
15	   standard ITU-T Recommendation [H.266] and ISO/IEC International
16	   Standard [ISO23090-3], both also known as Versatile Video Coding
17	   (VVC) and developed by the Joint Video Experts Team (JVET).  The RTP
18	   payload format allows for packetization of one or more Network
19	   Abstraction Layer (NAL) units in each RTP packet payload as well as
20	   fragmentation of a NAL unit into multiple RTP packets.  The payload
21	   format has wide applicability in videoconferencing, Internet video
22	   streaming, and high-bitrate entertainment-quality video, among other
23	   applications.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at https://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on October 1, 2020.

42	Copyright Notice

44	   Copyright (c) 2020 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (https://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
61	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   3
62	       1.1.2.  Systems and Transport Interfaces  . . . . . . . . . .   6
63	       1.1.3.  Parallel Processing Support (informative) . . . . . .  10
64	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  10
65	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  12
66	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  12
67	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  12
68	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  12
69	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  13
70	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  16
71	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  16
72	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  17
73	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  18
74	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  19
75	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  20
76	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  20
77	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  21
78	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  25
79	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  28
80	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  29
81	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  30
82	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  32
83	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  32
84	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  32
85	     8.2.  Slice Loss Indication (SLI) . . . . . . . . . . . . . . .  32
86	     8.3.  Reference Picture Selection Indication (RPSI) . . . . . .  33
87	     8.4.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  33
88	   9.  Frame marking . . . . . . . . . . . . . . . . . . . . . . . .  33
89	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  33
90	   11. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  35
91	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  36
92	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  36
93	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  36
94	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  36
95	     14.2.  Informative References . . . . . . . . . . . . . . . . .  38
96	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  39
97	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  39

99	1.  Introduction

101	   The Versatile Video Coding [VVC] specification, formally published as
102	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
103	   23090-3 [ISO23090-3], is currently in the ISO/IEC approval process
104	   and is planned for ratification in mid 2020.  H.266 is reported to
105	   provide significant coding efficiency gains over H.265 and earlier
106	   video codec formats.

108	   This memo describes an RTP payload format for VVC.  It shares its
109	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
110	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
111	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
112	   and their respective predecessors.  With respect to design
113	   philosophy, security, congestion control, and overall implementation
114	   complexity, it has similar properties to those earlier payload format
115	   specifications.  This is a conscious choice, as at least RFC 6184 is
116	   widely deployed and generally known in the relevant implementer
117	   communities.  Certain mechanisms known from [RFC6190] were
118	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
119	   signal-to-noise ratio (SNR) scalability.

121	1.1.  Overview of the VVC Codec

123	   [VVC] and [HEVC] share a similar hybrid video codec design.  In this
124	   memo, we provide a very brief overview of those features of VVC that
125	   are, in some form, addressed by the payload format specified herein.
126	   Implementers have to read, understand, and apply the ITU- T/ISO/IEC
127	   specifications pertaining to [VVC] to arrive at interoperable, well-
128	   performing implementations.

130	   Conceptually, both [VVC] and [HEVC] include a Video Coding Layer
131	   (VCL), which is often used to refer to the coding-tool features, and
132	   a NAL, which is often used to refer to the systems and transport
133	   interface aspects of the codecs.

135	1.1.1.  Coding-Tool Features (informative)

137	   Coding tool features are described below with occasional reference to
138	   the coding tool set of [HEVC], which is well known in the community.

140	   Similar to earlier hybrid-video-coding-based standards, including
141	   HEVC, the following basic video coding design is employed by VVC.  A
142	   prediction signal is first formed by either intra- or motion-
143	   compensated prediction, and the residual (the difference between the
144	   original and the prediction) is then coded.  The gains in coding
145	   efficiency are achieved by redesigning and improving almost all parts
146	   of the codec over earlier designs.  In addition, [VVC] includes
147	   several tools to make the implementation on parallel architectures
148	   easier.

150	   Finally, [VVC] includes temporal, spatial, and SNR scalability as
151	   well as multiview coding support.

153	   Coding blocks and transform structure

155	   Among major coding-tool differences between HEVC and VVC, one of the
156	   important improvements is the more flexible coding tree structure in
157	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
158	   ternary trees are also supported, which contributes significant
159	   improvement in coding efficiency.  Moreover, the maximum size of
160	   Coding Tree Unit (CTU) is increased from 64x64 to 128x128.  To
161	   improve the coding efficiency of chroma signal, luma chroma separated
162	   trees at CTU level may be employed for intra-slices.  The square
163	   transforms in HEVC are extended to non-square transforms for
164	   rectangular blocks resulting from binary and ternary tree splits.
165	   Besides, [VVC] supports multiple transform sets (MTS), including DCT-
166	   2, DST-7, and DCT-8 as well as the non-separable secondary transform.
167	   The transforms used in [VVC] can have different sizes with support
168	   for larger transform sizes.  For DCT-2, the transform sizes range
169	   from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range
170	   from 4x4 to 32x32.  In addition, [VVC] also support sub-block
171	   transform for both intra and inter coded blocks.  For intra coded
172	   blocks, intra sub-partitioning (ISP) may be used to allow sub-block
173	   based intra prediction and transform.  For inter blocks, sub-block
174	   transform may be used assuming that only a part of an inter-block has
175	   non-zero transform coefficients.

177	   Entropy coding

179	   Similar to HEVC , [VVC] uses a single entropy-coding engine, which is
180	   based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC],
181	   but with the support of multi-window sizes.  The window sizes can be
182	   initialized differently for different context models.  Due to such a
183	   design, it has more efficient adaptation speed and better coding
184	   efficiency.  A joint chroma residual coding scheme is applied to
185	   further exploit the correlation between the residuals of two color
186	   components.  In VVC, different residual coding schemes are applied
187	   for regular transform coefficients and residual samples generated
188	   using transform-skip mode.

190	   In-loop filtering

192	   [VVC] has more feature support in loop filters than HEVC.  The
193	   deblocking filter in [VVC] is similar to HEVC but operates at a
194	   smaller grid.  After deblocking and sample adaptive offset (SAO), an
195	   adaptive loop filter (ALF) may be used.  As a Wiener filter, ALF
196	   reduces distortion of decoded pictures.  Besides, [VVC] introduces a
197	   new module before deblocking called luma mapping with chroma scaling
198	   to fully utilize the dynamic range of signal so that rate-distortion
199	   performance of both SDR and HDR content is improved.

201	   Motion prediction and coding

203	   Compared to HEVC, [VVC] introduces several improvements in this area.
204	   First, there is the Adaptive motion vector resolution (AMVR), which
205	   can save bit cost for motion vectors by adaptively signaling motion
206	   vector resolution.  Then the Affine motion compensation is included
207	   to capture complicated motion like zooming and rotation.  Meanwhile,
208	   prediction refinement with the optical flow with affine mode (PROF)
209	   is further deployed to mimic affine motion at the pixel level.
210	   Thirdly the decoder side motion vector refinement (DMVR) is a method
211	   to derive MV vector at decoder side based on block matching so that
212	   fewer bits may be spent on motion vectors.  Bi-directional optical
213	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
214	   offset at 4x4 sub-block level that is derived with equations based on
215	   gradients of the prediction samples and a motion difference relative
216	   to CU motion vectors.  Furthermore, merge with motion vector
217	   difference (MMVD) is a special mode, which further signals a limited
218	   set of motion vector differences on top of merge mode.  In addition
219	   to MMVD, there are another three types of special merge modes, i.e.,
220	   sub-block merge, triangle, and combined intra-/inter- prediction
221	   (CIIP).  Sub- block merge list includes one candidate of sub-block
222	   temporal motion vector prediction (SbTMVP) and up to four candidates
223	   of affine motion vectors.  Triangle is based on triangular block
224	   motion compensation.  CIIP combines intra- and inter- predictions
225	   with weighting.  Adaptive weighting may be employed with a block-
226	   level tool called bi-prediction with CU based weighting (BCW) which
227	   provides more flexibility than in HEVC.

229	   Intra prediction and intra-coding

231	   To capture the diversified local image texture directions with finer
232	   granularity, [VVC] supports 65 angular directions instead of 33
233	   directions in HEVC.  The intra mode coding is based on a 6 most
234	   probable mode scheme, and the 6 most probable modes are derived using
235	   the neighboring intra prediction directions.  In addition, to deal
236	   with the different distributions of intra prediction angles for
237	   different block aspect ratios, a wide-angle intra prediction (WAIP)
238	   scheme is applied in [VVC] by including intra prediction angles
239	   beyond those present in HEVC.  Unlike HEVC which only allows using
240	   the most adjacent line of reference samples for intra prediction,
241	   [VVC] also allows using two further reference lines, as known as
242	   multi-reference-line (MRL) intra prediction.  The additional
243	   reference lines can be only used for 6 most probable intra prediction
244	   modes.  To capture the strong correlation between different colour
245	   components, in VVC, a cross-component linear mode (CCLM) is utilized
246	   which assumes a linear relationship between the luma sample values
247	   and their associated chroma samples.  For intra prediction, [VVC]
248	   also applies a position-dependent prediction combination (PDPC) for
249	   refining the prediction samples closer to the intra prediction block
250	   boundary.  Matrix-based intra prediction (MIP) modes are also used in
251	   [VVC] which generates an up to 8x8 intra prediction block using a
252	   weighted sum of downsampled neighboring reference samples, and the
253	   weightings are hardcoded constants.

255	   Other coding-tool feature

257	   [VVC] introduces dependent quantization (DQ) to reduce quantization
258	   error by state-based switching between two quantizers.

260	1.1.2.  Systems and Transport Interfaces

262	   [VVC] inherits the basic systems and transport interfaces designs
263	   from HEVC and H.264.  These include the NAL-unit-based syntax
264	   structure, the hierarchical syntax and data unit structure, the
265	   Supplemental Enhancement Information (SEI) message mechanism, and the
266	   video buffering model based on the Hypothetical Reference Decoder
267	   (HRD).  The scalability features of [VVC] are conceptually similar to
268	   the scalable variant of HEVC known as SHVC.  The hierarchical syntax
269	   and data unit structure consists of parameter sets at various levels
270	   (decoder, sequence (pertaining to all), sequence (pertaining to a
271	   single), picture), picture-level header parameters, slice-level
272	   header parameters, and lower-level parameters.

274	   A number of key components that influenced the Network Abstraction
275	   Layer design of [VVC] as well as this memo are described below

277	   Decoding Capability Information

279	   The Decoding capability information includes parameters that stay
280	   constant for the lifetime of a Video Bitstream, which in IETF terms
281	   can translate to the lifetime of a session.  Decoding capability
282	   informations can include profile, level, and sub-profile information
283	   to determine a maximum complexity interop point that is guaranteed to
284	   be never exceeded, even if splicing of video sequences occurs within
285	   a session.  It further includes constraint flags, which can
286	   optionally be set to indicate that the video bitstream will be
287	   constraint in the use of certain features as indicated by the values
288	   of those flags.  With this, a bitstream can be labelled as not using
289	   certain tools, which allows among other things for resource
290	   allocation in a decoder implementation.

292	   Video parameter set

294	   The Video Parameter Set (VPS) pertains to a Coded Video Sequences
295	   (CVS) of multiple layers covering the same range of picture units,
296	   and includes, among other information decoding dependency expressed
297	   as information for reference picture set construction of enhancement
298	   layers.  The VPS provides a "big picture" of a scalable sequence,
299	   including what types of operation points are provided, the profile,
300	   tier, and level of the operation points, and some other high-level
301	   properties of the bitstream that can be used as the basis for session
302	   negotiation and content selection, etc.  One VPS may be referenced by
303	   one or more Sequence parameter sets.

305	   Sequence parameter set

307	   The Sequence Parameter Set (SPS) contains syntax elements pertaining
308	   to a coded layer video sequence (CLVS), which is a group of pictures
309	   belonging to the same layer, starting with a random access point, and
310	   followed by pictures that may depend on each other and the random
311	   access point picture.  In MPGEG-2, the equivalent of a CVS was a
312	   Group of Pictures (GOP), which normally started with an I frame and
313	   was followed by P and B frames.  While more complex in its options of
314	   random access points, VVC retains this basic concept.  One remarkable
315	   difference of VVC is that a CLVS may start with a Gradual Decoding
316	   Refresh (GDR) picture, without requiring presence of traditional
317	   random access points in the bitstream, such as Instantaneous Decoding
318	   Refresh (IDR) or Clean Random Access (CRA) pictures.  In many TV-like
319	   applications, a CVS contains a few hundred milliseconds to a few
320	   seconds of video.  In video conferencing (without switching MCUs
321	   involved), a CVS can be as long in duration as the whole session.

323	   Picture and Adaptation parameter set

325	   The Picture Parameter Set and the Adaptation Parameter Set (PPS and
326	   APS, respectively) carry information pertaining to zero or more
327	   pictures and zero or more slices, respectively.  The PPS contains
328	   information that is likely to stay constant from picture to picture-
329	   at least for pictures for a certain type-whereas the APS contains
330	   information, such as adaptive loop filter coefficients, that are
331	   likely to change from picture to picture or even within a picture.  A
332	   single APS can be referenced by slices of the same picture if that
333	   APS contains information about luma mapping with chroma scaling
334	   (LMCS) but different APS can be referenced by slices of the same
335	   picture if those APS contain information about ALF.

337	   Picture Header

339	   A Picture Header contains information that is common to all slices
340	   that belong to the same picture.  Being able to send that information
341	   as a separate NAL unit when pictures are split into several slices
342	   allows for saving bitrate, compared to repeating the same information
343	   in all slices.  However, there might be scenarios where low-bitrate
344	   video is transmitted using a single slice per picture.  Having a
345	   separate NAL unit to convey that information incurs in an overhead
346	   for such scenarios.  Therefore, VVC specifies signaling that
347	   indicates whether Picture Headers are present in the CLVS or not.

349	   Profile, tier, and level

351	   The profile, tier and level syntax structures in DCI, VPS and SPS
352	   contain profile, tier, level information for all layers that refer to
353	   the DCI, for layers associated with one or more output layer sets
354	   specified by the VPS, and for any layer that refers to the SPS,
355	   respectively.

357	   Sub-Profiles

359	   Within the [VVC] specification, a sub-profile is a 32-bit number
360	   coded according to ITU-T Rec. T.35, that does not carry a semantic.
361	   It is carried in the profile_tier_level structure and hence
362	   (potentially) present in the DCI, VPS, and SPS.  External
363	   registration bodies can register a T.35 codepoint with ITU-T
364	   registration authorities and associate with their registration a
365	   description of bitstream complexity restrictions beyond the profiles
366	   defined by ITU-T and ISO/IEC.  This would allow encoder manufacturers
367	   to label the bitstreams generated by their encoder as complying with
368	   such sub-profile.  It is expected that upstream standardization
369	   organizations (such as: DVB and ATSC), as well as walled-garden video
370	   services will take advantage of this labelling system.  In contrast
371	   to "normal" profiles, it is expected that sub-profiles may indicate
372	   encoder choices traditionally left open in the (decoder- centric)
373	   video coding specs, such as GOP structures, minimum/maximum QP
374	   values, and the mandatory use of certain tools or SEI messages.

376	   Constraint Flags

378	   The profile_tier_level structure carries a considerable number of
379	   constraint flags, which an encoder can use to indicate to a decoder
380	   that it will not use a certain tool or technology.  They were
381	   included in reaction to a perceived market need for labelling a
382	   bitstream as not exercising a certain tool that has become
383	   commercially unviable.

385	   Temporal scalability support

387	      Editor notes: need will update along with VVC new draft in the
388	      future

390	   [VVC] includes support of temporal scalability, by inclusion of the
391	   signaling of TemporalId in the NAL unit header, the restriction that
392	   pictures of a particular temporal sub-layer cannot be used for inter
393	   prediction reference by pictures of a lower temporal sub-layer, the
394	   sub-bitstream extraction process, and the requirement that each sub-
395	   bitstream extraction output be a conforming bitstream.  Media-Aware
396	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
397	   header for stream adaptation purposes based on temporal scalability.

399	   Spatial, SNR, View Scalability

401	   [VVC] includes support for spatial, SNR, and View scalability.
402	   Scalable video coding is widely considered to have technical benefits
403	   and enrich services for various video applications.  Until recently,
404	   however, the functionality has not been included in the main profiles
405	   of video codecs and not wide deployed due to additional costs.  In
406	   VVC, however, all those forms of scalability are supported natively
407	   through the signaling of the layer_id in the NAL unit header, the VPS
408	   which associates layers with given layer_ids to each other, reference
409	   picture selection, reference picture resampling for spatial
410	   scalability, and a number of other mechanisms not relevant for this
411	   memo.  Scalability support can be implemented in a single decoding
412	   "loop" and is widely considered a comparatively lightweight
413	   operation.

415	      Spatial Scalability

417	         With the existence of Reference Picture Resampling (RPR), in
418	         the "main" profile of VVC, the additional burden for
419	         scalability support is just a minor modification of the high-
420	         level syntax (HLS).  In technical aspects, the inter-layer
421	         prediction is employed in a scalable system to improve the
422	         coding efficiency of the enhancement layers.  In addition to
423	         the spatial and temporal motion-compensated predictions that
424	         are available in a single- layer codec, the inter-layer
425	         prediction in [VVC] uses the resampled video data of the
426	         reconstructed reference picture from a reference layer to
427	         predict the current enhancement layer.  Then, the resampling
428	         process for inter-layer prediction is performed at the block-
429	         level, without modifying the existing interpolation process for
430	         motion compensation compared to non-scalable RPR.  It means
431	         that no additional resampling process is needed to support
432	         scalability.

434	      SNR Scalability

436	         SNR scalability is similar to Spatial Scalability except that
437	         the resampling factors are 1:1-in other words, there is no
438	         change in resolution, but there is inter-layer prediction.

440	   SEI Messages

442	   Supplementary Enhancement Information (SEI) messages are codepoints
443	   in the bitstream that do not influence the decoding process as
444	   specified in the [VVC] spec, but address issues of representation/
445	   rendering of the decoded bitstream, label the bitstream for certain
446	   applications, among other, similar tasks.  The overall concept of SEI
447	   messages and many of the messages themselves has been inherited from
448	   the H.264 and HEVC specs.  In the [VVC] environment, some of the SEI
449	   messages considered to be generally useful also in other video coding
450	   technologies have been moved out of the main specification into a
451	   companion document (TO DO: add reference once ITU designation is
452	   known).

454	1.1.3.  Parallel Processing Support (informative)

456	   Compared to HEVC, the [VVC] design to support parallelization offers
457	   numerous improvements.  Some of those improvements are still
458	   undergoing changes in JVET.  Information, to the extent relevant for
459	   this memo, will be added in future versions of this memo as the
460	   standardization in JVET progresses and the technology stabilizes.

462	      Editor notes: udpate on sub-picture/slice/tile is needed following
463	      new VVC draft

465	1.1.4.  NAL Unit Header

467	   [VVC] maintains the NAL unit concept of HEVC with modifications.  VVC
468	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
469	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

471	                     +---------------+---------------+
472	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
473	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
474	                     |F|Z| LayerID   |  Type   | TID |
475	                     +---------------+---------------+

477	                   The Structure of the VVC NAL Unit Header.

479	                                 Figure 1

481	   The semantics of the fields in the NAL unit header are as specified
482	   in [VVC] and described briefly below for convenience.  In addition to
483	   the name and size of each field, the corresponding syntax element
484	   name in [VVC] is also provided.

486	   F: 1 bit

488	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
489	      inclusion of this bit in the NAL unit header was to enable
490	      transport of [VVC] video over MPEG-2 transport systems (avoidance
491	      of start code emulations) [MPEG2S].  In the context of this memo
492	      the value 1 may be used to indicate a syntax violation, e.g., for
493	      a NAL unit resulted from aggregating a number of fragmented units
494	      of a NAL unit but missing the last fragment, as described in
495	      Section TBD.

497	   Z: 1 bit

499	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
500	      for future extensions by ITU-T and ISO/IEC.
501	      This memo does not overload the "Z" bit for local extensions, as
502	      a) overloading the "F" bit is sufficient and b) to preserve the
503	      usefulness of this memo to possible future versions of [VVC].

505	   LayerId: 6 bits

507	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
508	      a layer may be, e.g., a spatial scalable layer, a quality scalable
509	      layer .

511	   Type: 5 bits

513	      nal_unit_type.  This field specifies the NAL unit type as defined
514	      in Table 7-1 of VVC.  For a reference of all currently defined NAL
515	      unit types and their semantics, please refer to Section 7.4.2.2 in
516	      [VVC].

518	   TID: 3 bits

520	      nuh_temporal_id_plus1.  This field specifies the temporal
521	      identifier of the NAL unit plus 1.  The value of TemporalId is
522	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
523	      there is at least one bit in the NAL unit header equal to 1, so to
524	      enable independent considerations of start code emulations in the
525	      NAL unit header and in the NAL unit payload data.

527	1.2.  Overview of the Payload Format

529	   This payload format defines the following processes required for
530	   transport of [VVC] coded data over RTP [RFC3550]:

532	   o  Usage of RTP header with this payload format

534	   o  Packetization of [VVC] coded NAL units into RTP packets using
535	      three types of payload structures: a single NAL unit packet,
536	      aggregation packet, and fragment unit

538	   o  Transmission of [VVC] NAL units of the same bitstream within a
539	      single RTP stream.

541	   o  Media type parameters to be used with the Session Description
542	      Protocol (SDP) [RFC4566]

544	   o  Frame-marking mapping [FrameMarking]

546	2.  Conventions

548	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
549	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
550	   "OPTIONAL" in this document are to be interpreted as described in BCP
551	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
552	   capitals, as shown above.

554	3.  Definitions and Abbreviations

556	3.1.  Definitions

558	   This document uses the terms and definitions of VVC.  Section 3.1.1
559	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
560	   provides definitions specific to this memo.

562	3.1.1.  Definitions from the VVC Specification

564	      Editor notes:

566	   Access unit (AU): A set of PUs that belong to different layers and
567	   contain coded pictures associated with the same time for output from
568	   the DPB.

570	   Adaptation parameter set (APS): A syntax structure containing syntax
571	   elements that apply to zero or more slices as determined by zero or
572	   more syntax elements found in slice headers.

574	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
575	   byte stream, that forms the representation of a sequence of AUs
576	   forming one or more coded video sequences (CVSs).

578	   Coded picture: A coded representation of a picture comprising VCL NAL
579	   units with a particular value of nuh_layer_id within an AU and
580	   containing all CTUs of the picture.

582	   Clean random access (CRA) PU: A PU in which the coded picture is a
583	   CRA picture.

585	   Clean random access (CRA) picture: An IRAP picture for which each VCL
586	   NAL unit has nal_unit_type equal to CRA_NUT.

588	   Coded video sequence (CVS): A sequence of AUs that consists, in
589	   decoding order, of a CVSS AU, followed by zero or more AUs that are
590	   not CVSS AUs, including all subsequent AUs up to but not including
591	   any subsequent AU that is a CVSS AU.

593	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
594	   for each layer in the CVS and the coded picture in each PU is a CLVSS
595	   picture.

597	   Coded layer video sequence (CLVS): A sequence of PUs with the same
598	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
599	   PU, followed by zero or more PUs that are not CLVSS PUs, including
600	   all subsequent PUs up to but not including any subsequent PU that is
601	   a CLVSS PU.

603	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
604	   picture is a CLVSS picture.

606	   Coded layer video sequence start (CLVSS) picture: A coded picture
607	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
608	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

610	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
611	   of chroma samples of a picture that has three sample arrays, or a CTB
612	   of samples of a monochrome picture or a picture that is coded using
613	   three separate colour planes and syntax structures used to code the
614	   samples.

616	   Decoding Capability Information (DCI): A syntax structure containing
617	   syntax elements that apply to the entire bitstream.

619	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
620	   reference, output reordering, or output delay specified for the
621	   hypothetical reference decoder.

623	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
624	   NAL unit has nal_unit_type equal to GDR_NUT.

626	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
627	   picture is an IDR picture.

629	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
630	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
631	   IDR_N_LP.

633	   Intra random access point (IRAP) AU: An AU in which there is a PU for
634	   each layer in the CVS and the coded picture in each PU is an IRAP
635	   picture.

637	   Intra random access point (IRAP) PU: A PU in which the coded picture
638	   is an IRAP picture.

640	   Intra random access point (IRAP) picture: A coded picture for which
641	   all VCL NAL units have the same value of nal_unit_type in the range
642	   of IDR_W_RADL to CRA_NUT, inclusive.

644	   Layer: A set of VCL NAL units that all have a particular value of
645	   nuh_layer_id and the associated non-VCL NAL units.

647	   Network abstraction layer (NAL) unit: A syntax structure containing
648	   an indication of the type of data to follow and bytes containing that
649	   data in the form of an RBSP interspersed as necessary with emulation
650	   prevention bytes.

652	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

654	   Operation point (OP): A temporal subset of an OLS, identified by an
655	   OLS index and a highest value of TemporalId.

657	   Picture parameter set (PPS): A syntax structure containing syntax
658	   elements that apply to zero or more entire coded pictures as
659	   determined by a syntax element found in each slice header.

661	   Picture unit (PU): A set of NAL units that are associated with each
662	   other according to a specified classification rule, are consecutive
663	   in decoding order, and contain exactly one coded picture.

665	   Random access: The act of starting the decoding process for a
666	   bitstream at a point other than the beginning of the stream.

668	   Sequence parameter set (SPS): A syntax structure containing syntax
669	   elements that apply to zero or more entire CLVSs as determined by the
670	   content of a syntax element found in the PPS referred to by a syntax
671	   element found in each picture header.

673	   Slice: An integer number of complete tiles or an integer number of
674	   consecutive complete CTU rows within a tile of a picture that are
675	   exclusively contained in a single NAL unit.

677	   Sub-layer: A temporal scalable layer of a temporal scalable bitstream
678	   consisting of VCL NAL units with a particular value of the TemporalId
679	   variable, and the associated non-VCL NAL units.

681	   Subpicture: An rectangular region of one or more slices within a
682	   picture.

684	   Sub-layer representation: A subset of the bitstream consisting of NAL
685	   units of a particular sub-layer and the lower sub-layers.

687	   Tile: A rectangular region of CTUs within a particular tile column
688	   and a particular tile row in a picture.

690	   Tile column: A rectangular region of CTUs having a height equal to
691	   the height of the picture and a width specified by syntax elements in
692	   the picture parameter set.

694	   Tile row: A rectangular region of CTUs having a height specified by
695	   syntax elements in the picture parameter set and a width equal to the
696	   width of the picture.

698	   Video coding layer (VCL) NAL unit: A collective term for coded slice
699	   NAL units and the subset of NAL units that have reserved values of
700	   nal_unit_type that are classified as VCL NAL units in this
701	   Specification.

703	3.1.2.  Definitions Specific to This Memo

705	   Media-Aware Network Element (MANE): A network element, such as a
706	   middlebox, selective forwarding unit, or application-layer gateway
707	   that is capable of parsing certain aspects of the RTP payload headers
708	   or the RTP payload and reacting to their contents.

710	      Editor Notes: the following informative needs to be updated along
711	      with frame marking update

713	      Informative note: The concept of a MANE goes beyond normal routers
714	      or gateways in that a MANE has to be aware of the signaling (e.g.,
715	      to learn about the payload type mappings of the media streams),
716	      and in that it has to be trusted when working with Secure RTP
717	      (SRTP).  The advantage of using MANEs is that they allow packets
718	      to be dropped according to the needs of the media coding.  For
719	      example, if a MANE has to drop packets due to congestion on a
720	      certain link, it can identify and remove those packets whose
721	      elimination produces the least adverse effect on the user
722	      experience.  After dropping packets, MANEs must rewrite RTCP
723	      packets to match the changes to the RTP stream, as specified in
724	      Section 7 of [RFC3550].

726	   NAL unit decoding order: A NAL unit order that conforms to the
727	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
728	   follow the Order of NAL units in the bitstream.

730	   NAL unit output order: A NAL unit order in which NAL units of
731	   different access units are in the output order of the decoded
732	   pictures corresponding to the access units, as specified in [VVC],
733	   and in which NAL units within an access unit are in their decoding
734	   order.

736	   RTP stream: See [RFC7656].  Within the scope of this memo, one RTP
737	   stream is utilized to transport one or more temporal sub-layers.

739	   Transmission order: The order of packets in ascending RTP sequence
740	   number order (in modulo arithmetic).  Within an aggregation packet,
741	   the NAL unit transmission order is the same as the order of
742	   appearance of NAL units in the packet.

744	3.2.  Abbreviations

746	   AU         Access Unit

748	   AP         Aggregation Packet

750	   CTU        Coding Tree Unit
751	   CVS        Coded Video Sequence

753	   DPB        Decoded Picture Buffer

755	   DCI        Decoding capability information

757	   DON        Decoding Order Number

759	   FIR        Full Intra Request

761	   FU         Fragmentation Unit

763	   HRD        Hypothetical Reference Decoder

765	   IDR        Instantaneous Decoding Refresh

767	   MANE       Media-Aware Network Element

769	   MTU        Maximum Transfer Unit

771	   NAL        Network Abstraction Layer

773	   NALU       Network Abstraction Layer Unit

775	   PLI        Picture Loss Indication

777	   PPS        Picture Parameter Set

779	   RPS        Reference Picture Set

781	   RPSI       Reference Picture Selection Indication

783	   SEI        Supplemental Enhancement Information

785	   SLI        Slice Loss Indication

787	   SPS        Sequence Parameter Set

789	   VCL        Video Coding Layer

791	   VPS        Video Parameter Set

793	4.  RTP Payload Format
794	4.1.  RTP Header Usage

796	   The format of the RTP header is specified in [RFC3550] (reprinted as
797	   Figure 2 for convenience).  This payload format uses the fields of
798	   the header in a manner consistent with that specification.

800	   The RTP payload (and the settings for some RTP header bits) for
801	   aggregation packets and fragmentation units are specified in
802	   Section 4.3.2 and Section 4.3.3, respectively.

804	       0                   1                   2                   3
805	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
806	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
807	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
808	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
809	      |                           timestamp                           |
810	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
811	      |           synchronization source (SSRC) identifier            |
812	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
813	      |            contributing source (CSRC) identifiers             |
814	      |                             ....                              |
815	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

817	                        RTP Header According to {{RFC3550}}

819	                                 Figure 2

821	   The RTP header information to be set according to this RTP payload
822	   format is set as follows:

824	   Marker bit (M): 1 bit

826	      Set for the last packet of the access unit, carried in the current
827	      RTP stream.  This is in line with the normal use of the M bit in
828	      video formats to allow an efficient playout buffer handling.

830	         Editor notes: The informative note below needs updating once
831	         the NAL unit type table is stable in the [VVC] spec.

833	         Informative note: The content of a NAL unit does not tell
834	         whether or not the NAL unit is the last NAL unit, in decoding
835	         order, of an access unit.  An RTP sender implementation may
836	         obtain this information from the video encoder.  If, however,
837	         the implementation cannot obtain this information directly from
838	         the encoder, e.g., when the bitstream was pre-encoded, and also
839	         there is no timestamp allocated for each NAL unit, then the
840	         sender implementation can inspect subsequent NAL units in
841	         decoding order to determine whether or not the NAL unit is the
842	         last NAL unit of an access unit as follows.  A NAL unit is
843	         determined to be the last NAL unit of an access unit if it is
844	         the last NAL unit of the bitstream.  A NAL unit naluX is also
845	         determined to be the last NAL unit of an access unit if both
846	         the following conditions are true: 1) the next VCL NAL unit
847	         naluY in decoding order has the high-order bit of the first
848	         byte after its NAL unit header equal to 1 or nal_unit_type
849	         equal to 19, and 2) all NAL units between naluX and naluY, when
850	         present, have nal_unit_type in the range of 13 to17, inclusive,
851	         equal to 20, equal to 23 or equal to 26.

853	   Payload Type (PT): 7 bits

855	      The assignment of an RTP payload type for this new packet format
856	      is outside the scope of this document and will not be specified
857	      here.  The assignment of a payload type has to be performed either
858	      through the profile used or in a dynamic way.

860	   Sequence Number (SN): 16 bits

862	      Set and used in accordance with [RFC3550].

864	   Timestamp: 32 bits

866	      The RTP timestamp is set to the sampling timestamp of the content.
867	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
868	      properties of its own (e.g., parameter set and SEI NAL units), the
869	      RTP timestamp MUST be set to the RTP timestamp of the coded
870	      picture of the access unit in which the NAL unit (according to
871	      Annex D of VVC) is included.  Receivers MUST use the RTP timestamp
872	      for the display process, even when the bitstream contains picture
873	      timing SEI messages or decoding unit information SEI messages as
874	      specified in VVC.

876	   Synchronization source (SSRC): 32 bits

878	      Used to identify the source of the RTP packets.  A single SSRC is
879	      used for all parts of a single bitstream.

881	4.2.  Payload Header Usage

883	   The first two bytes of the payload of an RTP packet are referred to
884	   as the payload header.  The payload header consists of the same
885	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
886	   in Section 1.1.4, irrespective of the type of the payload structure.

888	   The TID value indicates (among other things) the relative importance
889	   of an RTP packet, for example, because NAL units belonging to higher
890	   temporal sub-layers are not used for the decoding of lower temporal
891	   sub-layers.  A lower value of TID indicates a higher importance.
892	   More-important NAL units MAY be better protected against transmission
893	   losses than less-important NAL units.

895	      For Discussion: quite possibly something similar can be said for
896	      the Layer_id in layered coding, but perhaps not in multiview
897	      coding.  (The relevant part of the spec is relatively new,
898	      therefore the soft language).  However, for serious layer pruning,
899	      interpretation of the VPS is required.  We can add language about
900	      the need for stateful interpretation of LayerID vis-a-vis
901	      stateless interpretation of TID later.

903	4.3.  Payload Structures

905	   Three different types of RTP packet payload structures are specified.
906	   A receiver can identify the type of an RTP packet payload through the
907	   Type field in the payload header.

909	   The three different payload structures are as follows:

911	   o  Single NAL unit packet: Contains a single NAL unit in the payload,
912	      and the NAL unit header of the NAL unit also serves as the payload
913	      header.  This payload structure is specified in Section 4.4.1.

915	   o  Aggregation Packet (AP): Contains more than one NAL unit within
916	      one access unit.  This payload structure is specified in
917	      Section 4.3.2.

919	   o  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
920	      This payload structure is specified in Section 4.3.3.

922	4.3.1.  Single NAL Unit Packets

924	      Editor notes: its better to add a section to describe DONL and
925	      sprop-max_don_diff

927	   A single NAL unit packet contains exactly one NAL unit, and consists
928	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
929	   DONL field (in network byte order), and the NAL unit payload data
930	   (the NAL unit excluding its NAL unit header) of the contained NAL
931	   unit, as shown in Figure 3.

933	      0                   1                   2                   3
934	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
935	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
936	     |           PayloadHdr          |      DONL (conditional)       |
937	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
938	     |                                                               |
939	     |                  NAL unit payload data                        |
940	     |                                                               |
941	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
942	     |                               :...OPTIONAL RTP padding        |
943	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

945	                  The Structure of a Single NAL Unit Packet

947	                                 Figure 3

949	   The DONL field, when present, specifies the value of the 16 least
950	   significant bits of the decoding order number of the contained NAL
951	   unit.  If sprop-max-don-diff is greater than 0 for any of the RTP
952	   streams, the DONL field MUST be present, and the variable DON for the
953	   contained NAL unit is derived as equal to the value of the DONL
954	   field.  Otherwise (sprop-max-don-diff is equal to 0 for all the RTP
955	   streams), the DONL field MUST NOT be present.

957	4.3.2.  Aggregation Packets (APs)

959	   Aggregation Packets (APs) can reduce of packetization overhead for
960	   small NAL units, such as most of the non- VCL NAL units, which are
961	   often only a few octets in size.

963	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
964	   carried in an AP is encapsulated in an aggregation unit.  NAL units
965	   aggregated in one AP are included in NAL unit decoding order.

967	   An AP consists of a payload header (denoted as PayloadHdr) followed
968	   by two or more aggregation units, as shown in Figure 4.

970	     0                   1                   2                   3
971	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
972	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
973	    |    PayloadHdr (Type=28)       |                               |
974	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
975	    |                                                               |
976	    |             two or more aggregation units                     |
977	    |                                                               |
978	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
979	    |                               :...OPTIONAL RTP padding        |
980	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

982	                   The Structure of an Aggregation Packet

984	                                 Figure 4

986	   The fields in the payload header of an AP are set as follows.  The F
987	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
988	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
989	   be equal to 28.

991	   The value of LayerId MUST be equal to the lowest value of LayerId of
992	   all the aggregated NAL units.  The value of TID MUST be the lowest
993	   value of TID of all the aggregated NAL units.

995	      Informative note: All VCL NAL units in an AP have the same TID
996	      value since they belong to the same access unit.  However, an AP
997	      may contain non-VCL NAL units for which the TID value in the NAL
998	      unit header may be different than the TID value of the VCL NAL
999	      units in the same AP.

1001	   An AP MUST carry at least two aggregation units and can carry as many
1002	   aggregation units as necessary; however, the total amount of data in
1003	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1004	   chosen so that the resulting IP packet is smaller than the MTU size
1005	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1006	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1007	   not contain another AP.

1009	   The first aggregation unit in an AP consists of a conditional 16-bit
1010	   DONL field (in network byte order) followed by a 16-bit unsigned size
1011	   information (in network byte order) that indicates the size of the
1012	   NAL unit in bytes (excluding these two octets, but including the NAL
1013	   unit header), followed by the NAL unit itself, including its NAL unit
1014	   header, as shown in Figure 5.

1016	     0                   1                   2                   3
1017	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1018	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1019	    |               :       DONL (conditional)      |   NALU size   |
1020	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1021	    |   NALU size   |                                               |
1022	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1023	    |                                                               |
1024	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1025	    |                               :
1026	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1028	           The Structure of the First Aggregation Unit in an AP

1030	                                 Figure 5

1032	   The DONL field, when present, specifies the value of the 16 least
1033	   significant bits of the decoding order number of the aggregated NAL
1034	   unit.

1036	   If sprop-max-don-diff is greater than 0 for any of the RTP streams,
1037	   the DONL field MUST be present in an aggregation unit that is the
1038	   first aggregation unit in an AP, and the variable DON for the
1039	   aggregated NAL unit is derived as equal to the value of the DONL
1040	   field.  Otherwise (sprop-max-don-diff is equal to 0 for all the RTP
1041	   streams), the DONL field MUST NOT be present in an aggregation unit
1042	   that is the first aggregation unit in an AP.

1044	   An aggregation unit that is not the first aggregation unit in an AP
1045	   will be followed immediately by a 16-bit unsigned size information
1046	   (in network byte order) that indicates the size of the NAL unit in
1047	   bytes (excluding these two octets, but including the NAL unit
1048	   header), followed by the NAL unit itself, including its NAL unit
1049	   header, as shown in Figure 6.

1051	     0                   1                   2                   3
1052	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1053	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1054	    |               :       NALU size               |   NAL unit    |
1055	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1056	    |                                                               |
1057	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1058	    |                               :
1059	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1061	         The Structure of an Aggregation Unit That Is Not the First
1062	                          Aggregation Unit in an AP

1064	                                 Figure 6

1066	   Figure 7 presents an example of an AP that contains two aggregation
1067	   units, labeled as 1 and 2 in the figure, without the DONL field being
1068	   present.

1070	     0                   1                   2                   3
1071	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1072	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1073	    |                          RTP Header                           |
1074	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1075	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1076	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1077	    |          NALU 1 HDR           |                               |
1078	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1079	    |                   . . .                                       |
1080	    |                                                               |
1081	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1083	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1084	    | NALU 2 HDR    |                                               |
1085	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1086	    |                   . . .                                       |
1087	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1088	    |                               :...OPTIONAL RTP padding        |
1089	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1091	               An Example of an AP Packet Containing
1092	             Two Aggregation Units without the DONL Field

1094	                                 Figure 7

1096	   Figure 8 presents an example of an AP that contains two aggregation
1097	   units, labeled as 1 and 2 in the figure, with the DONL field being
1098	   present.

1100	     0                   1                   2                   3
1101	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1102	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1103	    |                          RTP Header                           |
1104	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1105	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1106	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1107	    |          NALU 1 Size          |            NALU 1 HDR         |
1108	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1109	    |                                                               |
1110	    |                 NALU 1 Data   . . .                           |
1111	    |                                                               |
1112	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1113	    |                               :          NALU 2 Size          |
1114	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1115	    |          NALU 2 HDR           |                               |
1116	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1117	    |                                                               |
1118	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1119	    |                               :...OPTIONAL RTP padding        |
1120	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1122	                   An Example of an AP Containing
1123	                 Two Aggregation Units with the DONL Field

1125	                                 Figure 8

1127	4.3.3.  Fragmentation Units

1129	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1130	   single NAL unit into multiple RTP packets, possibly without
1131	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1132	   unit consists of an integer number of consecutive octets of that NAL
1133	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1134	   order with ascending RTP sequence numbers (with no other RTP packets
1135	   within the same RTP stream being sent between the first and last
1136	   fragment).

1138	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1139	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1140	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1142	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1143	   time of the fragmented NAL unit.

1145	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1146	   header of one octet, a conditional 16-bit DONL field (in network byte
1147	   order), and an FU payload, as shown in Figure 9}.

1149	     0                   1                   2                   3
1150	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1151	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1152	    |    PayloadHdr (Type=29)       |   FU header   | DONL (cond)   |
1153	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1154	    | DONL (cond)   |                                               |
1155	    |-+-+-+-+-+-+-+-+                                               |
1156	    |                         FU payload                            |
1157	    |                                                               |
1158	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1159	    |                               :...OPTIONAL RTP padding        |
1160	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1162	                          The Structure of an FU

1164	                                 Figure 9

1166	   The fields in the payload header are set as follows.  The Type field
1167	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1168	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1169	   unit.

1171	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1172	   FuType field, as shown in Figure 10.

1174	                             +---------------+
1175	                             |0|1|2|3|4|5|6|7|
1176	                             +-+-+-+-+-+-+-+-+
1177	                             |S|E|R|  FuType |
1178	                             +---------------+

1180	                         The Structure of FU Header

1182	                                 Figure 10

1184	   The semantics of the FU header fields are as follows:

1186	   S: 1 bit

1188	      When set to 1, the S bit indicates the start of a fragmented NAL
1189	      unit, i.e., the first byte of the FU payload is also the first
1190	      byte of the payload of the fragmented NAL unit.  When the FU
1191	      payload is not the start of the fragmented NAL unit payload, the S
1192	      bit MUST be set to 0.

1194	   E: 1 bit
1195	      When set to 1, the E bit indicates the end of a fragmented NAL
1196	      unit, i.e., the last byte of the payload is also the last byte of
1197	      the fragmented NAL unit.  When the FU payload is not the last
1198	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1200	   Reserved: 1 bit

1202	      Placeholder

1204	   FuType: 5 bits

1206	      The field FuType MUST be equal to the field Type of the fragmented
1207	      NAL unit.

1209	   The DONL field, when present, specifies the value of the 16 least
1210	   significant bits of the decoding order number of the fragmented NAL
1211	   unit.

1213	   If sprop-max-don-diff is greater than 0 for any of the RTP streams,
1214	   and the S bit is equal to 1, the DONL field MUST be present in the
1215	   FU, and the variable DON for the fragmented NAL unit is derived as
1216	   equal to the value of the DONL field.  Otherwise (sprop-max-don-diff
1217	   is equal to 0 for all the RTP streams, or the S bit is equal to 0),
1218	   the DONL field MUST NOT be present in the FU.

1220	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1221	   the Start bit and End bit must not both be set to 1 in the same FU
1222	   header.

1224	   The FU payload consists of fragments of the payload of the fragmented
1225	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1226	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1227	   equal to 1, are sequentially concatenated, the payload of the
1228	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1229	   fragmented NAL unit is not included as such in the FU payload, but
1230	   rather the information of the NAL unit header of the fragmented NAL
1231	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1232	   headers of the FUs and the FuType field of the FU header of the FUs.
1233	   An FU payload MUST NOT be empty.

1235	   If an FU is lost, the receiver SHOULD discard all following
1236	   fragmentation units in transmission order corresponding to the same
1237	   fragmented NAL unit, unless the decoder in the receiver is known to
1238	   be prepared to gracefully handle incomplete NAL units.

1240	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1241	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1242	   n of that NAL unit is not received.  In this case, the
1243	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1244	   syntax violation.

1246	4.4.  Decoding Order Number

1248	   For each NAL unit, the variable AbsDon is derived, representing the
1249	   decoding order number that is indicative of the NAL unit decoding
1250	   order.

1252	   Let NAL unit n be the n-th NAL unit in transmission order within an
1253	   RTP stream.

1255	   If sprop-max-don-diff is equal to 0 for all the RTP streams carrying
1256	   the [VVC] bitstream, AbsDon[n], the value of AbsDon for NAL unit n,
1257	   is derived as equal to n.

1259	   Otherwise (sprop-max-don-diff is greater than 0 for any of the RTP
1260	   streams), AbsDon[n] is derived as follows, where DON[n] is the value
1261	   of the variable DON for NAL unit n:

1263	   o  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1264	      transmission order), AbsDon[0] is set equal to DON[0].

1266	   o  Otherwise (n is greater than 0), the following applies for
1267	      derivation of AbsDon[n]:

1269	         If DON[n] == DON[n-1],
1270	            AbsDon[n] = AbsDon[n-1]

1272	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1273	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1275	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1276	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1278	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1279	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1280	            DON[n])

1282	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1283	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1285	   For any two NAL units m and n, the following applies:

1287	   o  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1288	      NAL unit m in NAL unit decoding order.

1290	   o  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1291	      of the two NAL units can be in either order.

1293	   o  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1294	      NAL unit m in decoding order.

1296	         Informative note: When two consecutive NAL units in the NAL
1297	         unit decoding order have different values of AbsDon, the
1298	         absolute difference between the two AbsDon values may be
1299	         greater than or equal to 1.

1301	         Informative note: There are multiple reasons to allow for the
1302	         absolute difference of the values of AbsDon for two consecutive
1303	         NAL units in the NAL unit decoding order to be greater than
1304	         one.  An increment by one is not required, as at the time of
1305	         associating values of AbsDon to NAL units, it may not be known
1306	         whether all NAL units are to be delivered to the receiver.  For
1307	         example, a gateway might not forward VCL NAL units of higher
1308	         sub- layers or some SEI NAL units when there is congestion in
1309	         the network.  In another example, the first intra-coded picture
1310	         of a pre-encoded clip is transmitted in advance to ensure that
1311	         it is readily available in the receiver, and when transmitting
1312	         the first intra-coded picture, the originator does not exactly
1313	         know how many NAL units will be encoded before the first intra-
1314	         coded picture of the pre-encoded clip follows in decoding
1315	         order.  Thus, the values of AbsDon for the NAL units of the
1316	         first intra-coded picture of the pre-encoded clip have to be
1317	         estimated when they are transmitted, and gaps in values of
1318	         AbsDon may occur.

1320	5.  Packetization Rules

1322	   The following packetization rules apply:

1324	   o  If sprop-max-don-diff is greater than 0 for any of the RTP
1325	      streams, the transmission order of NAL units carried in the RTP
1326	      stream MAY be different than the NAL unit decoding order and the
1327	      NAL unit output order.

1329	   o  A NAL unit of a small size SHOULD be encapsulated in an
1330	      aggregation packet together one or more other NAL units in order
1331	      to avoid the unnecessary packetization overhead for small NAL
1332	      units.  For example, non-VCL NAL units such as access unit
1333	      delimiters, parameter sets, or SEI NAL units are typically small
1334	      and can often be aggregated with VCL NAL units without violating
1335	      MTU size constraints.

1337	   o  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1338	      viewpoint, be encapsulated in an aggregation packet together with
1339	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1340	      be meaningless without the associated VCL NAL unit being
1341	      available.

1343	   o  For carrying exactly one NAL unit in an RTP packet, a single NAL
1344	      unit packet MUST be used.

1346	6.  De-packetization Process

1348	   The general concept behind de-packetization is to get the NAL units
1349	   out of the RTP packets in an RTP stream and pass them to the decoder
1350	   in the NAL unit decoding order.

1352	   The de-packetization process is implementation dependent.  Therefore,
1353	   the following description should be seen as an example of a suitable
1354	   implementation.  Other schemes may be used as well, as long as the
1355	   output for the same input is the same as the process described below.
1356	   The output is the same when the set of output NAL units and their
1357	   order are both identical.  Optimizations relative to the described
1358	   algorithms are possible.

1360	   All normal RTP mechanisms related to buffer management apply.  In
1361	   particular, duplicated or outdated RTP packets (as indicated by the
1362	   RTP sequences number and the RTP timestamp) are removed.  To
1363	   determine the exact time for decoding, factors such as a possible
1364	   intentional delay to allow for proper inter-stream synchronization
1365	   MUST be factored in.

1367	   NAL units with NAL unit type values in the range of 0 to 27,
1368	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1369	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1370	   NOT be passed to the decoder.

1372	   The receiver includes a receiver buffer, which is used to compensate
1373	   for transmission delay jitter within individual RTP streams and
1374	   across RTP streams, to reorder NAL units from transmission order to
1375	   the NAL unit decoding order.  In this section, the receiver operation
1376	   is described under the assumption that there is no transmission delay
1377	   jitter within an RTP stream and across RTP streams.  To make a
1378	   difference from a practical receiver buffer that is also used for
1379	   compensation of transmission delay jitter, the receiver buffer is
1380	   hereafter called the de-packetization buffer in this section.
1381	   Receivers should also prepare for transmission delay jitter; that is,
1382	   either reserve separate buffers for transmission delay jitter
1383	   buffering and de-packetization buffering or use a receiver buffer for
1384	   both transmission delay jitter and de- packetization.  Moreover,
1385	   receivers should take transmission delay jitter into account in the
1386	   buffering operation, e.g., by additional initial buffering before
1387	   starting of decoding and playback.

1389	   When sprop-max-don-diff is equal to 0 for all the received RTP
1390	   streams, the de-packetization buffer size is zero bytes, and the
1391	   process described in the remainder of this paragraph applies.
1392	   The NAL units carried in the single RTP stream are directly passed to
1393	   the decoder in their transmission order, which is identical to their
1394	   decoding order.  When there are several NAL units of the same RTP
1395	   stream with the same NTP timestamp, the order to pass them to the
1396	   decoder is their transmission order.

1398	      Informative note: The mapping between RTP and NTP timestamps is
1399	      conveyed in RTCP SR packets.  In addition, the mechanisms for
1400	      faster media timestamp synchronization discussed in [RFC6051] may
1401	      be used to speed up the acquisition of the RTP-to-wall-clock
1402	      mapping.

1404	   When sprop-max-don-diff is greater than 0 for any the received RTP
1405	   streams, the process described in the remainder of this section
1406	   applies.

1408	   There are two buffering states in the receiver: initial buffering and
1409	   buffering while playing.  Initial buffering starts when the reception
1410	   is initialized.  After initial buffering, decoding and playback are
1411	   started, and the buffering-while-playing mode is used.

1413	   Regardless of the buffering state, the receiver stores incoming NAL
1414	   units, in reception order, into the de-packetization buffer.  NAL
1415	   units carried in RTP packets are stored in the de-packetization
1416	   buffer individually, and the value of AbsDon is calculated and stored
1417	   for each NAL unit.

1419	   Initial buffering lasts until condition A (the difference between the
1420	   greatest and smallest AbsDon values of the NAL units in the de-
1421	   packetization buffer is greater than or equal to the value of sprop-
1422	   max-don-diff) or condition B (the number of NAL units in the de-
1423	   packetization buffer is greater than the value of sprop-depack-buf-
1424	   nalus) is true.

1426	   After initial buffering, whenever condition A or condition B is true,
1427	   the following operation is repeatedly applied until both condition A
1428	   and condition B become false:

1430	   o  The NAL unit in the de-packetization buffer with the smallest
1431	      value of AbsDon is removed from the de-packetization buffer and
1432	      passed to the decoder.

1434	   When no more NAL units are flowing into the de-packetization buffer,
1435	   all NAL units remaining in the de-packetization buffer are removed
1436	   from the buffer and passed to the decoder in the order of increasing
1437	   AbsDon values.

1439	7.  Payload Format Parameters

1441	   Placeholder

1443	8.  Use with Feedback Messages

1445	   The following subsections define the use of the Picture Loss
1446	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
1447	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
1448	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
1449	   [RFC4585], and the FIR message is defined in [RFC5104].

1451	8.1.  Picture Loss Indication (PLI)

1453	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
1454	   media sender indicates "the loss of an undefined amount of coded
1455	   video data belonging to one or more pictures".  Without having any
1456	   specific knowledge of the setup of the bitstream (such as use and
1457	   location of in-band parameter sets, non-IRAP decoder refresh points,
1458	   picture structures, and so forth), a reaction to the reception of an
1459	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
1460	   parameter sets; potentially with sufficient redundancy so to ensure
1461	   correct reception.  However, sometimes information about the
1462	   bitstream structure is known.  For example, state could have been
1463	   established outside of the mechanisms defined in this document that
1464	   parameter sets are conveyed out of band only, and stay static for the
1465	   duration of the session.  In that case, it is obviously unnecessary
1466	   to send them in-band as a result of the reception of a PLI.  Other
1467	   examples could be devised based on a priori knowledge of different
1468	   aspects of the bitstream structure.  In all cases, the timing and
1469	   congestion control mechanisms of RFC 4585 MUST be observed.

1471	8.2.  Slice Loss Indication (SLI)

1473	   For further study.  Maybe remove as there are no known
1474	   implementations of SDLI in [HEVC] based systems

1476	8.3.  Reference Picture Selection Indication (RPSI)

1478	   Feedback-based reference picture selection has been shown as a
1479	   powerful tool to stop temporal error propagation for improved error
1480	   resilience [Girod99] [Wang05].  In one approach, the decoder side
1481	   tracks errors in the decoded pictures and informs the encoder side
1482	   that a particular picture that has been decoded relatively earlier is
1483	   correct and still present in the decoded picture buffer; it requests
1484	   the encoder to use that correct picture-availability information when
1485	   encoding the next picture, so to stop further temporal error
1486	   propagation.  For this approach, the decoder side should use the RPSI
1487	   feedback message.

1489	   Encoders can encode some long-term reference pictures as specified in
1490	   [VVC] for purposes described in the previous paragraph without the
1491	   need of a huge decoded picture buffer.  As shown in [Wang05], with a
1492	   flexible reference picture management scheme, as in VVC, even a
1493	   decoded picture buffer size of two picture storage buffers would work
1494	   for the approach described in the previous paragraph.

1496	   The text above is copy-paste from RFC 7798.  If we keep the RPSI
1497	   message, it needs adaptation to the [VVC] syntax.  Doing so shouldn't
1498	   be too hard as the [VVC] reference picture mechanism is not too
1499	   different from the [HEVC] one.

1501	8.4.  Full Intra Request (FIR)

1503	   The purpose of the FIR message is to force an encoder to send an
1504	   independent decoder refresh point as soon as possible, while
1505	   observing applicable congestion-control-related constraints, such as
1506	   those set out in [RFC8082]).

1508	   Upon reception of a FIR, a sender MUST send an IDR picture.
1509	   Parameter sets MUST also be sent, except when there is a priori
1510	   knowledge that the parameter sets have been correctly established.  A
1511	   typical example for that is an understanding between sender and
1512	   receiver, established by means outside this document, that parameter
1513	   sets are exclusively sent out-of-band.

1515	9.  Frame marking

1517	      placeholder

1519	10.  Security Considerations

1521	   The scope of this Security Considerations section is limited to the
1522	   payload format itself and to one feature of [VVC] that may pose a
1523	   particularly serious security risk if implemented naively.  The
1524	   payload format, in isolation, does not form a complete system.
1525	   Implementers are advised to read and understand relevant security-
1526	   related documents, especially those pertaining to RTP (see the
1527	   Security Considerations section in [RFC3550] ), and the security of
1528	   the call-control stack chosen (that may make use of the media type
1529	   registration of this memo).  Implementers should also consider known
1530	   security vulnerabilities of video coding and decoding implementations
1531	   in general and avoid those.

1533	   Within this RTP payload format, and with the exception of the user
1534	   data SEI message as described below, no security threats other than
1535	   those common to RTP payload formats are known.  In other words,
1536	   neither the various media-plane-based mechanisms, nor the signaling
1537	   part of this memo, seems to pose a security risk beyond those common
1538	   to all RTP-based systems.

1540	   RTP packets using the payload format defined in this specification
1541	   are subject to the security considerations discussed in the RTP
1542	   specification [RFC3550] , and in any applicable RTP profile such as
1543	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
1544	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
1545	   Does Not Mandate a Single Media Security Solution" [RFC7202]
1546	   discusses, it is not an RTP payload format's responsibility to
1547	   discuss or mandate what solutions are used to meet the basic security
1548	   goals like confidentiality, integrity and source authenticity for RTP
1549	   in general.  This responsibility lays on anyone using RTP in an
1550	   application.  They can find guidance on available security mechanisms
1551	   and important considerations in "Options for Securing RTP Sessions"
1552	   [RFC7201] . The rest of this section discusses the security impacting
1553	   properties of the payload format itself.

1555	   Because the data compression used with this payload format is applied
1556	   end-to-end, any encryption needs to be performed after compression.
1557	   A potential denial-of-service threat exists for data encodings using
1558	   compression techniques that have non-uniform receiver-end
1559	   computational load.  The attacker can inject pathological datagrams
1560	   into the bitstream that are complex to decode and that cause the
1561	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
1562	   attacks, as it is extremely simple to generate datagrams containing
1563	   NAL units that affect the decoding process of many future NAL units.
1564	   Therefore, the usage of data origin authentication and data integrity
1565	   protection of at least the RTP packet is RECOMMENDED, for example,
1566	   with SRTP [RFC3711] .

1568	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
1569	   Enhancement Information (SEI) message.  This SEI message allows
1570	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
1571	   bitstring could include JavaScript, machine code, and other active
1572	   content.  [VVC] leaves the handling of this SEI message to the
1573	   receiving system.  In order to avoid harmful side effects the user
1574	   data SEI message, decoder implementations cannot naively trust its
1575	   content.  For example, it would be a bad and insecure implementation
1576	   practice to forward any JavaScript a decoder implementation detects
1577	   to a web browser.  The safest way to deal with user data SEI messages
1578	   is to simply discard them, but that can have negative side effects on
1579	   the quality of experience by the user.

1581	   End-to-end security with authentication, integrity, or
1582	   confidentiality protection will prevent a MANE from performing media-
1583	   aware operations other than discarding complete packets.  In the case
1584	   of confidentiality protection, it will even be prevented from
1585	   discarding packets in a media-aware way.  To be allowed to perform
1586	   such operations, a MANE is required to be a trusted entity that is
1587	   included in the security context establishment.

1589	11.  Congestion Control

1591	   Congestion control for RTP SHALL be used in accordance with RTP
1592	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
1593	   If best-effort service is being used, an additional requirement is
1594	   that users of this payload format MUST monitor packet loss to ensure
1595	   that the packet loss rate is within an acceptable range.  Packet loss
1596	   is considered acceptable if a TCP flow across the same network path,
1597	   and experiencing the same network conditions, would achieve an
1598	   average throughput, measured on a reasonable timescale, that is not
1599	   less than all RTP streams combined are achieving.  This condition can
1600	   be satisfied by implementing congestion-control mechanisms to adapt
1601	   the transmission rate, the number of layers subscribed for a layered
1602	   multicast session, or by arranging for a receiver to leave the
1603	   session if the loss rate is unacceptably high.

1605	   The bitrate adaptation necessary for obeying the congestion control
1606	   principle is easily achievable when real-time encoding is used, for
1607	   example, by adequately tuning the quantization parameter.  However,
1608	   when pre-encoded content is being transmitted, bandwidth adaptation
1609	   requires the pre-coded bitstream to be tailored for such adaptivity.
1610	   The key mechanisms available in [VVC] are temporal scalability, and
1611	   spatial/SNR scalability.  A media sender can remove NAL units
1612	   belonging to higher temporal sub-layers (i.e., those NAL units with a
1613	   high value of TID) or higher spatio-SNR layers (as indicated by
1614	   interpreting the VPS) until the sending bitrate drops to an
1615	   acceptable range.

1617	   The mechanisms mentioned above generally work within a defined
1618	   profile and level and, therefore, no renegotiation of the channel is
1619	   required.  Only when non-downgradable parameters (such as profile)
1620	   are required to be changed does it become necessary to terminate and
1621	   restart the RTP stream(s).  This may be accomplished by using
1622	   different RTP payload types.

1624	   MANEs MAY remove certain unusable packets from the RTP stream when
1625	   that RTP stream was damaged due to previous packet losses.  This can
1626	   help reduce the network load in certain special cases.  For example,
1627	   MANES can remove those FUs where the leading FUs belonging to the
1628	   same NAL unit have been lost or those dependent slice segments when
1629	   the leading slice segments belonging to the same slice have been
1630	   lost, because the trailing FUs or dependent slice segments are
1631	   meaningless to most decoders.  MANES can also remove higher temporal
1632	   scalable layers if the outbound transmission (from the MANE's
1633	   viewpoint) experiences congestion.

1635	12.  IANA Considerations

1637	   Placeholder

1639	13.  Acknowledgements

1641	   Dr. Byeongdoo Choi is thanked for the video codec related technical
1642	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
1643	   are thanked for their contributions on [VVC] specification
1644	   descriptive content.  Spencer Dawkins is thanked for his valuable
1645	   review comments that led to great improvements of this memo.  Some
1646	   parts of this specification share text with the RTP payload format
1647	   for HEVC [RFC7798].  We thank the authors of that specification for
1648	   their excellent work.

1650	14.  References

1652	14.1.  Normative References

1654	   [H.266]    "ITU-T, Versatile Video Coding", n.d..

1656	   [ISO23090-3]
1657	              "ISO/IEC DIS Information technology --- Coded
1658	              representation of immersive media --- Part 3 Versatile
1659	              video codings", n.d.,
1660	              <https://www.iso.org/standard/73022.html>.

1662	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1663	              Requirement Levels", BCP 14, RFC 2119,
1664	              DOI 10.17487/RFC2119, March 1997,
1665	              <https://www.rfc-editor.org/info/rfc2119>.

1667	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1668	              Jacobson, "RTP: A Transport Protocol for Real-Time
1669	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
1670	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

1672	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
1673	              Video Conferences with Minimal Control", STD 65, RFC 3551,
1674	              DOI 10.17487/RFC3551, July 2003,
1675	              <https://www.rfc-editor.org/info/rfc3551>.

1677	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
1678	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
1679	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
1680	              <https://www.rfc-editor.org/info/rfc3711>.

1682	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
1683	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
1684	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

1686	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
1687	              "Extended RTP Profile for Real-time Transport Control
1688	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
1689	              DOI 10.17487/RFC4585, July 2006,
1690	              <https://www.rfc-editor.org/info/rfc4585>.

1692	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
1693	              "Codec Control Messages in the RTP Audio-Visual Profile
1694	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
1695	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

1697	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
1698	              Real-time Transport Control Protocol (RTCP)-Based Feedback
1699	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
1700	              2008, <https://www.rfc-editor.org/info/rfc5124>.

1702	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
1703	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
1704	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
1705	              DOI 10.17487/RFC7656, November 2015,
1706	              <https://www.rfc-editor.org/info/rfc7656>.

1708	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
1709	              "Using Codec Control Messages in the RTP Audio-Visual
1710	              Profile with Feedback with Layered Codecs", RFC 8082,
1711	              DOI 10.17487/RFC8082, March 2017,
1712	              <https://www.rfc-editor.org/info/rfc8082>.

1714	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
1715	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
1716	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

1718	   [VVC]      "Versatile Video Coding (Draft 8), Joint Video Experts
1719	              Team (JVET)", January 2020.

1721	14.2.  Informative References

1723	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
1724	              HEVC, IEEE Transactions on Circuts and Systems for Video
1725	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
1726	              2012.

1728	   [FrameMarking]
1729	              Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame
1730	              Marking RTP Header Extension", Work in Progress draft-
1731	              berger-avtext-framemarking , 2015.

1733	   [Girod99]  Girod, B, . and . et al, "Feedback-based error control for
1734	              mobile video transmission, Proceedings of the IEEE",
1735	              DOI 110.1109/5.790632, October 1999.

1737	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
1738	              H.265", April 2013.

1740	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
1741	              ofmoving pictures and associated audio information - Part
1742	              1:Systems, ISO International Standard 13818-1", 2013.

1744	   [RFC6051]  Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP
1745	              Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010,
1746	              <https://www.rfc-editor.org/info/rfc6051>.

1748	   [RFC6184]  Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP
1749	              Payload Format for H.264 Video", RFC 6184,
1750	              DOI 10.17487/RFC6184, May 2011,
1751	              <https://www.rfc-editor.org/info/rfc6184>.

1753	   [RFC6190]  Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis,
1754	              "RTP Payload Format for Scalable Video Coding", RFC 6190,
1755	              DOI 10.17487/RFC6190, May 2011,
1756	              <https://www.rfc-editor.org/info/rfc6190>.

1758	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
1759	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
1760	              <https://www.rfc-editor.org/info/rfc7201>.

1762	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
1763	              Framework: Why RTP Does Not Mandate a Single Media
1764	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
1765	              2014, <https://www.rfc-editor.org/info/rfc7202>.

1767	   [RFC7798]  Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M.
1768	              Hannuksela, "RTP Payload Format for High Efficiency Video
1769	              Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March
1770	              2016, <https://www.rfc-editor.org/info/rfc7798>.

1772	   [Wang05]   Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient
1773	              video coding using flexible reference fames", Visual
1774	              Communications and Image Processing 2005 (VCIP 2005) ,
1775	              July 2005.

1777	Appendix A.  Change History

1779	   draft-zhao-payload-rtp-vvc-00 ........ initial version

1781	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
1782	   corrections

1784	Authors' Addresses

1786	   Shuai Zhao
1787	   Tencent
1788	   2747 Park Blvd
1789	   Palo Alto  94588
1790	   USA

1792	   Email: shuai.zhao@ieee.org

1794	   Stephan Wenger
1795	   Tencent
1796	   2747 Park Blvd
1797	   Palo Alto  94588

1799	   Email: stewe@stewe.org

1801	   Yago Sanchez
1802	   Fraunhofer HHI
1803	   Einsteinufer 37
1804	   Berlin  10587
1805	   Germany

1807	   Email: yago.sanchez@hhi.fraunhofer.de