idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-03.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (October 27, 2020) is 1275 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1269

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3'

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 2 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: April 30, 2021                                       Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                        October 27, 2020

9	          RTP Payload Format for Versatile Video Coding (VVC)
10	                     draft-ietf-avtcore-rtp-vvc-03

12	Abstract

14	   This memo describes an RTP payload format for the video coding
15	   standard ITU-T Recommendation H.266 and ISO/IEC International
16	   Standard ISO23090-3, both also known as Versatile Video Coding (VVC)
17	   and developed by the Joint Video Experts Team (JVET).  The RTP
18	   payload format allows for packetization of one or more Network
19	   Abstraction Layer (NAL) units in each RTP packet payload as well as
20	   fragmentation of a NAL unit into multiple RTP packets.  The payload
21	   format has wide applicability in videoconferencing, Internet video
22	   streaming, and high-bitrate entertainment-quality video, among other
23	   applications.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at https://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on April 30, 2021.

42	Copyright Notice

44	   Copyright (c) 2020 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (https://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
61	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   4
62	       1.1.2.  Systems and Transport Interfaces  . . . . . . . . . .   6
63	       1.1.3.  Parallel Processing Support (informative) . . . . . .  10
64	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  11
65	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  12
66	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  12
67	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  12
68	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  12
69	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  13
70	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  16
71	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  16
72	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  17
73	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  18
74	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  19
75	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  20
76	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  20
77	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  21
78	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  25
79	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  28
80	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  29
81	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  30
82	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  32
83	     7.1.  Media Type Registration . . . . . . . . . . . . . . . . .  32
84	     7.2.  SDP Parameters  . . . . . . . . . . . . . . . . . . . . .  32
85	       7.2.1.  Mapping of Payload Type Parameters to SDP . . . . . .  32
86	       7.2.2.  Usage with SDP Offer/Answer Model . . . . . . . . . .  33
87	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  33
88	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  33
89	     8.2.  Slice Loss Indication (SLI) . . . . . . . . . . . . . . .  34
90	     8.3.  Reference Picture Selection Indication (RPSI) . . . . . .  34
91	     8.4.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  34
92	   9.  Frame Marking . . . . . . . . . . . . . . . . . . . . . . . .  35
93	     9.1.  Frame Marking Short Extension . . . . . . . . . . . . . .  35
94	     9.2.  Frame Marking Long Extension  . . . . . . . . . . . . . .  36
95	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  37
96	   11. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  38
97	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  39
98	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  39
99	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  40
100	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  40
101	     14.2.  Informative References . . . . . . . . . . . . . . . . .  42
102	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  43
103	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  43

105	1.  Introduction

107	   The Versatile Video Coding [VVC] specification, formally published as
108	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
109	   23090-3 [ISO23090-3], is currently in the ITU-T publication process
110	   and the ISO/IEC approval process.  [H.266] is reported to provide
111	   significant coding efficiency gains over H.265 and earlier video
112	   codec formats.

114	   This memo specifices an RTP payload format for VVC.  It shares its
115	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
116	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
117	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
118	   and their respective predecessors.  With respect to design
119	   philosophy, security, congestion control, and overall implementation
120	   complexity, it has similar properties to those earlier payload format
121	   specifications.  This is a conscious choice, as at least RFC 6184 is
122	   widely deployed and generally known in the relevant implementer
123	   communities.  Certain mechanisms known from [RFC6190] were
124	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
125	   signal-to-noise ratio (SNR) scalability.

127	1.1.  Overview of the VVC Codec

129	   [VVC] and [HEVC] share a similar hybrid video codec design.  In this
130	   memo, we provide a very brief overview of those features of VVC that
131	   are, in some form, addressed by the payload format specified herein.
132	   Implementers have to read, understand, and apply the ITU- T/ISO/IEC
133	   specifications pertaining to [VVC] to arrive at interoperable, well-
134	   performing implementations.

136	   Conceptually, both [VVC] and [HEVC] include a Video Coding Layer
137	   (VCL), which is often used to refer to the coding-tool features, and
138	   a NAL, which is often used to refer to the systems and transport
139	   interface aspects of the codecs.

141	1.1.1.  Coding-Tool Features (informative)

143	   Coding tool features are described below with occasional reference to
144	   the coding tool set of [HEVC], which is well known in the community.

146	   Similar to earlier hybrid-video-coding-based standards, including
147	   HEVC, the following basic video coding design is employed by VVC.  A
148	   prediction signal is first formed by either intra- or motion-
149	   compensated prediction, and the residual (the difference between the
150	   original and the prediction) is then coded.  The gains in coding
151	   efficiency are achieved by redesigning and improving almost all parts
152	   of the codec over earlier designs.  In addition, [VVC] includes
153	   several tools to make the implementation on parallel architectures
154	   easier.

156	   Finally, [VVC] includes temporal, spatial, and SNR scalability as
157	   well as multiview coding support.

159	   Coding blocks and transform structure

161	   Among major coding-tool differences between HEVC and VVC, one of the
162	   important improvements is the more flexible coding tree structure in
163	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
164	   ternary trees are also supported, which contributes significant
165	   improvement in coding efficiency.  Moreover, the maximum size of
166	   coding tree unit (CTU) is increased from 64x64 to 128x128.  To
167	   improve the coding efficiency of chroma signal, luma chroma separated
168	   trees at CTU level may be employed for intra-slices.  The square
169	   transforms in HEVC are extended to non-square transforms for
170	   rectangular blocks resulting from binary and ternary tree splits.
171	   Besides, [VVC] supports multiple transform sets (MTS), including DCT-
172	   2, DST-7, and DCT-8 as well as the non-separable secondary transform.
173	   The transforms used in [VVC] can have different sizes with support
174	   for larger transform sizes.  For DCT-2, the transform sizes range
175	   from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range
176	   from 4x4 to 32x32.  In addition, [VVC] also support sub-block
177	   transform for both intra and inter coded blocks.  For intra coded
178	   blocks, intra sub-partitioning (ISP) may be used to allow sub-block
179	   based intra prediction and transform.  For inter blocks, sub-block
180	   transform may be used assuming that only a part of an inter-block has
181	   non-zero transform coefficients.

183	   Entropy coding

185	   Similar to HEVC, VVC uses a single entropy-coding engine, which is
186	   based on context adaptive binary arithmetic coding [CABAC], but with
187	   the support of multi-window sizes.  The window sizes can be
188	   initialized differently for different context models.  Due to such a
189	   design, it has more efficient adaptation speed and better coding
190	   efficiency.  A joint chroma residual coding scheme is applied to
191	   further exploit the correlation between the residuals of two color
192	   components.  In VVC, different residual coding schemes are applied
193	   for regular transform coefficients and residual samples generated
194	   using transform-skip mode.

196	   In-loop filtering

198	   VVC has more feature support in loop filters than HEVC.  The
199	   deblocking filter in VVC is similar to HEVC but operates at a smaller
200	   grid.  After deblocking and sample adaptive offset (SAO), an adaptive
201	   loop filter (ALF) may be used.  As a Wiener filter, ALF reduces
202	   distortion of decoded pictures.  Besides, VVC introduces a new module
203	   before deblocking called luma mapping with chroma scaling to fully
204	   utilize the dynamic range of signal so that rate-distortion
205	   performance of both SDR and HDR content is improved.

207	   Motion prediction and coding

209	   Compared to HEVC, [VVC] introduces several improvements in this area.
210	   First, there is the adaptive motion vector resolution (AMVR), which
211	   can save bit cost for motion vectors by adaptively signaling motion
212	   vector resolution.  Then the affine motion compensation is included
213	   to capture complicated motion like zooming and rotation.  Meanwhile,
214	   prediction refinement with the optical flow with affine mode (PROF)
215	   is further deployed to mimic affine motion at the pixel level.
216	   Thirdly the decoder side motion vector refinement (DMVR) is a method
217	   to derive MV vector at decoder side based on block matching so that
218	   fewer bits may be spent on motion vectors.  Bi-directional optical
219	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
220	   offset at 4x4 sub-block level that is derived with equations based on
221	   gradients of the prediction samples and a motion difference relative
222	   to CU motion vectors.  Furthermore, merge with motion vector
223	   difference (MMVD) is a special mode, which further signals a limited
224	   set of motion vector differences on top of merge mode.  In addition
225	   to MMVD, there are another three types of special merge modes, i.e.,
226	   sub-block merge, triangle, and combined intra-/inter-prediction
227	   (CIIP).  Sub- block merge list includes one candidate of sub-block
228	   temporal motion vector prediction (SbTMVP) and up to four candidates
229	   of affine motion vectors.  Triangle is based on triangular block
230	   motion compensation.  CIIP combines intra- and inter- predictions
231	   with weighting.  Adaptive weighting may be employed with a block-
232	   level tool called bi-prediction with CU based weighting (BCW) which
233	   provides more flexibility than in HEVC.

235	   Intra prediction and intra-coding
236	   To capture the diversified local image texture directions with finer
237	   granularity, [VVC] supports 65 angular directions instead of 33
238	   directions in HEVC.  The intra mode coding is based on a 6-most -
239	   probable-mode scheme, and the 6 most probable modes are derived using
240	   the neighboring intra prediction directions.  In addition, to deal
241	   with the different distributions of intra prediction angles for
242	   different block aspect ratios, a wide-angle intra prediction (WAIP)
243	   scheme is applied in [VVC] by including intra prediction angles
244	   beyond those present in HEVC.  Unlike HEVC which only allows using
245	   the most adjacent line of reference samples for intra prediction,
246	   [VVC] also allows using two further reference lines, as known as
247	   multi-reference-line (MRL) intra prediction.  The additional
248	   reference lines can be only used for the 6 most probable intra
249	   prediction modes.  To capture the strong correlation between
250	   different colour components, in VVC, a cross-component linear mode
251	   (CCLM) is utilized which assumes a linear relationship between the
252	   luma sample values and their associated chroma samples.  For intra
253	   prediction, [VVC] also applies a position-dependent prediction
254	   combination (PDPC) for refining the prediction samples closer to the
255	   intra prediction block boundary.  Matrix-based intra prediction (MIP)
256	   modes are also used in [VVC] which generates an up to 8x8 intra
257	   prediction block using a weighted sum of downsampled neighboring
258	   reference samples, and the weights are hardcoded constants.

260	   Other coding-tool feature

262	   [VVC] introduces dependent quantization (DQ) to reduce quantization
263	   error by state-based switching between two quantizers.

265	1.1.2.  Systems and Transport Interfaces

267	   [VVC] inherits the basic systems and transport interfaces designs
268	   from HEVC and H.264.  These include the NAL-unit-based syntax
269	   structure, the hierarchical syntax and data unit structure, the
270	   supplemental enhancement information (SEI) message mechanism, and the
271	   video buffering model based on the hypothetical reference decoder
272	   (HRD).  The scalability features of [VVC] are conceptually similar to
273	   the scalable variant of HEVC known as SHVC.  The hierarchical syntax
274	   and data unit structure consists of parameter sets at various levels
275	   (decoder, sequence (pertaining to all), sequence (pertaining to a
276	   single), picture), picture-level header parameters, slice-level
277	   header parameters, and lower-level parameters.

279	   A number of key components that influenced the network abstraction
280	   layer design of [VVC] as well as this memo are described below

282	   Decoding Capability Information
283	   The decoding capability information includes parameters that stay
284	   constant for the lifetime of a Video Bitstream, which in IETF terms
285	   can translate to the lifetime of a session.  Such information
286	   includes profile, level, and sub-profile information to determine a
287	   maximum capability interop point that is guaranteed to be never
288	   exceeded, even if splicing of video sequences occurs within a
289	   session.  It further includes constraint fields (most of which are
290	   flags), which can optionally be set to indicate that the video
291	   bitstream will be constraint in the use of certain features as
292	   indicated by the values of those fields.  With this, a bitstream can
293	   be labelled as not using certain tools, which allows among other
294	   things for resource allocation in a decoder implementation.

296	   Video parameter set

298	   TThe ideo parameter set (VPS) pertains to a coded video sequences
299	   (CVS) of multiple layers covering the same range of access units, and
300	   includes, among other information decoding dependency expressed as
301	   information for reference picture list construction of enhancement
302	   layers.  The VPS provides a "big picture" of a scalable sequence,
303	   including what types of operation points are provided, the profile,
304	   tier, and level of the operation points, and some other high-level
305	   properties of the bitstream that can be used as the basis for session
306	   negotiation and content selection, etc.  One VPS may be referenced by
307	   one or more sequence parameter sets.

309	   Sequence parameter set

311	   The sequence parameter set (SPS) contains syntax elements pertaining
312	   to a coded layer video sequence (CLVS), which is a group of pictures
313	   belonging to the same layer, starting with a random access point, and
314	   followed by pictures that may depend on each other, until the next
315	   random access point picture.  In MPGEG-2, the equivalent of a CVS was
316	   a group of pictures (GOP), which normally started with an I frame and
317	   was followed by P and B frames.  While more complex in its options of
318	   random access points, VVC retains this basic concept.  One remarkable
319	   difference of VVC is that a CLVS may start with a Gradual Decoding
320	   Refresh (GDR) picture, without requiring presence of traditional
321	   random access points in the bitstream, such as instantaneous decoding
322	   refresh (IDR) or clean random access (CRA) pictures.  In many TV-like
323	   applications, a CVS contains a few hundred milliseconds to a few
324	   seconds of video.  In video conferencing (without switching MCUs
325	   involved), a CVS can be as long in duration as the whole session.

327	   Picture and adaptation parameter set

329	   The picture parameter set and the adaptation parameter set (PPS and
330	   APS, respectively) carry information pertaining to zero or more
331	   pictures and zero or more slices, respectively.  The PPS contains
332	   information that is likely to stay constant from picture to picture-
333	   at least for pictures for a certain type-whereas the APS contains
334	   information, such as adaptive loop filter coefficients, that are
335	   likely to change from picture to picture or even within a picture.  A
336	   single APS is referenced by all slices of the same picture if that
337	   APS contains information about luma mapping with chroma scaling
338	   (LMCS) or scaling list.  Different APSs containing ALF parameters can
339	   be referenced by slices of the same picture.

341	   Picture Header

343	   A Picture Header contains information that is common to all slices
344	   that belong to the same picture.  Being able to send that information
345	   as a separate NAL unit when pictures are split into several slices
346	   allows for saving bitrate, compared to repeating the same information
347	   in all slices.  However, there might be scenarios where low-bitrate
348	   video is transmitted using a single slice per picture.  Having a
349	   separate NAL unit to convey that information incurs in an overhead
350	   for such scenarios.  For such scenarios, the picture header syntax
351	   structure is directly included in the slice header, instead of in its
352	   own NAL unit.  The mode of the picture header syntax structure being
353	   included in its own NAL unit or not can only be switched on/off for
354	   an entire CLVS, and can only be switched off when in the entire CLVS
355	   each picture contains only one slice.

357	   Profile, tier, and level

359	   The profile, tier and level syntax structures in DCI, VPS and SPS
360	   contain profile, tier, level information for all layers that refer to
361	   the DCI, for layers associated with one or more output layer sets
362	   specified by the VPS, and for any layer that refers to the SPS,
363	   respectively.

365	   Sub-Profiles

367	   Within the VVC specification, a sub-profile is a 32-bit number, coded
368	   according to ITU-T Rec. T.35, that does not carry a semantics.  It is
369	   carried in the profile_tier_level structure and hence (potentially)
370	   present in the DCI, VPS, and SPS.  External registration bodies can
371	   register a T.35 codepoint with ITU-T registration authorities and
372	   associate with their registration a description of bitstream
373	   restrictions beyond the profiles defined by ITU-T and ISO/IEC.  This
374	   would allow encoder manufacturers to label the bitstreams generated
375	   by their encoder as complying with such sub-profile.  It is expected
376	   that upstream standardization organizations (such as: DVB and ATSC),
377	   as well as walled-garden video services will take advantage of this
378	   labelling system.  In contrast to "normal" profiles, it is expected
379	   that sub-profiles may indicate encoder choices traditionally left
380	   open in the (decoder- centric) video coding specs, such as GOP
381	   structures, minimum/maximum QP values, and the mandatory use of
382	   certain tools or SEI messages.

384	   Constraint Fields

386	   The profile_tier_level structure carries a considerable number of
387	   constraint fields (more of which are flags), which an encoder can use
388	   to indicate to a decoder that it will not use a certain tool or
389	   technology.  They were included in reaction to a perceived market
390	   need for labelling a bitstream as not exercising a certain tool that
391	   has become commercially unviable.

393	   Temporal scalability support

395	      Editor notes: need will update along with VVC new draft in the
396	      future

398	   [VVC] includes support of temporal scalability, by inclusion of the
399	   signaling of TemporalId in the NAL unit header, the restriction that
400	   pictures of a particular temporal sublayer cannot be used for inter
401	   prediction reference by pictures of a lower temporal sublayer, the
402	   sub-bitstream extraction process, and the requirement that each sub-
403	   bitstream extraction output be a conforming bitstream.  Media-Aware
404	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
405	   header for stream adaptation purposes based on temporal scalability.

407	   Picture reference resampling (RPR)

409	      Editor's notes: to do updated

411	   Spatial, SNR, and multiview scalability

413	   [VVC] includes support for spatial, SNR, and multiview scalability.
414	   Scalable video coding is widely considered to have technical benefits
415	   and enrich services for various video applications.  Until recently,
416	   however, the functionality has not been included in the first version
417	   of specifications of the video codecs.  In VVC, however, all those
418	   forms of scalability are supported natively through the signaling of
419	   the layer_id in the NAL unit header, the VPS which associates layers
420	   with given layer_ids to each other, reference picture selection,
421	   reference picture resampling for spatial scalability, and a number of
422	   other mechanisms not relevant for this memo.  Scalability support can
423	   be implemented in a single decoding "loop" and is widely considered a
424	   comparatively lightweight operation.

426	      Spatial Scalability
427	         With the existence of Reference Picture Resampling (RPR), in
428	         the "main" profile of VVC, the additional burden for
429	         scalability support is just a modification of the high-level
430	         syntax (HLS).  The inter-layer prediction is employed in a
431	         scalable system to improve the coding efficiency of the
432	         enhancement layers.  In addition to the spatial and temporal
433	         motion-compensated predictions that are available in a single-
434	         layer codec, the inter-layer prediction in VVC uses the
435	         possibly resampled video data of the reconstructed reference
436	         picture from a reference layer to predict the current
437	         enhancement layer.  The resampling process for inter-layer
438	         prediction, when used, is performed at the block-level, reusing
439	         the existing interpolation process for motion compensation in
440	         single-layer coding.  It means that no additional resampling
441	         process is needed to support spatial scalability.

443	      SNR Scalability

445	         SNR scalability is similar to spatial scalability except that
446	         the resampling factors are 1:1.  In other words, there is no
447	         change in resolution, but there is inter-layer prediction.

449	   SEI Messages

451	   Supplementary enhancement information (SEI) messages are information
452	   in the bitstream that do not influence the decoding process as
453	   specified in the VVC spec, but address issues of representation/
454	   rendering of the decoded bitstream, label the bitstream for certain
455	   applications, among other, similar tasks.  The overall concept of SEI
456	   messages and many of the messages themselves has been inherited from
457	   the H.264 and HEVC specs.  Except for the SEI messages that affect
458	   the specification of the hypothetical reference decoder (HRD), other
459	   SEI messages for use in the VVC environment, which are generally
460	   useful also in other video coding technologies, are not included in
461	   the main VVC specification.

463	1.1.3.  Parallel Processing Support (informative)

465	   Compared to HEVC, the [VVC] design to support parallelization offers
466	   numerous improvements.

468	      Editor notes: udpate on sub-picture/slice/tile is needed following
469	      new VVC draft

471	1.1.4.  NAL Unit Header

473	   [VVC] maintains the NAL unit concept of HEVC with modifications.  VVC
474	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
475	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

477	                     +---------------+---------------+
478	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
479	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
480	                     |F|Z| LayerID   |  Type   | TID |
481	                     +---------------+---------------+

483	                   The Structure of the VVC NAL Unit Header.

485	                                 Figure 1

487	   The semantics of the fields in the NAL unit header are as specified
488	   in [VVC] and described briefly below for convenience.  In addition to
489	   the name and size of each field, the corresponding syntax element
490	   name in [VVC] is also provided.

492	   F: 1 bit

494	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
495	      inclusion of this bit in the NAL unit header was to enable
496	      transport of [VVC] video over MPEG-2 transport systems (avoidance
497	      of start code emulations) [MPEG2S].  In the context of this memo
498	      the value 1 may be used to indicate a syntax violation, e.g., for
499	      a NAL unit resulted from aggregating a number of fragmented units
500	      of a NAL unit but missing the last fragment, as described in
501	      Section TBD.

503	   Z: 1 bit

505	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
506	      for future extensions by ITU-T and ISO/IEC.
507	      This memo does not overload the "Z" bit for local extensions, as
508	      a) overloading the "F" bit is sufficient and b) to preserve the
509	      usefulness of this memo to possible future versions of [VVC].

511	   LayerId: 6 bits

513	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
514	      a layer may be, e.g., a spatial scalable layer, a quality scalable
515	      layer .

517	   Type: 5 bits
518	      nal_unit_type.  This field specifies the NAL unit type as defined
519	      in Table 7-1 of VVC.  For a reference of all currently defined NAL
520	      unit types and their semantics, please refer to Section 7.4.2.2 in
521	      [VVC].

523	   TID: 3 bits

525	      nuh_temporal_id_plus1.  This field specifies the temporal
526	      identifier of the NAL unit plus 1.  The value of TemporalId is
527	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
528	      there is at least one bit in the NAL unit header equal to 1, so to
529	      enable independent considerations of start code emulations in the
530	      NAL unit header and in the NAL unit payload data.

532	1.2.  Overview of the Payload Format

534	   This payload format defines the following processes required for
535	   transport of [VVC] coded data over RTP [RFC3550]:

537	   o  Usage of RTP header with this payload format

539	   o  Packetization of [VVC] coded NAL units into RTP packets using
540	      three types of payload structures: a single NAL unit packet,
541	      aggregation packet, and fragment unit

543	   o  Transmission of [VVC] NAL units of the same bitstream within a
544	      single RTP stream.

546	   o  Media type parameters to be used with the Session Description
547	      Protocol (SDP) [RFC4566]

549	   o  Frame-marking mapping [FrameMarking]

551	2.  Conventions

553	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
554	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
555	   "OPTIONAL" in this document are to be interpreted as described in BCP
556	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
557	   capitals, as shown above.

559	3.  Definitions and Abbreviations

561	3.1.  Definitions

563	   This document uses the terms and definitions of VVC.  Section 3.1.1
564	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
565	   provides definitions specific to this memo.

567	3.1.1.  Definitions from the VVC Specification

569	      Editor notes:

571	   Access unit (AU): A set of PUs that belong to different layers and
572	   contain coded pictures associated with the same time for output from
573	   the DPB.

575	   Adaptation parameter set (APS): A syntax structure containing syntax
576	   elements that apply to zero or more slices as determined by zero or
577	   more syntax elements found in slice headers.

579	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
580	   byte stream, that forms the representation of a sequence of AUs
581	   forming one or more coded video sequences (CVSs).

583	   Coded picture: A coded representation of a picture comprising VCL NAL
584	   units with a particular value of nuh_layer_id within an AU and
585	   containing all CTUs of the picture.

587	   Clean random access (CRA) PU: A PU in which the coded picture is a
588	   CRA picture.

590	   Clean random access (CRA) picture: An IRAP picture for which each VCL
591	   NAL unit has nal_unit_type equal to CRA_NUT.

593	   Coded video sequence (CVS): A sequence of AUs that consists, in
594	   decoding order, of a CVSS AU, followed by zero or more AUs that are
595	   not CVSS AUs, including all subsequent AUs up to but not including
596	   any subsequent AU that is a CVSS AU.

598	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
599	   for each layer in the CVS and the coded picture in each PU is a CLVSS
600	   picture.

602	   Coded layer video sequence (CLVS): A sequence of PUs with the same
603	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
604	   PU, followed by zero or more PUs that are not CLVSS PUs, including
605	   all subsequent PUs up to but not including any subsequent PU that is
606	   a CLVSS PU.

608	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
609	   picture is a CLVSS picture.

611	   Coded layer video sequence start (CLVSS) picture: A coded picture
612	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
613	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

615	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
616	   of chroma samples of a picture that has three sample arrays, or a CTB
617	   of samples of a monochrome picture or a picture that is coded using
618	   three separate colour planes and syntax structures used to code the
619	   samples.

621	   Decoding Capability Information (DCI): A syntax structure containing
622	   syntax elements that apply to the entire bitstream.

624	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
625	   reference, output reordering, or output delay specified for the
626	   hypothetical reference decoder.

628	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
629	   NAL unit has nal_unit_type equal to GDR_NUT.

631	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
632	   picture is an IDR picture.

634	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
635	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
636	   IDR_N_LP.

638	   Intra random access point (IRAP) AU: An AU in which there is a PU for
639	   each layer in the CVS and the coded picture in each PU is an IRAP
640	   picture.

642	   Intra random access point (IRAP) PU: A PU in which the coded picture
643	   is an IRAP picture.

645	   Intra random access point (IRAP) picture: A coded picture for which
646	   all VCL NAL units have the same value of nal_unit_type in the range
647	   of IDR_W_RADL to CRA_NUT, inclusive.

649	   Layer: A set of VCL NAL units that all have a particular value of
650	   nuh_layer_id and the associated non-VCL NAL units.

652	   Network abstraction layer (NAL) unit: A syntax structure containing
653	   an indication of the type of data to follow and bytes containing that
654	   data in the form of an RBSP interspersed as necessary with emulation
655	   prevention bytes.

657	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

659	   Operation point (OP): A temporal subset of an OLS, identified by an
660	   OLS index and a highest value of TemporalId.

662	   Picture parameter set (PPS): A syntax structure containing syntax
663	   elements that apply to zero or more entire coded pictures as
664	   determined by a syntax element found in each slice header.

666	   Picture unit (PU): A set of NAL units that are associated with each
667	   other according to a specified classification rule, are consecutive
668	   in decoding order, and contain exactly one coded picture.

670	   Random access: The act of starting the decoding process for a
671	   bitstream at a point other than the beginning of the stream.

673	   Sequence parameter set (SPS): A syntax structure containing syntax
674	   elements that apply to zero or more entire CLVSs as determined by the
675	   content of a syntax element found in the PPS referred to by a syntax
676	   element found in each picture header.

678	   Slice: An integer number of complete tiles or an integer number of
679	   consecutive complete CTU rows within a tile of a picture that are
680	   exclusively contained in a single NAL unit.

682	   sublayer: A temporal scalable layer of a temporal scalable bitstream
683	   consisting of VCL NAL units with a particular value of the TemporalId
684	   variable, and the associated non-VCL NAL units.

686	   Subpicture: An rectangular region of one or more slices within a
687	   picture.

689	   sublayer representation: A subset of the bitstream consisting of NAL
690	   units of a particular sublayer and the lower sublayers.

692	   Tile: A rectangular region of CTUs within a particular tile column
693	   and a particular tile row in a picture.

695	   Tile column: A rectangular region of CTUs having a height equal to
696	   the height of the picture and a width specified by syntax elements in
697	   the picture parameter set.

699	   Tile row: A rectangular region of CTUs having a height specified by
700	   syntax elements in the picture parameter set and a width equal to the
701	   width of the picture.

703	   Video coding layer (VCL) NAL unit: A collective term for coded slice
704	   NAL units and the subset of NAL units that have reserved values of
705	   nal_unit_type that are classified as VCL NAL units in this
706	   Specification.

708	3.1.2.  Definitions Specific to This Memo

710	   Media-Aware Network Element (MANE): A network element, such as a
711	   middlebox, selective forwarding unit, or application-layer gateway
712	   that is capable of parsing certain aspects of the RTP payload headers
713	   or the RTP payload and reacting to their contents.

715	      Editor Notes: the following informative needs to be updated along
716	      with frame marking update

718	      Informative note: The concept of a MANE goes beyond normal routers
719	      or gateways in that a MANE has to be aware of the signaling (e.g.,
720	      to learn about the payload type mappings of the media streams),
721	      and in that it has to be trusted when working with Secure RTP
722	      (SRTP).  The advantage of using MANEs is that they allow packets
723	      to be dropped according to the needs of the media coding.  For
724	      example, if a MANE has to drop packets due to congestion on a
725	      certain link, it can identify and remove those packets whose
726	      elimination produces the least adverse effect on the user
727	      experience.  After dropping packets, MANEs must rewrite RTCP
728	      packets to match the changes to the RTP stream, as specified in
729	      Section 7 of [RFC3550].

731	   NAL unit decoding order: A NAL unit order that conforms to the
732	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
733	   follow the Order of NAL units in the bitstream.

735	   NAL unit output order: A NAL unit order in which NAL units of
736	   different access units are in the output order of the decoded
737	   pictures corresponding to the access units, as specified in [VVC],
738	   and in which NAL units within an access unit are in their decoding
739	   order.

741	   RTP stream: See [RFC7656].  Within the scope of this memo, one RTP
742	   stream is utilized to transport one or more temporal sublayers.

744	   Transmission order: The order of packets in ascending RTP sequence
745	   number order (in modulo arithmetic).  Within an aggregation packet,
746	   the NAL unit transmission order is the same as the order of
747	   appearance of NAL units in the packet.

749	3.2.  Abbreviations

751	   AU         Access Unit

753	   AP         Aggregation Packet

755	   CTU        Coding Tree Unit
756	   CVS        Coded Video Sequence

758	   DPB        Decoded Picture Buffer

760	   DCI        Decoding capability information

762	   DON        Decoding Order Number

764	   FIR        Full Intra Request

766	   FU         Fragmentation Unit

768	   HRD        Hypothetical Reference Decoder

770	   IDR        Instantaneous Decoding Refresh

772	   MANE       Media-Aware Network Element

774	   MTU        Maximum Transfer Unit

776	   NAL        Network Abstraction Layer

778	   NALU       Network Abstraction Layer Unit

780	   PLI        Picture Loss Indication

782	   PPS        Picture Parameter Set

784	   RPS        Reference Picture Set

786	   RPSI       Reference Picture Selection Indication

788	   SEI        Supplemental Enhancement Information

790	   SLI        Slice Loss Indication

792	   SPS        Sequence Parameter Set

794	   VCL        Video Coding Layer

796	   VPS        Video Parameter Set

798	4.  RTP Payload Format
799	4.1.  RTP Header Usage

801	   The format of the RTP header is specified in [RFC3550] (reprinted as
802	   Figure 2 for convenience).  This payload format uses the fields of
803	   the header in a manner consistent with that specification.

805	   The RTP payload (and the settings for some RTP header bits) for
806	   aggregation packets and fragmentation units are specified in
807	   Section 4.3.2 and Section 4.3.3, respectively.

809	       0                   1                   2                   3
810	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
811	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
812	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
813	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
814	      |                           timestamp                           |
815	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
816	      |           synchronization source (SSRC) identifier            |
817	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
818	      |            contributing source (CSRC) identifiers             |
819	      |                             ....                              |
820	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

822	                        RTP Header According to {{RFC3550}}

824	                                 Figure 2

826	   The RTP header information to be set according to this RTP payload
827	   format is set as follows:

829	   Marker bit (M): 1 bit

831	      Set for the last packet of the access unit, carried in the current
832	      RTP stream.  This is in line with the normal use of the M bit in
833	      video formats to allow an efficient playout buffer handling.

835	         Editor notes: The informative note below needs updating once
836	         the NAL unit type table is stable in the [VVC] spec.

838	         Informative note: The content of a NAL unit does not tell
839	         whether or not the NAL unit is the last NAL unit, in decoding
840	         order, of an access unit.  An RTP sender implementation may
841	         obtain this information from the video encoder.  If, however,
842	         the implementation cannot obtain this information directly from
843	         the encoder, e.g., when the bitstream was pre-encoded, and also
844	         there is no timestamp allocated for each NAL unit, then the
845	         sender implementation can inspect subsequent NAL units in
846	         decoding order to determine whether or not the NAL unit is the
847	         last NAL unit of an access unit as follows.  A NAL unit is
848	         determined to be the last NAL unit of an access unit if it is
849	         the last NAL unit of the bitstream.  A NAL unit naluX is also
850	         determined to be the last NAL unit of an access unit if both
851	         the following conditions are true: 1) the next VCL NAL unit
852	         naluY in decoding order has the high-order bit of the first
853	         byte after its NAL unit header equal to 1 or nal_unit_type
854	         equal to 19, and 2) all NAL units between naluX and naluY, when
855	         present, have nal_unit_type in the range of 13 to17, inclusive,
856	         equal to 20, equal to 23 or equal to 26.

858	   Payload Type (PT): 7 bits

860	      The assignment of an RTP payload type for this new packet format
861	      is outside the scope of this document and will not be specified
862	      here.  The assignment of a payload type has to be performed either
863	      through the profile used or in a dynamic way.

865	   Sequence Number (SN): 16 bits

867	      Set and used in accordance with [RFC3550].

869	   Timestamp: 32 bits

871	      The RTP timestamp is set to the sampling timestamp of the content.
872	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
873	      properties of its own (e.g., parameter set and SEI NAL units), the
874	      RTP timestamp MUST be set to the RTP timestamp of the coded
875	      picture of the access unit in which the NAL unit (according to
876	      Annex D of VVC) is included.  Receivers MUST use the RTP timestamp
877	      for the display process, even when the bitstream contains picture
878	      timing SEI messages or decoding unit information SEI messages as
879	      specified in VVC.

881	   Synchronization source (SSRC): 32 bits

883	      Used to identify the source of the RTP packets.  A single SSRC is
884	      used for all parts of a single bitstream.

886	4.2.  Payload Header Usage

888	   The first two bytes of the payload of an RTP packet are referred to
889	   as the payload header.  The payload header consists of the same
890	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
891	   in Section 1.1.4, irrespective of the type of the payload structure.

893	   The TID value indicates (among other things) the relative importance
894	   of an RTP packet, for example, because NAL units belonging to higher
895	   temporal sublayers are not used for the decoding of lower temporal
896	   sublayers.  A lower value of TID indicates a higher importance.
897	   More-important NAL units MAY be better protected against transmission
898	   losses than less-important NAL units.

900	      For Discussion: quite possibly something similar can be said for
901	      the Layer_id in layered coding, but perhaps not in multiview
902	      coding.  (The relevant part of the spec is relatively new,
903	      therefore the soft language).  However, for serious layer pruning,
904	      interpretation of the VPS is required.  We can add language about
905	      the need for stateful interpretation of LayerID vis-a-vis
906	      stateless interpretation of TID later.

908	4.3.  Payload Structures

910	   Three different types of RTP packet payload structures are specified.
911	   A receiver can identify the type of an RTP packet payload through the
912	   Type field in the payload header.

914	   The three different payload structures are as follows:

916	   o  Single NAL unit packet: Contains a single NAL unit in the payload,
917	      and the NAL unit header of the NAL unit also serves as the payload
918	      header.  This payload structure is specified in Section 4.4.1.

920	   o  Aggregation Packet (AP): Contains more than one NAL unit within
921	      one access unit.  This payload structure is specified in
922	      Section 4.3.2.

924	   o  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
925	      This payload structure is specified in Section 4.3.3.

927	4.3.1.  Single NAL Unit Packets

929	      Editor notes: its better to add a section to describe DONL and
930	      sprop-max_don_diff.  sprop-max_don_diff is used but not specified
931	      as parameters in section 7 are not yet specified.  A value of
932	      sprop-max_don_diff greater than 0 indicates that the transmission
933	      order may not correspond to the decoding order and that the DON is
934	      is included in the payload header.

936	   A single NAL unit packet contains exactly one NAL unit, and consists
937	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
938	   DONL field (in network byte order), and the NAL unit payload data
939	   (the NAL unit excluding its NAL unit header) of the contained NAL
940	   unit, as shown in Figure 3.

942	      0                   1                   2                   3
943	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
944	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
945	     |           PayloadHdr          |      DONL (conditional)       |
946	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
947	     |                                                               |
948	     |                  NAL unit payload data                        |
949	     |                                                               |
950	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
951	     |                               :...OPTIONAL RTP padding        |
952	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

954	                  The Structure of a Single NAL Unit Packet

956	                                 Figure 3

958	   The DONL field, when present, specifies the value of the 16 least
959	   significant bits of the decoding order number of the contained NAL
960	   unit.  If sprop-max-don-diff is greater than 0, the DONL field MUST
961	   be present, and the variable DON for the contained NAL unit is
962	   derived as equal to the value of the DONL field.  Otherwise (sprop-
963	   max-don-diff is equal to 0), the DONL field MUST NOT be present.

965	4.3.2.  Aggregation Packets (APs)

967	   Aggregation Packets (APs) can reduce of packetization overhead for
968	   small NAL units, such as most of the non- VCL NAL units, which are
969	   often only a few octets in size.

971	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
972	   carried in an AP is encapsulated in an aggregation unit.  NAL units
973	   aggregated in one AP are included in NAL unit decoding order.

975	   An AP consists of a payload header (denoted as PayloadHdr) followed
976	   by two or more aggregation units, as shown in Figure 4.

978	     0                   1                   2                   3
979	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
980	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
981	    |    PayloadHdr (Type=28)       |                               |
982	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
983	    |                                                               |
984	    |             two or more aggregation units                     |
985	    |                                                               |
986	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
987	    |                               :...OPTIONAL RTP padding        |
988	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

990	                   The Structure of an Aggregation Packet

992	                                 Figure 4

994	   The fields in the payload header of an AP are set as follows.  The F
995	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
996	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
997	   be equal to 28.

999	   The value of LayerId MUST be equal to the lowest value of LayerId of
1000	   all the aggregated NAL units.  The value of TID MUST be the lowest
1001	   value of TID of all the aggregated NAL units.

1003	      Informative note: All VCL NAL units in an AP have the same TID
1004	      value since they belong to the same access unit.  However, an AP
1005	      may contain non-VCL NAL units for which the TID value in the NAL
1006	      unit header may be different than the TID value of the VCL NAL
1007	      units in the same AP.

1009	   An AP MUST carry at least two aggregation units and can carry as many
1010	   aggregation units as necessary; however, the total amount of data in
1011	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1012	   chosen so that the resulting IP packet is smaller than the MTU size
1013	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1014	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1015	   not contain another AP.

1017	   The first aggregation unit in an AP consists of a conditional 16-bit
1018	   DONL field (in network byte order) followed by a 16-bit unsigned size
1019	   information (in network byte order) that indicates the size of the
1020	   NAL unit in bytes (excluding these two octets, but including the NAL
1021	   unit header), followed by the NAL unit itself, including its NAL unit
1022	   header, as shown in Figure 5.

1024	     0                   1                   2                   3
1025	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1026	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1027	    |               :       DONL (conditional)      |   NALU size   |
1028	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1029	    |   NALU size   |                                               |
1030	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1031	    |                                                               |
1032	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1033	    |                               :
1034	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1036	           The Structure of the First Aggregation Unit in an AP

1038	                                 Figure 5

1040	   The DONL field, when present, specifies the value of the 16 least
1041	   significant bits of the decoding order number of the aggregated NAL
1042	   unit.

1044	   If sprop-max-don-diff is greater than 0, the DONL field MUST be
1045	   present in an aggregation unit that is the first aggregation unit in
1046	   an AP, and the variable DON for the aggregated NAL unit is derived as
1047	   equal to the value of the DONL field.  Otherwise (sprop-max-don-diff
1048	   is equal to 0), the DONL field MUST NOT be present in an aggregation
1049	   unit that is the first aggregation unit in an AP.

1051	   An aggregation unit that is not the first aggregation unit in an AP
1052	   will be followed immediately by a 16-bit unsigned size information
1053	   (in network byte order) that indicates the size of the NAL unit in
1054	   bytes (excluding these two octets, but including the NAL unit
1055	   header), followed by the NAL unit itself, including its NAL unit
1056	   header, as shown in Figure 6.

1058	     0                   1                   2                   3
1059	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1060	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1061	    |               :       NALU size               |   NAL unit    |
1062	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1063	    |                                                               |
1064	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1065	    |                               :
1066	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1068	         The Structure of an Aggregation Unit That Is Not the First
1069	                          Aggregation Unit in an AP

1071	                                 Figure 6

1073	   Figure 7 presents an example of an AP that contains two aggregation
1074	   units, labeled as 1 and 2 in the figure, without the DONL field being
1075	   present.

1077	     0                   1                   2                   3
1078	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1079	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1080	    |                          RTP Header                           |
1081	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1083	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1084	    |          NALU 1 HDR           |                               |
1085	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1086	    |                   . . .                                       |
1087	    |                                                               |
1088	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1089	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1090	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1091	    | NALU 2 HDR    |                                               |
1092	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1093	    |                   . . .                                       |
1094	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1095	    |                               :...OPTIONAL RTP padding        |
1096	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1098	               An Example of an AP Packet Containing
1099	             Two Aggregation Units without the DONL Field

1101	                                 Figure 7

1103	   Figure 8 presents an example of an AP that contains two aggregation
1104	   units, labeled as 1 and 2 in the figure, with the DONL field being
1105	   present.

1107	     0                   1                   2                   3
1108	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1109	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1110	    |                          RTP Header                           |
1111	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1112	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1113	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1114	    |          NALU 1 Size          |            NALU 1 HDR         |
1115	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1116	    |                                                               |
1117	    |                 NALU 1 Data   . . .                           |
1118	    |                                                               |
1119	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1120	    |                               :          NALU 2 Size          |
1121	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1122	    |          NALU 2 HDR           |                               |
1123	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1124	    |                                                               |
1125	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1126	    |                               :...OPTIONAL RTP padding        |
1127	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1129	                   An Example of an AP Containing
1130	                 Two Aggregation Units with the DONL Field

1132	                                 Figure 8

1134	4.3.3.  Fragmentation Units

1136	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1137	   single NAL unit into multiple RTP packets, possibly without
1138	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1139	   unit consists of an integer number of consecutive octets of that NAL
1140	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1141	   order with ascending RTP sequence numbers (with no other RTP packets
1142	   within the same RTP stream being sent between the first and last
1143	   fragment).

1145	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1146	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1147	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1149	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1150	   time of the fragmented NAL unit.

1152	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1153	   header of one octet, a conditional 16-bit DONL field (in network byte
1154	   order), and an FU payload, as shown in Figure 9.

1156	     0                   1                   2                   3
1157	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1158	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1159	    |   PayloadHdr (Type=29)        |   FU header   | DONL (cond)   |
1160	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1161	    |   DONL (cond) |                                               |
1162	    |-+-+-+-+-+-+-+-+                                               |
1163	    |                         FU payload                            |
1164	    |                                                               |
1165	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1166	    |                               :...OPTIONAL RTP padding        |
1167	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1169	                          The Structure of an FU

1171	                                 Figure 9

1173	   The fields in the payload header are set as follows.  The Type field
1174	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1175	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1176	   unit.

1178	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1179	   FuType field, as shown in Figure 10.

1181	                           +---------------+
1182	                           |0|1|2|3|4|5|6|7|
1183	                           +-+-+-+-+-+-+-+-+
1184	                           |S|E|R|  FuType |
1185	                           +---------------+

1187	                       The Structure of FU Header

1189	                                 Figure 10

1191	   The semantics of the FU header fields are as follows:

1193	   S: 1 bit
1194	      When set to 1, the S bit indicates the start of a fragmented NAL
1195	      unit, i.e., the first byte of the FU payload is also the first
1196	      byte of the payload of the fragmented NAL unit.  When the FU
1197	      payload is not the start of the fragmented NAL unit payload, the S
1198	      bit MUST be set to 0.

1200	   E: 1 bit

1202	      When set to 1, the E bit indicates the end of a fragmented NAL
1203	      unit, i.e., the last byte of the payload is also the last byte of
1204	      the fragmented NAL unit.  When the FU payload is not the last
1205	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1207	   Reserved: 1 bit

1209	      Placeholder

1211	   FuType: 5 bits

1213	      The field FuType MUST be equal to the field Type of the fragmented
1214	      NAL unit.

1216	   The DONL field, when present, specifies the value of the 16 least
1217	   significant bits of the decoding order number of the fragmented NAL
1218	   unit.

1220	   If sprop-max-don-diff is greater than 0, and the S bit is equal to 1,
1221	   the DONL field MUST be present in the FU, and the variable DON for
1222	   the fragmented NAL unit is derived as equal to the value of the DONL
1223	   field.  Otherwise (sprop-max-don-diff is equal to 0, or the S bit is
1224	   equal to 0), the DONL field MUST NOT be present in the FU.

1226	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1227	   the Start bit and End bit must not both be set to 1 in the same FU
1228	   header.

1230	   The FU payload consists of fragments of the payload of the fragmented
1231	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1232	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1233	   equal to 1, are sequentially concatenated, the payload of the
1234	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1235	   fragmented NAL unit is not included as such in the FU payload, but
1236	   rather the information of the NAL unit header of the fragmented NAL
1237	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1238	   headers of the FUs and the FuType field of the FU header of the FUs.
1239	   An FU payload MUST NOT be empty.

1241	   If an FU is lost, the receiver SHOULD discard all following
1242	   fragmentation units in transmission order corresponding to the same
1243	   fragmented NAL unit, unless the decoder in the receiver is known to
1244	   be prepared to gracefully handle incomplete NAL units.

1246	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1247	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1248	   n of that NAL unit is not received.  In this case, the
1249	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1250	   syntax violation.

1252	4.4.  Decoding Order Number

1254	   For each NAL unit, the variable AbsDon is derived, representing the
1255	   decoding order number that is indicative of the NAL unit decoding
1256	   order.

1258	   Let NAL unit n be the n-th NAL unit in transmission order within an
1259	   RTP stream.

1261	   If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon
1262	   for NAL unit n, is derived as equal to n.

1264	   Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is
1265	   derived as follows, where DON[n] is the value of the variable DON for
1266	   NAL unit n:

1268	   o  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1269	      transmission order), AbsDon[0] is set equal to DON[0].

1271	   o  Otherwise (n is greater than 0), the following applies for
1272	      derivation of AbsDon[n]:

1274	         If DON[n] == DON[n-1],
1275	            AbsDon[n] = AbsDon[n-1]

1277	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1278	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1280	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1281	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1283	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1284	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1285	            DON[n])

1287	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1288	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1290	   For any two NAL units m and n, the following applies:

1292	   o  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1293	      NAL unit m in NAL unit decoding order.

1295	   o  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1296	      of the two NAL units can be in either order.

1298	   o  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1299	      NAL unit m in decoding order.

1301	      Informative note: When two consecutive NAL units in the NAL unit
1302	      decoding order have different values of AbsDon, the absolute
1303	      difference between the two AbsDon values may be greater than or
1304	      equal to 1.

1306	      Informative note: There are multiple reasons to allow for the
1307	      absolute difference of the values of AbsDon for two consecutive
1308	      NAL units in the NAL unit decoding order to be greater than one.
1309	      An increment by one is not required, as at the time of associating
1310	      values of AbsDon to NAL units, it may not be known whether all NAL
1311	      units are to be delivered to the receiver.  For example, a gateway
1312	      might not forward VCL NAL units of higher sublayers or some SEI
1313	      NAL units when there is congestion in the network.
1314	      In another example, the first intra-coded picture of a pre-encoded
1315	      clip is transmitted in advance to ensure that it is readily
1316	      available in the receiver, and when transmitting the first intra-
1317	      coded picture, the originator does not exactly know how many NAL
1318	      units will be encoded before the first intra-coded picture of the
1319	      pre-encoded clip follows in decoding order.  Thus, the values of
1320	      AbsDon for the NAL units of the first intra-coded picture of the
1321	      pre-encoded clip have to be estimated when they are transmitted,
1322	      and gaps in values of AbsDon may occur.

1324	5.  Packetization Rules

1326	   The following packetization rules apply:

1328	   o  If sprop-max-don-diff is greater than 0, the transmission order of
1329	      NAL units carried in the RTP stream MAY be different than the NAL
1330	      unit decoding order and the NAL unit output order.

1332	   o  A NAL unit of a small size SHOULD be encapsulated in an
1333	      aggregation packet together one or more other NAL units in order
1334	      to avoid the unnecessary packetization overhead for small NAL
1335	      units.  For example, non-VCL NAL units such as access unit
1336	      delimiters, parameter sets, or SEI NAL units are typically small
1337	      and can often be aggregated with VCL NAL units without violating
1338	      MTU size constraints.

1340	   o  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1341	      viewpoint, be encapsulated in an aggregation packet together with
1342	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1343	      be meaningless without the associated VCL NAL unit being
1344	      available.

1346	   o  For carrying exactly one NAL unit in an RTP packet, a single NAL
1347	      unit packet MUST be used.

1349	6.  De-packetization Process

1351	   The general concept behind de-packetization is to get the NAL units
1352	   out of the RTP packets in an RTP stream and pass them to the decoder
1353	   in the NAL unit decoding order.

1355	   The de-packetization process is implementation dependent.  Therefore,
1356	   the following description should be seen as an example of a suitable
1357	   implementation.  Other schemes may be used as well, as long as the
1358	   output for the same input is the same as the process described below.
1359	   The output is the same when the set of output NAL units and their
1360	   order are both identical.  Optimizations relative to the described
1361	   algorithms are possible.

1363	   All normal RTP mechanisms related to buffer management apply.  In
1364	   particular, duplicated or outdated RTP packets (as indicated by the
1365	   RTP sequences number and the RTP timestamp) are removed.  To
1366	   determine the exact time for decoding, factors such as a possible
1367	   intentional delay to allow for proper inter-stream synchronization
1368	   MUST be factored in.

1370	   NAL units with NAL unit type values in the range of 0 to 27,
1371	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1372	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1373	   NOT be passed to the decoder.

1375	   The receiver includes a receiver buffer, which is used to compensate
1376	   for transmission delay jitter within individual RTP streams and
1377	   across RTP streams, to reorder NAL units from transmission order to
1378	   the NAL unit decoding order.  In this section, the receiver operation
1379	   is described under the assumption that there is no transmission delay
1380	   jitter within an RTP stream and across RTP streams.  To make a
1381	   difference from a practical receiver buffer that is also used for
1382	   compensation of transmission delay jitter, the receiver buffer is
1383	   hereafter called the de-packetization buffer in this section.
1384	   Receivers should also prepare for transmission delay jitter; that is,
1385	   either reserve separate buffers for transmission delay jitter
1386	   buffering and de-packetization buffering or use a receiver buffer for
1387	   both transmission delay jitter and de- packetization.  Moreover,
1388	   receivers should take transmission delay jitter into account in the
1389	   buffering operation, e.g., by additional initial buffering before
1390	   starting of decoding and playback.

1392	   When sprop-max-don-diff is equal to 0, the de-packetization buffer
1393	   size is zero bytes, and the process described in the remainder of
1394	   this paragraph applies.
1395	   The NAL units carried in the single RTP stream are directly passed to
1396	   the decoder in their transmission order, which is identical to their
1397	   decoding order.  When there are several NAL units of the same RTP
1398	   stream with the same NTP timestamp, the order to pass them to the
1399	   decoder is their transmission order.

1401	      Informative note: The mapping between RTP and NTP timestamps is
1402	      conveyed in RTCP SR packets.  In addition, the mechanisms for
1403	      faster media timestamp synchronization discussed in [RFC6051] may
1404	      be used to speed up the acquisition of the RTP-to-wall-clock
1405	      mapping.

1407	   When sprop-max-don-diff is greater than 0, the process described in
1408	   the remainder of this section applies.

1410	   There are two buffering states in the receiver: initial buffering and
1411	   buffering while playing.  Initial buffering starts when the reception
1412	   is initialized.  After initial buffering, decoding and playback are
1413	   started, and the buffering-while-playing mode is used.

1415	   Regardless of the buffering state, the receiver stores incoming NAL
1416	   units, in reception order, into the de-packetization buffer.  NAL
1417	   units carried in RTP packets are stored in the de-packetization
1418	   buffer individually, and the value of AbsDon is calculated and stored
1419	   for each NAL unit.

1421	   Initial buffering lasts until condition A (the difference between the
1422	   greatest and smallest AbsDon values of the NAL units in the de-
1423	   packetization buffer is greater than or equal to the value of sprop-
1424	   max-don-diff) or condition B (the number of NAL units in the de-
1425	   packetization buffer is greater than the value of sprop-depack-buf-
1426	   nalus) is true.

1428	   After initial buffering, whenever condition A or condition B is true,
1429	   the following operation is repeatedly applied until both condition A
1430	   and condition B become false:

1432	   o  The NAL unit in the de-packetization buffer with the smallest
1433	      value of AbsDon is removed from the de-packetization buffer and
1434	      passed to the decoder.

1436	   When no more NAL units are flowing into the de-packetization buffer,
1437	   all NAL units remaining in the de-packetization buffer are removed
1438	   from the buffer and passed to the decoder in the order of increasing
1439	   AbsDon values.

1441	7.  Payload Format Parameters

1443	   This section specifies the optional parameters.  A mapping of the
1444	   parameters with Session Description Protocol (SDP) [RFC4556] is also
1445	   provided for applications that use SDP.

1447	7.1.  Media Type Registration

1449	   The receiver MUST ignore any parameter unspecified in this memo.

1451	   Type name:            Video

1453	   Subtype name:         H266

1455	   Required parameters:  none

1457	   Optional parameters:

1459	      Editor's notes: To be added

1461	7.2.  SDP Parameters

1463	   The receiver MUST ignore any parameter unspecified in this memo.

1465	7.2.1.  Mapping of Payload Type Parameters to SDP

1467	   The media type video/H266 string is mapped to fields in the Session
1468	   Description Protocol (SDP) [RFC4566] as follows:

1470	   o  The media name in the "m=" line of SDP MUST be video.

1472	   o  The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the
1473	      media subtype).

1475	   o  The clock rate in the "a=rtpmap" line MUST be 90000.

1477	   o  OPTIONAL PARAMETERS:

1479	      Editor's notes: To be dicussed here

1481	7.2.1.1.  SDP Example

1483	   An example of media representation in SDP is as follows:

1485	       m=video 49170 RTP/AVP 98
1486	       a=rtpmap:98 H266/90000
1487	       a=fmtp:98 profile-id=1; sprop-vps=<video parameter sets data>

1489	7.2.2.  Usage with SDP Offer/Answer Model

1491	   When [VVC] is offered over RTP using SDP in an offer/answer model
1492	   [RFC3264] for negotiation for unicast usage, the following
1493	   limitations and rules apply:

1495	   Placeholder: To add limitations and considerations.

1497	8.  Use with Feedback Messages

1499	   The following subsections define the use of the Picture Loss
1500	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
1501	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
1502	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
1503	   [RFC4585], and the FIR message is defined in [RFC5104].

1505	8.1.  Picture Loss Indication (PLI)

1507	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
1508	   media sender indicates "the loss of an undefined amount of coded
1509	   video data belonging to one or more pictures".  Without having any
1510	   specific knowledge of the setup of the bitstream (such as use and
1511	   location of in-band parameter sets, non-IRAP decoder refresh points,
1512	   picture structures, and so forth), a reaction to the reception of an
1513	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
1514	   parameter sets; potentially with sufficient redundancy so to ensure
1515	   correct reception.  However, sometimes information about the
1516	   bitstream structure is known.  For example, state could have been
1517	   established outside of the mechanisms defined in this document that
1518	   parameter sets are conveyed out of band only, and stay static for the
1519	   duration of the session.  In that case, it is obviously unnecessary
1520	   to send them in-band as a result of the reception of a PLI.  Other
1521	   examples could be devised based on a priori knowledge of different
1522	   aspects of the bitstream structure.  In all cases, the timing and
1523	   congestion control mechanisms of RFC 4585 MUST be observed.

1525	8.2.  Slice Loss Indication (SLI)

1527	   For further study.  Maybe remove as there are no known
1528	   implementations of SDLI in [HEVC] based systems

1530	8.3.  Reference Picture Selection Indication (RPSI)

1532	   Feedback-based reference picture selection has been shown as a
1533	   powerful tool to stop temporal error propagation for improved error
1534	   resilience [Girod99] [Wang05].  In one approach, the decoder side
1535	   tracks errors in the decoded pictures and informs the encoder side
1536	   that a particular picture that has been decoded relatively earlier is
1537	   correct and still present in the decoded picture buffer; it requests
1538	   the encoder to use that correct picture-availability information when
1539	   encoding the next picture, so to stop further temporal error
1540	   propagation.  For this approach, the decoder side should use the RPSI
1541	   feedback message.

1543	   Encoders can encode some long-term reference pictures as specified in
1544	   [VVC] for purposes described in the previous paragraph without the
1545	   need of a huge decoded picture buffer.  As shown in [Wang05], with a
1546	   flexible reference picture management scheme, as in VVC, even a
1547	   decoded picture buffer size of two picture storage buffers would work
1548	   for the approach described in the previous paragraph.

1550	   The text above is copy-paste from RFC 7798.  If we keep the RPSI
1551	   message, it needs adaptation to the [VVC] syntax.  Doing so shouldn't
1552	   be too hard as the [VVC] reference picture mechanism is not too
1553	   different from the [HEVC] one.

1555	8.4.  Full Intra Request (FIR)

1557	   The purpose of the FIR message is to force an encoder to send an
1558	   independent decoder refresh point as soon as possible, while
1559	   observing applicable congestion-control-related constraints, such as
1560	   those set out in [RFC8082]).

1562	   Upon reception of a FIR, a sender MUST send an IDR picture.
1563	   Parameter sets MUST also be sent, except when there is a priori
1564	   knowledge that the parameter sets have been correctly established.  A
1565	   typical example for that is an understanding between sender and
1566	   receiver, established by means outside this document, that parameter
1567	   sets are exclusively sent out-of-band.

1569	9.  Frame Marking

1571	   [FrameMarking] provides an extension mechanism for RTP.  The codec-
1572	   agnostic meta-data in the [FrameMarking] header provides valuable
1573	   video frame information.  Its usage with [VVC] is defined in this
1574	   section.  Refer [FrameMarking] for any unspecified fields.  Two
1575	   header extensions are RECOMMENDED:

1577	   o  The short extension for non-scalable streams.

1579	   o  The long extension for scalable streams.

1581	9.1.  Frame Marking Short Extension

1583	   The fields for the short extension, as shown in Figure 11, are used
1584	   as described in the following.

1586	                          0                   1
1587	                          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
1588	                         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1589	                         |  ID   |  L=0  |S|E|I|D|0 0 0 0|
1590	                         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1592	                    Short Frame Marking RTP Extension for [VVC]

1594	                                 Figure 11

1596	   The I bit MUST be 1 when the NAL unit type is 7-9 (inclusive),
1597	   otherwise it MUST be 0.

1599	   The D bit MUST be 1 when the syntax element ph_non_ref_pic_flag for a
1600	   picture is equal to 1, otherwise it MUST be 0.

1602	   The S bit MUST be set to 1 if any of the following conditions is true
1603	   and MUST be set to 0 otherwise:

1605	   o  The RTP packet is a single NAL unit packet and it is the first VCL
1606	      NAL unit, in decoding order, of a picture.

1608	   o  The RTP packet is an AP, and the NAL unit in the first contained
1609	      aggregation unit is the first VCL NAL unit, in decoding order, of
1610	      a picture.

1612	   o  The RTP packet is a FU with its S bit equal to 1 and the FU
1613	      payload contains a fragment of the first VCL NAL unit, in decoding
1614	      order, of a picture.

1616	   The E bit MUST be set to 1 if any of the following conditions is true
1617	   and MUST be set to 0 otherwise:

1619	   o  The RTP packet is a single NAL unit packet and it is the last VCL
1620	      NAL unit, in decoding order, of a picture.

1622	   o  The RTP packet is an AP and the NAL unit in the last contained
1623	      aggregation unit is the last VCL NAL unit, in decoding order, of a
1624	      picture.

1626	   o  The RTP packet is a FU with its E bit equal to 1 and the FU
1627	      payload contains a fragment of the last VCL NAL unit, in decoding
1628	      order, of a picture.

1630	9.2.  Frame Marking Long Extension

1632	       0                   1                   2                   3
1633	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1634	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1635	      |  ID   |  L=2  |S|E|I|D|B| TID |0|0|   LayerID |    TL0PICIDX  |
1636	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1638	                     Long Frame Marking RTP Extension for [VVC]

1640	                                 Figure 12

1642	   The fields for the long extension for scalable streams, as shown in
1643	   Figure 12, are used as described in the following.

1645	   The LayerID (6 bits) and TID (3 bits) from the NAL unit header
1646	   Section 1.1.4 are mapped to the generic LID and TID fields in
1647	   [FrameMarking] as shown in Figure 12.

1649	   The I bit MUST be 1 when the NAL unit type is 7-9 (inclusive),
1650	   otherwise it MUST be 0.

1652	   The D bit MUST be 1 when the syntax element ph_non_ref_pic_flag for a
1653	   picture is equal to 1, otherwise it MUST be 0.

1655	   The S bit MUST be set to 1 if any of the following conditions is true
1656	   and MUST be set to 0 otherwise:

1658	   o  The RTP packet is a single NAL unit packet and it is the first VCL
1659	      NAL unit, in decoding order, of a picture.

1661	   o  The RTP packet is an AP, and the NAL unit in the first contained
1662	      aggregation unit is the first VCL NAL unit, in decoding order, of
1663	      a picture.

1665	   o  The RTP packet is a FU with its S bit equal to 1 and the FU
1666	      payload contains a fragment of the first VCL NAL unit, in decoding
1667	      order, of a picture.

1669	   The E bit MUST be set to 1 if any of the following conditions is true
1670	   and MUST be set to 0 otherwise:

1672	   o  The RTP packet is a single NAL unit packet and it is the last VCL
1673	      NAL unit, in decoding order, of a picture.

1675	   o  The RTP packet is an AP and the NAL unit in the last contained
1676	      aggregation unit is the last VCL NAL unit, in decoding order, of a
1677	      picture.

1679	   o  The RTP packet is a FU with its E bit equal to 1 and the FU
1680	      payload contains a fragment of the last VCL NAL unit, in decoding
1681	      order, of a picture.

1683	10.  Security Considerations

1685	   The scope of this Security Considerations section is limited to the
1686	   payload format itself and to one feature of [VVC] that may pose a
1687	   particularly serious security risk if implemented naively.  The
1688	   payload format, in isolation, does not form a complete system.
1689	   Implementers are advised to read and understand relevant security-
1690	   related documents, especially those pertaining to RTP (see the
1691	   Security Considerations section in [RFC3550] ), and the security of
1692	   the call-control stack chosen (that may make use of the media type
1693	   registration of this memo).  Implementers should also consider known
1694	   security vulnerabilities of video coding and decoding implementations
1695	   in general and avoid those.

1697	   Within this RTP payload format, and with the exception of the user
1698	   data SEI message as described below, no security threats other than
1699	   those common to RTP payload formats are known.  In other words,
1700	   neither the various media-plane-based mechanisms, nor the signaling
1701	   part of this memo, seems to pose a security risk beyond those common
1702	   to all RTP-based systems.

1704	   RTP packets using the payload format defined in this specification
1705	   are subject to the security considerations discussed in the RTP
1706	   specification [RFC3550] , and in any applicable RTP profile such as
1707	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
1708	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
1709	   Does Not Mandate a Single Media Security Solution" [RFC7202]
1710	   discusses, it is not an RTP payload format's responsibility to
1711	   discuss or mandate what solutions are used to meet the basic security
1712	   goals like confidentiality, integrity and source authenticity for RTP
1713	   in general.  This responsibility lays on anyone using RTP in an
1714	   application.  They can find guidance on available security mechanisms
1715	   and important considerations in "Options for Securing RTP Sessions"
1716	   [RFC7201] . The rest of this section discusses the security impacting
1717	   properties of the payload format itself.

1719	   Because the data compression used with this payload format is applied
1720	   end-to-end, any encryption needs to be performed after compression.
1721	   A potential denial-of-service threat exists for data encodings using
1722	   compression techniques that have non-uniform receiver-end
1723	   computational load.  The attacker can inject pathological datagrams
1724	   into the bitstream that are complex to decode and that cause the
1725	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
1726	   attacks, as it is extremely simple to generate datagrams containing
1727	   NAL units that affect the decoding process of many future NAL units.
1728	   Therefore, the usage of data origin authentication and data integrity
1729	   protection of at least the RTP packet is RECOMMENDED, for example,
1730	   with SRTP [RFC3711] .

1732	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
1733	   Enhancement Information (SEI) message.  This SEI message allows
1734	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
1735	   bitstring could include JavaScript, machine code, and other active
1736	   content.  [VVC] leaves the handling of this SEI message to the
1737	   receiving system.  In order to avoid harmful side effects the user
1738	   data SEI message, decoder implementations cannot naively trust its
1739	   content.  For example, it would be a bad and insecure implementation
1740	   practice to forward any JavaScript a decoder implementation detects
1741	   to a web browser.  The safest way to deal with user data SEI messages
1742	   is to simply discard them, but that can have negative side effects on
1743	   the quality of experience by the user.

1745	   End-to-end security with authentication, integrity, or
1746	   confidentiality protection will prevent a MANE from performing media-
1747	   aware operations other than discarding complete packets.  In the case
1748	   of confidentiality protection, it will even be prevented from
1749	   discarding packets in a media-aware way.  To be allowed to perform
1750	   such operations, a MANE is required to be a trusted entity that is
1751	   included in the security context establishment.

1753	11.  Congestion Control

1755	   Congestion control for RTP SHALL be used in accordance with RTP
1756	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
1757	   If best-effort service is being used, an additional requirement is
1758	   that users of this payload format MUST monitor packet loss to ensure
1759	   that the packet loss rate is within an acceptable range.  Packet loss
1760	   is considered acceptable if a TCP flow across the same network path,
1761	   and experiencing the same network conditions, would achieve an
1762	   average throughput, measured on a reasonable timescale, that is not
1763	   less than all RTP streams combined are achieving.  This condition can
1764	   be satisfied by implementing congestion-control mechanisms to adapt
1765	   the transmission rate, the number of layers subscribed for a layered
1766	   multicast session, or by arranging for a receiver to leave the
1767	   session if the loss rate is unacceptably high.

1769	   The bitrate adaptation necessary for obeying the congestion control
1770	   principle is easily achievable when real-time encoding is used, for
1771	   example, by adequately tuning the quantization parameter.  However,
1772	   when pre-encoded content is being transmitted, bandwidth adaptation
1773	   requires the pre-coded bitstream to be tailored for such adaptivity.
1774	   The key mechanisms available in [VVC] are temporal scalability, and
1775	   spatial/SNR scalability.  A media sender can remove NAL units
1776	   belonging to higher temporal sublayers (i.e., those NAL units with a
1777	   high value of TID) or higher spatio-SNR layers (as indicated by
1778	   interpreting the VPS) until the sending bitrate drops to an
1779	   acceptable range.

1781	   The mechanisms mentioned above generally work within a defined
1782	   profile and level and, therefore, no renegotiation of the channel is
1783	   required.  Only when non-downgradable parameters (such as profile)
1784	   are required to be changed does it become necessary to terminate and
1785	   restart the RTP stream(s).  This may be accomplished by using
1786	   different RTP payload types.

1788	   MANEs MAY remove certain unusable packets from the RTP stream when
1789	   that RTP stream was damaged due to previous packet losses.  This can
1790	   help reduce the network load in certain special cases.  For example,
1791	   MANES can remove those FUs where the leading FUs belonging to the
1792	   same NAL unit have been lost or those dependent slice segments when
1793	   the leading slice segments belonging to the same slice have been
1794	   lost, because the trailing FUs or dependent slice segments are
1795	   meaningless to most decoders.  MANES can also remove higher temporal
1796	   scalable layers if the outbound transmission (from the MANE's
1797	   viewpoint) experiences congestion.

1799	12.  IANA Considerations

1801	   Placeholder

1803	13.  Acknowledgements

1805	   Dr. Byeongdoo Choi is thanked for the video codec related technical
1806	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
1807	   are thanked for their contributions on [VVC] specification
1808	   descriptive content.  Spencer Dawkins is thanked for his valuable
1809	   review comments that led to great improvements of this memo.  Some
1810	   parts of this specification share text with the RTP payload format
1811	   for HEVC [RFC7798].  We thank the authors of that specification for
1812	   their excellent work.

1814	14.  References

1816	14.1.  Normative References

1818	   [H.266]    "ISO/IEC FDIS 23090-3 Information technology --- Coded
1819	              representation of immersive media --- Part 3 - Versatile
1820	              video coding", n.d.,
1821	              <https://www.iso.org/standard/73022.html>.

1823	   [ISO23090-3]
1824	              "ISO/IEC DIS Information technology --- Coded
1825	              representation of immersive media --- Part 3 Versatile
1826	              video codings", n.d.,
1827	              <https://www.iso.org/standard/73022.html>.

1829	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1830	              Requirement Levels", BCP 14, RFC 2119,
1831	              DOI 10.17487/RFC2119, March 1997,
1832	              <https://www.rfc-editor.org/info/rfc2119>.

1834	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
1835	              with Session Description Protocol (SDP)", RFC 3264,
1836	              DOI 10.17487/RFC3264, June 2002,
1837	              <https://www.rfc-editor.org/info/rfc3264>.

1839	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1840	              Jacobson, "RTP: A Transport Protocol for Real-Time
1841	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
1842	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

1844	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
1845	              Video Conferences with Minimal Control", STD 65, RFC 3551,
1846	              DOI 10.17487/RFC3551, July 2003,
1847	              <https://www.rfc-editor.org/info/rfc3551>.

1849	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
1850	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
1851	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
1852	              <https://www.rfc-editor.org/info/rfc3711>.

1854	   [RFC4556]  Zhu, L. and B. Tung, "Public Key Cryptography for Initial
1855	              Authentication in Kerberos (PKINIT)", RFC 4556,
1856	              DOI 10.17487/RFC4556, June 2006,
1857	              <https://www.rfc-editor.org/info/rfc4556>.

1859	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
1860	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
1861	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

1863	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
1864	              "Extended RTP Profile for Real-time Transport Control
1865	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
1866	              DOI 10.17487/RFC4585, July 2006,
1867	              <https://www.rfc-editor.org/info/rfc4585>.

1869	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
1870	              "Codec Control Messages in the RTP Audio-Visual Profile
1871	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
1872	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

1874	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
1875	              Real-time Transport Control Protocol (RTCP)-Based Feedback
1876	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
1877	              2008, <https://www.rfc-editor.org/info/rfc5124>.

1879	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
1880	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
1881	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
1882	              DOI 10.17487/RFC7656, November 2015,
1883	              <https://www.rfc-editor.org/info/rfc7656>.

1885	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
1886	              "Using Codec Control Messages in the RTP Audio-Visual
1887	              Profile with Feedback with Layered Codecs", RFC 8082,
1888	              DOI 10.17487/RFC8082, March 2017,
1889	              <https://www.rfc-editor.org/info/rfc8082>.

1891	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
1892	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
1893	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

1895	   [VVC]      "ISO/IEC FDIS 23090-3 Information technology --- Coded
1896	              representation of immersive media --- Part 3 - Versatile
1897	              video coding", n.d.,
1898	              <https://www.iso.org/standard/73022.html>.

1900	14.2.  Informative References

1902	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
1903	              HEVC, IEEE Transactions on Circuts and Systems for Video
1904	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
1905	              2012.

1907	   [FrameMarking]
1908	              Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame
1909	              Marking RTP Header Extension", Work in Progress draft-
1910	              berger-avtext-framemarking , 2015.

1912	   [Girod99]  Girod, B, . and . et al, "Feedback-based error control for
1913	              mobile video transmission, Proceedings of the IEEE",
1914	              DOI 110.1109/5.790632, October 1999.

1916	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
1917	              H.265", April 2013.

1919	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
1920	              ofmoving pictures and associated audio information - Part
1921	              1:Systems, ISO International Standard 13818-1", 2013.

1923	   [RFC6051]  Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP
1924	              Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010,
1925	              <https://www.rfc-editor.org/info/rfc6051>.

1927	   [RFC6184]  Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP
1928	              Payload Format for H.264 Video", RFC 6184,
1929	              DOI 10.17487/RFC6184, May 2011,
1930	              <https://www.rfc-editor.org/info/rfc6184>.

1932	   [RFC6190]  Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis,
1933	              "RTP Payload Format for Scalable Video Coding", RFC 6190,
1934	              DOI 10.17487/RFC6190, May 2011,
1935	              <https://www.rfc-editor.org/info/rfc6190>.

1937	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
1938	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
1939	              <https://www.rfc-editor.org/info/rfc7201>.

1941	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
1942	              Framework: Why RTP Does Not Mandate a Single Media
1943	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
1944	              2014, <https://www.rfc-editor.org/info/rfc7202>.

1946	   [RFC7798]  Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M.
1947	              Hannuksela, "RTP Payload Format for High Efficiency Video
1948	              Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March
1949	              2016, <https://www.rfc-editor.org/info/rfc7798>.

1951	   [Wang05]   Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient
1952	              video coding using flexible reference fames", Visual
1953	              Communications and Image Processing 2005 (VCIP 2005) ,
1954	              July 2005.

1956	Appendix A.  Change History

1958	   draft-zhao-payload-rtp-vvc-00 ........ initial version

1960	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
1961	   corrections

1963	   draft-ietf-payload-rtp-vvc-00 ........ initial WG draft

1965	   draft-ietf-payload-rtp-vvc-01 ........ VVC specification update

1967	   draft-ietf-payload-rtp-vvc-02 ........ VVC specification update

1969	   draft-ietf-payload-rtp-vvc-03 ........ VVC coding tool introduction
1970	   update

1972	Authors' Addresses

1974	   Shuai Zhao
1975	   Tencent
1976	   2747 Park Blvd
1977	   Palo Alto  94588
1978	   USA

1980	   Email: shuai.zhao@ieee.org

1982	   Stephan Wenger
1983	   Tencent
1984	   2747 Park Blvd
1985	   Palo Alto  94588

1987	   Email: stewe@stewe.org
1988	   Yago Sanchez
1989	   Fraunhofer HHI
1990	   Einsteinufer 37
1991	   Berlin  10587
1992	   Germany

1994	   Email: yago.sanchez@hhi.fraunhofer.de