idnits 2.17.1 

draft-ietf-avtcore-rtp-vvc-02.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The abstract seems to contain references ([ISO23090-3]), which it
     shouldn't.  Please replace those with straight textual mentions of the
     documents in question.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == The document seems to lack the recommended RFC 2119 boilerplate, even if
     it appears to use RFC 2119 keywords -- however, there's a paragraph with
     a matching beginning. Boilerplate error?

     (The document does seem to have the reference to RFC 2119 which the
     ID-Checklist requires).
  -- The document date (July 11, 2020) is 1383 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  -- Looks like a reference, but probably isn't: '0' on line 1267

  -- Possible downref: Non-RFC (?) normative reference: ref. 'ISO23090-3'

  ** Obsolete normative reference: RFC 4566 (Obsoleted by RFC 8866)

  ** Downref: Normative reference to an Informational RFC: RFC 7656

  -- Possible downref: Non-RFC (?) normative reference: ref. 'VVC'


     Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 4 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	avtcore                                                          S. Zhao
3	Internet-Draft                                                 S. Wenger
4	Intended status: Standards Track                                 Tencent
5	Expires: January 12, 2021                                     Y. Sanchez
6	                                                          Fraunhofer HHI
7	                                                           July 11, 2020

9	          RTP Payload Format for Versatile Video Coding (VVC)
10	                     draft-ietf-avtcore-rtp-vvc-02

12	Abstract

14	   This memo describes an RTP payload format for the video coding
15	   standard ITU-T Recommendation [H.266] and ISO/IEC International
16	   Standard [ISO23090-3], both also known as Versatile Video Coding
17	   (VVC) and developed by the Joint Video Experts Team (JVET).  The RTP
18	   payload format allows for packetization of one or more Network
19	   Abstraction Layer (NAL) units in each RTP packet payload as well as
20	   fragmentation of a NAL unit into multiple RTP packets.  The payload
21	   format has wide applicability in videoconferencing, Internet video
22	   streaming, and high-bitrate entertainment-quality video, among other
23	   applications.

25	Status of This Memo

27	   This Internet-Draft is submitted in full conformance with the
28	   provisions of BCP 78 and BCP 79.

30	   Internet-Drafts are working documents of the Internet Engineering
31	   Task Force (IETF).  Note that other groups may also distribute
32	   working documents as Internet-Drafts.  The list of current Internet-
33	   Drafts is at https://datatracker.ietf.org/drafts/current/.

35	   Internet-Drafts are draft documents valid for a maximum of six months
36	   and may be updated, replaced, or obsoleted by other documents at any
37	   time.  It is inappropriate to use Internet-Drafts as reference
38	   material or to cite them other than as "work in progress."

40	   This Internet-Draft will expire on January 12, 2021.

42	Copyright Notice

44	   Copyright (c) 2020 IETF Trust and the persons identified as the
45	   document authors.  All rights reserved.

47	   This document is subject to BCP 78 and the IETF Trust's Legal
48	   Provisions Relating to IETF Documents
49	   (https://trustee.ietf.org/license-info) in effect on the date of
50	   publication of this document.  Please review these documents
51	   carefully, as they describe your rights and restrictions with respect
52	   to this document.  Code Components extracted from this document must
53	   include Simplified BSD License text as described in Section 4.e of
54	   the Trust Legal Provisions and are provided without warranty as
55	   described in the Simplified BSD License.

57	Table of Contents

59	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
60	     1.1.  Overview of the VVC Codec . . . . . . . . . . . . . . . .   3
61	       1.1.1.  Coding-Tool Features (informative)  . . . . . . . . .   4
62	       1.1.2.  Systems and Transport Interfaces  . . . . . . . . . .   6
63	       1.1.3.  Parallel Processing Support (informative) . . . . . .  10
64	       1.1.4.  NAL Unit Header . . . . . . . . . . . . . . . . . . .  11
65	     1.2.  Overview of the Payload Format  . . . . . . . . . . . . .  12
66	   2.  Conventions . . . . . . . . . . . . . . . . . . . . . . . . .  12
67	   3.  Definitions and Abbreviations . . . . . . . . . . . . . . . .  12
68	     3.1.  Definitions . . . . . . . . . . . . . . . . . . . . . . .  12
69	       3.1.1.  Definitions from the VVC Specification  . . . . . . .  13
70	       3.1.2.  Definitions Specific to This Memo . . . . . . . . . .  16
71	     3.2.  Abbreviations . . . . . . . . . . . . . . . . . . . . . .  16
72	   4.  RTP Payload Format  . . . . . . . . . . . . . . . . . . . . .  17
73	     4.1.  RTP Header Usage  . . . . . . . . . . . . . . . . . . . .  18
74	     4.2.  Payload Header Usage  . . . . . . . . . . . . . . . . . .  19
75	     4.3.  Payload Structures  . . . . . . . . . . . . . . . . . . .  20
76	       4.3.1.  Single NAL Unit Packets . . . . . . . . . . . . . . .  20
77	       4.3.2.  Aggregation Packets (APs) . . . . . . . . . . . . . .  21
78	       4.3.3.  Fragmentation Units . . . . . . . . . . . . . . . . .  25
79	     4.4.  Decoding Order Number . . . . . . . . . . . . . . . . . .  28
80	   5.  Packetization Rules . . . . . . . . . . . . . . . . . . . . .  29
81	   6.  De-packetization Process  . . . . . . . . . . . . . . . . . .  30
82	   7.  Payload Format Parameters . . . . . . . . . . . . . . . . . .  32
83	     7.1.  Media Type Registration . . . . . . . . . . . . . . . . .  32
84	     7.2.  SDP Parameters  . . . . . . . . . . . . . . . . . . . . .  32
85	       7.2.1.  Mapping of Payload Type Parameters to SDP . . . . . .  32
86	       7.2.2.  Usage with SDP Offer/Answer Model . . . . . . . . . .  33
87	   8.  Use with Feedback Messages  . . . . . . . . . . . . . . . . .  33
88	     8.1.  Picture Loss Indication (PLI) . . . . . . . . . . . . . .  33
89	     8.2.  Slice Loss Indication (SLI) . . . . . . . . . . . . . . .  33
90	     8.3.  Reference Picture Selection Indication (RPSI) . . . . . .  33
91	     8.4.  Full Intra Request (FIR)  . . . . . . . . . . . . . . . .  34
92	   9.  Frame Marking . . . . . . . . . . . . . . . . . . . . . . . .  34
93	     9.1.  Frame Marking Short Extension . . . . . . . . . . . . . .  35
94	     9.2.  Frame Marking Long Extension  . . . . . . . . . . . . . .  36
95	   10. Security Considerations . . . . . . . . . . . . . . . . . . .  37
96	   11. Congestion Control  . . . . . . . . . . . . . . . . . . . . .  38
97	   12. IANA Considerations . . . . . . . . . . . . . . . . . . . . .  39
98	   13. Acknowledgements  . . . . . . . . . . . . . . . . . . . . . .  39
99	   14. References  . . . . . . . . . . . . . . . . . . . . . . . . .  39
100	     14.1.  Normative References . . . . . . . . . . . . . . . . . .  39
101	     14.2.  Informative References . . . . . . . . . . . . . . . . .  41
102	   Appendix A.  Change History . . . . . . . . . . . . . . . . . . .  42
103	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  43

105	1.  Introduction

107	   The Versatile Video Coding [VVC] specification, formally published as
108	   both ITU-T Recommendation H.266 and ISO/IEC International Standard
109	   23090-3 [ISO23090-3], is currently in the ISO/IEC approval process
110	   and is planned for ratification in mid 2020.  H.266 is reported to
111	   provide significant coding efficiency gains over H.265 and earlier
112	   video codec formats.

114	   This memo describes an RTP payload format for VVC.  It shares its
115	   basic design with the NAL (Network Abstraction Layer) unit-based RTP
116	   payload formats of, H.264 Video Coding [RFC6184], Scalable Video
117	   Coding (SVC) [RFC6190], High Efficiency Video Coding (HEVC) [RFC7798]
118	   and their respective predecessors.  With respect to design
119	   philosophy, security, congestion control, and overall implementation
120	   complexity, it has similar properties to those earlier payload format
121	   specifications.  This is a conscious choice, as at least RFC 6184 is
122	   widely deployed and generally known in the relevant implementer
123	   communities.  Certain mechanisms known from [RFC6190] were
124	   incorporated in VVC, as VVC version 1 supports temporal, spatial, and
125	   signal-to-noise ratio (SNR) scalability.

127	1.1.  Overview of the VVC Codec

129	   [VVC] and [HEVC] share a similar hybrid video codec design.  In this
130	   memo, we provide a very brief overview of those features of VVC that
131	   are, in some form, addressed by the payload format specified herein.
132	   Implementers have to read, understand, and apply the ITU- T/ISO/IEC
133	   specifications pertaining to [VVC] to arrive at interoperable, well-
134	   performing implementations.

136	   Conceptually, both [VVC] and [HEVC] include a Video Coding Layer
137	   (VCL), which is often used to refer to the coding-tool features, and
138	   a NAL, which is often used to refer to the systems and transport
139	   interface aspects of the codecs.

141	1.1.1.  Coding-Tool Features (informative)

143	   Coding tool features are described below with occasional reference to
144	   the coding tool set of [HEVC], which is well known in the community.

146	   Similar to earlier hybrid-video-coding-based standards, including
147	   HEVC, the following basic video coding design is employed by VVC.  A
148	   prediction signal is first formed by either intra- or motion-
149	   compensated prediction, and the residual (the difference between the
150	   original and the prediction) is then coded.  The gains in coding
151	   efficiency are achieved by redesigning and improving almost all parts
152	   of the codec over earlier designs.  In addition, [VVC] includes
153	   several tools to make the implementation on parallel architectures
154	   easier.

156	   Finally, [VVC] includes temporal, spatial, and SNR scalability as
157	   well as multiview coding support.

159	   Coding blocks and transform structure

161	   Among major coding-tool differences between HEVC and VVC, one of the
162	   important improvements is the more flexible coding tree structure in
163	   VVC, i.e., multi-type tree.  In addition to quadtree, binary and
164	   ternary trees are also supported, which contributes significant
165	   improvement in coding efficiency.  Moreover, the maximum size of
166	   Coding Tree Unit (CTU) is increased from 64x64 to 128x128.  To
167	   improve the coding efficiency of chroma signal, luma chroma separated
168	   trees at CTU level may be employed for intra-slices.  The square
169	   transforms in HEVC are extended to non-square transforms for
170	   rectangular blocks resulting from binary and ternary tree splits.
171	   Besides, [VVC] supports multiple transform sets (MTS), including DCT-
172	   2, DST-7, and DCT-8 as well as the non-separable secondary transform.
173	   The transforms used in [VVC] can have different sizes with support
174	   for larger transform sizes.  For DCT-2, the transform sizes range
175	   from 2x2 to 64x64, and for DST-7 and DCT-8, the transform sizes range
176	   from 4x4 to 32x32.  In addition, [VVC] also support sub-block
177	   transform for both intra and inter coded blocks.  For intra coded
178	   blocks, intra sub-partitioning (ISP) may be used to allow sub-block
179	   based intra prediction and transform.  For inter blocks, sub-block
180	   transform may be used assuming that only a part of an inter-block has
181	   non-zero transform coefficients.

183	   Entropy coding

185	   Similar to HEVC , [VVC] uses a single entropy-coding engine, which is
186	   based on Context Adaptive Binary Arithmetic Coding (CABAC) [CABAC],
187	   but with the support of multi-window sizes.  The window sizes can be
188	   initialized differently for different context models.  Due to such a
189	   design, it has more efficient adaptation speed and better coding
190	   efficiency.  A joint chroma residual coding scheme is applied to
191	   further exploit the correlation between the residuals of two color
192	   components.  In VVC, different residual coding schemes are applied
193	   for regular transform coefficients and residual samples generated
194	   using transform-skip mode.

196	   In-loop filtering

198	   [VVC] has more feature support in loop filters than HEVC.  The
199	   deblocking filter in [VVC] is similar to HEVC but operates at a
200	   smaller grid.  After deblocking and sample adaptive offset (SAO), an
201	   adaptive loop filter (ALF) may be used.  As a Wiener filter, ALF
202	   reduces distortion of decoded pictures.  Besides, [VVC] introduces a
203	   new module before deblocking called luma mapping with chroma scaling
204	   to fully utilize the dynamic range of signal so that rate-distortion
205	   performance of both SDR and HDR content is improved.

207	   Motion prediction and coding

209	   Compared to HEVC, [VVC] introduces several improvements in this area.
210	   First, there is the Adaptive motion vector resolution (AMVR), which
211	   can save bit cost for motion vectors by adaptively signaling motion
212	   vector resolution.  Then the Affine motion compensation is included
213	   to capture complicated motion like zooming and rotation.  Meanwhile,
214	   prediction refinement with the optical flow with affine mode (PROF)
215	   is further deployed to mimic affine motion at the pixel level.
216	   Thirdly the decoder side motion vector refinement (DMVR) is a method
217	   to derive MV vector at decoder side based on block matching so that
218	   fewer bits may be spent on motion vectors.  Bi-directional optical
219	   flow (BDOF) is a similar method to PROF.  BDOF adds a sample wise
220	   offset at 4x4 sub-block level that is derived with equations based on
221	   gradients of the prediction samples and a motion difference relative
222	   to CU motion vectors.  Furthermore, merge with motion vector
223	   difference (MMVD) is a special mode, which further signals a limited
224	   set of motion vector differences on top of merge mode.  In addition
225	   to MMVD, there are another three types of special merge modes, i.e.,
226	   sub-block merge, triangle, and combined intra-/inter- prediction
227	   (CIIP).  Sub- block merge list includes one candidate of sub-block
228	   temporal motion vector prediction (SbTMVP) and up to four candidates
229	   of affine motion vectors.  Triangle is based on triangular block
230	   motion compensation.  CIIP combines intra- and inter- predictions
231	   with weighting.  Adaptive weighting may be employed with a block-
232	   level tool called bi-prediction with CU based weighting (BCW) which
233	   provides more flexibility than in HEVC.

235	   Intra prediction and intra-coding
236	   To capture the diversified local image texture directions with finer
237	   granularity, [VVC] supports 65 angular directions instead of 33
238	   directions in HEVC.  The intra mode coding is based on a 6 most
239	   probable mode scheme, and the 6 most probable modes are derived using
240	   the neighboring intra prediction directions.  In addition, to deal
241	   with the different distributions of intra prediction angles for
242	   different block aspect ratios, a wide-angle intra prediction (WAIP)
243	   scheme is applied in [VVC] by including intra prediction angles
244	   beyond those present in HEVC.  Unlike HEVC which only allows using
245	   the most adjacent line of reference samples for intra prediction,
246	   [VVC] also allows using two further reference lines, as known as
247	   multi-reference-line (MRL) intra prediction.  The additional
248	   reference lines can be only used for 6 most probable intra prediction
249	   modes.  To capture the strong correlation between different colour
250	   components, in VVC, a cross-component linear mode (CCLM) is utilized
251	   which assumes a linear relationship between the luma sample values
252	   and their associated chroma samples.  For intra prediction, [VVC]
253	   also applies a position-dependent prediction combination (PDPC) for
254	   refining the prediction samples closer to the intra prediction block
255	   boundary.  Matrix-based intra prediction (MIP) modes are also used in
256	   [VVC] which generates an up to 8x8 intra prediction block using a
257	   weighted sum of downsampled neighboring reference samples, and the
258	   weightings are hardcoded constants.

260	   Other coding-tool feature

262	   [VVC] introduces dependent quantization (DQ) to reduce quantization
263	   error by state-based switching between two quantizers.

265	1.1.2.  Systems and Transport Interfaces

267	   [VVC] inherits the basic systems and transport interfaces designs
268	   from HEVC and H.264.  These include the NAL-unit-based syntax
269	   structure, the hierarchical syntax and data unit structure, the
270	   Supplemental Enhancement Information (SEI) message mechanism, and the
271	   video buffering model based on the Hypothetical Reference Decoder
272	   (HRD).  The scalability features of [VVC] are conceptually similar to
273	   the scalable variant of HEVC known as SHVC.  The hierarchical syntax
274	   and data unit structure consists of parameter sets at various levels
275	   (decoder, sequence (pertaining to all), sequence (pertaining to a
276	   single), picture), picture-level header parameters, slice-level
277	   header parameters, and lower-level parameters.

279	   A number of key components that influenced the Network Abstraction
280	   Layer design of [VVC] as well as this memo are described below

282	   Decoding Capability Information
283	   The Decoding capability information includes parameters that stay
284	   constant for the lifetime of a Video Bitstream, which in IETF terms
285	   can translate to the lifetime of a session.  Decoding capability
286	   informations can include profile, level, and sub-profile information
287	   to determine a maximum complexity interop point that is guaranteed to
288	   be never exceeded, even if splicing of video sequences occurs within
289	   a session.  It further includes constraint flags, which can
290	   optionally be set to indicate that the video bitstream will be
291	   constraint in the use of certain features as indicated by the values
292	   of those flags.  With this, a bitstream can be labelled as not using
293	   certain tools, which allows among other things for resource
294	   allocation in a decoder implementation.

296	   Video parameter set

298	   The Video Parameter Set (VPS) pertains to a Coded Video Sequences
299	   (CVS) of multiple layers covering the same range of picture units,
300	   and includes, among other information decoding dependency expressed
301	   as information for reference picture set construction of enhancement
302	   layers.  The VPS provides a "big picture" of a scalable sequence,
303	   including what types of operation points are provided, the profile,
304	   tier, and level of the operation points, and some other high-level
305	   properties of the bitstream that can be used as the basis for session
306	   negotiation and content selection, etc.  One VPS may be referenced by
307	   one or more Sequence parameter sets.

309	   Sequence parameter set

311	   The Sequence Parameter Set (SPS) contains syntax elements pertaining
312	   to a coded layer video sequence (CLVS), which is a group of pictures
313	   belonging to the same layer, starting with a random access point, and
314	   followed by pictures that may depend on each other and the random
315	   access point picture.  In MPGEG-2, the equivalent of a CVS was a
316	   Group of Pictures (GOP), which normally started with an I frame and
317	   was followed by P and B frames.  While more complex in its options of
318	   random access points, VVC retains this basic concept.  One remarkable
319	   difference of VVC is that a CLVS may start with a Gradual Decoding
320	   Refresh (GDR) picture, without requiring presence of traditional
321	   random access points in the bitstream, such as Instantaneous Decoding
322	   Refresh (IDR) or Clean Random Access (CRA) pictures.  In many TV-like
323	   applications, a CVS contains a few hundred milliseconds to a few
324	   seconds of video.  In video conferencing (without switching MCUs
325	   involved), a CVS can be as long in duration as the whole session.

327	   Picture and Adaptation parameter set

329	   The Picture Parameter Set and the Adaptation Parameter Set (PPS and
330	   APS, respectively) carry information pertaining to zero or more
331	   pictures and zero or more slices, respectively.  The PPS contains
332	   information that is likely to stay constant from picture to picture-
333	   at least for pictures for a certain type-whereas the APS contains
334	   information, such as adaptive loop filter coefficients, that are
335	   likely to change from picture to picture or even within a picture.  A
336	   single APS can be referenced by slices of the same picture if that
337	   APS contains information about luma mapping with chroma scaling
338	   (LMCS) but different APS can be referenced by slices of the same
339	   picture if those APS contain information about ALF.

341	   Picture Header

343	   A Picture Header contains information that is common to all slices
344	   that belong to the same picture.  Being able to send that information
345	   as a separate NAL unit when pictures are split into several slices
346	   allows for saving bitrate, compared to repeating the same information
347	   in all slices.  However, there might be scenarios where low-bitrate
348	   video is transmitted using a single slice per picture.  Having a
349	   separate NAL unit to convey that information incurs in an overhead
350	   for such scenarios.  Therefore, VVC specifies signaling that
351	   indicates whether Picture Headers are present in the CLVS or not.

353	   Profile, tier, and level

355	   The profile, tier and level syntax structures in DCI, VPS and SPS
356	   contain profile, tier, level information for all layers that refer to
357	   the DCI, for layers associated with one or more output layer sets
358	   specified by the VPS, and for any layer that refers to the SPS,
359	   respectively.

361	   Sub-Profiles

363	   Within the [VVC] specification, a sub-profile is a 32-bit number
364	   coded according to ITU-T Rec. T.35, that does not carry a semantic.
365	   It is carried in the profile_tier_level structure and hence
366	   (potentially) present in the DCI, VPS, and SPS.  External
367	   registration bodies can register a T.35 codepoint with ITU-T
368	   registration authorities and associate with their registration a
369	   description of bitstream complexity restrictions beyond the profiles
370	   defined by ITU-T and ISO/IEC.  This would allow encoder manufacturers
371	   to label the bitstreams generated by their encoder as complying with
372	   such sub-profile.  It is expected that upstream standardization
373	   organizations (such as: DVB and ATSC), as well as walled-garden video
374	   services will take advantage of this labelling system.  In contrast
375	   to "normal" profiles, it is expected that sub-profiles may indicate
376	   encoder choices traditionally left open in the (decoder- centric)
377	   video coding specs, such as GOP structures, minimum/maximum QP
378	   values, and the mandatory use of certain tools or SEI messages.

380	   Constraint Flags

382	   The profile_tier_level structure carries a considerable number of
383	   constraint flags, which an encoder can use to indicate to a decoder
384	   that it will not use a certain tool or technology.  They were
385	   included in reaction to a perceived market need for labelling a
386	   bitstream as not exercising a certain tool that has become
387	   commercially unviable.

389	   Temporal scalability support

391	      Editor notes: need will update along with VVC new draft in the
392	      future

394	   [VVC] includes support of temporal scalability, by inclusion of the
395	   signaling of TemporalId in the NAL unit header, the restriction that
396	   pictures of a particular temporal sub-layer cannot be used for inter
397	   prediction reference by pictures of a lower temporal sub-layer, the
398	   sub-bitstream extraction process, and the requirement that each sub-
399	   bitstream extraction output be a conforming bitstream.  Media-Aware
400	   Network Elements (MANEs) can utilize the TemporalId in the NAL unit
401	   header for stream adaptation purposes based on temporal scalability.

403	   Spatial, SNR, View Scalability

405	   [VVC] includes support for spatial, SNR, and View scalability.
406	   Scalable video coding is widely considered to have technical benefits
407	   and enrich services for various video applications.  Until recently,
408	   however, the functionality has not been included in the main profiles
409	   of video codecs and not wide deployed due to additional costs.  In
410	   VVC, however, all those forms of scalability are supported natively
411	   through the signaling of the layer_id in the NAL unit header, the VPS
412	   which associates layers with given layer_ids to each other, reference
413	   picture selection, reference picture resampling for spatial
414	   scalability, and a number of other mechanisms not relevant for this
415	   memo.  Scalability support can be implemented in a single decoding
416	   "loop" and is widely considered a comparatively lightweight
417	   operation.

419	      Spatial Scalability

421	         With the existence of Reference Picture Resampling (RPR), in
422	         the "main" profile of VVC, the additional burden for
423	         scalability support is just a minor modification of the high-
424	         level syntax (HLS).  In technical aspects, the inter-layer
425	         prediction is employed in a scalable system to improve the
426	         coding efficiency of the enhancement layers.  In addition to
427	         the spatial and temporal motion-compensated predictions that
428	         are available in a single- layer codec, the inter-layer
429	         prediction in [VVC] uses the resampled video data of the
430	         reconstructed reference picture from a reference layer to
431	         predict the current enhancement layer.  Then, the resampling
432	         process for inter-layer prediction is performed at the block-
433	         level, without modifying the existing interpolation process for
434	         motion compensation compared to non-scalable RPR.  It means
435	         that no additional resampling process is needed to support
436	         scalability.

438	      SNR Scalability

440	         SNR scalability is similar to Spatial Scalability except that
441	         the resampling factors are 1:1-in other words, there is no
442	         change in resolution, but there is inter-layer prediction.

444	   SEI Messages

446	   Supplementary Enhancement Information (SEI) messages are codepoints
447	   in the bitstream that do not influence the decoding process as
448	   specified in the [VVC] spec, but address issues of representation/
449	   rendering of the decoded bitstream, label the bitstream for certain
450	   applications, among other, similar tasks.  The overall concept of SEI
451	   messages and many of the messages themselves has been inherited from
452	   the H.264 and HEVC specs.  In the [VVC] environment, some of the SEI
453	   messages considered to be generally useful also in other video coding
454	   technologies have been moved out of the main specification into a
455	   companion document (TO DO: add reference once ITU designation is
456	   known).

458	1.1.3.  Parallel Processing Support (informative)

460	   Compared to HEVC, the [VVC] design to support parallelization offers
461	   numerous improvements.  Some of those improvements are still
462	   undergoing changes in JVET.  Information, to the extent relevant for
463	   this memo, will be added in future versions of this memo as the
464	   standardization in JVET progresses and the technology stabilizes.

466	      Editor notes: udpate on sub-picture/slice/tile is needed following
467	      new VVC draft

469	1.1.4.  NAL Unit Header

471	   [VVC] maintains the NAL unit concept of HEVC with modifications.  VVC
472	   uses a two-byte NAL unit header, as shown in Figure 1.  The payload
473	   of a NAL unit refers to the NAL unit excluding the NAL unit header.

475	                     +---------------+---------------+
476	                     |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
477	                     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
478	                     |F|Z| LayerID   |  Type   | TID |
479	                     +---------------+---------------+

481	                   The Structure of the VVC NAL Unit Header.

483	                                 Figure 1

485	   The semantics of the fields in the NAL unit header are as specified
486	   in [VVC] and described briefly below for convenience.  In addition to
487	   the name and size of each field, the corresponding syntax element
488	   name in [VVC] is also provided.

490	   F: 1 bit

492	      forbidden_zero_bit.  Required to be zero in VVC.  Note that the
493	      inclusion of this bit in the NAL unit header was to enable
494	      transport of [VVC] video over MPEG-2 transport systems (avoidance
495	      of start code emulations) [MPEG2S].  In the context of this memo
496	      the value 1 may be used to indicate a syntax violation, e.g., for
497	      a NAL unit resulted from aggregating a number of fragmented units
498	      of a NAL unit but missing the last fragment, as described in
499	      Section TBD.

501	   Z: 1 bit

503	      nuh_reserved_zero_bit.  Required to be zero in VVC, and reserved
504	      for future extensions by ITU-T and ISO/IEC.
505	      This memo does not overload the "Z" bit for local extensions, as
506	      a) overloading the "F" bit is sufficient and b) to preserve the
507	      usefulness of this memo to possible future versions of [VVC].

509	   LayerId: 6 bits

511	      nuh_layer_id.  Identifies the layer a NAL unit belongs to, wherein
512	      a layer may be, e.g., a spatial scalable layer, a quality scalable
513	      layer .

515	   Type: 5 bits
516	      nal_unit_type.  This field specifies the NAL unit type as defined
517	      in Table 7-1 of VVC.  For a reference of all currently defined NAL
518	      unit types and their semantics, please refer to Section 7.4.2.2 in
519	      [VVC].

521	   TID: 3 bits

523	      nuh_temporal_id_plus1.  This field specifies the temporal
524	      identifier of the NAL unit plus 1.  The value of TemporalId is
525	      equal to TID minus 1.  A TID value of 0 is illegal to ensure that
526	      there is at least one bit in the NAL unit header equal to 1, so to
527	      enable independent considerations of start code emulations in the
528	      NAL unit header and in the NAL unit payload data.

530	1.2.  Overview of the Payload Format

532	   This payload format defines the following processes required for
533	   transport of [VVC] coded data over RTP [RFC3550]:

535	   o  Usage of RTP header with this payload format

537	   o  Packetization of [VVC] coded NAL units into RTP packets using
538	      three types of payload structures: a single NAL unit packet,
539	      aggregation packet, and fragment unit

541	   o  Transmission of [VVC] NAL units of the same bitstream within a
542	      single RTP stream.

544	   o  Media type parameters to be used with the Session Description
545	      Protocol (SDP) [RFC4566]

547	   o  Frame-marking mapping [FrameMarking]

549	2.  Conventions

551	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
552	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
553	   "OPTIONAL" in this document are to be interpreted as described in BCP
554	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
555	   capitals, as shown above.

557	3.  Definitions and Abbreviations

559	3.1.  Definitions

561	   This document uses the terms and definitions of VVC.  Section 3.1.1
562	   lists relevant definitions from [VVC] for convenience.  Section 3.1.2
563	   provides definitions specific to this memo.

565	3.1.1.  Definitions from the VVC Specification

567	      Editor notes:

569	   Access unit (AU): A set of PUs that belong to different layers and
570	   contain coded pictures associated with the same time for output from
571	   the DPB.

573	   Adaptation parameter set (APS): A syntax structure containing syntax
574	   elements that apply to zero or more slices as determined by zero or
575	   more syntax elements found in slice headers.

577	   Bitstream: A sequence of bits, in the form of a NAL unit stream or a
578	   byte stream, that forms the representation of a sequence of AUs
579	   forming one or more coded video sequences (CVSs).

581	   Coded picture: A coded representation of a picture comprising VCL NAL
582	   units with a particular value of nuh_layer_id within an AU and
583	   containing all CTUs of the picture.

585	   Clean random access (CRA) PU: A PU in which the coded picture is a
586	   CRA picture.

588	   Clean random access (CRA) picture: An IRAP picture for which each VCL
589	   NAL unit has nal_unit_type equal to CRA_NUT.

591	   Coded video sequence (CVS): A sequence of AUs that consists, in
592	   decoding order, of a CVSS AU, followed by zero or more AUs that are
593	   not CVSS AUs, including all subsequent AUs up to but not including
594	   any subsequent AU that is a CVSS AU.

596	   Coded video sequence start (CVSS) AU: An AU in which there is a PU
597	   for each layer in the CVS and the coded picture in each PU is a CLVSS
598	   picture.

600	   Coded layer video sequence (CLVS): A sequence of PUs with the same
601	   value of nuh_layer_id that consists, in decoding order, of a CLVSS
602	   PU, followed by zero or more PUs that are not CLVSS PUs, including
603	   all subsequent PUs up to but not including any subsequent PU that is
604	   a CLVSS PU.

606	   Coded layer video sequence start (CLVSS) PU: A PU in which the coded
607	   picture is a CLVSS picture.

609	   Coded layer video sequence start (CLVSS) picture: A coded picture
610	   that is an IRAP picture with NoOutputBeforeRecoveryFlag equal to 1 or
611	   a GDR picture with NoOutputBeforeRecoveryFlag equal to 1.

613	   Coding tree unit (CTU): A CTB of luma samples, two corresponding CTBs
614	   of chroma samples of a picture that has three sample arrays, or a CTB
615	   of samples of a monochrome picture or a picture that is coded using
616	   three separate colour planes and syntax structures used to code the
617	   samples.

619	   Decoding Capability Information (DCI): A syntax structure containing
620	   syntax elements that apply to the entire bitstream.

622	   Decoded picture buffer (DPB): A buffer holding decoded pictures for
623	   reference, output reordering, or output delay specified for the
624	   hypothetical reference decoder.

626	   Gradual decoding refresh (GDR) picture: A picture for which each VCL
627	   NAL unit has nal_unit_type equal to GDR_NUT.

629	   Instantaneous decoding refresh (IDR) PU: A PU in which the coded
630	   picture is an IDR picture.

632	   Instantaneous decoding refresh (IDR) picture: An IRAP picture for
633	   which each VCL NAL unit has nal_unit_type equal to IDR_W_RADL or
634	   IDR_N_LP.

636	   Intra random access point (IRAP) AU: An AU in which there is a PU for
637	   each layer in the CVS and the coded picture in each PU is an IRAP
638	   picture.

640	   Intra random access point (IRAP) PU: A PU in which the coded picture
641	   is an IRAP picture.

643	   Intra random access point (IRAP) picture: A coded picture for which
644	   all VCL NAL units have the same value of nal_unit_type in the range
645	   of IDR_W_RADL to CRA_NUT, inclusive.

647	   Layer: A set of VCL NAL units that all have a particular value of
648	   nuh_layer_id and the associated non-VCL NAL units.

650	   Network abstraction layer (NAL) unit: A syntax structure containing
651	   an indication of the type of data to follow and bytes containing that
652	   data in the form of an RBSP interspersed as necessary with emulation
653	   prevention bytes.

655	   Network abstraction layer (NAL) unit stream: A sequence of NAL units.

657	   Operation point (OP): A temporal subset of an OLS, identified by an
658	   OLS index and a highest value of TemporalId.

660	   Picture parameter set (PPS): A syntax structure containing syntax
661	   elements that apply to zero or more entire coded pictures as
662	   determined by a syntax element found in each slice header.

664	   Picture unit (PU): A set of NAL units that are associated with each
665	   other according to a specified classification rule, are consecutive
666	   in decoding order, and contain exactly one coded picture.

668	   Random access: The act of starting the decoding process for a
669	   bitstream at a point other than the beginning of the stream.

671	   Sequence parameter set (SPS): A syntax structure containing syntax
672	   elements that apply to zero or more entire CLVSs as determined by the
673	   content of a syntax element found in the PPS referred to by a syntax
674	   element found in each picture header.

676	   Slice: An integer number of complete tiles or an integer number of
677	   consecutive complete CTU rows within a tile of a picture that are
678	   exclusively contained in a single NAL unit.

680	   Sub-layer: A temporal scalable layer of a temporal scalable bitstream
681	   consisting of VCL NAL units with a particular value of the TemporalId
682	   variable, and the associated non-VCL NAL units.

684	   Subpicture: An rectangular region of one or more slices within a
685	   picture.

687	   Sub-layer representation: A subset of the bitstream consisting of NAL
688	   units of a particular sub-layer and the lower sub-layers.

690	   Tile: A rectangular region of CTUs within a particular tile column
691	   and a particular tile row in a picture.

693	   Tile column: A rectangular region of CTUs having a height equal to
694	   the height of the picture and a width specified by syntax elements in
695	   the picture parameter set.

697	   Tile row: A rectangular region of CTUs having a height specified by
698	   syntax elements in the picture parameter set and a width equal to the
699	   width of the picture.

701	   Video coding layer (VCL) NAL unit: A collective term for coded slice
702	   NAL units and the subset of NAL units that have reserved values of
703	   nal_unit_type that are classified as VCL NAL units in this
704	   Specification.

706	3.1.2.  Definitions Specific to This Memo

708	   Media-Aware Network Element (MANE): A network element, such as a
709	   middlebox, selective forwarding unit, or application-layer gateway
710	   that is capable of parsing certain aspects of the RTP payload headers
711	   or the RTP payload and reacting to their contents.

713	      Editor Notes: the following informative needs to be updated along
714	      with frame marking update

716	      Informative note: The concept of a MANE goes beyond normal routers
717	      or gateways in that a MANE has to be aware of the signaling (e.g.,
718	      to learn about the payload type mappings of the media streams),
719	      and in that it has to be trusted when working with Secure RTP
720	      (SRTP).  The advantage of using MANEs is that they allow packets
721	      to be dropped according to the needs of the media coding.  For
722	      example, if a MANE has to drop packets due to congestion on a
723	      certain link, it can identify and remove those packets whose
724	      elimination produces the least adverse effect on the user
725	      experience.  After dropping packets, MANEs must rewrite RTCP
726	      packets to match the changes to the RTP stream, as specified in
727	      Section 7 of [RFC3550].

729	   NAL unit decoding order: A NAL unit order that conforms to the
730	   constraints on NAL unit order given in Section 7.4.2.4 in [VVC],
731	   follow the Order of NAL units in the bitstream.

733	   NAL unit output order: A NAL unit order in which NAL units of
734	   different access units are in the output order of the decoded
735	   pictures corresponding to the access units, as specified in [VVC],
736	   and in which NAL units within an access unit are in their decoding
737	   order.

739	   RTP stream: See [RFC7656].  Within the scope of this memo, one RTP
740	   stream is utilized to transport one or more temporal sub-layers.

742	   Transmission order: The order of packets in ascending RTP sequence
743	   number order (in modulo arithmetic).  Within an aggregation packet,
744	   the NAL unit transmission order is the same as the order of
745	   appearance of NAL units in the packet.

747	3.2.  Abbreviations

749	   AU         Access Unit

751	   AP         Aggregation Packet

753	   CTU        Coding Tree Unit
754	   CVS        Coded Video Sequence

756	   DPB        Decoded Picture Buffer

758	   DCI        Decoding capability information

760	   DON        Decoding Order Number

762	   FIR        Full Intra Request

764	   FU         Fragmentation Unit

766	   HRD        Hypothetical Reference Decoder

768	   IDR        Instantaneous Decoding Refresh

770	   MANE       Media-Aware Network Element

772	   MTU        Maximum Transfer Unit

774	   NAL        Network Abstraction Layer

776	   NALU       Network Abstraction Layer Unit

778	   PLI        Picture Loss Indication

780	   PPS        Picture Parameter Set

782	   RPS        Reference Picture Set

784	   RPSI       Reference Picture Selection Indication

786	   SEI        Supplemental Enhancement Information

788	   SLI        Slice Loss Indication

790	   SPS        Sequence Parameter Set

792	   VCL        Video Coding Layer

794	   VPS        Video Parameter Set

796	4.  RTP Payload Format
797	4.1.  RTP Header Usage

799	   The format of the RTP header is specified in [RFC3550] (reprinted as
800	   Figure 2 for convenience).  This payload format uses the fields of
801	   the header in a manner consistent with that specification.

803	   The RTP payload (and the settings for some RTP header bits) for
804	   aggregation packets and fragmentation units are specified in
805	   Section 4.3.2 and Section 4.3.3, respectively.

807	       0                   1                   2                   3
808	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
809	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
810	      |V=2|P|X|  CC   |M|     PT      |       sequence number         |
811	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
812	      |                           timestamp                           |
813	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
814	      |           synchronization source (SSRC) identifier            |
815	      +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
816	      |            contributing source (CSRC) identifiers             |
817	      |                             ....                              |
818	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

820	                        RTP Header According to {{RFC3550}}

822	                                 Figure 2

824	   The RTP header information to be set according to this RTP payload
825	   format is set as follows:

827	   Marker bit (M): 1 bit

829	      Set for the last packet of the access unit, carried in the current
830	      RTP stream.  This is in line with the normal use of the M bit in
831	      video formats to allow an efficient playout buffer handling.

833	         Editor notes: The informative note below needs updating once
834	         the NAL unit type table is stable in the [VVC] spec.

836	         Informative note: The content of a NAL unit does not tell
837	         whether or not the NAL unit is the last NAL unit, in decoding
838	         order, of an access unit.  An RTP sender implementation may
839	         obtain this information from the video encoder.  If, however,
840	         the implementation cannot obtain this information directly from
841	         the encoder, e.g., when the bitstream was pre-encoded, and also
842	         there is no timestamp allocated for each NAL unit, then the
843	         sender implementation can inspect subsequent NAL units in
844	         decoding order to determine whether or not the NAL unit is the
845	         last NAL unit of an access unit as follows.  A NAL unit is
846	         determined to be the last NAL unit of an access unit if it is
847	         the last NAL unit of the bitstream.  A NAL unit naluX is also
848	         determined to be the last NAL unit of an access unit if both
849	         the following conditions are true: 1) the next VCL NAL unit
850	         naluY in decoding order has the high-order bit of the first
851	         byte after its NAL unit header equal to 1 or nal_unit_type
852	         equal to 19, and 2) all NAL units between naluX and naluY, when
853	         present, have nal_unit_type in the range of 13 to17, inclusive,
854	         equal to 20, equal to 23 or equal to 26.

856	   Payload Type (PT): 7 bits

858	      The assignment of an RTP payload type for this new packet format
859	      is outside the scope of this document and will not be specified
860	      here.  The assignment of a payload type has to be performed either
861	      through the profile used or in a dynamic way.

863	   Sequence Number (SN): 16 bits

865	      Set and used in accordance with [RFC3550].

867	   Timestamp: 32 bits

869	      The RTP timestamp is set to the sampling timestamp of the content.
870	      A 90 kHz clock rate MUST be used.  If the NAL unit has no timing
871	      properties of its own (e.g., parameter set and SEI NAL units), the
872	      RTP timestamp MUST be set to the RTP timestamp of the coded
873	      picture of the access unit in which the NAL unit (according to
874	      Annex D of VVC) is included.  Receivers MUST use the RTP timestamp
875	      for the display process, even when the bitstream contains picture
876	      timing SEI messages or decoding unit information SEI messages as
877	      specified in VVC.

879	   Synchronization source (SSRC): 32 bits

881	      Used to identify the source of the RTP packets.  A single SSRC is
882	      used for all parts of a single bitstream.

884	4.2.  Payload Header Usage

886	   The first two bytes of the payload of an RTP packet are referred to
887	   as the payload header.  The payload header consists of the same
888	   fields (F, Z, LayerId, Type, and TID) as the NAL unit header as shown
889	   in Section 1.1.4, irrespective of the type of the payload structure.

891	   The TID value indicates (among other things) the relative importance
892	   of an RTP packet, for example, because NAL units belonging to higher
893	   temporal sub-layers are not used for the decoding of lower temporal
894	   sub-layers.  A lower value of TID indicates a higher importance.
895	   More-important NAL units MAY be better protected against transmission
896	   losses than less-important NAL units.

898	      For Discussion: quite possibly something similar can be said for
899	      the Layer_id in layered coding, but perhaps not in multiview
900	      coding.  (The relevant part of the spec is relatively new,
901	      therefore the soft language).  However, for serious layer pruning,
902	      interpretation of the VPS is required.  We can add language about
903	      the need for stateful interpretation of LayerID vis-a-vis
904	      stateless interpretation of TID later.

906	4.3.  Payload Structures

908	   Three different types of RTP packet payload structures are specified.
909	   A receiver can identify the type of an RTP packet payload through the
910	   Type field in the payload header.

912	   The three different payload structures are as follows:

914	   o  Single NAL unit packet: Contains a single NAL unit in the payload,
915	      and the NAL unit header of the NAL unit also serves as the payload
916	      header.  This payload structure is specified in Section 4.4.1.

918	   o  Aggregation Packet (AP): Contains more than one NAL unit within
919	      one access unit.  This payload structure is specified in
920	      Section 4.3.2.

922	   o  Fragmentation Unit (FU): Contains a subset of a single NAL unit.
923	      This payload structure is specified in Section 4.3.3.

925	4.3.1.  Single NAL Unit Packets

927	      Editor notes: its better to add a section to describe DONL and
928	      sprop-max_don_diff.  sprop-max_don_diff is used but not specified
929	      as parameters in section 7 are not yet specified.  A value of
930	      sprop-max_don_diff greater than 0 indicates that the transmission
931	      order may not correspond to the decoding order and that the DON is
932	      is included in the payload header.

934	   A single NAL unit packet contains exactly one NAL unit, and consists
935	   of a payload header (denoted as PayloadHdr), a conditional 16-bit
936	   DONL field (in network byte order), and the NAL unit payload data
937	   (the NAL unit excluding its NAL unit header) of the contained NAL
938	   unit, as shown in Figure 3.

940	      0                   1                   2                   3
941	      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
942	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
943	     |           PayloadHdr          |      DONL (conditional)       |
944	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
945	     |                                                               |
946	     |                  NAL unit payload data                        |
947	     |                                                               |
948	     |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
949	     |                               :...OPTIONAL RTP padding        |
950	     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

952	                  The Structure of a Single NAL Unit Packet

954	                                 Figure 3

956	   The DONL field, when present, specifies the value of the 16 least
957	   significant bits of the decoding order number of the contained NAL
958	   unit.  If sprop-max-don-diff is greater than 0, the DONL field MUST
959	   be present, and the variable DON for the contained NAL unit is
960	   derived as equal to the value of the DONL field.  Otherwise (sprop-
961	   max-don-diff is equal to 0), the DONL field MUST NOT be present.

963	4.3.2.  Aggregation Packets (APs)

965	   Aggregation Packets (APs) can reduce of packetization overhead for
966	   small NAL units, such as most of the non- VCL NAL units, which are
967	   often only a few octets in size.

969	   An AP aggregates NAL units of one access unit.  Each NAL unit to be
970	   carried in an AP is encapsulated in an aggregation unit.  NAL units
971	   aggregated in one AP are included in NAL unit decoding order.

973	   An AP consists of a payload header (denoted as PayloadHdr) followed
974	   by two or more aggregation units, as shown in Figure 4.

976	     0                   1                   2                   3
977	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
978	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
979	    |    PayloadHdr (Type=28)       |                               |
980	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
981	    |                                                               |
982	    |             two or more aggregation units                     |
983	    |                                                               |
984	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
985	    |                               :...OPTIONAL RTP padding        |
986	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

988	                   The Structure of an Aggregation Packet

990	                                 Figure 4

992	   The fields in the payload header of an AP are set as follows.  The F
993	   bit MUST be equal to 0 if the F bit of each aggregated NAL unit is
994	   equal to zero; otherwise, it MUST be equal to 1.  The Type field MUST
995	   be equal to 28.

997	   The value of LayerId MUST be equal to the lowest value of LayerId of
998	   all the aggregated NAL units.  The value of TID MUST be the lowest
999	   value of TID of all the aggregated NAL units.

1001	      Informative note: All VCL NAL units in an AP have the same TID
1002	      value since they belong to the same access unit.  However, an AP
1003	      may contain non-VCL NAL units for which the TID value in the NAL
1004	      unit header may be different than the TID value of the VCL NAL
1005	      units in the same AP.

1007	   An AP MUST carry at least two aggregation units and can carry as many
1008	   aggregation units as necessary; however, the total amount of data in
1009	   an AP obviously MUST fit into an IP packet, and the size SHOULD be
1010	   chosen so that the resulting IP packet is smaller than the MTU size
1011	   so to avoid IP layer fragmentation.  An AP MUST NOT contain FUs
1012	   specified in Section 4.3.3.  APs MUST NOT be nested; i.e., an AP can
1013	   not contain another AP.

1015	   The first aggregation unit in an AP consists of a conditional 16-bit
1016	   DONL field (in network byte order) followed by a 16-bit unsigned size
1017	   information (in network byte order) that indicates the size of the
1018	   NAL unit in bytes (excluding these two octets, but including the NAL
1019	   unit header), followed by the NAL unit itself, including its NAL unit
1020	   header, as shown in Figure 5.

1022	     0                   1                   2                   3
1023	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1024	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1025	    |               :       DONL (conditional)      |   NALU size   |
1026	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1027	    |   NALU size   |                                               |
1028	    +-+-+-+-+-+-+-+-+         NAL unit                              |
1029	    |                                                               |
1030	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1031	    |                               :
1032	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1034	           The Structure of the First Aggregation Unit in an AP

1036	                                 Figure 5

1038	   The DONL field, when present, specifies the value of the 16 least
1039	   significant bits of the decoding order number of the aggregated NAL
1040	   unit.

1042	   If sprop-max-don-diff is greater than 0, the DONL field MUST be
1043	   present in an aggregation unit that is the first aggregation unit in
1044	   an AP, and the variable DON for the aggregated NAL unit is derived as
1045	   equal to the value of the DONL field.  Otherwise (sprop-max-don-diff
1046	   is equal to 0), the DONL field MUST NOT be present in an aggregation
1047	   unit that is the first aggregation unit in an AP.

1049	   An aggregation unit that is not the first aggregation unit in an AP
1050	   will be followed immediately by a 16-bit unsigned size information
1051	   (in network byte order) that indicates the size of the NAL unit in
1052	   bytes (excluding these two octets, but including the NAL unit
1053	   header), followed by the NAL unit itself, including its NAL unit
1054	   header, as shown in Figure 6.

1056	     0                   1                   2                   3
1057	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1058	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1059	    |               :       NALU size               |   NAL unit    |
1060	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1061	    |                                                               |
1062	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1063	    |                               :
1064	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1066	         The Structure of an Aggregation Unit That Is Not the First
1067	                          Aggregation Unit in an AP

1069	                                 Figure 6

1071	   Figure 7 presents an example of an AP that contains two aggregation
1072	   units, labeled as 1 and 2 in the figure, without the DONL field being
1073	   present.

1075	     0                   1                   2                   3
1076	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1077	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1078	    |                          RTP Header                           |
1079	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1080	    |   PayloadHdr (Type=28)        |         NALU 1 Size           |
1081	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1082	    |          NALU 1 HDR           |                               |
1083	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+         NALU 1 Data           |
1084	    |                   . . .                                       |
1085	    |                                                               |
1086	    +               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1087	    |  . . .        | NALU 2 Size                   | NALU 2 HDR    |
1088	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1089	    | NALU 2 HDR    |                                               |
1090	    +-+-+-+-+-+-+-+-+              NALU 2 Data                      |
1091	    |                   . . .                                       |
1092	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1093	    |                               :...OPTIONAL RTP padding        |
1094	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1096	               An Example of an AP Packet Containing
1097	             Two Aggregation Units without the DONL Field

1099	                                 Figure 7

1101	   Figure 8 presents an example of an AP that contains two aggregation
1102	   units, labeled as 1 and 2 in the figure, with the DONL field being
1103	   present.

1105	     0                   1                   2                   3
1106	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1107	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1108	    |                          RTP Header                           |
1109	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1110	    |   PayloadHdr (Type=28)        |        NALU 1 DONL            |
1111	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1112	    |          NALU 1 Size          |            NALU 1 HDR         |
1113	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1114	    |                                                               |
1115	    |                 NALU 1 Data   . . .                           |
1116	    |                                                               |
1117	    +        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1118	    |                               :          NALU 2 Size          |
1119	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1120	    |          NALU 2 HDR           |                               |
1121	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+          NALU 2 Data          |
1122	    |                                                               |
1123	    |        . . .                  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1124	    |                               :...OPTIONAL RTP padding        |
1125	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1127	                   An Example of an AP Containing
1128	                 Two Aggregation Units with the DONL Field

1130	                                 Figure 8

1132	4.3.3.  Fragmentation Units

1134	   Fragmentation Units (FUs) are introduced to enable fragmenting a
1135	   single NAL unit into multiple RTP packets, possibly without
1136	   cooperation or knowledge of the [VVC] encoder.  A fragment of a NAL
1137	   unit consists of an integer number of consecutive octets of that NAL
1138	   unit.  Fragments of the same NAL unit MUST be sent in consecutive
1139	   order with ascending RTP sequence numbers (with no other RTP packets
1140	   within the same RTP stream being sent between the first and last
1141	   fragment).

1143	   When a NAL unit is fragmented and conveyed within FUs, it is referred
1144	   to as a fragmented NAL unit.  APs MUST NOT be fragmented.  FUs MUST
1145	   NOT be nested; i.e., an FU can not contain a subset of another FU.

1147	   The RTP timestamp of an RTP packet carrying an FU is set to the NALU-
1148	   time of the fragmented NAL unit.

1150	   An FU consists of a payload header (denoted as PayloadHdr), an FU
1151	   header of one octet, a conditional 16-bit DONL field (in network byte
1152	   order), and an FU payload, as shown in Figure 9}.

1154	     0                   1                   2                   3
1155	     0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1156	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1157	    |    PayloadHdr (Type=29)       |   FU header   | DONL (cond)   |
1158	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-|
1159	    | DONL (cond)   |                                               |
1160	    |-+-+-+-+-+-+-+-+                                               |
1161	    |                         FU payload                            |
1162	    |                                                               |
1163	    |                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1164	    |                               :...OPTIONAL RTP padding        |
1165	    +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1167	                          The Structure of an FU

1169	                                 Figure 9

1171	   The fields in the payload header are set as follows.  The Type field
1172	   MUST be equal to 29.  The fields F, LayerId, and TID MUST be equal to
1173	   the fields F, LayerId, and TID, respectively, of the fragmented NAL
1174	   unit.

1176	   The FU header consists of an S bit, an E bit, an R bit and a 5-bit
1177	   FuType field, as shown in Figure 10.

1179	                             +---------------+
1180	                             |0|1|2|3|4|5|6|7|
1181	                             +-+-+-+-+-+-+-+-+
1182	                             |S|E|R|  FuType |
1183	                             +---------------+

1185	                         The Structure of FU Header

1187	                                 Figure 10

1189	   The semantics of the FU header fields are as follows:

1191	   S: 1 bit

1193	      When set to 1, the S bit indicates the start of a fragmented NAL
1194	      unit, i.e., the first byte of the FU payload is also the first
1195	      byte of the payload of the fragmented NAL unit.  When the FU
1196	      payload is not the start of the fragmented NAL unit payload, the S
1197	      bit MUST be set to 0.

1199	   E: 1 bit
1200	      When set to 1, the E bit indicates the end of a fragmented NAL
1201	      unit, i.e., the last byte of the payload is also the last byte of
1202	      the fragmented NAL unit.  When the FU payload is not the last
1203	      fragment of a fragmented NAL unit, the E bit MUST be set to 0.

1205	   Reserved: 1 bit

1207	      Placeholder

1209	   FuType: 5 bits

1211	      The field FuType MUST be equal to the field Type of the fragmented
1212	      NAL unit.

1214	   The DONL field, when present, specifies the value of the 16 least
1215	   significant bits of the decoding order number of the fragmented NAL
1216	   unit.

1218	   If sprop-max-don-diff is greater than 0, and the S bit is equal to 1,
1219	   the DONL field MUST be present in the FU, and the variable DON for
1220	   the fragmented NAL unit is derived as equal to the value of the DONL
1221	   field.  Otherwise (sprop-max-don-diff is equal to 0, or the S bit is
1222	   equal to 0), the DONL field MUST NOT be present in the FU.

1224	   A non-fragmented NAL unit MUST NOT be transmitted in one FU; i.e.,
1225	   the Start bit and End bit must not both be set to 1 in the same FU
1226	   header.

1228	   The FU payload consists of fragments of the payload of the fragmented
1229	   NAL unit so that if the FU payloads of consecutive FUs, starting with
1230	   an FU with the S bit equal to 1 and ending with an FU with the E bit
1231	   equal to 1, are sequentially concatenated, the payload of the
1232	   fragmented NAL unit can be reconstructed.  The NAL unit header of the
1233	   fragmented NAL unit is not included as such in the FU payload, but
1234	   rather the information of the NAL unit header of the fragmented NAL
1235	   unit is conveyed in F, LayerId, and TID fields of the FU payload
1236	   headers of the FUs and the FuType field of the FU header of the FUs.
1237	   An FU payload MUST NOT be empty.

1239	   If an FU is lost, the receiver SHOULD discard all following
1240	   fragmentation units in transmission order corresponding to the same
1241	   fragmented NAL unit, unless the decoder in the receiver is known to
1242	   be prepared to gracefully handle incomplete NAL units.

1244	   A receiver in an endpoint or in a MANE MAY aggregate the first n-1
1245	   fragments of a NAL unit to an (incomplete) NAL unit, even if fragment
1246	   n of that NAL unit is not received.  In this case, the
1247	   forbidden_zero_bit of the NAL unit MUST be set to 1 to indicate a
1248	   syntax violation.

1250	4.4.  Decoding Order Number

1252	   For each NAL unit, the variable AbsDon is derived, representing the
1253	   decoding order number that is indicative of the NAL unit decoding
1254	   order.

1256	   Let NAL unit n be the n-th NAL unit in transmission order within an
1257	   RTP stream.

1259	   If sprop-max-don-diff is equal to 0, AbsDon[n], the value of AbsDon
1260	   for NAL unit n, is derived as equal to n.

1262	   Otherwise (sprop-max-don-diff is greater than 0), AbsDon[n] is
1263	   derived as follows, where DON[n] is the value of the variable DON for
1264	   NAL unit n:

1266	   o  If n is equal to 0 (i.e., NAL unit n is the very first NAL unit in
1267	      transmission order), AbsDon[0] is set equal to DON[0].

1269	   o  Otherwise (n is greater than 0), the following applies for
1270	      derivation of AbsDon[n]:

1272	         If DON[n] == DON[n-1],
1273	            AbsDon[n] = AbsDon[n-1]

1275	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] < 32768),
1276	            AbsDon[n] = AbsDon[n-1] + DON[n] - DON[n-1]

1278	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] >= 32768),
1279	            AbsDon[n] = AbsDon[n-1] + 65536 - DON[n-1] + DON[n]

1281	         If (DON[n] > DON[n-1] and DON[n] - DON[n-1] >= 32768),
1282	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] + 65536 -
1283	            DON[n])

1285	         If (DON[n] < DON[n-1] and DON[n-1] - DON[n] < 32768),
1286	            AbsDon[n] = AbsDon[n-1] - (DON[n-1] - DON[n])

1288	   For any two NAL units m and n, the following applies:

1290	   o  AbsDon[n] greater than AbsDon[m] indicates that NAL unit n follows
1291	      NAL unit m in NAL unit decoding order.

1293	   o  When AbsDon[n] is equal to AbsDon[m], the NAL unit decoding order
1294	      of the two NAL units can be in either order.

1296	   o  AbsDon[n] less than AbsDon[m] indicates that NAL unit n precedes
1297	      NAL unit m in decoding order.

1299	      Informative note: When two consecutive NAL units in the NAL unit
1300	      decoding order have different values of AbsDon, the absolute
1301	      difference between the two AbsDon values may be greater than or
1302	      equal to 1.

1304	      Informative note: There are multiple reasons to allow for the
1305	      absolute difference of the values of AbsDon for two consecutive
1306	      NAL units in the NAL unit decoding order to be greater than one.
1307	      An increment by one is not required, as at the time of associating
1308	      values of AbsDon to NAL units, it may not be known whether all NAL
1309	      units are to be delivered to the receiver.  For example, a gateway
1310	      might not forward VCL NAL units of higher sub-layers or some SEI
1311	      NAL units when there is congestion in the network.
1312	      In another example, the first intra-coded picture of a pre-encoded
1313	      clip is transmitted in advance to ensure that it is readily
1314	      available in the receiver, and when transmitting the first intra-
1315	      coded picture, the originator does not exactly know how many NAL
1316	      units will be encoded before the first intra-coded picture of the
1317	      pre-encoded clip follows in decoding order.  Thus, the values of
1318	      AbsDon for the NAL units of the first intra-coded picture of the
1319	      pre-encoded clip have to be estimated when they are transmitted,
1320	      and gaps in values of AbsDon may occur.

1322	5.  Packetization Rules

1324	   The following packetization rules apply:

1326	   o  If sprop-max-don-diff is greater than 0, the transmission order of
1327	      NAL units carried in the RTP stream MAY be different than the NAL
1328	      unit decoding order and the NAL unit output order.

1330	   o  A NAL unit of a small size SHOULD be encapsulated in an
1331	      aggregation packet together one or more other NAL units in order
1332	      to avoid the unnecessary packetization overhead for small NAL
1333	      units.  For example, non-VCL NAL units such as access unit
1334	      delimiters, parameter sets, or SEI NAL units are typically small
1335	      and can often be aggregated with VCL NAL units without violating
1336	      MTU size constraints.

1338	   o  Each non-VCL NAL unit SHOULD, when possible from an MTU size match
1339	      viewpoint, be encapsulated in an aggregation packet together with
1340	      its associated VCL NAL unit, as typically a non-VCL NAL unit would
1341	      be meaningless without the associated VCL NAL unit being
1342	      available.

1344	   o  For carrying exactly one NAL unit in an RTP packet, a single NAL
1345	      unit packet MUST be used.

1347	6.  De-packetization Process

1349	   The general concept behind de-packetization is to get the NAL units
1350	   out of the RTP packets in an RTP stream and pass them to the decoder
1351	   in the NAL unit decoding order.

1353	   The de-packetization process is implementation dependent.  Therefore,
1354	   the following description should be seen as an example of a suitable
1355	   implementation.  Other schemes may be used as well, as long as the
1356	   output for the same input is the same as the process described below.
1357	   The output is the same when the set of output NAL units and their
1358	   order are both identical.  Optimizations relative to the described
1359	   algorithms are possible.

1361	   All normal RTP mechanisms related to buffer management apply.  In
1362	   particular, duplicated or outdated RTP packets (as indicated by the
1363	   RTP sequences number and the RTP timestamp) are removed.  To
1364	   determine the exact time for decoding, factors such as a possible
1365	   intentional delay to allow for proper inter-stream synchronization
1366	   MUST be factored in.

1368	   NAL units with NAL unit type values in the range of 0 to 27,
1369	   inclusive, may be passed to the decoder.  NAL-unit-like structures
1370	   with NAL unit type values in the range of 28 to 31, inclusive, MUST
1371	   NOT be passed to the decoder.

1373	   The receiver includes a receiver buffer, which is used to compensate
1374	   for transmission delay jitter within individual RTP streams and
1375	   across RTP streams, to reorder NAL units from transmission order to
1376	   the NAL unit decoding order.  In this section, the receiver operation
1377	   is described under the assumption that there is no transmission delay
1378	   jitter within an RTP stream and across RTP streams.  To make a
1379	   difference from a practical receiver buffer that is also used for
1380	   compensation of transmission delay jitter, the receiver buffer is
1381	   hereafter called the de-packetization buffer in this section.
1382	   Receivers should also prepare for transmission delay jitter; that is,
1383	   either reserve separate buffers for transmission delay jitter
1384	   buffering and de-packetization buffering or use a receiver buffer for
1385	   both transmission delay jitter and de- packetization.  Moreover,
1386	   receivers should take transmission delay jitter into account in the
1387	   buffering operation, e.g., by additional initial buffering before
1388	   starting of decoding and playback.

1390	   When sprop-max-don-diff is equal to 0, the de-packetization buffer
1391	   size is zero bytes, and the process described in the remainder of
1392	   this paragraph applies.
1393	   The NAL units carried in the single RTP stream are directly passed to
1394	   the decoder in their transmission order, which is identical to their
1395	   decoding order.  When there are several NAL units of the same RTP
1396	   stream with the same NTP timestamp, the order to pass them to the
1397	   decoder is their transmission order.

1399	      Informative note: The mapping between RTP and NTP timestamps is
1400	      conveyed in RTCP SR packets.  In addition, the mechanisms for
1401	      faster media timestamp synchronization discussed in [RFC6051] may
1402	      be used to speed up the acquisition of the RTP-to-wall-clock
1403	      mapping.

1405	   When sprop-max-don-diff is greater than 0, the process described in
1406	   the remainder of this section applies.

1408	   There are two buffering states in the receiver: initial buffering and
1409	   buffering while playing.  Initial buffering starts when the reception
1410	   is initialized.  After initial buffering, decoding and playback are
1411	   started, and the buffering-while-playing mode is used.

1413	   Regardless of the buffering state, the receiver stores incoming NAL
1414	   units, in reception order, into the de-packetization buffer.  NAL
1415	   units carried in RTP packets are stored in the de-packetization
1416	   buffer individually, and the value of AbsDon is calculated and stored
1417	   for each NAL unit.

1419	   Initial buffering lasts until condition A (the difference between the
1420	   greatest and smallest AbsDon values of the NAL units in the de-
1421	   packetization buffer is greater than or equal to the value of sprop-
1422	   max-don-diff) or condition B (the number of NAL units in the de-
1423	   packetization buffer is greater than the value of sprop-depack-buf-
1424	   nalus) is true.

1426	   After initial buffering, whenever condition A or condition B is true,
1427	   the following operation is repeatedly applied until both condition A
1428	   and condition B become false:

1430	   o  The NAL unit in the de-packetization buffer with the smallest
1431	      value of AbsDon is removed from the de-packetization buffer and
1432	      passed to the decoder.

1434	   When no more NAL units are flowing into the de-packetization buffer,
1435	   all NAL units remaining in the de-packetization buffer are removed
1436	   from the buffer and passed to the decoder in the order of increasing
1437	   AbsDon values.

1439	7.  Payload Format Parameters

1441	   This section specifies the optional parameters.  A mapping of the
1442	   parameters with Session Description Protocol (SDP) [RFC4556] is also
1443	   provided for applications that use SDP.

1445	7.1.  Media Type Registration

1447	   The receiver MUST ignore any parameter unspecified in this memo.

1449	   Type name:            Video

1451	   Subtype name:         H266

1453	   Required parameters:  none

1455	   Optional parameters:

1457	      Editor's notes: To be added

1459	7.2.  SDP Parameters

1461	   The receiver MUST ignore any parameter unspecified in this memo.

1463	7.2.1.  Mapping of Payload Type Parameters to SDP

1465	   The media type video/H266 string is mapped to fields in the Session
1466	   Description Protocol (SDP) [RFC4566] as follows:

1468	   o  The media name in the "m=" line of SDP MUST be video.

1470	   o  The encoding name in the "a=rtpmap" line of SDP MUST be H266 (the
1471	      media subtype).

1473	   o  The clock rate in the "a=rtpmap" line MUST be 90000.

1475	   o  OPTIONAL PARAMETERS:

1477	      Editor's notes: To be dicussed here

1479	7.2.1.1.  SDP Example

1481	   An example of media representation in SDP is as follows:

1483	       m=video 49170 RTP/AVP 98
1484	       a=rtpmap:98 H266/90000
1485	       a=fmtp:98 profile-id=1; sprop-vps=<video parameter sets data>

1487	7.2.2.  Usage with SDP Offer/Answer Model

1489	   When [VVC] is offered over RTP using SDP in an offer/answer model
1490	   [RFC3264] for negotiation for unicast usage, the following
1491	   limitations and rules apply:

1493	   Placeholder: To add limitations and considerations.

1495	8.  Use with Feedback Messages

1497	   The following subsections define the use of the Picture Loss
1498	   Indication (PLI), Slice Lost Indication (SLI), Reference Picture
1499	   Selection Indication (RPSI), and Full Intra Request (FIR) feedback
1500	   messages with HEVC.  The PLI, SLI, and RPSI messages are defined in
1501	   [RFC4585], and the FIR message is defined in [RFC5104].

1503	8.1.  Picture Loss Indication (PLI)

1505	   As specified in RFC 4585, Section 6.3.1, the reception of a PLI by a
1506	   media sender indicates "the loss of an undefined amount of coded
1507	   video data belonging to one or more pictures".  Without having any
1508	   specific knowledge of the setup of the bitstream (such as use and
1509	   location of in-band parameter sets, non-IRAP decoder refresh points,
1510	   picture structures, and so forth), a reaction to the reception of an
1511	   PLI by a [VVC] sender SHOULD be to send an IRAP picture and relevant
1512	   parameter sets; potentially with sufficient redundancy so to ensure
1513	   correct reception.  However, sometimes information about the
1514	   bitstream structure is known.  For example, state could have been
1515	   established outside of the mechanisms defined in this document that
1516	   parameter sets are conveyed out of band only, and stay static for the
1517	   duration of the session.  In that case, it is obviously unnecessary
1518	   to send them in-band as a result of the reception of a PLI.  Other
1519	   examples could be devised based on a priori knowledge of different
1520	   aspects of the bitstream structure.  In all cases, the timing and
1521	   congestion control mechanisms of RFC 4585 MUST be observed.

1523	8.2.  Slice Loss Indication (SLI)

1525	   For further study.  Maybe remove as there are no known
1526	   implementations of SDLI in [HEVC] based systems

1528	8.3.  Reference Picture Selection Indication (RPSI)

1530	   Feedback-based reference picture selection has been shown as a
1531	   powerful tool to stop temporal error propagation for improved error
1532	   resilience [Girod99] [Wang05].  In one approach, the decoder side
1533	   tracks errors in the decoded pictures and informs the encoder side
1534	   that a particular picture that has been decoded relatively earlier is
1535	   correct and still present in the decoded picture buffer; it requests
1536	   the encoder to use that correct picture-availability information when
1537	   encoding the next picture, so to stop further temporal error
1538	   propagation.  For this approach, the decoder side should use the RPSI
1539	   feedback message.

1541	   Encoders can encode some long-term reference pictures as specified in
1542	   [VVC] for purposes described in the previous paragraph without the
1543	   need of a huge decoded picture buffer.  As shown in [Wang05], with a
1544	   flexible reference picture management scheme, as in VVC, even a
1545	   decoded picture buffer size of two picture storage buffers would work
1546	   for the approach described in the previous paragraph.

1548	   The text above is copy-paste from RFC 7798.  If we keep the RPSI
1549	   message, it needs adaptation to the [VVC] syntax.  Doing so shouldn't
1550	   be too hard as the [VVC] reference picture mechanism is not too
1551	   different from the [HEVC] one.

1553	8.4.  Full Intra Request (FIR)

1555	   The purpose of the FIR message is to force an encoder to send an
1556	   independent decoder refresh point as soon as possible, while
1557	   observing applicable congestion-control-related constraints, such as
1558	   those set out in [RFC8082]).

1560	   Upon reception of a FIR, a sender MUST send an IDR picture.
1561	   Parameter sets MUST also be sent, except when there is a priori
1562	   knowledge that the parameter sets have been correctly established.  A
1563	   typical example for that is an understanding between sender and
1564	   receiver, established by means outside this document, that parameter
1565	   sets are exclusively sent out-of-band.

1567	9.  Frame Marking

1569	   [FrameMarking] provides an extension mechanism for RTP.  The codec-
1570	   agnostic meta-data in the [FrameMarking] header provides valuable
1571	   video frame information.  Its usage with [VVC] is defined in this
1572	   section.  Refer [FrameMarking] for any unspecified fields.  Two
1573	   header extensions are RECOMMENDED:

1575	   o  The short extension for non-scalable streams.

1577	   o  The long extension for scalable streams.

1579	9.1.  Frame Marking Short Extension

1581	   The fields for the short extension, as shown in Figure 11, are used
1582	   as described in the following.

1584	                          0                   1
1585	                          0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
1586	                         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1587	                         |  ID   |  L=0  |S|E|I|D|0 0 0 0|
1588	                         +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1590	                    Short Frame Marking RTP Extension for [VVC]

1592	                                 Figure 11

1594	   The I bit MUST be 1 when the NAL unit type is 7-9 (inclusive),
1595	   otherwise it MUST be 0.

1597	   The D bit MUST be 1 when the syntax element ph_non_ref_pic_flag for a
1598	   picture is equal to 1, otherwise it MUST be 0.

1600	   The S bit MUST be set to 1 if any of the following conditions is true
1601	   and MUST be set to 0 otherwise:

1603	   o  The RTP packet is a single NAL unit packet and it is the first VCL
1604	      NAL unit, in decoding order, of a picture.

1606	   o  The RTP packet is an AP, and the NAL unit in the first contained
1607	      aggregation unit is the first VCL NAL unit, in decoding order, of
1608	      a picture.

1610	   o  The RTP packet is a FU with its S bit equal to 1 and the FU
1611	      payload contains a fragment of the first VCL NAL unit, in decoding
1612	      order, of a picture.

1614	   The E bit MUST be set to 1 if any of the following conditions is true
1615	   and MUST be set to 0 otherwise:

1617	   o  The RTP packet is a single NAL unit packet and it is the last VCL
1618	      NAL unit, in decoding order, of a picture.

1620	   o  The RTP packet is an AP and the NAL unit in the last contained
1621	      aggregation unit is the last VCL NAL unit, in decoding order, of a
1622	      picture.

1624	   o  The RTP packet is a FU with its E bit equal to 1 and the FU
1625	      payload contains a fragment of the last VCL NAL unit, in decoding
1626	      order, of a picture.

1628	9.2.  Frame Marking Long Extension

1630	       0                   1                   2                   3
1631	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1632	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1633	      |  ID   |  L=2  |S|E|I|D|B| TID |0|0|   LayerID |    TL0PICIDX  |
1634	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

1636	                     Long Frame Marking RTP Extension for [VVC]

1638	                                 Figure 12

1640	   The fields for the long extension for scalable streams, as shown in
1641	   Figure 12, are used as described in the following.

1643	   The LayerID (6 bits) and TID (3 bits) from the NAL unit header
1644	   Section 1.1.4 are mapped to the generic LID and TID fields in
1645	   [FrameMarking] as shown in Figure 12.

1647	   The I bit MUST be 1 when the NAL unit type is 7-9 (inclusive),
1648	   otherwise it MUST be 0.

1650	   The D bit MUST be 1 when the syntax element ph_non_ref_pic_flag for a
1651	   picture is equal to 1, otherwise it MUST be 0.

1653	   The S bit MUST be set to 1 if any of the following conditions is true
1654	   and MUST be set to 0 otherwise:

1656	   o  The RTP packet is a single NAL unit packet and it is the first VCL
1657	      NAL unit, in decoding order, of a picture.

1659	   o  The RTP packet is an AP, and the NAL unit in the first contained
1660	      aggregation unit is the first VCL NAL unit, in decoding order, of
1661	      a picture.

1663	   o  The RTP packet is a FU with its S bit equal to 1 and the FU
1664	      payload contains a fragment of the first VCL NAL unit, in decoding
1665	      order, of a picture.

1667	   The E bit MUST be set to 1 if any of the following conditions is true
1668	   and MUST be set to 0 otherwise:

1670	   o  The RTP packet is a single NAL unit packet and it is the last VCL
1671	      NAL unit, in decoding order, of a picture.

1673	   o  The RTP packet is an AP and the NAL unit in the last contained
1674	      aggregation unit is the last VCL NAL unit, in decoding order, of a
1675	      picture.

1677	   o  The RTP packet is a FU with its E bit equal to 1 and the FU
1678	      payload contains a fragment of the last VCL NAL unit, in decoding
1679	      order, of a picture.

1681	10.  Security Considerations

1683	   The scope of this Security Considerations section is limited to the
1684	   payload format itself and to one feature of [VVC] that may pose a
1685	   particularly serious security risk if implemented naively.  The
1686	   payload format, in isolation, does not form a complete system.
1687	   Implementers are advised to read and understand relevant security-
1688	   related documents, especially those pertaining to RTP (see the
1689	   Security Considerations section in [RFC3550] ), and the security of
1690	   the call-control stack chosen (that may make use of the media type
1691	   registration of this memo).  Implementers should also consider known
1692	   security vulnerabilities of video coding and decoding implementations
1693	   in general and avoid those.

1695	   Within this RTP payload format, and with the exception of the user
1696	   data SEI message as described below, no security threats other than
1697	   those common to RTP payload formats are known.  In other words,
1698	   neither the various media-plane-based mechanisms, nor the signaling
1699	   part of this memo, seems to pose a security risk beyond those common
1700	   to all RTP-based systems.

1702	   RTP packets using the payload format defined in this specification
1703	   are subject to the security considerations discussed in the RTP
1704	   specification [RFC3550] , and in any applicable RTP profile such as
1705	   RTP/AVP [RFC3551] , RTP/AVPF [RFC4585] , RTP/SAVP [RFC3711] , or RTP/
1706	   SAVPF [RFC5124] .  However, as "Securing the RTP Framework: Why RTP
1707	   Does Not Mandate a Single Media Security Solution" [RFC7202]
1708	   discusses, it is not an RTP payload format's responsibility to
1709	   discuss or mandate what solutions are used to meet the basic security
1710	   goals like confidentiality, integrity and source authenticity for RTP
1711	   in general.  This responsibility lays on anyone using RTP in an
1712	   application.  They can find guidance on available security mechanisms
1713	   and important considerations in "Options for Securing RTP Sessions"
1714	   [RFC7201] . The rest of this section discusses the security impacting
1715	   properties of the payload format itself.

1717	   Because the data compression used with this payload format is applied
1718	   end-to-end, any encryption needs to be performed after compression.
1719	   A potential denial-of-service threat exists for data encodings using
1720	   compression techniques that have non-uniform receiver-end
1721	   computational load.  The attacker can inject pathological datagrams
1722	   into the bitstream that are complex to decode and that cause the
1723	   receiver to be overloaded.  [VVC] is particularly vulnerable to such
1724	   attacks, as it is extremely simple to generate datagrams containing
1725	   NAL units that affect the decoding process of many future NAL units.
1726	   Therefore, the usage of data origin authentication and data integrity
1727	   protection of at least the RTP packet is RECOMMENDED, for example,
1728	   with SRTP [RFC3711] .

1730	   Like HEVC [RFC7798], [VVC] includes a user data Supplemental
1731	   Enhancement Information (SEI) message.  This SEI message allows
1732	   inclusion of an arbitrary bitstring into the video bitstream.  Such a
1733	   bitstring could include JavaScript, machine code, and other active
1734	   content.  [VVC] leaves the handling of this SEI message to the
1735	   receiving system.  In order to avoid harmful side effects the user
1736	   data SEI message, decoder implementations cannot naively trust its
1737	   content.  For example, it would be a bad and insecure implementation
1738	   practice to forward any JavaScript a decoder implementation detects
1739	   to a web browser.  The safest way to deal with user data SEI messages
1740	   is to simply discard them, but that can have negative side effects on
1741	   the quality of experience by the user.

1743	   End-to-end security with authentication, integrity, or
1744	   confidentiality protection will prevent a MANE from performing media-
1745	   aware operations other than discarding complete packets.  In the case
1746	   of confidentiality protection, it will even be prevented from
1747	   discarding packets in a media-aware way.  To be allowed to perform
1748	   such operations, a MANE is required to be a trusted entity that is
1749	   included in the security context establishment.

1751	11.  Congestion Control

1753	   Congestion control for RTP SHALL be used in accordance with RTP
1754	   [RFC3550] and with any applicable RTP profile, e.g., AVP [RFC3551].
1755	   If best-effort service is being used, an additional requirement is
1756	   that users of this payload format MUST monitor packet loss to ensure
1757	   that the packet loss rate is within an acceptable range.  Packet loss
1758	   is considered acceptable if a TCP flow across the same network path,
1759	   and experiencing the same network conditions, would achieve an
1760	   average throughput, measured on a reasonable timescale, that is not
1761	   less than all RTP streams combined are achieving.  This condition can
1762	   be satisfied by implementing congestion-control mechanisms to adapt
1763	   the transmission rate, the number of layers subscribed for a layered
1764	   multicast session, or by arranging for a receiver to leave the
1765	   session if the loss rate is unacceptably high.

1767	   The bitrate adaptation necessary for obeying the congestion control
1768	   principle is easily achievable when real-time encoding is used, for
1769	   example, by adequately tuning the quantization parameter.  However,
1770	   when pre-encoded content is being transmitted, bandwidth adaptation
1771	   requires the pre-coded bitstream to be tailored for such adaptivity.
1772	   The key mechanisms available in [VVC] are temporal scalability, and
1773	   spatial/SNR scalability.  A media sender can remove NAL units
1774	   belonging to higher temporal sub-layers (i.e., those NAL units with a
1775	   high value of TID) or higher spatio-SNR layers (as indicated by
1776	   interpreting the VPS) until the sending bitrate drops to an
1777	   acceptable range.

1779	   The mechanisms mentioned above generally work within a defined
1780	   profile and level and, therefore, no renegotiation of the channel is
1781	   required.  Only when non-downgradable parameters (such as profile)
1782	   are required to be changed does it become necessary to terminate and
1783	   restart the RTP stream(s).  This may be accomplished by using
1784	   different RTP payload types.

1786	   MANEs MAY remove certain unusable packets from the RTP stream when
1787	   that RTP stream was damaged due to previous packet losses.  This can
1788	   help reduce the network load in certain special cases.  For example,
1789	   MANES can remove those FUs where the leading FUs belonging to the
1790	   same NAL unit have been lost or those dependent slice segments when
1791	   the leading slice segments belonging to the same slice have been
1792	   lost, because the trailing FUs or dependent slice segments are
1793	   meaningless to most decoders.  MANES can also remove higher temporal
1794	   scalable layers if the outbound transmission (from the MANE's
1795	   viewpoint) experiences congestion.

1797	12.  IANA Considerations

1799	   Placeholder

1801	13.  Acknowledgements

1803	   Dr. Byeongdoo Choi is thanked for the video codec related technical
1804	   discussion and other aspects in this memo.  Xin Zhao and Dr. Xiang Li
1805	   are thanked for their contributions on [VVC] specification
1806	   descriptive content.  Spencer Dawkins is thanked for his valuable
1807	   review comments that led to great improvements of this memo.  Some
1808	   parts of this specification share text with the RTP payload format
1809	   for HEVC [RFC7798].  We thank the authors of that specification for
1810	   their excellent work.

1812	14.  References

1814	14.1.  Normative References

1816	   [H.266]    "ITU-T, Versatile Video Coding", n.d..

1818	   [ISO23090-3]
1819	              "ISO/IEC DIS Information technology --- Coded
1820	              representation of immersive media --- Part 3 Versatile
1821	              video codings", n.d.,
1822	              <https://www.iso.org/standard/73022.html>.

1824	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
1825	              Requirement Levels", BCP 14, RFC 2119,
1826	              DOI 10.17487/RFC2119, March 1997,
1827	              <https://www.rfc-editor.org/info/rfc2119>.

1829	   [RFC3264]  Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model
1830	              with Session Description Protocol (SDP)", RFC 3264,
1831	              DOI 10.17487/RFC3264, June 2002,
1832	              <https://www.rfc-editor.org/info/rfc3264>.

1834	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
1835	              Jacobson, "RTP: A Transport Protocol for Real-Time
1836	              Applications", STD 64, RFC 3550, DOI 10.17487/RFC3550,
1837	              July 2003, <https://www.rfc-editor.org/info/rfc3550>.

1839	   [RFC3551]  Schulzrinne, H. and S. Casner, "RTP Profile for Audio and
1840	              Video Conferences with Minimal Control", STD 65, RFC 3551,
1841	              DOI 10.17487/RFC3551, July 2003,
1842	              <https://www.rfc-editor.org/info/rfc3551>.

1844	   [RFC3711]  Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K.
1845	              Norrman, "The Secure Real-time Transport Protocol (SRTP)",
1846	              RFC 3711, DOI 10.17487/RFC3711, March 2004,
1847	              <https://www.rfc-editor.org/info/rfc3711>.

1849	   [RFC4556]  Zhu, L. and B. Tung, "Public Key Cryptography for Initial
1850	              Authentication in Kerberos (PKINIT)", RFC 4556,
1851	              DOI 10.17487/RFC4556, June 2006,
1852	              <https://www.rfc-editor.org/info/rfc4556>.

1854	   [RFC4566]  Handley, M., Jacobson, V., and C. Perkins, "SDP: Session
1855	              Description Protocol", RFC 4566, DOI 10.17487/RFC4566,
1856	              July 2006, <https://www.rfc-editor.org/info/rfc4566>.

1858	   [RFC4585]  Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey,
1859	              "Extended RTP Profile for Real-time Transport Control
1860	              Protocol (RTCP)-Based Feedback (RTP/AVPF)", RFC 4585,
1861	              DOI 10.17487/RFC4585, July 2006,
1862	              <https://www.rfc-editor.org/info/rfc4585>.

1864	   [RFC5104]  Wenger, S., Chandra, U., Westerlund, M., and B. Burman,
1865	              "Codec Control Messages in the RTP Audio-Visual Profile
1866	              with Feedback (AVPF)", RFC 5104, DOI 10.17487/RFC5104,
1867	              February 2008, <https://www.rfc-editor.org/info/rfc5104>.

1869	   [RFC5124]  Ott, J. and E. Carrara, "Extended Secure RTP Profile for
1870	              Real-time Transport Control Protocol (RTCP)-Based Feedback
1871	              (RTP/SAVPF)", RFC 5124, DOI 10.17487/RFC5124, February
1872	              2008, <https://www.rfc-editor.org/info/rfc5124>.

1874	   [RFC7656]  Lennox, J., Gross, K., Nandakumar, S., Salgueiro, G., and
1875	              B. Burman, Ed., "A Taxonomy of Semantics and Mechanisms
1876	              for Real-Time Transport Protocol (RTP) Sources", RFC 7656,
1877	              DOI 10.17487/RFC7656, November 2015,
1878	              <https://www.rfc-editor.org/info/rfc7656>.

1880	   [RFC8082]  Wenger, S., Lennox, J., Burman, B., and M. Westerlund,
1881	              "Using Codec Control Messages in the RTP Audio-Visual
1882	              Profile with Feedback with Layered Codecs", RFC 8082,
1883	              DOI 10.17487/RFC8082, March 2017,
1884	              <https://www.rfc-editor.org/info/rfc8082>.

1886	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
1887	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
1888	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

1890	   [VVC]      "Versatile Video Coding (Draft 10), Joint Video Experts
1891	              Team (JVET)", July 2020.

1893	14.2.  Informative References

1895	   [CABAC]    Sole, J, . and . et al, "Transform coefficient coding in
1896	              HEVC, IEEE Transactions on Circuts and Systems for Video
1897	              Technology", DOI 10.1109/TCSVT.2012.2223055, December
1898	              2012.

1900	   [FrameMarking]
1901	              Berger, E, ., Nandakumar, S, ., and . Zanaty M, "Frame
1902	              Marking RTP Header Extension", Work in Progress draft-
1903	              berger-avtext-framemarking , 2015.

1905	   [Girod99]  Girod, B, . and . et al, "Feedback-based error control for
1906	              mobile video transmission, Proceedings of the IEEE",
1907	              DOI 110.1109/5.790632, October 1999.

1909	   [HEVC]     "High efficiency video coding, ITU-T Recommendation
1910	              H.265", April 2013.

1912	   [MPEG2S]   IS0/IEC, ., "Information technology - Generic coding
1913	              ofmoving pictures and associated audio information - Part
1914	              1:Systems, ISO International Standard 13818-1", 2013.

1916	   [RFC6051]  Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP
1917	              Flows", RFC 6051, DOI 10.17487/RFC6051, November 2010,
1918	              <https://www.rfc-editor.org/info/rfc6051>.

1920	   [RFC6184]  Wang, Y., Even, R., Kristensen, T., and R. Jesup, "RTP
1921	              Payload Format for H.264 Video", RFC 6184,
1922	              DOI 10.17487/RFC6184, May 2011,
1923	              <https://www.rfc-editor.org/info/rfc6184>.

1925	   [RFC6190]  Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis,
1926	              "RTP Payload Format for Scalable Video Coding", RFC 6190,
1927	              DOI 10.17487/RFC6190, May 2011,
1928	              <https://www.rfc-editor.org/info/rfc6190>.

1930	   [RFC7201]  Westerlund, M. and C. Perkins, "Options for Securing RTP
1931	              Sessions", RFC 7201, DOI 10.17487/RFC7201, April 2014,
1932	              <https://www.rfc-editor.org/info/rfc7201>.

1934	   [RFC7202]  Perkins, C. and M. Westerlund, "Securing the RTP
1935	              Framework: Why RTP Does Not Mandate a Single Media
1936	              Security Solution", RFC 7202, DOI 10.17487/RFC7202, April
1937	              2014, <https://www.rfc-editor.org/info/rfc7202>.

1939	   [RFC7798]  Wang, Y., Sanchez, Y., Schierl, T., Wenger, S., and M.
1940	              Hannuksela, "RTP Payload Format for High Efficiency Video
1941	              Coding (HEVC)", RFC 7798, DOI 10.17487/RFC7798, March
1942	              2016, <https://www.rfc-editor.org/info/rfc7798>.

1944	   [Wang05]   Wang, YK, ., Zhu, C, ., and . Li, H, "Error resilient
1945	              video coding using flexible reference fames", Visual
1946	              Communications and Image Processing 2005 (VCIP 2005) ,
1947	              July 2005.

1949	Appendix A.  Change History

1951	   draft-zhao-payload-rtp-vvc-00 ........ initial version

1953	   draft-zhao-payload-rtp-vvc-01 ........ editorial clarifications and
1954	   corrections

1956	   draft-ietf-payload-rtp-vvc-00 ........ initial WG draft

1958	   draft-ietf-payload-rtp-vvc-01 ........ VVC specification update

1960	Authors' Addresses

1962	   Shuai Zhao
1963	   Tencent
1964	   2747 Park Blvd
1965	   Palo Alto  94588
1966	   USA

1968	   Email: shuai.zhao@ieee.org

1970	   Stephan Wenger
1971	   Tencent
1972	   2747 Park Blvd
1973	   Palo Alto  94588

1975	   Email: stewe@stewe.org

1977	   Yago Sanchez
1978	   Fraunhofer HHI
1979	   Einsteinufer 37
1980	   Berlin  10587
1981	   Germany

1983	   Email: yago.sanchez@hhi.fraunhofer.de