idnits 2.17.1 draft-schierl-avt-rtp-multi-session-transmission-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 17. -- Found old boilerplate from RFC 3978, Section 5.5, updated by RFC 4748 on line 941. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 952. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 959. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 965. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust Copyright Line does not match the current year == Line 829 has weird spacing: '...channel audio...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 27, 2008) is 5657 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Outdated reference: A later version (-03) exists of draft-ietf-avt-rtp-mps-01 == Outdated reference: A later version (-27) exists of draft-ietf-avt-rtp-svc-14 == Outdated reference: A later version (-08) exists of draft-ietf-mmusic-decoding-dependency-04 == Outdated reference: A later version (-02) exists of draft-ietf-mmusic-sdp-source-attributes-01 == Outdated reference: A later version (-05) exists of draft-wang-avt-rtp-mvc-02 -- Obsolete informational reference (is this intentional?): RFC 3388 (Obsoleted by RFC 5888) -- Obsolete informational reference (is this intentional?): RFC 3984 (Obsoleted by RFC 6184) -- Obsolete informational reference (is this intentional?): RFC 4566 (Obsoleted by RFC 8866) -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) Summary: 1 error (**), 0 flaws (~~), 7 warnings (==), 11 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 AVT T. Schierl 3 Internet-Draft Fraunhofer HHI 4 Intended status: Informational J. Lennox 5 Expires: April 30, 2009 Vidyo 6 October 27, 2008 8 Multi-Session and Multi-Source Transmission in the Real-Time Transport 9 Protocol (RTP) 10 draft-schierl-avt-rtp-multi-session-transmission-00 12 Status of this Memo 14 By submitting this Internet-Draft, each author represents that any 15 applicable patent or other IPR claims of which he or she is aware 16 have been or will be disclosed, and any of which he or she becomes 17 aware will be disclosed, in accordance with Section 6 of BCP 79. 19 Internet-Drafts are working documents of the Internet Engineering 20 Task Force (IETF), its areas, and its working groups. Note that 21 other groups may also distribute working documents as Internet- 22 Drafts. 24 Internet-Drafts are draft documents valid for a maximum of six months 25 and may be updated, replaced, or obsoleted by other documents at any 26 time. It is inappropriate to use Internet-Drafts as reference 27 material or to cite them other than as "work in progress." 29 The list of current Internet-Drafts can be accessed at 30 http://www.ietf.org/ietf/1id-abstracts.txt. 32 The list of Internet-Draft Shadow Directories can be accessed at 33 http://www.ietf.org/shadow.html. 35 This Internet-Draft will expire on April 30, 2009. 37 Abstract 39 In this draft, we discuss problems related to multi-session and 40 multi-source transmission using the Real-Time Transport Protocol 41 (RTP). Most of the input to this draft is taken from email 42 discussion. Multi-session and multi-source transmission is motivated 43 by media data which allows for different transport layer treatment of 44 parts of the media. This is typically the case for layered media. 45 Multi-session transmission is when media data from a single media 46 source is split over multiple RTP sessions. Single-session multi- 47 source transmission (from now on just called "multi-source 48 transmission") is when data from a single media source is sent as 49 several RTP streams in the same RTP session. The main problems 50 discussed are the mechanisms used for data alignment and source 51 correlation. This draft gives further an overview of payload formats 52 using multi-sessions/multi-source transmission and highlights other 53 transport related issues. The draft concludes with recommendations 54 for the discussed problems. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 59 2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 3. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 5 61 4. Existing Users of Multi-Session and Multi-Source 62 Transmission . . . . . . . . . . . . . . . . . . . . . . . . . 5 63 4.1. Progressive Video with Hybrid (PVH) . . . . . . . . . . . 5 64 4.2. H.264 Scalable Video Coding (SVC) . . . . . . . . . . . . 6 65 4.3. H.264 Multi-View Coding (MVC) . . . . . . . . . . . . . . 6 66 4.4. G.718: Embedded Variable Bit-Rate (EV-VBR) 67 Speech/Audio Codec . . . . . . . . . . . . . . . . . . . . 6 68 4.5. MPEG Surround . . . . . . . . . . . . . . . . . . . . . . 7 69 4.6. RTP Forward Error Correction . . . . . . . . . . . . . . . 7 70 4.7. RTP Retransmission . . . . . . . . . . . . . . . . . . . . 7 71 5. Topology Overview . . . . . . . . . . . . . . . . . . . . . . 8 72 6. Requirements for multi-session transmission . . . . . . . . . 8 73 6.1. Requirements on Data Alignment . . . . . . . . . . . . . . 8 74 6.2. Requirements on Source Correlation . . . . . . . . . . . . 9 75 7. Review of techniques for Data Alignment . . . . . . . . . . . 9 76 7.1. NTP Timestamp Alignment using RTCP Sender Report (SR) 77 Packets . . . . . . . . . . . . . . . . . . . . . . . . . 9 78 7.1.1. Identified problems . . . . . . . . . . . . . . . . . 10 79 7.2. Review of other potential techniques for Data Alignment . 12 80 7.2.1. RTP Timestamp Alignment . . . . . . . . . . . . . . . 12 81 7.2.2. Initial RTP Timestamp or RTP Timestamp Offset 82 Signaling . . . . . . . . . . . . . . . . . . . . . . 12 83 7.2.3. CCM message - need NTP update . . . . . . . . . . . . 13 84 7.2.4. Multiple early RTCP SRs . . . . . . . . . . . . . . . 13 85 7.2.5. Codec-Specific Mechanisms . . . . . . . . . . . . . . 13 86 7.2.6. RTP header extension . . . . . . . . . . . . . . . . . 14 87 8. Review of techniques for Source Correlation . . . . . . . . . 14 88 8.1. Source Correlation using CNAME in SDES . . . . . . . . . . 14 89 8.2. Review of other potential techniques for Source 90 Correlation . . . . . . . . . . . . . . . . . . . . . . . 15 91 8.2.1. Single SSRC Space . . . . . . . . . . . . . . . . . . 15 92 8.2.2. SSRC Groups . . . . . . . . . . . . . . . . . . . . . 15 93 8.2.3. CNAME in Source Attributes . . . . . . . . . . . . . . 16 94 8.2.4. Application-specific Inference of Association . . . . 16 95 9. Summary of RTP solution for Data Alignment and Source 96 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . 16 97 9.1. Data Alignment in RTP . . . . . . . . . . . . . . . . . . 16 98 9.2. Source Correlation in RTP . . . . . . . . . . . . . . . . 16 99 9.3. Dependency signaling . . . . . . . . . . . . . . . . . . . 17 100 10. Recommendations . . . . . . . . . . . . . . . . . . . . . . . 17 101 11. Other transport related issues for multi-session 102 transmission . . . . . . . . . . . . . . . . . . . . . . . . . 18 103 11.1. Inter-session Jitter . . . . . . . . . . . . . . . . . . . 18 104 11.2. Inter-session Interleaving . . . . . . . . . . . . . . . . 18 105 12. Security Considerations . . . . . . . . . . . . . . . . . . . 18 106 13. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 18 107 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 18 108 14.1. Normative References . . . . . . . . . . . . . . . . . . . 18 109 14.2. Informative References . . . . . . . . . . . . . . . . . . 19 110 Appendix A. Acknowledgements . . . . . . . . . . . . . . . . . . 20 111 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 21 112 Intellectual Property and Copyright Statements . . . . . . . . . . 22 114 1. Introduction 116 Multi-session transmission is when media data from a single media 117 source is split over multiple Real-Time Transport Protocol (RTP) 118 [RFC3550] sessions. This is usually done because different transport 119 layer treatment is desired for different aspects of the media source, 120 e.g., different multicast groups or different traffic classes. If 121 the traffic is being sent using multicast routing, this is often 122 known as "layered multicast." 124 Single-session multi-source transmission (from now on just called 125 "multi-source transmission") is when data from a single media source 126 is sent as several RTP streams in the same RTP session. In this 127 case, the streams need to be treated differently by RTP (e.g. with 128 separate RTCP statistics, or selective forwarding by RTP translators) 129 but do not need different transport characteristics. This is often 130 referred to as "SSRC multiplexing", after the synchronization source 131 identifier (SSRC) which distinguishes sources in an RTP session. 133 Such techniques are often used for "layered" or "embedded" codecs 134 (the former term is typically used for video, the latter for audio). 135 A lower-bitrate, and often lower-complexity, stream (known as the 136 "base"), often backward-compatible with older codecs, provides basic 137 media quality, while one or more additional streams (known as 138 "enhancements") provide richer media or otherwise provide an enhanced 139 user experience. Various layered and embedded codecs are discussed 140 in Section 4. 142 Multi-session and multi-source transmission are also used for stream 143 robustness. Both RTP Forward Error Correction [RFC5109] and RTP 144 Retransmission [RFC4588] use multi-session transmission, and the 145 latter can optionally use multi-source transmission as well. 147 For both multi-session and multi-source transmission, two issues 148 arise: how streams are correlated, i.e. how receivers determine which 149 base and enhancement streams carry data for the same media source; 150 and how streams are aligned, i.e. how receivers determine which 151 packets of the base stream are associated with which packets of the 152 enhancement stream. 154 2. Definitions 156 multi-session transmission: In multi-session transmission, media 157 data from a single media source is split over multiple RTP 158 sessions. The term "layered multicast" is equivalent to multi- 159 session transmission for sessions using multicast addresses. 161 multi-source transmission: In multi-source transmission, data from a 162 single media source is sent as several RTP streams in the same RTP 163 session. The sources contained in an RTP session are identified 164 by their synchronization source identifiers (SSRCs) or, if 165 combined by a RTP mixer, by their contributing source identifiers 166 (CSRCs), as defined in RTP [RFC3550]. 167 associated multimedia streams: Associated multimedia streams are 168 independent media sources from the same session participant, e.g. 169 audio and video sources, or multiple cameras from a single 170 participant. Each source can have an independent media clock, 171 reflecting the device that captured the media. For live media, 172 these clocks will often drift relative to each other, over and 173 above their often inherently-different clock rates. In RTP, each 174 stream has separate initial RTP timestamps and sequence numbers. 175 Related sources are associated using the RTCP Canonical Name 176 (CNAME) Source Description (SDES) field. A common time base may 177 be computed using NTP timestamps, based on information carried in 178 RTCP Sender Report (SR) packets. The sources are typically 179 synchronized ("lip-synced") by receivers when rendered, based on 180 the computed NTP timestamps. 181 Data Alignment: Assembling data of the same media frame which is 182 transferred in different sessions or as different sources in the 183 same session as part of a layered media. The assembly of the 184 media frame must be achieved before decoding, otherwise the 185 decoding process typically fails or may be only possible at a 186 reduced quality. 187 Source Correlation: The logical association of RTP streams 188 transferred as multiple separate sessions or as multiple sources 189 in the same session to one layered media. 191 3. Terminology 193 "The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 194 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 195 document are to be interpreted as described in RFC 2119 [RFC2119]. 197 4. Existing Users of Multi-Session and Multi-Source Transmission 199 4.1. Progressive Video with Hybrid (PVH) 201 Progressive Video with Hybrid transform (PVH) [McCa96] was used in 202 the initial demonstration of multi-session transmission. PVH was the 203 initial driver for adding text on layered multicast to the Real-Time 204 Transport Protocol (RTP) [RFC3550]. Data Alignment was done using 205 packets' RTP timestamps. 207 4.2. H.264 Scalable Video Coding (SVC) 209 H.264 Scalable Video Coding (SVC) [I-D.ietf-avt-rtp-svc] extends the 210 H.264 [RFC3984] video standard to provide spatial, temporal, and 211 quality (signal-to-noise) enhancements. The base layer of SVC is 212 backward-compatible with existing H.264 decoders. A base layer sent 213 separately using the H.264 [RFC3984] payload format can be received 214 and processed by existing devices. The Payload Format for SVC uses 215 the multi-session transmission approach. Currently two basic modes 216 are defined in the SVC Payload Format for decoding order recovery of 217 media data received from multiple sessions: 218 Data Alignment based on NTP timestamps: This method is used in the 219 NI-T and NI-TC mode defined in [I-D.ietf-avt-rtp-svc]. These 220 modes currently rely on exact NTP timestamp alignment in order to 221 recover the decoding order. 222 Cross-Session Decoding Order Number (CS-DON): This method is used in 223 the NI-C, NI-TC and I-C modes defined in [I-D.ietf-avt-rtp-svc]. 224 These modes rely on a number (CS-DON) which is associated to 225 packets indicating the decoding order across sessions. 227 4.3. H.264 Multi-View Coding (MVC) 229 H.264 Multi View Coding (MVC) [I-D.wang-avt-rtp-mvc] extends the 230 H.264 [RFC3984] video standard to provide multiple views of a video 231 stream, for multi view and 3D applications. MVC is similarly to SVC 232 an extension of H.264 and has a backward compatible base view, which 233 can be also decoded by existing H.264 receivers. Thus it is possible 234 to provide the base view of a multi sessions transmission in a 235 compatible way using the H.264 [RFC3984] as Payload Format. Since 236 the new coding approach is mainly based on exploiting temporal 237 references to other frames of the same view or different views, there 238 is not always the need to receive the base view in order to decode a 239 desired view. The payload format will rely on the same approaches as 240 defined in the RTP Payload Format for SVC video 241 [I-D.ietf-avt-rtp-svc] for decoding order recovery when receiving 242 data from multiple sessions. 244 4.4. G.718: Embedded Variable Bit-Rate (EV-VBR) Speech/Audio Codec 246 G.718, the Embedded Variable Bit-Rate (EV-VBR) speech/audio codec 247 [I-D.lakaniemi-avt-rtp-evbr] provides an embedded speech-rate 248 encoder. This codec also allows for multi-session transmission. The 249 current draft mandates RTP SR for Data Alignment in multi-session 250 transmission. 252 4.5. MPEG Surround 254 MPEG Surround (Spatial Audio Coding, SAC) [I-D.ietf-avt-rtp-mps] 255 enhances MPEG two-channel audio with multi-channel surround sound 256 while maintaining backward compatibility with two-channel receivers. 257 The payload relies on NTP timestamp alignment for multi-session 258 transmission. The audio codec typically has different sampling rates 259 for base and enhancements. 261 4.6. RTP Forward Error Correction 263 RTP Generic Forward Error Correction [RFC5109] allows a supplemental 264 stream to provide additional data for recovery from packet loss using 265 a separate session for transmitting the FEC stream. The repair 266 stream is typically sent as a separate RTP session. A special case 267 is when the FEC stream is being sent as a secondary codec in the 268 redundant encoding format. In this case the FEC stream is sent as a 269 separate source in the same session as the redundant codec. Data 270 Alignment is achieved using sequence numbers of the FEC protected 271 packets. 273 FEC Grouping Issues in Session Description Protocol 274 [I-D.begen-mmusic-fec-grouping-issues] describes a grouping framework 275 for FEC and media streams based on the Grouping of Media Lines in the 276 Session Description Protocol (SDP) [RFC3388] framework. The 277 framework relies on transmitting the FEC streams in separate 278 sessions. Data Alignment is achieved by the FEC Framework and relies 279 on the used FEC scheme, i.e. there is a specific solution for 280 associating data of the protected and the protecting packet stream. 282 4.7. RTP Retransmission 284 RTP Retransmission [RFC4588] allows senders to retransmit RTP packets 285 indicated by the receiver as lost. The re-sent packets are 286 transported in a separate stream and may be transmitted within a 287 separate RTP session or may be transmitted as a separate source in 288 the same session as the media stream. 290 If multi-source (i.e., single-session) transmission is being used, 291 retransmitted packets are sent with a different SSRC. Source 292 association in this case done by sources' CNAMEs, with the further 293 requirement that a receiver MUST NOT have two outstanding requests 294 for the same packet sequence number in two different original streams 295 before the association is resolved. 297 5. Topology Overview 299 A number of different RTP Topologies [RFC5117] are relevant for 300 consideration for multi-source and multi-session transmission. 302 [Ed. TBD: more text on the relation between the approaches presented 303 in the memo and the mentioned topologies.] 305 o Point-to-point - Two endpoints communicating using unicast. 306 o Point-to-multipoint via multicast - Using a multicast transport 307 mechnisms to send packets of one participant to all the other 308 participants in the multicast group. 309 o Point-to-multipoint via RTP translator - Using [RFC3550] 310 translators to send packets of one participant to other 311 participants of a group. Packets of one or more participants may 312 be forwarded to the group. 313 o Point-to-multipoint via RTP mixer - Using [RFC3550] mixers to send 314 packets of one participant to other participants of a group. 315 Packets of one or more participants may be forwarded to the group. 316 o Point-to-multipoint via Video Switching MCUs - Allows for sending 317 packets from one participant to the other participants in a group. 318 But typically only one participant's video data is forwarded at a 319 time to the other participants. 320 o Point-to-multipoint via RTCP-terminating MCUs - Each participant 321 is running a point-to-point session with the MCU. Typically, only 322 one participant's video data is forwarded at a time to the other 323 participants. 324 o Point-to-multipoint without a feedback channel - These channels 325 typically provide IP multicast over a broadcast transmission 326 medium, which naturally do not provide a bi-directional channel. 327 This is the case, e.g. for DVB channels using IP over MPE over 328 MPEG-2 Transport Stream as for DVB-H or the emerging DVB-SH. 330 6. Requirements for multi-session transmission 332 6.1. Requirements on Data Alignment 334 Synchronization of media streams received from multiple sessions is 335 typically used for lip-synchronization of audio and video data. For 336 this case, RTP provides a strong tool, which is the presence of (RTP) 337 timestamps for each media frame, generated from individual clocks for 338 each session. Additionally, RTCP Sender Report packets are sent 339 periodically in each session containing (NTP) timestamps from a 340 wallclock common across all of the sessions, plus a reference to the 341 corresponding (RTP) timestamp that would be generated for a media 342 frame with the signaled wallclock time. The interval between 343 transmission of RTCP SRs is typically in the range of multiple 344 seconds. For a more detailed review of RTP synchronization 345 techniques, see Section 7.1. 347 For the reception of layered media, either on multiple sessions or as 348 multiple sources, it is absolutely essential to allow for immediate 349 Data Alignment. That is, the Data Alignment must be applied before 350 the decoding process of the layered media. If Data Alignment is not 351 applied before decoding, the decoder may not be able to decode the 352 media at all, or may only be able to produce a media representation 353 at reduced quality. 355 6.2. Requirements on Source Correlation 357 For the reception of layered media, whether on multiple sessions or 358 as multiple sources, it is absolutely essential to find out prior to 359 decoding which sessions and sources are correlated. That is, the 360 receiver needs to know, prior to Data Alignment and decoding, the 361 inter-session and the inter-source dependency. Notably, for cases in 362 which multiple independent media sources are transmitted as layered 363 media in the same session or set of sessions, miscorrelation of 364 sources could lead to a decoder attempting to use one source's base 365 layer with another source's enhancement layer. 367 7. Review of techniques for Data Alignment 369 7.1. NTP Timestamp Alignment using RTCP Sender Report (SR) Packets 371 The inter-media synchronization mechanism defined in [RFC3550] uses 372 RTP timestamps in the RTP packets and a combination of RTP timestamp 373 and NTP wallclock carried in the RTCP Sender Report (SR) packets. 374 The RTCP SR packet contains a RTP timestamp in the media timescale 375 and as reference to an absolute wallclock time the NTP timestamp. 376 The definitions for timestamp generation and synchronization in 377 section 5.1 and 6.4.1 of [RFC3550] are summarized in the following 378 list: 380 o The timestamp reflects the sampling instant of the first octet in 381 the RTP data packet. 382 o The sampling instant MUST be derived from a clock that increments 383 monotonically and linearly in time to allow synchronization and 384 jitter calculations (see Section 6.4.1). 385 o The resolution of the clock MUST be sufficient for the desired 386 synchronization accuracy and for measuring packet arrival jitter 387 (one tick per video frame is typically not sufficient). 388 o If RTP packets are generated periodically, the nominal sampling 389 instant as determined from the sampling clock is to be used, not a 390 reading of the system clock. 392 o RTP timestamps from different media streams may advance at 393 different rates and usually have independent, random offsets. 394 Therefore, although these timestamps are sufficient to reconstruct 395 the timing of a single stream, directly comparing RTP timestamps 396 from different media is not effective for synchronization. 397 Instead, for each medium the RTP timestamp is related to the 398 sampling instant by pairing it with a timestamp from a reference 399 clock (wallclock) that represents the time when the data 400 corresponding to the RTP timestamp was sampled.. 401 o Receivers should expect that the measurement accuracy of the 402 timestamp may be limited to far less than the resolution of the 403 NTP timestamp. 404 o On a system that has no notion of wallclock time but does have 405 some system-specific clock such as "system uptime", a sender MAY 406 use that clock as a reference to calculate relative NTP 407 timestamps. 408 o It is important to choose a commonly used clock so that if 409 separate implementations are used to produce the individual 410 streams of a multimedia session, all implementations will use the 411 same clock. 412 o [Ed. : The RTP timestamp in the SR] corresponds to the same time 413 as the NTP timestamp (above), but in the same units and with the 414 same random offset as the RTP timestamps in data packets. 415 o This correspondence may be used for intra- and inter-media 416 synchronization for sources whose NTP timestamps are synchronized, 417 and may be used by media-independent receivers to estimate the 418 nominal RTP clock frequency. 419 o Rather, it MUST be calculated from the corresponding NTP timestamp 420 using the relationship between the RTP timestamp counter and real 421 time as maintained by periodically checking the wallclock time at 422 a sampling instant. 424 To summarize, the definitions in [RFC3550]: the RTCP SR is used for 425 deriving the media timestamp using the RTP timestamp and the NTP 426 wallclock. If this synchronization mechanism is correctly 427 implemented and there is no clock jitter in neither the media clock 428 nor in the clock thus it can be always guaranteed, that a RTP 429 timestamp and its NTP wallclock timestamp are perfectly aligned, the 430 RTP approach should work fine for Data Alignment. [Ed. : need more 431 text for summary / review of text above ] 433 7.1.1. Identified problems 435 7.1.1.1. Synchronization Delay 437 Since [RFC3550] mandates RTCP SRs to be sent in intervals of multiple 438 seconds, Data Alignment based on this information may introduce a 439 delay to this process, which may lead to delayed tune-in for the 440 decoding process. This is typically not the case for decoding media 441 transferred in exactly one session and source, since synchronization 442 is not required for decoding, but only for playout. A delay for 443 playout or lip synchronization does not usually pose a fundamental 444 problem. 446 7.1.1.2. Losing synchronization information 448 The loss of RTCP SR packets may introduce additional delay to the 449 Data Alignment process, thus a more robust mechanism would be 450 desirable. 452 7.1.1.3. Clock Skew 454 Clock skew between the NTP/system clock and the media clock will 455 affect the NTP media timestamp generation derived from RTCP SRs and 456 RTP timestamps. That typically results in different NTP timestamps 457 for packets of the same media frame transmitted in the different 458 sessions or transferred as different sources, and leads to 459 misalignment for the Data Alignment. As far as we know, there is no 460 way to always guarantee the presence of perfect clocks for media and 461 NTP/system clock. From the standardization point of view this may 462 seem to be an implementation issue. However, if this implementation 463 issue puts a burden on the senders like the presence of a perfect 464 clocks for generating timestamps, this issue needs to be solved in an 465 easy and general way. 467 Following the RTP philosophy, clock skew can be estimated by 468 observing several RTCP SRs. The receiver may use the observation to 469 compensate for the clock skew. However, this is only possible if 470 there is no requirement for immediate synchronization of the sort 471 which is essential for Data Alignment of layered codecs. 473 The case of clock skew between in media and NTP/system clocks may be 474 overcome by using the same clock instance, e.g. the system clock, for 475 RTP as well as NTP timestamp generation. However, this is not 476 compliant with RTP, since [RFC3550] mandates the use of a media clock 477 which is different from the system clock (see definitions in RTP as 478 cited above in Section 7.1). Indeed, for many codecs, notably audio, 479 correct decoding requires that the timestamp difference between 480 subsequent frames exactly correspond to the amount of data sent in 481 each frame. 483 7.1.1.4. Accuracy of clocks 485 Assuming that we have clocks without skew, there is still the 486 question of accuracy of the clock used for generating the timestamps. 487 Notably, the Windows system clock is only updated on each system 488 clock tick, typically every 10 or 15 milliseconds on Windows XP and 489 Vista. RTP says that a receiver should not make any assumption on 490 this, but an implementation which may have to cope with rounding done 491 in the low-order microsecond cannot simply compare two NTP timestamps 492 for being identical. An application may have to compare "ranges" of 493 timestamps in order to get rid of rounding problems. However, in 494 some cases the ranges of NTP timestamps required may indeed be 495 greater than the time interval between consecutive media frames. 497 7.1.1.5. Existing RTCP SR implementations 499 As far as we know, existing RTCP SR implementations show a wide range 500 of alignment problems for generating exact NTP media timestamps for 501 Data Alignment. NTP alignment issues can be modeled for existing 502 RTCP senders by capturing an NTP and RTP timestamps in consecutive SR 503 packets, projecting the NTP timestamp in one SR packet based on the 504 RTP timestamp in that SR packet, the NTP and RTP timestamps in the 505 previous SR packet, and the codec's nominal clock rate. Initial 506 experiments have shown NTP timestamp alignment problems on the order 507 of 40-50 milliseconds for several implementations. 509 7.2. Review of other potential techniques for Data Alignment 511 7.2.1. RTP Timestamp Alignment 513 The idea here is to signal the same RTP timestamp for packets 514 containing data of the same media time instance in the different 515 sessions. That is the same clock would have to be used for the 516 multiple sessions and the same RTP random offset would have to be 517 used. This method is backward compatible with using NTP timestamps 518 for inter-media synchronization as well as for jitter calculation. 519 Furthermore, this is the only alternative used up to our knowledge 520 (see Section 4.1) for layered transmission of media. 522 7.2.1.1. Identified problems 524 Using the same RTP timestamp random offset may lead to getting weak 525 initialization vectors for the encryption method defined in [RFC3550] 526 if keys are shared across the sessions or streams. Additionally, 527 that it may be unnatural for some codecs to use the same clockrate 528 for the multiple sessions, for example an audio wideband enhancement 529 layer enhancing a narrow-band base layer. 531 7.2.2. Initial RTP Timestamp or RTP Timestamp Offset Signaling 533 Signaling the initial RTP timestamp or the initial offsets as an 534 media or source level attribute in SDP associated with each stream. 535 This could be done, e.g., using 537 [I-D.ietf-mmusic-sdp-source-attributes]. 539 7.2.2.1. Identified problems 541 This may have an implication for implementations, since one needs to 542 know packet stream related information as initial RTP timestamp, or 543 offset between RTP timestamps during while offering a session. This 544 may be a problem for sessions where multiple senders are present: it 545 may not always be possible for an SDP creator to include all initial 546 offsets / timestamps for all participants for sessions with multiple 547 sending parties. 549 7.2.3. CCM message - need NTP update 551 In this case, a receiver would request for immediate synchronization 552 information. This method may reduce the initial delay, but just work 553 for topologies with bi-directional channels. 555 7.2.3.1. Identified problems 557 This method is only feasible for topologies with bidirectional and 558 reasonably rapid communication channels, i.e. unicast or small-group 559 multicast. This method also assumes that the NTP timestamp alignment 560 always works. 562 7.2.4. Multiple early RTCP SRs 564 In this case, the sender would generate more RTCP SRs than typically 565 required and send them at an early point in the session. This method 566 does also work for topologies with uni-directional communication 567 channels. 569 7.2.4.1. Identified problems 571 This method may overflow the RTCP bandwidth. Enhancing the RTCP 572 sender bandwidth may be achieved using SDP bandwidth parameters. 573 This method may require an adjustment of the RTCP bandwidth of the 574 session depending on the number of participants and senders. 575 Further, this approach does not solve the problem for receivers 576 tuning in to the session after it begins ("random entry"). This 577 method also assumes that the NTP timestamp alignment always works. 579 7.2.5. Codec-Specific Mechanisms 581 This mechanism exploits signaling contained within the payload's data 582 sections in order to allow the Data Alignment. Example is the Cross 583 Session Decoding Order Number (CS-DON) as defined in 584 [I-D.ietf-avt-rtp-svc] or as proposed in 586 [I-D.hannuksela-avt-rtp-svc], where a timestamp or a timestamp delta 587 of the RTP packet to be aligned is carried by payload specific means. 589 7.2.5.1. Identified problems 591 A payload independent solution for the basic functionality of Data 592 Alignment is desirable. 594 7.2.6. RTP header extension 596 The RTP header extension may be used to add generic signaling about 597 Data Alignment to RTP packets. 599 7.2.6.1. Identified problems 601 RTP header extensions are required to be ancillary information which 602 can safely be discarded by receivers which do not understand them. 603 Data alignment mechanisms do not satisfy this requirement. 605 8. Review of techniques for Source Correlation 607 8.1. Source Correlation using CNAME in SDES 609 In RTP, associated multimedia streams (e.g., audio and video sources 610 from a single participant) have different SSRCs, and are associated 611 using SDES CNAME fields. While in principle the same technique can 612 be used to associate streams for multi-session or multi-source 613 transmission, several issues arise. 615 Startup latency: while slow lipsync convergence of multimedia streams 616 is often tolerable, layered sources have to be associated from the 617 start in order to be decodable, particularly for codec types such as 618 video with inter-frame decoding dependencies. 620 If multiple sources are sent from the same participant on the same 621 session or family of sessions, e.g. multiple video cameras, they will 622 have the same CNAME, because they are synchronized with each other 623 and with any other sources for the session. This makes it impossible 624 to definitively associate base and enhancement sources, as there may 625 be more than one of each with the same CNAME. This potential for 626 confusion is the reason for RTP retransmission's restriction on 627 multiple outstanding RTP NACKs before stream association has 628 completed, as described in Section 4.7. 630 8.2. Review of other potential techniques for Source Correlation 632 8.2.1. Single SSRC Space 634 Motivated by the problems with CNAME association, RTP [RFC3550] 635 specifies instead a single SSRC space for layered multicast 636 (multiple-session transmission). Furthermore, as described in 637 Section 9.2, it specifies that SSRC collision detection is performed 638 only in the base layer. 640 Applying SSRC collision detection in just the base layer in case of 641 using multi-session transmission seems to work for current codec 642 implementations. 644 By definition one of the multiple views possible in MVC media 645 Section 4.3 is the base view and this view is backward compatible to 646 H.264. Decoding a view other than the base view may not require the 647 presence of the base view. Although MVC is by its nature a layered 648 codec, it may not always be reasonable to require the reception of 649 the base layer for collision detection, even when it is not required 650 for decoding. 652 Currently, we do not see major relevance for the MVC codec format, 653 due to its lack in coding efficiency, thus we tend not to take MVC as 654 the killer application for new Source Correlation functionalities. 655 This means without taking MVC into account, the current solution of 656 using the base layer for SSRC collision detection seems to be still 657 appropriate. 659 If needed, collision detection could instead be performed across all, 660 or a subset of, the sessions used for multi-session transmission. 661 However, it is not entirely clear how this would work for senders or 662 receivers that are only participating in a subset of the sessions, 663 and this would require further study. 665 8.2.2. SSRC Groups 667 The Internet-Draft [I-D.ietf-mmusic-sdp-source-attributes] specifies 668 a mechanism by which related sources can be described as grouped in 669 SDP. For multi-source (single-session) transmission, this can 670 provide an alternative way to provide source association. 672 Clearly, this will only be effective in topologies and signaling 673 architectures in which the SDP author can know about every source in 674 the session that will be used for multi-source transmission, and the 675 SDP can be updated on the addition of new sources or SSRCs 676 collisions. 678 8.2.3. CNAME in Source Attributes 680 The draft [I-D.ietf-mmusic-sdp-source-attributes] also provides a 681 mechanism for sources' SSRCs to be associated to their CNAMEs in SDP. 682 This can eliminate the startup latency of stream association for the 683 mechanism described in Section 8.1, though it does not solve the 684 problem of multiple sources for a session. It also has the same 685 architectural limitations as Section 8.2.2 in terms of using SDP. 687 8.2.4. Application-specific Inference of Association 689 As described in Section 4.7, it is in some cases possible to use 690 mechanisms specific to a particular codec or mechanism to determine 691 stream associations. For retransmission, for instance, a NACK of a 692 packet with sequence N with SSRC A, followed by a retransmission of a 693 packet with sequence N on SSRC B, indicates that SSRC B is the 694 retransmission stream for SSRC A. Such techniques are mechanism- 695 specific and cannot easily be generalized. 697 9. Summary of RTP solution for Data Alignment and Source Correlation 699 9.1. Data Alignment in RTP 701 The text on layered multicast in [RFC3550] does not discuss Data 702 Alignment among the media data carried in the different RTP sessions. 703 We assume that the intention of the RTP specification was to use NTP 704 timestamp alignment. However, Vic, the demonstration code for 705 layered multicast using PVH, used RTP timestamp alignment for this 706 purpose. 708 9.2. Source Correlation in RTP 710 The text in section 8.3 of [RFC3550] mandates a single SSRC to be 711 used for multiple sessions containing data of the same layered media 712 source. Further, the text mandates the detection of SSRC collisions 713 using the CNAME item in SDES packets carried in the base layer: 715 For layered encodings transmitted on separate RTP sessions (see 716 Section 2.4), a single SSRC identifier space SHOULD be used across 717 the sessions of all layers and the core (base) layer SHOULD be 718 used for SSRC identifier allocation and collision resolution. 719 When a source discovers that it has collided, it transmits an RTCP 720 BYE packet on only the base layer but changes the SSRC identifier 721 to the new value in all layers. ... 723 9.3. Dependency signaling 725 For signaling the dependency of data transmitted using layered 726 multicast, SDP [RFC4566] contains rudimentary support, in that it 727 allows for signaling a range of transport addresses in a certain 728 media description. By definition, a higher transport address 729 identifies a higher layer in the one- dimensional hierarchy. A 730 receiver needs only to decode data conveyed over this transport 731 address and lower transport addresses to decode this Operation Point. 733 When the media data of one source is transmitted in multiple RTP 734 sessions, the mechanism defined in Signaling media decoding 735 dependency in Session Description Protocol (SDP) 736 [I-D.ietf-mmusic-decoding-dependency] can also be used to indicate 737 the relationship between the multiple sessions of the same media 738 type. Currently, this mechanism is inherited by the new Payload 739 Formats allowing multi-session transmission: [I-D.ietf-avt-rtp-svc], 740 [I-D.wang-avt-rtp-mvc], [I-D.ietf-avt-rtp-mps], and 741 [I-D.lakaniemi-avt-rtp-evbr] . By definition the base layer is 742 signaled as the RTP session which does not depend on any other 743 session. 745 Since [RFC3550] mandates the correlation of one layered media with 746 the same source, there is no mechanism to indicate dependencies of 747 multiple sources. 749 10. Recommendations 751 We recommend for Data Alignment of media data from the same source, 752 that the same RTP timestamp is used for packets of the same time 753 instance as defined in 754 [I-D.lennox-avt-rtp-layered-encoding-timestamps]. This method comes 755 for free and can be implemented in a backward compatible way, since 756 NTP timing for synchronizing different types of media is not 757 affected. This further requires the use of the same timescale of the 758 sessions of an multi-session or multi-source transmission, which is 759 anyway the case if the layered media is identified as a unique 760 source. Mandating the same timescale for each of the sessions in a 761 multi-session transmission may need to be discussed with respect to 762 the audio codec described in Section 4.5. 764 For Source Correlation, we suggest to keep the mechanism defined in 765 [RFC3550], i.e. all layers of a layered media source have the same 766 SSRC and the base layer is used for SSRC collision detection. 767 Further, it may be useful to have a signaling mechanism, which 768 indicates the RTP session to be used for SSRC collision detection. 770 11. Other transport related issues for multi-session transmission 772 11.1. Inter-session Jitter 774 The transport of media of the same source in different sessions may 775 introduce different jitter behaviors in the different sessions. We 776 call this issue inter-session jitter. Inter-session jitter may be 777 caused by sessions taking different network paths or by any other 778 packet reordering within the network outside the control of the user. 779 RTP implementations typically use buffers for de-jittering each of 780 the sessions separately. In a simple A/V transmission scenario, de- 781 jittering the audio and the video input queue separately is not 782 problematic, since the synchronization is achieved after the decoder 783 during playout. Using multi-session transmission, de-jittering and 784 synchronization (Data Alignment) is required before decoding instead 785 of synchronizing the data after decoding at playout time. And the 786 Data Alignment via NTP timestamp must be 100% exact on a micro second 787 base, otherwise the synchronization fails. This is definitely 788 different from doing synchronization for lip synchronized playout of 789 audio and video. 791 11.2. Inter-session Interleaving 793 Using multi-session transmission allows for data interleaving, while 794 the data transmitted within one session can still be sent in decoding 795 order. Inter-session interleaving may be also realizable using Data 796 Alignment via timestamps. 798 12. Security Considerations 800 [Ed. TBD] 802 13. IANA Considerations 804 No action by IANA is required. 806 14. References 808 14.1. Normative References 810 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 811 Jacobson, "RTP: A Transport Protocol for Real-Time 812 Applications", STD 64, RFC 3550, July 2003. 814 14.2. Informative References 816 [I-D.begen-mmusic-fec-grouping-issues] 817 Begen, A., "FEC Grouping Issues in Session Description 818 Protocol", draft-begen-mmusic-fec-grouping-issues-00 (work 819 in progress), February 2008. 821 [I-D.hannuksela-avt-rtp-svc] 822 Hannuksela, M. and Y. Wang, "Session Multiplexing for SVC 823 Video", draft-hannuksela-avt-rtp-svc-01 (work in 824 progress), July 2008. 826 [I-D.ietf-avt-rtp-mps] 827 Bont, F., Doehla, S., Schmidt, M., and R. Sperschneider, 828 "RTP Payload Format for Elementary Streams with MPEG 829 Surround multi- channel audio", draft-ietf-avt-rtp-mps-01 830 (work in progress), October 2008. 832 [I-D.ietf-avt-rtp-svc] 833 Wenger, S., Wang, Y., Schierl, T., and A. Eleftheriadis, 834 "RTP Payload Format for SVC Video", 835 draft-ietf-avt-rtp-svc-14 (work in progress), 836 September 2008. 838 [I-D.ietf-mmusic-decoding-dependency] 839 Schierl, T. and S. Wenger, "Signaling media decoding 840 dependency in Session Description Protocol (SDP)", 841 draft-ietf-mmusic-decoding-dependency-04 (work in 842 progress), October 2008. 844 [I-D.ietf-mmusic-sdp-source-attributes] 845 Lennox, J., Ott, J., and T. Schierl, "Source-Specific 846 Media Attributes in the Session Description Protocol 847 (SDP)", draft-ietf-mmusic-sdp-source-attributes-01 (work 848 in progress), February 2008. 850 [I-D.lakaniemi-avt-rtp-evbr] 851 Lakaniemi, A. and Y. Wang, "RTP payload format for G.718 852 speech/audio", draft-lakaniemi-avt-rtp-evbr-04 (work in 853 progress), October 2008. 855 [I-D.lennox-avt-rtp-layered-encoding-timestamps] 856 Lennox, J., Schierl, T., and S. Ganesan, "Real-Time 857 Transport Protocol (RTP) Timestamps for Layered 858 Encodings", 859 draft-lennox-avt-rtp-layered-encoding-timestamps-00 (work 860 in progress), June 2008. 862 [I-D.wang-avt-rtp-mvc] 863 Wang, Y. and T. Schierl, "RTP Payload Format for MVC 864 Video", draft-wang-avt-rtp-mvc-02 (work in progress), 865 August 2008. 867 [McCa96] McCanne, S., "Scalable Compression and Transmission of 868 Internet Multicast Video", Report No. UCB/CSD-96-928, 869 December 1996. 871 Ph.D. Dissertation, University of California Berkeley. 873 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 874 Requirement Levels", BCP 14, RFC 2119, March 1997. 876 [RFC3388] Camarillo, G., Eriksson, G., Holler, J., and H. 877 Schulzrinne, "Grouping of Media Lines in the Session 878 Description Protocol (SDP)", RFC 3388, December 2002. 880 [RFC3984] Wenger, S., Hannuksela, M., Stockhammer, T., Westerlund, 881 M., and D. Singer, "RTP Payload Format for H.264 Video", 882 RFC 3984, February 2005. 884 [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session 885 Description Protocol", RFC 4566, July 2006. 887 [RFC4588] Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R. 888 Hakenberg, "RTP Retransmission Payload Format", RFC 4588, 889 July 2006. 891 [RFC5109] Li, A., "RTP Payload Format for Generic Forward Error 892 Correction", RFC 5109, December 2007. 894 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 895 January 2008. 897 Appendix A. Acknowledgements 899 Funding for the RFC Editor function is provided by the IETF 900 Administrative Support Activity (IASA). Further, the author Thomas 901 Schierl of Fraunhofer HHI is sponsored by the European Commission 902 under the contract number FP7-ICT-214063, project SEA. The authors 903 want to thank Colin Perkins, Ye-Kui Wang, Randell Jesup, Ingemar 904 Johansson, Gerard Babonneau, Alex Eleftheriadis, Stefan Doehla, and 905 Roni Even for their valuable comments on the mailing list. 907 Authors' Addresses 909 Thomas Schierl 910 Fraunhofer HHI 911 Einsteinufer 37 912 D-10587 Berlin 913 Germany 915 Phone: +49-30-31002-227 916 Email: mail@thomas-schierl.de 918 Jonathan Lennox 919 Vidyo, Inc. 920 433 Hackensack Avenue 921 Sixth Floor 922 Hackensack, NJ 07601 923 US 925 Email: jonathan@vidyo.com 927 Full Copyright Statement 929 Copyright (C) The IETF Trust (2008). 931 This document is subject to the rights, licenses and restrictions 932 contained in BCP 78, and except as set forth therein, the authors 933 retain all their rights. 935 This document and the information contained herein are provided on an 936 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 937 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND 938 THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS 939 OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF 940 THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 941 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 943 Intellectual Property 945 The IETF takes no position regarding the validity or scope of any 946 Intellectual Property Rights or other rights that might be claimed to 947 pertain to the implementation or use of the technology described in 948 this document or the extent to which any license under such rights 949 might or might not be available; nor does it represent that it has 950 made any independent effort to identify any such rights. Information 951 on the procedures with respect to rights in RFC documents can be 952 found in BCP 78 and BCP 79. 954 Copies of IPR disclosures made to the IETF Secretariat and any 955 assurances of licenses to be made available, or the result of an 956 attempt made to obtain a general license or permission for the use of 957 such proprietary rights by implementers or users of this 958 specification can be obtained from the IETF on-line IPR repository at 959 http://www.ietf.org/ipr. 961 The IETF invites any interested party to bring to its attention any 962 copyrights, patents or patent applications, or other proprietary 963 rights that may cover technology that may be required to implement 964 this standard. Please address the information to the IETF at 965 ietf-ipr@ietf.org.