idnits 2.17.1 draft-lennox-clue-rtp-usage-04.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (June 1, 2012) is 4318 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-25) exists of draft-ietf-clue-framework-05 == Outdated reference: A later version (-07) exists of draft-ietf-clue-telepresence-requirements-01 == Outdated reference: A later version (-09) exists of draft-ietf-clue-telepresence-use-cases-02 == Outdated reference: A later version (-03) exists of draft-westerlund-avtcore-multiplex-architecture-01 -- Obsolete informational reference (is this intentional?): RFC 5117 (Obsoleted by RFC 7667) -- Obsolete informational reference (is this intentional?): RFC 5285 (Obsoleted by RFC 8285) Summary: 0 errors (**), 0 flaws (~~), 5 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 CLUE J. Lennox 3 Internet-Draft Vidyo 4 Intended status: Standards Track P. Witty 5 Expires: December 3, 2012 6 A. Romanow 7 Cisco Systems 8 June 1, 2012 10 Real-Time Transport Protocol (RTP) Usage for Telepresence Sessions 11 draft-lennox-clue-rtp-usage-04 13 Abstract 15 This document describes mechanisms and recommended practice for 16 transmitting the media streams of telepresence sessions using the 17 Real-Time Transport Protocol (RTP). 19 Status of this Memo 21 This Internet-Draft is submitted in full conformance with the 22 provisions of BCP 78 and BCP 79. 24 Internet-Drafts are working documents of the Internet Engineering 25 Task Force (IETF). Note that other groups may also distribute 26 working documents as Internet-Drafts. The list of current Internet- 27 Drafts is at http://datatracker.ietf.org/drafts/current/. 29 Internet-Drafts are draft documents valid for a maximum of six months 30 and may be updated, replaced, or obsoleted by other documents at any 31 time. It is inappropriate to use Internet-Drafts as reference 32 material or to cite them other than as "work in progress." 34 This Internet-Draft will expire on December 3, 2012. 36 Copyright Notice 38 Copyright (c) 2012 IETF Trust and the persons identified as the 39 document authors. All rights reserved. 41 This document is subject to BCP 78 and the IETF Trust's Legal 42 Provisions Relating to IETF Documents 43 (http://trustee.ietf.org/license-info) in effect on the date of 44 publication of this document. Please review these documents 45 carefully, as they describe your rights and restrictions with respect 46 to this document. Code Components extracted from this document must 47 include Simplified BSD License text as described in Section 4.e of 48 the Trust Legal Provisions and are provided without warranty as 49 described in the Simplified BSD License. 51 Table of Contents 53 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 54 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 55 3. RTP requirements for CLUE . . . . . . . . . . . . . . . . . . 3 56 4. RTCP requirements for CLUE . . . . . . . . . . . . . . . . . . 5 57 5. Multiplexing multiple streams or multiple sessions? . . . . . 6 58 6. Use of multiple transport flows . . . . . . . . . . . . . . . 6 59 7. Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 8. Other implementation constraints . . . . . . . . . . . . . . . 9 61 9. Requirements of a solution . . . . . . . . . . . . . . . . . . 9 62 10. Mapping streams to requested captures . . . . . . . . . . . . 11 63 10.1. Sending SSRC to capture ID mapping outside the media 64 stream . . . . . . . . . . . . . . . . . . . . . . . . . 11 65 10.2. Sending capture IDs in the media stream . . . . . . . . . 12 66 10.2.1. Multiplex ID shim . . . . . . . . . . . . . . . . . . 13 67 10.2.2. RTP header extension . . . . . . . . . . . . . . . . 13 68 10.2.3. Combined approach . . . . . . . . . . . . . . . . . . 14 69 10.3. Recommendations . . . . . . . . . . . . . . . . . . . . . 16 70 11. Security Considerations . . . . . . . . . . . . . . . . . . . 16 71 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 16 72 13. References . . . . . . . . . . . . . . . . . . . . . . . . . . 16 73 13.1. Normative References . . . . . . . . . . . . . . . . . . 16 74 13.2. Informative References . . . . . . . . . . . . . . . . . 17 75 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 18 77 1. Introduction 79 Telepresence systems, of the architecture described by 80 [I-D.ietf-clue-telepresence-use-cases] and 81 [I-D.ietf-clue-telepresence-requirements], will send and receive 82 multiple media streams, where the number of streams in use is 83 potentially large and asymmetric between endpoints, and streams can 84 come and go dynamically. These characteristics lead to a number of 85 architectural design choices which, while still in the scope of 86 potential architectures envisioned by the Real-Time Transport 87 Protocol [RFC3550], must be fairly different than those typically 88 implemented by the current generation of voice or video conferencing 89 systems. 91 Furthermore, captures, as defined by the CLUE Framework 92 [I-D.ietf-clue-framework], are a somewhat different concept than 93 RTP's concept of media streams, so there is a need to communicate the 94 associations between them. 96 This document makes recommendations, for this telepresence 97 architecture, about how streams should be encoded and transmitted in 98 RTP, and how their relation to captures should be communicated. 100 2. Terminology 102 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 103 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 104 document are to be interpreted as described in RFC 2119 [RFC2119] and 105 indicate requirement levels for compliant implementations. 107 3. RTP requirements for CLUE 109 CLUE will permit a SIP call to include multiple media streams: easily 110 dozens at a time (given, e.g., a continuous presence screen in a 111 multi-point conference), potentially out of a possible pool of 112 hundreds. Furthermore, endpoints will have an asymmetric number of 113 media streams. 115 Two main backwards compatibility issues exist: firstly, on an initial 116 SIP offer we can not be sure that the far end will support CLUE, and 117 therefore a CLUE endpoint must not offer a selection of RTP sessions 118 which would confuse a CLUEless endpoint. Secondly, there exist many 119 SIP devices in the network through which calls may be routed; even if 120 we know that the far end supports CLUE, re-offering with a larger 121 selection of RTP sessions may fall foul of one of these middle boxes. 123 We also desire to simplify NAT and firewall traversal by allowing 124 endpoints to deal with only a single static address/port mapping per 125 media type rather than multiple mappings which change dynamically 126 over the duration of the call. 128 A SIP call in common usage today will typically offer one or two 129 video RTP sessions (one for presentation, one for main video), and 130 one audio session. Each of these RTP sessions will be used to send 131 either zero or one media streams in either direction, with the 132 presence of these streams negotiated in the SDP (offering a 133 particular session as send only, receive only, or send and receive), 134 and through BFCP (for presentation video). 136 In a CLUE environment this model -- sending zero or one source (in 137 each direction) per RTP session -- doesn't scale as discussed above, 138 and mapping asymmetric numbers of sources to sessions is needlessly 139 complex. 141 Therefore, telepresence systems SHOULD use a single RTP session per 142 media type, as shown in Figure 1, except where there's a need to give 143 sessions different transport treatment. All sources of the same 144 media type, although from distinct captures, are sent over this 145 single RTP session. 147 Camera 1 -.__ _,'Screen 1 148 `--._ , =-----------........... ,' 149 `'+.._`\ _________________ _\,' 150 / '| RTP | 151 Camera 2 ------------+----,''''''''''''''''''''':-------- Screen 2 152 \ _ ----------------------.'. 153 _,.-''-----------------------,/ `-._ 154 _,.-' `.. Screen 3 155 Camera 3 ,-' ` 157 Figure 1: Multiplexing multiple media streams into one RTP session 159 During call setup, a single RTP session is negotiated for each media 160 type. In SDP, only one media line is negotiated per media and 161 multiple media streams are sent over the same UDP channel negotiated 162 using the SDP media line. 164 A number of protocol issues involved in multiplexing RTP streams into 165 a single session are discussed in 166 [I-D.westerlund-avtcore-multiplex-architecture] and 167 [I-D.lennox-rtcweb-rtp-media-type-mux]. In the rest of this document 168 we concentrate on examining the mapping of RTP streams to requested 169 CLUE captures in the specific context of telepresence systems. 171 The CLUE architecture requires more than simply source multiplexing, 172 as defined by [RFC3550]. The key issue is how a receiver interprets 173 the multiplexed streams it receives, and correlates them with the 174 captures it has requested. In some cases, the CLUE Framework 175 [I-D.ietf-clue-framework]'s concept of the "capture" maps cleanly to 176 the RTP concept of an SSRC, but in many cases it does not. 178 First we will consider the cases that need to be considered. We will 179 then examine the two most obvious approaches to mapping streams for 180 captures, showing their pros and cons. We then describe a third 181 possible alternative. 183 4. RTCP requirements for CLUE 185 When sending media streams, we are also required to send 186 corresponding RTCP information. However, while a unidirectional RTP 187 stream (as identified by a single SSRC) will contain a single stream 188 of media, the associated RTCP stream will include sender information 189 about the stream, but will also include feedback for streams sent in 190 the opposite direction. On a simple point-to-point case, it may be 191 possible to naively forward on RTCP in a similar manner to RTP, but 192 in more complicated use cases where multipoint devices are switching 193 streams to multiple receivers, this simple approach is insufficient. 195 As an example, receiver report messages are sent with the source SSRC 196 of a single media stream sent in the same direction as the RTCP, but 197 contain within the message zero or more receiver report blocks for 198 streams sent in the other direction. Forwarding on the receiver 199 report packets to the same endpoints which are receiving the media 200 stream tagged with that SSRC will provide no useful information to 201 endpoints receiving the messages, and does not guarantee that the 202 reports will ever reach the origin of the media streams on which they 203 are reporting. 205 CLUE therefore requires devices to more intelligently deal with 206 received RTCP messages, which will require full packet inspection, 207 including SRTCP decryption. The low rate of RTCP transmission/ 208 reception makes this feasible to do. 210 RTCP also carries information to establish clock synchronization 211 between multiple RTP streams. For CLUE, this information will be 212 crucial, not only for traditional lip-sync between video and audio, 213 but also for synchronized playout of multiple video streams from the 214 same room. This information needs to be provided even in the case of 215 switched captures, to provide clock synchronization for sources that 216 are temporarily being shown for a switched capture. 218 5. Multiplexing multiple streams or multiple sessions? 220 It may not be immediately obvious whether this problem is best 221 described as multiplexing multiple RTP sessions onto a single 222 transport layer, or as multiplexing multiple media streams onto a 223 single RTP session. Certainly, the different captures represent 224 independent purposes for the media that is sent; however, as any 225 stream may be switched into any of the multiplexed captures, we 226 maintain the requirement that all media streams within a CLUE call 227 must have a unique SSRC -- this is also a requirement for the above 228 use of RTCP. 230 Because of this, CLUE's use of RTP can best be described as 231 multiplexing multiple streams onto one RTP session, but with 232 additional data about the streams to identify their intended 233 destinations. A solution to perform this multiplexing may also be 234 sufficient to multiplex multiple RTP sessions onto one transport 235 session, but this is not a requirement. 237 6. Use of multiple transport flows 239 Most existing videoconferencing systems use separate RTP sessions for 240 main and presentation video sources, distinguished by the SDP content 241 attribute [RFC4796]. The use of the CLUE telepresence framework 242 [I-D.ietf-clue-framework] to describe multiplexed streams can remove 243 the need to establish separate RTP sessions (and transport flows) for 244 these sessions, as the relevant information can be provided by CLUE 245 messaging instead. 247 However, it can still be useful in many cases to establish multiple 248 RTP sessions (and transport flows) for a single CLUE session. Two 249 clear cases would be for disaggregated media (where media is being 250 sent to devices with different transport addresses), or scenarios 251 where different sources should get different quality-of-service 252 treatment. To support such scenarios, the use of multiple RTP 253 sessions, with SDP m lines with different transport addresses, would 254 be necessary. 256 To support this case, CLUE messaging needs to be able to indicate the 257 RTP session in which a requested capture is intended to be received. 259 7. Use Cases 261 There are three distinct use cases relevant for telepresence systems: 262 static stream choice, dynamically changing streams chosen from a 263 finite set, and dynamic changing streams chosen from an unbounded 264 set. 266 Static stream choice: 268 In this case, the streams sent over the multiplex are constant over 269 the complete session. An example is a triple-camera system to MCU in 270 which left, center and right streams are sent for the duration of the 271 session. 273 This describes an endpoint to endpoint, endpoint to multipoint 274 device, and equivalently a transcoding multipoint device to endpoint. 276 This is illustrated in Figure 2. 278 ,'''''''''''| +-----------Y 279 | | | | 280 | +--------+|"""""""""""""""""""""""""""|+--------+ | 281 | |EndPoint||---------------------------||EndPoint| | 282 | +--------+|"""""""""""""""""""""""""""|+--------+ | 283 | | | | 284 "-----------' "------------ 286 Figure 2: Point to Point Static Streams 288 Dynamic streams from a finite set: 290 In this case, the receiver has requested a smaller number of streams 291 than the number of media sources that are available, and expects the 292 sender to switch the sources being sent based on criteria chosen by 293 the sender. (This is called auto-switched in the CLUE Framework 294 [I-D.ietf-clue-framework].) 296 An example is a triple-camera system to two-screen system, in which 297 the sender needs to switch either LC -> LR, or CR -> LR. (Note in 298 particular, in this example, that the center camera stream could be 299 sent as either the left or the right auto-switched capture.) 301 This describes an endpoint to endpoint, endpoint to multipoint 302 device, and a transcoding device to endpoint. 304 This is illustrated in Figure 3. 306 ,'''''''''''| +-----------Y 307 | | |+--------+ | 308 | +--------+|"""""""""""""""""""""""""""||EndPoint| | 309 | |EndPoint|| |+--------+_| 310 | +--------+'''''''''' ''''''''''' 311 | |........ 312 "-----------' 314 Figure 3: Point to Point Finite Source Streams 316 Dynamic streams from an unbounded set: 318 This case describes a switched multipoint device to endpoint, in 319 which the multipoint device can choose to send any streams received 320 from any other endpoints within the conference to the endpoint. 322 For example, in an MCU to triple-screen system, the MCU could send 323 e.g. LCR of a triple-camera system -> LCR, or CCC of three single- 324 camera endpoints -> LCR. 326 This is illustrated in Figure 4. 328 +-+--+--+ 329 | |EP| `-. 330 | +--+ |`.`-. 331 +-------`. `. `. 332 `-.`. `-. 333 `.`-. `-. 334 `-.`. `-.-------+ +------+ 335 +--+--+---+ `.`.| +---+ ---------------| +--+ | 336 | |EP| +----.....:=. |MCU| ...............| |EP| | 337 | +--+ |"""""""""--| +---+ |______________| +--+ | 338 +---------+"""""""""";'.'.'.'---+ +------+ 339 .'.'.'.' 340 .'.'.'.' 341 / /.'.' 342 .'.::-' 343 +--+--+--+ .'.::' 344 | |EP| .'.::' 345 | +--+ .::' 346 +--------.' 348 Figure 4: Multipoint Unbounded Streams 350 Within any of these cases, every stream within the multiplexed 351 session MUST have a unique SSRC. The SSRC is chosen at random 352 [RFC3550] to ensure uniqueness (within the conference), and contains 353 no meaningful information. 355 Any source may choose to restart a stream at any time, resulting in a 356 new SSRC. For example, a transcoding MCU might, for reasons of load 357 balancing, transfer an encoder onto a different DSP, and throw away 358 all context of the encoding at this state, sending an RTCP BYE 359 message for the old SSRC, and picking a new SSRC for the stream when 360 started on the new DSP. 362 Because of this possibility of changing the SSRC at any time, all our 363 use cases can be considered as simplifications of the third and most 364 difficult case, that of dynamic streams from an unbounded set. Thus, 365 this is the primary case we will consider. 367 8. Other implementation constraints 369 To cope with receivers with limited decoding resources, for example a 370 hardware based telepresence endpoint with a fixed number of decoding 371 modules, each capable of handling only a single stream, it is 372 particularly important to ensure that the number of streams which the 373 transmitter is expecting the receiver to decode never exceeds the 374 maximum number the receiver has requested. In this case the receiver 375 will be forced to drop some of the received streams, causing a poor 376 user experience, and potentially higher bandwidth usage, should it be 377 required to retransmit I-frames. 379 On a change of stream, such a receiver can be expected to have a one- 380 out, one-in policy, so that the decoder of the stream currently being 381 received on a given capture is stopped before starting the decoder 382 for the stream replacing it. The sender MUST therefore indicate to 383 the receiver which stream will be replaced upon a stream change. 385 9. Requirements of a solution 387 This section lists, more briefly, the requirements a media 388 architecture for Clue telepresence needs to achieve, summarizing the 389 discussion of previous sections. In this section, RFC 2119 language 390 refers to requirements on a solution, not an implementation; thus, 391 requirements keywords are not written in capital letters. 393 Media-1: It must not be necessary for a Clue session to use more 394 than a single transport flow for transport of a given media type 395 (video or audio). 396 Media-2: It must, however, be possible for a Clue session to use 397 multiple transport flows for a given media type where it is 398 considered valuable (for example, for distributed media, or 399 differential quality-of-service). 400 Media-3: It must be possible for a Clue endpoint or MCU to 401 simultaneously send sources corresponding to static, to 402 composited, and to switched captures, in the same transport flow. 403 (Any given device might not necessarily be able send all of these 404 source types; but for those that can, it must be possible for them 405 to be sent simultaneously.) 406 Media-4: It must be possible for an original source to move among 407 switched captures (i.e. at one time be sent for one switched 408 capture, and at a later time be sent for another one). 409 Media-5: It must be possible for a source to be placed into a 410 switched capture even if the source is a "late joiner", i.e. was 411 added to the conference after the receiver requested the switched 412 source. 413 Media-6: Whenever a given source is assigned to a switched capture, 414 it must be immediately possible for a receiver to determine the 415 switched capture it corresponds to, and thus that any previous 416 source is no longer being mapped to that switched capture. 417 Media-7: It must be possible for a receiver to identify the actual 418 source that is currently being mapped to a switched capture, and 419 correlate it with out-of-band (non-Clue) information such as 420 rosters. 421 Media-8: It must be possible for a source to move among switched 422 captures without requiring a refresh of decoder state (e.g., for 423 video, a fresh I-frame), when this is unnecessary. However, it 424 must also be possible for a receiver to indicate when a refresh of 425 decoder state is in fact necessary. 426 Media-9: If a given source is being sent on the same transport flow 427 for more than one reason (e.g. if it corresponds to more than one 428 switched capture at once, or to a static capture), it should be 429 possible for a sender to send only one copy of the source. 430 Media-10: On the network, media flows should, as much as possible, 431 look and behave like currently-defined usages of existing 432 protocols; established semantics of existing protocols must not be 433 redefined. 434 Media-11: The solution should seek to minimize the processing burden 435 for boxes that distribute media to decoding hardware. 436 Media-12: If multiple sources from a single synchronization context 437 are being sent simultaneously, it must be possible for a receiver 438 to associate and synchronize them properly, even for sources that 439 are are mapped to switched captures. 441 10. Mapping streams to requested captures 443 The goal of any scheme is to allow the receiver to match the received 444 streams to the requested captures. As discussed in Section 7, during 445 the lifetime of the transmission of one capture, we may see one or 446 multiple media streams which belong to this capture, and during the 447 lifetime of one media stream, it may be assigned to one or more 448 captures. 450 Topologically, the requirements in Section 9 are best addressed by 451 implementing static and a switched captures with an RTP Media 452 Translator, i.e. the topology that RTP Topologies [RFC5117] defines 453 as Topo-Media-Translator. (A composited capture would be the 454 topology described by Topo-Mixer; an MCU can easily produce either or 455 both as appropriate, simultaneously.). The MCU selectively forwards 456 certain sources, corresponding to those sources which it currently 457 assigns to the requested switched captures. 459 Demultiplexing of streams is done by SSRC; each stream is known to 460 have a unique SSRC. However, this SSRC contains no information about 461 capture IDs. There are two obvious choices for providing the mapping 462 from SSRC to captures: sending the mapping outside of the media 463 stream, or tagging media packets with the capture ID. (There may be 464 other choices, e.g., payload type number, which might be appropriate 465 for multiplexing one audio with one video stream on the same RTP 466 session, but this not relevant for the cases discussed here.) 468 (An alternative architecture would be to map all captures directly to 469 SSRCs, and then to use a Topo-Mixer topology to represent switched 470 captures as a "mixed" source with a single contributing CSRC. 471 However, such an architecture would not be able to satisfy the 472 requirements Media-8, Media-9, or Media-12 described in Section 9, 473 without substantial changes to the semantics of RTP.) 475 10.1. Sending SSRC to capture ID mapping outside the media stream 477 Every RTP packet includes an SSRC, which can be used to demultiplex 478 the streams. However, although the SSRC uniquely identifies a 479 stream, it does not indicate which of the requested captures that 480 stream is tied to. If more than one capture is requested, a mapping 481 from SSRC to capture ID is therefore required so that the media 482 receiver can treat each received stream correctly. 484 As described above, the receiver may need to know in advance of 485 receiving the media stream how to allocate its decoding resources. 486 Although implementations MAY cache incoming media received before 487 knowing which multiplexed stream it applies to, this is optional, and 488 other implementations may choose to discard media, potentially 489 requiring an expensive state refresh, such as an Full Intra Request 490 (FIR) [RFC5104]. 492 In addition, a receiver will have to store lookup tables of SSRCs to 493 stream IDs/decoders etc. Because of the large SSRC space (32 bits), 494 this will have to be in the form of something like a hash map, and a 495 lookup will have to be performed for every incoming packet, which may 496 prove costly for e.g. MCUs processing large numbers of incoming 497 streams. 499 Consider the choices for where to put the mapping from SSRC to 500 capture ID. This mapping could be sent in the CLUE messaging. The 501 use of a reliable transport means that it can be sure that the 502 mapping will not be lost, but if this reliability is achieved through 503 retransmission, the time taken for the mapping to reach all receivers 504 (particularly in a very large scale conference, e.g., with thousands 505 of users) could result in very poor switching times, providing a bad 506 user experience. 508 A second option for sending the mapping is in RTCP, for instance as a 509 new SDES item. This is likely to follow the same path as media, and 510 therefore if the mapping data is sent slightly in advance of the 511 media, it can be expected to be received in advance of the media. 512 However, because RTCP is lossy and, due to its timing rules, cannot 513 always be sent immediately, the mapping may not be received for some 514 time, resulting in the receiver of the media not knowing how to route 515 the received media. A system of acks and retransmissions could 516 mitigate this, but this results in the same high switching latency 517 behaviour as discussed for using CLUE as a transport for the mapping. 519 0 1 2 3 520 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 521 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 522 | CaptureID=9 | length=4 | Capture ID : 523 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 524 : | 525 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 527 Figure 5: SDES item for encoding of the Capture ID 529 10.2. Sending capture IDs in the media stream 531 The second option is to tag each media packet with the capture ID. 532 This means that a receiver immediately knows how to interpret 533 received media, even when an unknown SSRC is seen. As long as the 534 media carries a known capture ID, it can be assumed that this media 535 stream will replace the stream currently being received with that 536 capture ID. 538 This gives significant advantages to switching latency, as a switch 539 between sources can be achieved without any form of negotiation with 540 the receiver. There is no chance of receiving media without knowing 541 to which switched capture it belongs. 543 However, the disadvantage in using a capture ID in the stream that it 544 introduces additional processing costs for every media packet, as 545 capture IDs are scoped only within one hop (i.e., within a cascaded 546 conference a capture ID that is used from the source to the first MCU 547 is not meaningful between two MCUs, or between an MCU and a 548 receiver), and so they may need to be added or modified at every 549 stage. 551 As capture IDs are chosen by the media sender, by offering a 552 particular capture to multiple recipients with the same ID, this 553 requires the sender to only produce one version of the stream 554 (assuming outgoing payload type numbers match). This reduces the 555 cost in the multicast case, although does not necessarily help in the 556 switching case. 558 An additional issue with putting capture IDs in the RTP packets comes 559 from cases where a non-CLUE aware endpoint is being switched by an 560 MCU to a CLUE endpoint. In this case, we may require up to an 561 additional 12 bytes in the RTP header, which may push a media packet 562 over the MTU. However, as the MTU on either side of the switch may 563 not match, it is possible that this could happen even without adding 564 extra data into the RTP packet. The 12 additional bytes per packet 565 could also be a significant bandwidth increase in the case of very 566 low bandwidth audio codecs. 568 10.2.1. Multiplex ID shim 570 As in draft-westerlund-avtcore-transport-multiplexing 572 10.2.2. RTP header extension 574 The capture ID could be carried within the RTP header extension 575 field, using [RFC5285]. This is negotiated within the SDP i.e. 577 a=extmap:1 urn:ietf:params:rtp-hdrex:clue-capture-id 579 Packets tagged by the sender with the capture ID will then contain a 580 header extension as shown below 581 0 1 2 3 582 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 583 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 584 | ID=1 | L=3 | capture id | 585 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 586 | capture id | 587 +-+-+-+-+-+-+-+-+ 589 Figure 6: RTP header extension for encoding of the capture ID 591 To add or modify the capture ID can be an expensive operation, 592 particularly if SRTP is used to authenticate the packet. 593 Modification to the contents of the RTP header requires a 594 reauthentication of the complete packet, and this could prove to be a 595 limiting factor in the throughput of a multipoint device. However, 596 it may be that reauthentication is required in any case due to the 597 nature of SDP. SDP permits the receiver to choose payload types, 598 meaning that a similar option to modify the payload type in the 599 packet header will cause the need to reauthenticate. 601 10.2.3. Combined approach 603 The two major flaws of the above methods (high latency switching of 604 SSRC multiplexing, high computational cost on switching nodes) can be 605 mitigated with a combined method. In this, the multiplex ID can be 606 included in packets belonging to the first frame of media (typically 607 an IDR/GDR), but following this only the SSRC is used to demultiplex. 609 10.2.3.1. Behaviour of receivers 611 A receiver of a stream should demultiplex on SSRC if it knows the 612 capture ID for the given SSRC, otherwise it should look within the 613 packet for the presence of the stream ID. This has an issue where a 614 stream switches from one capture to a second - for example, in the 615 second use case described in Section 7, where the transmitter chooses 616 to switch the center stream from the receiver's right capture to the 617 left capture, and so the receiver will already know an incorrect 618 mapping from that stream's SSRC to a capture ID. 620 In this case the receiver should, at the RTP level, detect the 621 presence of the capture ID and update its SSRC to capture ID map. 622 This could potentially have issues where the demultiplexer has now 623 sent the packet to the wrong physical device - this could be solved 624 by checking for the presence of a capture ID in every packet, but 625 this will have speed implications. If a packet is received where the 626 receiver does not already know the mapping between SSRC and capture 627 ID, and the packet does not contain a capture ID, the receiver may 628 discard it, and MUST request a transmission of the capture ID (see 629 below). 631 10.2.3.2. Choosing when to send capture IDs 633 The updated capture ID needs to be known as soon as possible on a 634 switch of SSRCs, as the receiver may be unable to allocate resources 635 to decode the incoming stream, and may throw away the received 636 packets. It can be assumed that the incoming stream is undecodable 637 until the capture ID is received. 639 In common video codecs (e.g. H.264), decoder refresh frames (either 640 IDR or GDR) also have this property, in that it is impossible to 641 decode any video without first receiving the refresh point. It 642 therefore seems natural to include the capture ID within every packet 643 of an IDR or GDR. 645 For most audio codecs, where every packet can be decoded 646 independently, there is not such an obvious place to put this 647 information. Placing the capture ID within the first n packets of a 648 stream on a switch is the most simple solution, where n needs to be 649 sufficiently large that it can be expected that at least one packet 650 will have reached the receiver. For example, n=50 on 20ms audio 651 packets will give 1 second of capture IDs, which should give 652 reasonable confidence of arrival. 654 In the case where a stream is switched between captures, for reasons 655 of coding efficiency, it may be desirable to avoid sending a new IDR 656 frame for this stream, if the receiver's architecture allows the same 657 decoding state to be used for its various captures. In this case, 658 the capture ID could be sent for a small number of frames after the 659 source switches capture, similarly to audio. 661 10.2.3.3. Requesting Capture ID retransmits 663 There will, unfortunately, always be cases where a receiver misses 664 the beginning of a stream, and therefore does not have the mapping. 665 One proposal could be to send the capture ID in SDES with every SDES 666 packet; this should ensure that within ~5 seconds of receiving a 667 stream, the capture ID will be received. However, a faster method 668 for requesting the transmission of a capture ID would be preferred. 670 Again, we look towards the present solution to this problem with 671 video. RFC5104 provides an Full Intra Refresh feedback message, 672 which requests that the encoder provide the stream such that 673 receivers need only the stream after that point. A video receiver 674 without the start of the stream will naturally need to make this 675 request, so by always including the capture ID in refresh frames, we 676 can be sure that the receiver will have all the information it needs 677 to decode the stream (both a refresh point, and a capture ID). 679 For audio, we can reuse this message. If a receiver receives an 680 audio stream for which it has no SSRC to capture mapping, it should 681 send a FIR message for the received SSRC. Upon receiving this, an 682 audio encoder must then tag outgoing media packets with the capture 683 ID for a short period of time. 685 Alternately, a new RTCP feedback message could be defined which would 686 explicitly request a refresh of the capture ID mapping. 688 10.3. Recommendations 690 We recommend that endpoints MUST support the RTP header extension 691 method of sharing capture IDs, with the extension in every media 692 packet. For low bandwidth situations, this may be considered 693 excessive overhead; in which case endpoints MAY support the combined 694 approach. 696 This will be advertised in the SDP (in a way yet to be determined); 697 if a receiver advertises support for the combined approach, 698 transmitters which support sending the combined approach SHOULD use 699 it in preference. 701 11. Security Considerations 703 The security considerations for multiplexed RTP do not seem to be 704 different than for non-multiplexed RTP. 706 Capture IDs need to be integrity-protected in secure environments; 707 however, they do not appear to need confidentiality. 709 12. IANA Considerations 711 Depending on the decisions, the new RTP header extension element, the 712 new RTCP SDES item, and/or the new AVPF feedback message will need to 713 be registered. 715 13. References 717 13.1. Normative References 719 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 720 Requirement Levels", BCP 14, RFC 2119, March 1997. 722 [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. 723 Jacobson, "RTP: A Transport Protocol for Real-Time 724 Applications", STD 64, RFC 3550, July 2003. 726 13.2. Informative References 728 [I-D.ietf-clue-framework] 729 Romanow, A., Duckworth, M., Pepperell, A., and B. Baldino, 730 "Framework for Telepresence Multi-Streams", 731 draft-ietf-clue-framework-05 (work in progress), May 2012. 733 [I-D.ietf-clue-telepresence-requirements] 734 Romanow, A. and S. Botzko, "Requirements for Telepresence 735 Multi-Streams", 736 draft-ietf-clue-telepresence-requirements-01 (work in 737 progress), October 2011. 739 [I-D.ietf-clue-telepresence-use-cases] 740 Romanow, A., Botzko, S., Duckworth, M., Even, R., and I. 741 Communications, "Use Cases for Telepresence Multi- 742 streams", draft-ietf-clue-telepresence-use-cases-02 (work 743 in progress), January 2012. 745 [I-D.lennox-rtcweb-rtp-media-type-mux] 746 Lennox, J. and J. Rosenberg, "Multiplexing Multiple Media 747 Types In a Single Real-Time Transport Protocol (RTP) 748 Session", draft-lennox-rtcweb-rtp-media-type-mux-00 (work 749 in progress), October 2011. 751 [I-D.westerlund-avtcore-multiplex-architecture] 752 Westerlund, M., Burman, B., and C. Perkins, "RTP 753 Multiplexing Architecture", 754 draft-westerlund-avtcore-multiplex-architecture-01 (work 755 in progress), March 2012. 757 [RFC4796] Hautakorpi, J. and G. Camarillo, "The Session Description 758 Protocol (SDP) Content Attribute", RFC 4796, 759 February 2007. 761 [RFC5104] Wenger, S., Chandra, U., Westerlund, M., and B. Burman, 762 "Codec Control Messages in the RTP Audio-Visual Profile 763 with Feedback (AVPF)", RFC 5104, February 2008. 765 [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, 766 January 2008. 768 [RFC5285] Singer, D. and H. Desineni, "A General Mechanism for RTP 769 Header Extensions", RFC 5285, July 2008. 771 Authors' Addresses 773 Jonathan Lennox 774 Vidyo, Inc. 775 433 Hackensack Avenue 776 Seventh Floor 777 Hackensack, NJ 07601 778 US 780 Email: jonathan@vidyo.com 782 Paul Witty 783 England 784 UK 786 Email: paul.witty@balliol.oxon.org 788 Allyn Romanow 789 Cisco Systems 790 San Jose, CA 95134 791 USA 793 Email: allyn@cisco.com