idnits 2.17.1 draft-jennings-dispatch-new-media-01.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) == There are 3 instances of lines with non-RFC2606-compliant FQDNs in the document. ** The document seems to lack a both a reference to RFC 2119 and the recommended RFC 2119 boilerplate, even if it appears to use RFC 2119 keywords. RFC 2119 keyword, line 368: '... The transport MAY provide a compres...' RFC 2119 keyword, line 565: '... Implementation MUST support at least...' RFC 2119 keyword, line 569: '... Implementation MUST support at least...' RFC 2119 keyword, line 573: '... Video codecs MUST support any aspec...' RFC 2119 keyword, line 587: '... Video codecs MUST support a min wid...' (1 more instance...) Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (March 18, 2018) is 2230 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Outdated reference: A later version (-01) exists of draft-barnes-mls-protocol-00 == Outdated reference: A later version (-20) exists of draft-ietf-payload-flexible-fec-scheme-06 Summary: 3 errors (**), 0 flaws (~~), 4 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group C. Jennings 3 Internet-Draft Cisco 4 Intended status: Standards Track March 18, 2018 5 Expires: September 19, 2018 7 Modular Media Stack 8 draft-jennings-dispatch-new-media-01 10 Abstract 12 A sketch of a proposal for a modular media stack for interactive 13 communications. 15 Status of This Memo 17 This Internet-Draft is submitted in full conformance with the 18 provisions of BCP 78 and BCP 79. 20 Internet-Drafts are working documents of the Internet Engineering 21 Task Force (IETF). Note that other groups may also distribute 22 working documents as Internet-Drafts. The list of current Internet- 23 Drafts is at http://datatracker.ietf.org/drafts/current/. 25 Internet-Drafts are draft documents valid for a maximum of six months 26 and may be updated, replaced, or obsoleted by other documents at any 27 time. It is inappropriate to use Internet-Drafts as reference 28 material or to cite them other than as "work in progress." 30 This Internet-Draft will expire on September 19, 2018. 32 Copyright Notice 34 Copyright (c) 2018 IETF Trust and the persons identified as the 35 document authors. All rights reserved. 37 This document is subject to BCP 78 and the IETF Trust's Legal 38 Provisions Relating to IETF Documents 39 (http://trustee.ietf.org/license-info) in effect on the date of 40 publication of this document. Please review these documents 41 carefully, as they describe your rights and restrictions with respect 42 to this document. Code Components extracted from this document must 43 include Simplified BSD License text as described in Section 4.e of 44 the Trust Legal Provisions and are provided without warranty as 45 described in the Simplified BSD License. 47 Table of Contents 49 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 50 2. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 51 3. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 4 52 4. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 53 5. Architecture . . . . . . . . . . . . . . . . . . . . . . . . 5 54 6. Connectivity Layer . . . . . . . . . . . . . . . . . . . . . 5 55 6.1. Snowflake - New ICE . . . . . . . . . . . . . . . . . . . 6 56 6.2. STUN2 . . . . . . . . . . . . . . . . . . . . . . . . . . 6 57 6.2.1. STUN2 Request . . . . . . . . . . . . . . . . . . . . 6 58 6.2.2. STUN2 Response . . . . . . . . . . . . . . . . . . . 6 59 6.3. TURN2 . . . . . . . . . . . . . . . . . . . . . . . . . . 7 60 7. Transport Layer . . . . . . . . . . . . . . . . . . . . . . . 8 61 8. Media Layer - RTP3 . . . . . . . . . . . . . . . . . . . . . 9 62 8.1. RTP Meta Data . . . . . . . . . . . . . . . . . . . . . . 12 63 8.2. Securing the messages . . . . . . . . . . . . . . . . . . 12 64 8.3. Sender requests . . . . . . . . . . . . . . . . . . . . . 12 65 8.4. Data Codecs . . . . . . . . . . . . . . . . . . . . . . . 13 66 8.5. Media Keep Alive . . . . . . . . . . . . . . . . . . . . 13 67 8.6. Forward Error Correction . . . . . . . . . . . . . . . . 13 68 8.7. MTI Codecs . . . . . . . . . . . . . . . . . . . . . . . 13 69 8.7.1. Audio . . . . . . . . . . . . . . . . . . . . . . . . 13 70 8.7.2. Video . . . . . . . . . . . . . . . . . . . . . . . . 13 71 8.7.3. Annotation . . . . . . . . . . . . . . . . . . . . . 14 72 8.7.4. Application Data Channels . . . . . . . . . . . . . . 14 73 8.7.5. Reverse Requests & Stats . . . . . . . . . . . . . . 14 74 8.8. Message Key Agreement . . . . . . . . . . . . . . . . . . 15 75 9. Control Layer . . . . . . . . . . . . . . . . . . . . . . . . 15 76 9.1. Transport Capabilities API . . . . . . . . . . . . . . . 15 77 9.2. Media Capabilities API . . . . . . . . . . . . . . . . . 15 78 9.3. Transport Configuration API . . . . . . . . . . . . . . . 16 79 9.4. Media Configuration API . . . . . . . . . . . . . . . . . 16 80 9.5. Transport Metrics . . . . . . . . . . . . . . . . . . . . 18 81 9.6. Flow Metrics API . . . . . . . . . . . . . . . . . . . . 18 82 9.7. Stream Metrics API . . . . . . . . . . . . . . . . . . . 19 83 10. Call Signalling - JABBER2 . . . . . . . . . . . . . . . . . . 19 84 11. Signalling Examples . . . . . . . . . . . . . . . . . . . . . 20 85 11.1. Simple Audio Example . . . . . . . . . . . . . . . . . . 20 86 11.1.1. simple audio advertisement . . . . . . . . . . . . . 20 87 11.1.2. simple audio proposal . . . . . . . . . . . . . . . 21 88 11.2. Simple Video Example . . . . . . . . . . . . . . . . . . 22 89 11.2.1. Proposal sent to camera . . . . . . . . . . . . . . 23 90 11.3. Simulcast Video Example . . . . . . . . . . . . . . . . 24 91 11.4. FEC Example . . . . . . . . . . . . . . . . . . . . . . 24 92 11.4.1. Advertisement includes a FEC codec. . . . . . . . . 24 93 11.4.2. Proposal sent to camera . . . . . . . . . . . . . . 25 94 12. Switched Forwarding Unit (SFU) . . . . . . . . . . . . . . . 26 95 12.1. Software Defined Networking . . . . . . . . . . . . . . 26 96 12.2. Vector Packet Processors . . . . . . . . . . . . . . . . 27 97 12.3. Information Centric Networking . . . . . . . . . . . . . 27 98 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 27 99 14. Other Work . . . . . . . . . . . . . . . . . . . . . . . . . 27 100 15. Style of specification . . . . . . . . . . . . . . . . . . . 27 101 16. Informative References . . . . . . . . . . . . . . . . . . . 28 102 Author's Address . . . . . . . . . . . . . . . . . . . . . . . . 28 104 1. Introduction 106 This draft is an accumulation of varios ideas some people are 107 thinking about. Most of them are fairly separable and could be 108 morphed into existing protocols though this draft takes a blank sheet 109 of paper approach to considering what would be the best think if we 110 were starting from scratch. With that is place, it is possible to 111 ask which of theses ideas makes sense to back patch into existing 112 protocols. 114 2. Goals 116 o Better connectivity by enable situation where asymmetric media is 117 possible. 119 o Design for SFU ( Switch Forwarding Units). Design for multiparty 120 calls first then consider two party calls as a specialized subcase 121 of that. 123 o Designed for client servers with server based controll of clients 125 o Faster setup 127 o Pluggable congestion controll 129 o much much simpler 131 o end to end security 133 o remove ability to use STUN / TURN in DDOS reflection attacks 135 o ability for receiver of video to tell the sender about size 136 changes of display window such that the sender can match 138 o Eliminiate the problems with ROC in SRTP 140 o address reasons people have not used from SDES to DTLS-SRTP 142 o seperation of call setup and ongoing call / conference control 143 o make codec negotiation more generic so that it works for future 144 codecs 146 o remove ICE's need for global pacing which is more or less 147 imposible on general purpose devices like PCs 149 3. Overview 151 This draft proposes a new media stack to replace the existing stack 152 RTP, DTLS-SRTP, and SDP Offer Answer. The key parts of this stack 153 are connectivity layer, the transport layer, the media layer, a 154 control API, and the singling layer. 156 The connectivity layer uses a simplified version of ICE, called 157 snowflake [I-D.jennings-dispatch-snowflake], to find connectivity 158 between endpoints and change the connectivity from one address to 159 another as different networks become available or disappear. It is 160 based on ideas from [I-D.jennings-mmusic-ice-fix]. 162 The transport layer uses QUIC to provide a hop by hop encrypted, 163 congestion controlled transport of media. Although QUIC does not 164 currently have all of the partial reliability mechanisms to make this 165 work, this draft assumes that they will be added to QUIC. 167 The media layer uses existing codecs and packages them along with 168 extra header information to provide information about, when the 169 sequence needs to be played back, which camera it came from, and 170 media streams to be synchronized. 172 The control API is an abstract API that provides a way for the media 173 stack to report it capabilities and features and a way for the an 174 application tell the media stack how it should be configured. 175 Configuration includes what codec to use, size and frame rate of 176 video, and where to send the media. 178 The singling layer is based on an advertisement and proposal model. 179 Each endpoint can create an advertisement that describes what it 180 supports including things like supported codecs and maximum bitrates. 181 A proposal can be sent to an endpoint that tells the endpoint exactly 182 what media to send and receive and where to send it. The endpoint 183 can accept or reject this proposal in total but cannot change any 184 part of it. 186 4. Terminology 188 o media stream: Stream of information from a single sensor. For 189 example, a video stream from a single camera. A stream may have 190 multiple encodings for example video at different resolutions. 192 o encoding: A encoded version of a stream. A given stream may have 193 several encodings at different resolutions. One encoding may 194 depend on other encodings such as forward error corrections or in 195 the case of scalable video codecs. 197 o flow: A logical transport between two computers. Many media 198 streams can be transported over a single flow. The actually IP 199 address and ports used to transport data in the flow may change 200 over time as connectivity changes. 202 o message: some data or media that to be sent across the network 203 along with metadata about it. Similar to an RTP packet. 205 o media source: a camera, microphone or other source of data on an 206 endpoint 208 o media sink: a speaker, screen, or other destination for data on an 209 endpoint 211 o TLV: Tag Length Value. When used in the draft, the Tag, Length, 212 and any integer values are coded as variable length integers 213 similar to how this is done in CBOR. 215 5. Architecture 217 Much of the deployments architecture of IETF media designs are based 218 on a distributed controller for the media stack that is running peer 219 to peer in each client. Nearly all deployments, by they a cloud 220 based conferencing systems or an enterprise PBX, use a central 221 controller that acts as an SBC to try and controll each client. The 222 goal here would be an deployment architecture that 224 o support a single controller that controlled all the device in a 225 given conference or call. The controller could be in the cloud or 226 running on one of the endpoints. 228 o design for multi party conference calls first and treat 2 party 229 calls as a specialed sub case of that 231 o design with the assumption that an light weight SFU (Switched 232 Forwarding Unit) was used to distribute media for conference 233 calls. 235 6. Connectivity Layer 236 6.1. Snowflake - New ICE 238 All that is needed to discover the connectivity is way to: 240 o Gather some IP/ports that may work using TURN2 relay, STUN2, and 241 local addresses. 243 o A controller, which might be running in the cloud, to inform a 244 client to send a STUN2 packet from a given source IP/port to a 245 given destination IP/port. 247 o The receiver notifies the controller about information on received 248 STUN2 packets. 250 o The controller can tell the sender the secret that was in the 251 packet to prove consent of the receiver to receive data then the 252 sending client can allow media to flow over that connection. 254 The actually algorithm used to decide on what pairs of addresses are 255 tested and in what order does not need to be agreed on by both the 256 sides of the call - only the controller needs to know it. This 257 allows the controller to use machine learning, past history, and 258 heuristics to find an optimal connection much faster than something 259 like ICE. 261 The details of this approach are described in 262 [I-D.jennings-dispatch-snowflake]. Many of ideas in this can be 263 traced back to [I-D.kaufman-rtcweb-traversal]. 265 6.2. STUN2 267 The speed of setting up a new media flow is often determined by how 268 many STUN2 checks need to be done. If the STUN2 packets are smaller, 269 then the stun checks can be done faster without risk of causing 270 congestion. 272 6.2.1. STUN2 Request 274 A STUN2 request consists of, well, really nothing. The STUN client 275 just opens a QUIC connection to the STUN server. 277 6.2.2. STUN2 Response 279 When the STUN2 sever receives a new QUIC connection, it responds with 280 the IP address and port that the connection came from. 282 The client can check it is talking to the correct STUN server by 283 checking the fingerprint of the certificate. Protocols like ICE 284 would need to exchange theses fingerprints instead of all the crazy 285 stun attributes. 287 Thanks to Peter Thatcher for proposing STUN over QUIC. 289 6.3. TURN2 291 TODO: make TURN2 run over QUIC 293 Out of band, the client tells the TURN2 server the fingerprint of the 294 cert it uses to authenticate with. The TURN2 server gives the client 295 two public IP:port address pairs. One is called inbound and other 296 called outbound. The client connects to the outbound port and 297 authenticates to TURN2 server using the TLS domain name of server. 298 The TURN2 server authenticates the client using mutual TLS with 299 fingerprint of cert provided by the client. Any time a message or 300 stun packet is received on the matched inbound port, the TURN2 server 301 forwards it to the client(s) connected to the outbound port. 303 A single TURN2 connection can be used for multiple different calls or 304 session at the same time and a client could choose to allocate the 305 TURN2 connection at the time that it started up. It does not need to 306 be done on a per session basis. 308 The client can not send from the TURN2 server. 310 Client A Turn Server Client B 311 (Media Receiver) (Media Sender) 312 | | | 313 | | | 314 | | | 315 |(1) OnInit Register (A's fingerprint) 316 |------------->| | 317 | | | 318 | | | 319 |(2) Register Response (Port Pair (L,R)) 320 |<-------------| | 321 | | | 322 | | | 323 | L(left of Server), R(Right of Server) 324 | | | 325 | | | 326 | | | 327 |(3) Setup TLS Connection (L port) 328 |..............| | 329 | | | 330 | | | 331 | | | B send's media to A 332 | | | 333 | | | 334 | | | 335 | |(4) Media Tx (Received on Port R) 336 | |<-------------| 337 | | | 338 | | | 339 |(5) Media Tx (Sent from Port L) 340 |<-------------| | 341 | | | 342 | | | 344 7. Transport Layer 346 The responsibility of the transport layer is to provide an end to end 347 crypto layer equivalent to DTLS and they must ensure adequate 348 congestion control. The transport layer brings up a flow between two 349 computers. This flow can be used by multiple media streams. 351 The MTI transport layer is QUIC with packets. It assumes that QUIC 352 has a way to delivers the packets in an effecent unreliable mode as 353 wells as an optional way to deliver important metadata packets in a 354 reliable mode. It assumes that QUIC can report up to the rate 355 adaptation layer a current max target bandwidth that QUIC can 356 transmit at. It's possible these are all unrealistic characteristics 357 of QUIC in which case a new transport protocol should be developed 358 that provides these and is layered on top of DTLS for security. 360 This is secured by checking the fingerprints of the DTLS connection 361 match the fingerprints provided at the control layer or by checking 362 the names of the certificates match what was provided at control 363 layer. 365 The transport layer needs to be able to set the DSCP values in 366 transmitting packets as specified by the control layer. 368 The transport MAY provide a compression mode to remove the redundancy 369 of the non-encrypted portion of the media messages such as 370 GlobalEncodingID. For example, a GlobalEncodingID could be mapped to 371 a QUIC channel and then it could be removed before sending the 372 message and added back on the receiving side. 374 The transport need to be able to ensure that it has a very small 375 chance of being confused with the STUN2 traffic it will be 376 multiplexed with. (Open issue - if the STUN2 runs on top of same 377 transport, this becomes less of issue ) 379 The transport crypto needs to be able to export server state that can 380 be passed out of band to the client to enable the client to make a 381 zero RTT connection to the server. 383 8. Media Layer - RTP3 385 Each message consist of a set of TLV headers with metadata about the 386 packet, followed by payload data such as the output of audio or video 387 codec. 389 There are several message headers that help the receiver understand 390 what to do with the media. The TLV header are the follow: 392 o Conference ID: Integer that will be globally unique identifier for 393 the for all applications using a common call singling system. 394 This is set by the proposal. 396 o Endpoint ID: Integer to uniquely identify the endpoint with within 397 scope of conference ID. This is set by the proposal. 399 o Source ID: integer to uniquely identify the input source within 400 the scope a endpoint ID. A source could be a specific camera or a 401 microphone. This is set by the endpoint and included in the 402 advertisement. 404 o Sink ID: integer to uniquely identify the sink within the scope a 405 endpoint ID. A sink could be a speaker or screen. This is set by 406 the endpoint and included in the advertisement. An endpoint 407 sending media can have this set. If it is set it should transmit 408 it for 3 frames any time it changes and once every 5 second. An 409 SFU can add, modify, or delete this from any media packet. TODO - 410 How to use this for SFU controlled layout - for example, if have 411 100 users in conference and want to put the 10 most recent 412 speakers in thumbnails. Do we need this at all ? 414 o Encoding ID: integer to uniquely identify the encoding of the 415 stream within the scope of the source ID. Note there may be 416 multiple encodings of data from the same source. This is set by 417 the proposal. 419 o Salt : salt to use for forming the initialization vector for AEAD. 420 The salt shall be sent as part of the packet and need not be sent 421 in all the packets. This is created by the endpoint sending the 422 message. 424 o GlobalEncodingID: 64 bit hash of concatenation of conference ID, 425 endpoint ID, source ID, encoding ID 427 o Capture time: Time when the first sample in the message was 428 captured. It is a NTP time in ms with the high order bits 429 discarded. The number of bits in the capture time needs to be 430 large enough that it does not wrap in for the lifetime of this 431 stream. This is set by the endpoint sending the message. 433 o Sequence ID: When the data captured for a single point in time is 434 too large to fit in a single message, it can be split into 435 multiple chunks which are sequentially numbered starting at 0 436 corresponding to the first chunk of the message. This is set by 437 the endpoint sending the message. 439 o GlobalMessageID: 64 bit hash of concatenation of conference ID, 440 endpoint ID, encoding ID, sequence ID 442 o Active level: this is a number from 0 to 100 indicates the 443 probability that the sender of this media wishes it to be 444 considered active media. For example if it was voice, it would be 445 100 if the person was clearly speaking, and 0 if not, and perhaps 446 a value in the middle if it was uncertain. This allows an media 447 switch to select the active speaker in the in a conference call. 449 o Location: relative or absolute location, direction of view, and 450 field view. With video coming from drones, 360 cameras, VR light 451 field cameras, and complex video conferencing rooms, this provides 452 the information about the camera or microphone that the receiver 453 can use to render the correct view. This is end to end encrypted. 455 o Reference Frame : bool to indicate if this message is part of a 456 reference frame. Typically, a SFU will switch to the new video 457 stream at the start of a reference frame. 459 o DSCP : DSCP to use on transmissions of this message and future 460 messages on this GlobalEncodingID 462 o Layer ID : Integer indicating which layer is for scalable video 463 codecs. SFU may use this to selectively drop a frame. 465 The keys used for the AEAD are unique to a given conference ID and 466 endpoint ID. 468 If the message has any of the following headers, they must occur in 469 the following order followed by all other headers: 471 1. GlobalEncodingID, 473 2. GlobalMessageID, 475 3. conference ID, 477 4. endpoint ID, 479 5. encoding ID, 481 6. sequence ID, 483 7. active level, 485 8. DSCP 487 Every second there much be at least one message in each encoding that 488 contains: 490 o conference ID, 492 o endpoint ID, 494 o encoding ID, 496 o salt, 498 o and sequence ID headers 499 but they are not needed in every packet. 501 The sequence ID or GlobalMessageID is required in every message and 502 periodically there should be message with the capture time. 504 8.1. RTP Meta Data 506 We tend to end up with a few categories of data associated with the 507 media: 509 o Stuff you need at the same time you get the media. For example, 510 this is a reference frame. 512 o Stuff you need soon but not instantly. For example the name of 513 the speaker in a given rectangle of a video stream 515 And it tends to change at different rates: 517 o Stuff that you need to process the media and may change but does 518 not change quickly and you don't need it with every frame. For 519 example, salt for encryption 521 o Stuff that you need to join the media but may never change. For 522 example, resolution of the video is 524 TODO - think about how to optimize design for each type of meta data 526 8.2. Securing the messages 528 The whole message is end to end secured with AEAD. The headers are 529 authenticated while the payload data is authenticated and encrypted. 530 Similar to how the IV for AES-GCM is calculated in SRTP, in this case 531 the IV is computed by xor'ing the salt with the concatenation of the 532 GlobalEncodingID and low 64 bits of sequence ID. The message 533 consists of the authenticated data, followed by the encrypted data , 534 then the authentication tag. 536 8.3. Sender requests 538 The control layer supports requesting retransmission of a particular 539 media message identified by IDs and capture time it would contain. 541 The control layer supports requesting a maximum rate for each given 542 encoding ID. 544 8.4. Data Codecs 546 Data messages including raw bytes, xml, senml can all be sent just 547 like media by selecting an appropriate codec and a software based 548 source or sink. An additional parameter to the codec can indicate if 549 reliably delivery is needed and if in order delivery is needed. 551 8.5. Media Keep Alive 553 Provided by transport. 555 8.6. Forward Error Correction 557 A new Reed-Solomon based FEC scheme based on 558 [I-D.ietf-payload-flexible-fec-scheme] that provides FEC over 559 messages needs to be defined. 561 8.7. MTI Codecs 563 8.7.1. Audio 565 Implementation MUST support at least G711 and Opus 567 8.7.2. Video 569 Implementation MUST support at least H.264 and AV1 571 Video codecs use square pixels. 573 Video codecs MUST support any aspect ratio within the limits of their 574 max width and height. 576 Video codecs can specify a maximum pixel rate, maximum frame rate, 577 maximum images size. The can also specify a list of binary flags of 578 supported features which are defined by the codec and may be 579 supported by the codec for encode, decode, or neither where each 580 feature can be independently controlled. They can not impose 581 constraints beyond that. Some existing codecs like vp8 may easily 582 fit into that while some codec like H264 may need some suspects 583 defined as new codecs to meet the requirements for this. It is not 584 expected that all the nuances that could be negotiated with SDP for 585 264 would be supported in this new media. 587 Video codecs MUST support a min width and min height of 1. 589 All video on the wire is oriented such that the first scan line in 590 the frame is up and first pixel in the scan line is on the left. 592 T.38 fax and DTMF are not supported. Fax can be sent as a TIFF 593 imager over a data channel and DTFM can be done as an application 594 specific information over a data channel. 596 TODO: Capture the list of what metadata video encoders produce * if 597 it is a reference frame or not * resolution * frame-rate ? * capture 598 time of frame 600 TODO: Capture the list of what metadata video encoders needs. * 601 capture timestamp * source and target resolution * source and target 602 frame-rate * target bitrate * max bitrate * max pixel rate 604 8.7.3. Annotation 606 Optional support for annotation based overlay using vector graphics 607 such as a subset of SVG. 609 8.7.4. Application Data Channels 611 Need support for application defined data in both a reliable and 612 unreliable datagram mode. 614 8.7.5. Reverse Requests & Stats 616 The hope is that this is not needed. 618 Much of what goes in the reverse direction of the media in RTCP is 619 either used for congestion controll, diagnostics, or controll of the 620 codec such as requesting to resent a frame or sending a new intra 621 codec frame for video. The design reduces the need for this. 623 The congestion controll information which is needed quickly is all 624 handled at QUIC layer. 626 The diagnostic type information can be reported from the endpint to 627 the controller and does not need to flow at the media level. 629 Information that needs to be delivered reliably can be sent that way 630 at the QUIC level remove the need for retransmit type request. 631 System that use selective retransmission to recover from packet loss 632 of media do not tend to work as well for interactive medias as 633 forward error correction schemes because of the large latency they 634 introduce. 636 Information like requesting a new intra codec frame for video often 637 needs to come from the controller and can be sent over the signalling 638 and controll layer. 640 8.8. Message Key Agreement 642 The secret for encrypting messages can be provided in the proposal by 643 value or by a reference. The reference approach allows the client to 644 get it from a messaging system where the server creating the proposal 645 may not have access to the the secret. For example, it might come 646 from a system like [I-D.barnes-mls-protocol]. 648 9. Control Layer 650 The control layer needs an API to find out what the capabilities of 651 the device are, and then a way to set up sending and receiving 652 stream. All media flow are only in one direction. The control is 653 broken into control of connectivity and transports, and control of 654 media streams. 656 9.1. Transport Capabilities API 658 An API to get information for remote connectivity including: 660 o set the IP, port, and credential for each TURN2 server 662 o can return the IP, port tuple for the remote side to send to TURN2 663 server 665 o gather local IP, port, protocol tuples for receiving media 667 o report SHA256 fingerprint of local TLS certificate 669 o encryption algorithms supported 671 o report an error for a bad TURN2 credential 673 9.2. Media Capabilities API 675 Send and receive codecs are consider separate codecs and can have 676 separate capabilities though the default to the same if not specified 677 separately. 679 For each send or receive audio codec, an API to learn: 681 o codec name 683 o the max sample rate 685 o the max sample size 687 o the max bitrate 688 For each send or receive video codec, an API to learn: 690 o codec name 692 o the max width 694 o the max height 696 o the max frame rate 698 o the max pixel depth 700 o the max bitrate 702 o the max pixel rate ( pixels / second ) 704 9.3. Transport Configuration API 706 To create a new flow, the information that can be configured is: 708 o turn server to use 710 o list of IP, Port, Protocol tuples to try connecting to 712 o encryption algorithm to use 714 o TLS fingerprint of far side 716 An api to allow modification of the follow attributes of a flow: 718 o total max bandwidth for flow 720 o forward error correction scheme for flow 722 o FEC time window 724 o retransmission scheme for flow 726 o addition IP, Port, Protocol pairs to send to that may improve 727 connectivity 729 9.4. Media Configuration API 731 For all streams: 733 o set conference ID 735 o set endpoint ID 736 o set encoding ID 738 o salt and secret for AEAD 740 o flag to pause transition 742 For each transmitted audio steam, a way to set the: 744 o audio codec to use 746 o media source to connect 748 o max encoded bitrate 750 o sample rate 752 o sample size 754 o number of channels to encode 756 o packetization time 758 o process as one of : automatically set, raw, speech, music 760 o DSCP value to use 762 o flag to indicating to use constant bit rate 764 o optionally set a sinkID to periodically include in the media 766 For each transmitted video stream, a way to set 768 o video codec to use 770 o media source to connect to 772 o max width and max height 774 o max encoded bitrate 776 o max pixel rate 778 o sample rate 780 o sample size 782 o process as one of : automatically set, rapidly changing video, 783 fine detail video 785 o DSCP value to use 787 o for layered codec, a layer ID and set of layers IDs this depends 788 on 790 o optionally set a sinkID to periodically include in the media 792 For each transmitted video stream, a way to tell it to: 794 o encode the next frame as an intra frame 796 For each transmitted data stream: 798 o a way to send a data message and indicate reliable or unreliable 799 transmission 801 For each received audio stream: 803 o audio codec to use 805 o media sink to connect to 807 o lip sync flag 809 For each received video stream: 811 o video codec to use 813 o media sink to connect to 815 o lip sync flag 817 For each received data stream: 819 o notification of received data messages 821 Note on lip sync: For any streams that have the lip sync flag set to 822 true, the render attempts to synchronize their play back. 824 9.5. Transport Metrics 826 o report gathering state and completion 828 9.6. Flow Metrics API 830 For each flow, report: 832 o report connectivity state 833 o report bits sent 835 o report packets lost 837 o report estimated RTT 839 o report SHA256 fingerprint for certificate of far side 841 o current 5 tuple in use 843 9.7. Stream Metrics API 845 For sending streams: 847 o Bits sent 849 o packets lost 851 For receiving streams: 853 o capture time of most recently receives packet 855 o endpoint ID of more recently received packet 857 o bits received 859 o packets lost 861 For video streams (send & receive): 863 o current encoded width and height 865 o current encoded frame rate 867 10. Call Signalling - JABBER2 869 Call signalling is out of scope for usages like WebRTC but other 870 usages may want a common REST API they can use. 872 Call signalling works be having the client connect to a server when 873 it starts up and send its current advertisement and open a web socket 874 or to receive proposals from the server. A client can make a rest 875 call indicating the parties(s) it wishes to connect to and the server 876 will then send proposals to all clients that connect them. The 877 proposal tell each client exactly how to configure it's media stack 878 and MUST be either completely accepted, or completely rejected. 880 The signalling is based on the the advertisement proposal ideas from 881 [I-D.peterson-sipcore-advprop]. 883 We define one round trip of signalling to be a message going from a 884 client up to a server in the cloud, then down to another client which 885 returns a response along the reverse path. With this definition SIP 886 is takes 1.5 round trips or more if TURN is needed to set up a call 887 while this takes 0.5 round trips. 889 11. Signalling Examples 891 11.1. Simple Audio Example 893 11.1.1. simple audio advertisement 895 { 896 "receiveAt":[ 897 { 898 "relay":"2001:db8::10:443", 899 "stunSecret":"s8i739dk8", 900 "tlsFingerprintSHA256":"1283938" 901 }, 902 { 903 "stun":"203.0.113.10:43210", 904 "stunSecret":"s8i739dk8", 905 "tlsFingerprintSHA256":"1283938" 906 }, 907 { 908 "local":"192.168.0.2:443", 909 "stunSecret":"s8i739dk8", 910 "tlsFingerprintSHA256":"1283938" 911 } 912 ], 913 "sources":[ 914 { 915 "sourceID":1, 916 "sourceType":"audio", 917 "codecs":[ 918 { 919 "codecName":"opus", 920 "maxBitrate":128000 921 }, 922 { 923 "codecName":"g711" 924 } 925 ] 926 } 927 ], 928 "sinks":[ 929 { 930 "sinkID":1, 931 "sourceType":"audio", 932 "codecs":[ 933 { 934 "codecName":"opus", 935 "maxBitrate":256000 936 }, 937 { 938 "codecName":"g711" 939 } 940 ] 941 } 942 ] 943 } 945 11.1.2. simple audio proposal 947 { 948 "receiveAt":[ 949 { 950 "relay":"2001:db8::10:443", 951 "stunSecret":"s8i739dk8" 952 }, 953 { 954 "stun":"203.0.113.10:43210", 955 "stunSecret":"s8i739dk8" 956 }, 957 { 958 "local":"192.168.0.10:443", 959 "stunSecret":"s8i739dk8" 960 } 961 ], 962 "sendTo":[ 963 { 964 "relay":"2001:db8::20:443", 965 "stunSecret":"20kdiu83kd8", 966 "tlsFingerprintSHA256":"9389739" 967 }, 968 { 969 "stun":"203.0.113.20:43210", 970 "stunSecret":"20kdiu83kd8", 971 "tlsFingerprintSHA256":"9389739" 972 }, 973 { 974 "local":"192.168.0.20:443", 975 "stunSecret":"20kdiu83kd8", 976 "tlsFingerprintSHA256":"9389739" 977 } 978 ], 979 "sendStreams":[ 980 { 981 "conferenceID":4638572387, 982 "endpointID":23, 983 "sourceID":1, 984 "encodingID":1, 985 "codecName":"opus", 986 "AEAD":"AES128-GCM", 987 "secret":"xy34", 988 "maxBitrate":24000, 989 "packetTime":20 990 } 991 ], 992 "receiveStreams":[ 993 { 994 "conferenceID":4638572387, 995 "endpointID":23, 996 "sinkID":1, 997 "encodingID":1, 998 "codecName":"opus", 999 "AEAD":"AES128-GCM", 1000 "secret":"xy34" 1001 } 1002 ] 1003 } 1005 11.2. Simple Video Example 1007 Advertisement for simple send only camera with no audio 1008 { 1009 "sources":[ 1010 { 1011 "sourceID":1, 1012 "sourceType":"video", 1013 "codecs":[ 1014 { 1015 "codecName":"av1", 1016 "maxBitrate":20000000, 1017 "maxWidth":3840, 1018 "maxHeight":2160, 1019 "maxFrameRate":120, 1020 "maxPixelRate":248832000, 1021 "maxPixelDepth":8 1022 } 1023 ] 1024 } 1025 ] 1026 } 1028 11.2.1. Proposal sent to camera 1030 { 1031 "sendTo":[ 1032 { 1033 "relay":"2001:db8::20:443", 1034 "stunSecret":"20kdiu83kd8", 1035 "tlsFingerprintSHA256":"9389739" 1036 } 1037 ], 1038 "sendStreams":[ 1039 { 1040 "conferenceID":0, 1041 "endpointID":0, 1042 "sourceID":0, 1043 "encodingID":0, 1044 "codecName":"av1", 1045 "AEAD":"NULL", 1046 "width":640, 1047 "height":480, 1048 "frameRate":30 1049 } 1050 ] 1051 } 1053 11.3. Simulcast Video Example 1055 Advertisement same as simple camera above but proposal has two 1056 streams with different encodingID. 1058 { 1059 "sendTo":[ 1060 { 1061 "relay":"2001:db8::20:443", 1062 "stunSecret":"20kdiu83kd8", 1063 "tlsFingerprintSHA256":"9389739" 1064 } 1065 ], 1066 "sendStreams":[ 1067 { 1068 "conferenceID":0, 1069 "endpointID":0, 1070 "sourceID":0, 1071 "encodingID":1, 1072 "codecName":"av1", 1073 "AEAD":"NULL", 1074 "width":1920, 1075 "height":1080, 1076 "frameRate":30 1077 }, 1078 { 1079 "conferenceID":0, 1080 "endpointID":0, 1081 "sourceID":0, 1082 "encodingID":2, 1083 "codecName":"av1", 1084 "AEAD":"NULL", 1085 "width":240, 1086 "height":240, 1087 "frameRate":15 1088 } 1089 ] 1090 } 1092 11.4. FEC Example 1094 11.4.1. Advertisement includes a FEC codec. 1096 { 1097 "sources":[ 1098 { 1099 "sourceID":1, 1100 "sourceType":"video", 1101 "codecs":[ 1102 { 1103 "codecName":"av1", 1104 "maxBitrate":20000000, 1105 "maxWidth":3840, 1106 "maxHeight":2160, 1107 "maxFrameRate":120, 1108 "maxPixelRate":248832000, 1109 "maxPixelDepth":8 1110 }, 1111 { 1112 "codecName":"flex-fec-rs" 1113 } 1114 ] 1115 } 1116 ] 1117 } 1119 11.4.2. Proposal sent to camera 1120 { 1121 "sendTo":[ 1122 { 1123 "relay":"2001:db8::20:443", 1124 "stunSecret":"20kdiu83kd8", 1125 "tlsFingerprintSHA256":"9389739" 1126 } 1127 ], 1128 "sendStreams":[ 1129 { 1130 "conferenceID":0, 1131 "endpointID":0, 1132 "sourceID":0, 1133 "encodingID":1, 1134 "codecName":"av1", 1135 "AEAD":"NULL", 1136 "width":640, 1137 "height":480, 1138 "frameRate":30 1139 }, 1140 { 1141 "conferenceID":0, 1142 "endpointID":0, 1143 "sourceID":0, 1144 "encodingID":2, 1145 "AEAD":"NULL", 1146 "codecName":"flex-fec-rs", 1147 "fecRepairWindow":200, 1148 "fecRepairEncodingIDs":[ 1149 1 1150 ] 1151 } 1152 ] 1153 } 1155 12. Switched Forwarding Unit (SFU) 1157 When several clients are in conference call, the SFU can forward 1158 packets based on looking at which clients needs a given 1159 GlobalEncodingID. By looking at the "active level", the SFU can 1160 figure out which endpoints are the active speaker and forward only 1161 those. The SFU never changes anything in the message. 1163 12.1. Software Defined Networking 1165 Is it possible to use the packet recycling concepts in SDN to forward 1166 a single packet to multiple endpoints? Can the way SDN forwarding 1167 would work be adapted to use a SDN router as a SFU? 1169 12.2. Vector Packet Processors 1171 Can we use fast VPP systems like fd.io to create a SFU? 1173 12.3. Information Centric Networking 1175 What changes would be needed to map RTP2 into the prefix and suffix 1176 of hICN? 1178 13. Acknowledgements 1180 Thank you for input from: Harald Alvestrand, Espen Berger, Matthew 1181 Kaufman, Patrick Linskey, Eric Rescorla, Peter Thatcher, Malcolm 1182 Walters Martin Thomson 1184 14. Other Work 1186 rfc7016 1188 draft-kaufman-rtcweb-traversal 1190 Consider using terminology from rfc7656 1192 docs.google.com/presentation/ 1193 d/1Sg_1TVCcKJvZ8Egz5oa0CP01TC2rNdv9HVu7W38Y4zA/ 1194 edit#slide=id.g29a8672e18_22_120 1196 docs.google.com/presentation/d/1o- 1197 o5jZBLw3Py1OuenzWDkxDG6NigSmLHvGw5KemKWLw/ 1198 edit#slide=id.g2f8f4acff1_1_249 1200 cs.chromium.org/chromium/src/third_party/webrtc/common_video/include/ 1201 video_frame.h 1203 15. Style of specification 1205 Fundamental driven by experiments. The proposal is to have a high 1206 level overview document where we document some of the design - this 1207 document could be a start of that. Then write a a spec for each on 1208 of the separable protocol parts such as STUN2, TURN2, etc. 1210 The protocol specs would contain a high level overview like you might 1211 find on a wikipedia page and the details of the protocol encoding 1212 would be provided in an open source reference implementation. The 1213 test code for the references implementation helps test the spec. The 1214 implementation is not optimized for perfromance but instead is simply 1215 trying to clearly illustrate the protocol. Particular version of the 1216 draft would be bound to a tagged version of the source code. All the 1217 source code would be under normal IETF IPR rules just like it was 1218 included directly in the draft. 1220 16. Informative References 1222 [I-D.barnes-mls-protocol] 1223 Barnes, R., Millican, J., Omara, E., Cohn-Gordon, K., and 1224 R. Robert, "The Messaging Layer Security (MLS) Protocol", 1225 draft-barnes-mls-protocol-00 (work in progress), February 1226 2018. 1228 [I-D.ietf-payload-flexible-fec-scheme] 1229 Zanaty, M., Singh, V., Begen, A., and G. Mandyam, "RTP 1230 Payload Format for Flexible Forward Error Correction 1231 (FEC)", draft-ietf-payload-flexible-fec-scheme-06 (work in 1232 progress), March 2018. 1234 [I-D.jennings-dispatch-snowflake] 1235 Jennings, C. and S. Nandakumar, "Snowflake - A Lighweight, 1236 Asymmetric, Flexible, Receiver Driven Connectivity 1237 Establishment", draft-jennings-dispatch-snowflake-01 (work 1238 in progress), March 2018. 1240 [I-D.jennings-mmusic-ice-fix] 1241 Jennings, C., "Proposal for Fixing ICE", draft-jennings- 1242 mmusic-ice-fix-00 (work in progress), July 2015. 1244 [I-D.kaufman-rtcweb-traversal] 1245 Kaufman, M. and J. Rosenberg, "NAT Traversal Requirements 1246 for RTCWEB", draft-kaufman-rtcweb-traversal-00 (work in 1247 progress), June 2011. 1249 [I-D.peterson-sipcore-advprop] 1250 Peterson, J. and C. Jennings, "The Advertisement/Proposal 1251 Model of Session Description", draft-peterson-sipcore- 1252 advprop-01 (work in progress), March 2011. 1254 Author's Address 1256 Cullen Jennings 1257 Cisco 1259 Email: fluffy@iii.ca