idnits 2.17.1 draft-rosenberg-rtcweb-framework-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There is 1 instance of too long lines in the document, the longest one being 4 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year -- The document date (February 8, 2011) is 4826 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- -- Obsolete informational reference (is this intentional?): RFC 5389 (Obsoleted by RFC 8489) == Outdated reference: A later version (-16) exists of draft-ietf-codec-opus-02 -- Obsolete informational reference (is this intentional?): RFC 5245 (Obsoleted by RFC 8445, RFC 8839) Summary: 3 errors (**), 0 flaws (~~), 2 warnings (==), 3 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 RTCWEB J. Rosenberg 3 Internet-Draft M. Kaufman 4 Intended status: Informational M. Hiie 5 Expires: August 12, 2011 F. Audet 6 Skype 7 February 8, 2011 9 An Architectural Framework for Browser based Real-Time Communications 10 (RTC) 11 draft-rosenberg-rtcweb-framework-00 13 Abstract 15 This document defines an architectural framework for browser-based 16 real-time communications (RTC). We propose a media component model, 17 where the browser provides an API abstraction which models media 18 components and connections. The underlying protocols within the 19 browser provide for a minimum set of functionality related to 20 transport of media. 22 Status of this Memo 24 This Internet-Draft is submitted in full conformance with the 25 provisions of BCP 78 and BCP 79. 27 Internet-Drafts are working documents of the Internet Engineering 28 Task Force (IETF). Note that other groups may also distribute 29 working documents as Internet-Drafts. The list of current Internet- 30 Drafts is at http://datatracker.ietf.org/drafts/current/. 32 Internet-Drafts are draft documents valid for a maximum of six months 33 and may be updated, replaced, or obsoleted by other documents at any 34 time. It is inappropriate to use Internet-Drafts as reference 35 material or to cite them other than as "work in progress." 37 This Internet-Draft will expire on August 12, 2011. 39 Copyright Notice 41 Copyright (c) 2011 IETF Trust and the persons identified as the 42 document authors. All rights reserved. 44 This document is subject to BCP 78 and the IETF Trust's Legal 45 Provisions Relating to IETF Documents 46 (http://trustee.ietf.org/license-info) in effect on the date of 47 publication of this document. Please review these documents 48 carefully, as they describe your rights and restrictions with respect 49 to this document. Code Components extracted from this document must 50 include Simplified BSD License text as described in Section 4.e of 51 the Trust Legal Provisions and are provided without warranty as 52 described in the Simplified BSD License. 54 Table of Contents 56 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 57 2. The Media Component Model . . . . . . . . . . . . . . . . . . 4 58 3. The Role of Signaling . . . . . . . . . . . . . . . . . . . . 5 59 4. The Role of Media Transport . . . . . . . . . . . . . . . . . 8 60 5. Benefits of the Media Component Model . . . . . . . . . . . . 9 61 5.1. Enabling Innovation . . . . . . . . . . . . . . . . . . . 9 62 5.2. The Importance of Flexibility . . . . . . . . . . . . . . 10 63 6. Interoperability with Existing VoIP Gear . . . . . . . . . . . 11 64 7. Informative References . . . . . . . . . . . . . . . . . . . . 12 65 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 13 67 1. Introduction 69 Real-time communications (RTC) remains one of the few - if only - 70 classes of desktop applications that is not yet possible using the 71 native capabilities of the web browser. These applications run 72 natively on the desktop, or are powered by plugins. The 73 functionality provided by these desktop clients is rich and complex - 74 ranging from user interface, to real-time notifications, to call 75 signaling and call processing, to instant messaging and presence, and 76 of course - the real-time media stack itself, including codecs, 77 transport, firewall and NAT traversal, security, and so on. 79 Given the breadth of functionality in today's desktop RTC clients, 80 careful consideration needs to be paid to how that functionality 81 manifests in the browser. What functionality lives within the 82 browser itself? What functionality lives on top of it - either in 83 client-side Javascript or within servers? What protocols are spoken 84 by the browser itself? What protocols can be implemented within the 85 Javascript? What protocols need to be standardized, and which do 86 not? Pictorially, the question is what protocols, APIs, and 87 functionality reside within the box marked "Browser RTC Function" in 88 Figure 1. Indeed, the central question is what functionality resides 89 in that box, as the functionality will ultimately dictate the 90 protocols that interface to it, and the APIs which control it. 92 +------------------------+ On-the-wire 93 | | Protocols 94 | Servers |---------> 95 | | 96 | | 97 +------------------------+ 98 ^ 99 | 100 | 101 | HTTP/ 102 | Websockets 103 | 104 | 105 | 106 | 107 +----------------------------+ 108 | Javascript/HTML/CSS | 109 +----------------------------+ 110 Other ^ ^RTC 111 APIs | |APIs 112 +---|-----------------|------+ 113 | | | | 114 | +---------+| 115 | | Browser || On-the-wire 116 | Browser | RTC || Protocols 117 | | Function|-----------> 118 | | || 119 | | || 120 | +---------+| 121 +---------------------|------+ 122 | 123 V 124 Native OS Services 126 Figure 1: Browser Model 128 2. The Media Component Model 130 It is our position that the functionality that manifests within the 131 box be a media component model. In this model, the browser 132 implements the necessary functionality to perform the real-time 133 processing of media, starting from capture/render, through 134 encapsulation in real-time transport protocols sent over the 135 Internet. This functionality must be built into the browser, rather 136 than within Javascript, due to its tight timing requirements and 137 complexity. Furthermore, the functionality manifest as a set of 138 loosely coupled components, each of which performs some aspect of the 139 real-time processing. Each component has APIs which allow that 140 component to be configured (with sensible defaults where 141 appropriate), along with APIs that allow applications to gather 142 information and statistics about the performance of that module. 144 The modules would include the codec itself, the acoustic echo 145 canceller (AEC), the jitter buffer, audio and video pre-processing 146 modules, and network transport components (including encryption and 147 integrity protection of media) which speak specific transport 148 protocols (such as the Real-Time Transport Protocol (RTP)). The 149 media component model is purposefully minimalistic. It opts for 150 maximizing the functionality that lives outside of the browser itself 151 - within Javascript or servers. In particular, only functionality 152 which is real-time - which cannot be done using Javascript or server 153 functionality - resides within the browser itself. As explained in 154 Section 5, this facilitates innovation, differentiation, and 155 development velocity - all of the key characteristics that have made 156 the web what it is. 158 As an example, a codec component implementing Opus 159 [I-D.ietf-codec-opus] might be represented by a Javascript object 160 with properties that mirror the configuration settings of the codec 161 itself - the sample rate (one of narrowband, mediumband, wideband or 162 super-wideband), the packet rate (number of frames per packet), the 163 bitrate (which can vary between 6 and 40kbps), a slider that adjusts 164 the packet loss resilience, a Boolean which indicates whether inband 165 FEC should be used, and another Boolean which indicates whether to 166 apply silence suppression. Of course, all of these parameters might 167 have reasonable defaults so that non-expert programmers can just make 168 it work. However, an advanced programmer could force a mode or 169 change a setting as needed. After all, the Opus codec itself makes 170 these parameters tunable exactly because there is no one right value; 171 the correct setting depends on the application scenario and needs of 172 the developer. 174 3. The Role of Signaling 176 It is our view that signaling is accomplished using a combination of 177 existing client-server web protocols (HTTP, COMET, and websockets) 178 and standards-based server-to-server protocols, such as SIP. A view 179 of the "browser RTC Trapezoid" is shown in Figure 2. 181 +-----------+ +-----------+ 182 | Web/ | | Web/ | 183 | SIP | SIP | SIP | 184 | |-------------| | 185 | Server | | Server | 186 | | | | 187 +-----------+ +-----------+ 188 / \ 189 / \ Proprietary over 190 / \ HTTP/Websockets 191 / \ 192 / Proprietary over \ 193 / HTTP/Websockets \ 194 / \ 195 +-----------+ +-----------+ 196 |JS/HTML/CSS| |JS/HTML/CSS| 197 +-----------+ +-----------+ 198 +-----------+ +-----------+ 199 | | | | 200 | | | | 201 | Browser | ------------------------- | Browser | 202 | | Media | | 203 | | | | 204 +-----------+ +-----------+ 206 Figure 2: Browser RTC Trapezoid 208 In this example, a call is placed between two different providers. 209 They use a SIP-based interface to federate between them. However, 210 each of their respective browser-based clients signals to its server 211 using proprietary application protocols built ontop of HTTP and 212 Websockets. For example, provider A might offer simple calling 213 services, and have a very simple web services interface for placing 214 calls: 216 http://calling.providerA.com/call?target=joe@providerB.com&myIP=1.2.3.4:4476 218 Which takes only the called party and local IP/port as arguments. 219 Provider A's server infrastructure - some combination of web and SIP 220 servers built in any way it likes - uses the identity of the target, 221 along with previously-known information on the capabilities of the 222 caller's browser learned through a web-services registration, to 223 generate a SIP INVITE. This arrives at provider B's server 224 infrastructure, which alerts its browser-based client of the incoming 225 call. Provider B might be an enterprise service provider, and offer 226 much richer features and signaling. Provider B uses a websocket 227 interface to the browser, providing it the identity of the caller, 228 the list of available codecs, and so on. B's service provider offers 229 web-services based APIs for answering the call, declining it, sending 230 to voicemail, redirecting to another number, parking it, and so on. 232 APIs within the browser allow each side to instruct the browsers to 233 send media, including selection of media types and codecs. In this 234 model, there is no SIP in the browser. It is our view that SIP has 235 no place within the browser. 237 SIP is an application protocol - providing call setup, registration, 238 codec negotiation, chat and presence, amongst other features. For 239 each and every new feature that is desired to run between a SIP 240 client and a SIP server, a new standard must be defined and then 241 implemented. The feature set is indeed vast, considering the wealth 242 of potential endpoints, ranging from simple consumer "voice only" 243 clients, to richer videophones, to voice and video multiparty 244 conferencing (including content sharing), to low-end enterprise 245 phones, to high end executive admin phones, to contact centers 246 endpoints, and beyond. Each of those requires more and more SIP 247 extensions in order to function. This has resulted in a growing 248 number of specifications, with diminishing returns of 249 interoperability and feature velocity. As an example, the BLISS 250 working group in IETF was formed to tackle some basic business phone 251 features - including line sharing, park, call queuing, and automated 252 call handling. Each of these individual features requires one or 253 more specifications, and needs to be designed to meet the needs of 254 all of the participants in the process. 256 There are two important consequences of this. First, the requirement 257 of standardization acts as a huge deterrent to innovation. Indeed, 258 in many ways, it is anathema to the very notion of how the web is 259 supposed to work. In the web model, the provider can define 260 arbitrary content to render to users, craft arbitrary UI, and define 261 arbitrary messaging from the browser back to the server, all without 262 standardization or change to the web browser. Google does not need 263 to wait for the browsers to implement IMAP in order to provide mail 264 service. Facebook does not need the browser to have XMPP or SIP to 265 enable presence and instant messaging. Why is call processing any 266 different? Why should Skype or any other real-time communications 267 provider be constrained by standardized application protocols? Each 268 provider should be able to design and innovate what it needs, and not 269 be constrained by the functionalities of the application protocols 270 burned into the browser. 272 While it is true that standardization will be required in order to 273 extend these features between domains, that standardization process 274 can be the successor - not the predecessor - to successsful 275 deployment and usage of the feature within a domain. Furthermore, 276 many features and services do not need to be extended between 277 domains. Many of the BLISS features are good examples of this. 279 Inclusion of SIP in the browser for client to server signaling will 280 also harm interoperability. Unfortunately, SIP interoperability 281 betweend endpoints and servers has been relatively poor; working only 282 for basic call setup, teardown, and basic features. Important 283 concepts like configuration remain poorly standardized and almost 284 never interoperate. The web has certainly had interoperability 285 problems, but the nature of those problems is different. In the web, 286 content providers often need to code differently for different 287 browsers, but at least they can deliver their application 288 functionality. On the other hand, with SIP phones, many cases 289 features simply do not and cannot work, and this cannot be resolved 290 through software development on behalf of the SIP provider. 291 Interoperability is improved when there are fewer standards and not 292 more. Instead of adding SIP and its extensions to the browser, 293 application providers can use the tools that are already there - HTTP 294 and websockets, and then define whatever signaling functions they 295 desire ontop, without interoperability consequences. 297 Make no mistake - SIP remains important as a glue between service 298 providers, and between server infrastructure within service provider 299 networks. However, in a web context, there is simply no need for SIP 300 support in the browser. 302 4. The Role of Media Transport 304 Unlike signaling, media transport does need to be in the browser, for 305 two important reasons: 307 1. It operates in real-time and does not fit well with the 308 programming model of Javascript 310 2. It needs to flow between endpoints directly - over UDP - in order 311 to achieve low latency, and therefore requires standardization in 312 order to interoperate with other providers or endpoints 314 The second point is important. Unlike most other web protocols, 315 real-time media needs to be sent from the browser client to 316 recipients other than the origin server or domain from which the web 317 content came from. This is essential for ensuring low latency 318 operations - one of the key metrics of quality in Voice over IP 319 systems. In some cases, the recipient will be another browser 320 endpoint from the same provider. However, it could be a desktop 321 client or mobile client from the same provider, or as shown in 322 Figure 2, it could be a browser endpoint or desktop endpoint from 323 another service provider. In all cases, a direct connection - indeed 324 a direct UDP connection - is important whenever possible. 326 From a security perspective however, the browser cannot just have an 327 API that tells it to send arbitrary UDP datagrams or even 328 standardized-format voice (or worse - video) media packets to an 329 arbitary IP address. The former introduces the opportunity for 330 malicious JavaScript to craft packets that mimick other application 331 protocols and send them to arbitrary endpoints (for example, an 332 enterprise SNMP server). The latter would introduce a substantial 333 opportunity for denial-of-service attacks. Malicious Javascript 334 could tell the browser to "spam" an unwitting recipient with high 335 bandwidth video. In the voice literature, this is referred to as the 336 voice hammer attack [RFC5245]. In existing voice systems, this 337 attack is possible but not likely due to the closed nature of most of 338 the software and systems. In a web environment, where all it takes 339 is one line of malicious Javascript, the attack becomes almost a 340 certainty. 342 To avoid this attack, a simple handshake can be utilized. The 343 browser should support a simple STUN-based [RFC5389] connection 344 handshake. The exchange of the STUN transaction ID prior to 345 transmission of media prevents the attack. 347 5. Benefits of the Media Component Model 349 There are several important benefits of the media component model 350 proposed here. 352 5.1. Enabling Innovation 354 One of the reasons why the Web has been successful as a user 355 interface platform is the short turn-around time to deploy new 356 versions of web-based services. Often, these new versions are 357 experiments that vary small details which are important to make the 358 service successful. It is the fine granularity of user interface 359 elements in HTML and related technologies that allow this 360 experimentation with details. As there is no agreed-upon 361 configuration of real-time audio/video communication technologies 362 that always delivers the best result, we think that it is essential 363 to give the application developers the same benefit of short turn- 364 around time and ability to experiment with details. Therefore, the 365 real-time communication primitives offered by user agents to web 366 applications/services should be fine-grained enough to allow for 367 enhanced configurations and possibly new scenarios. Also, these 368 interfaces to the primitives should allow gathering real-world data 369 in enough detail on how the primitives are operating, to enable the 370 feedback loop of deploy-measure-reconfigure-redeploy. 372 One of the areas where perhaps the most innovation can be expected is 373 signaling - one only needs to look at the plethora of standards 374 around SIP. Proposing user-agent vendors to implement all these 375 standards is a sure way to make the common denominator across user 376 agents marginal. Instead, the browser already has a programmability 377 model (JavaScript) that can handle all these use cases, and more, 378 provided the programming environment has access to the underlying 379 media components as we propose here. Drawing again parallels from 380 user interface development, there is an undecided problem of what 381 should be executed by the user agent, and what by the web servers 382 (e.g. validation). Similar gray boundary between the client and the 383 server exists in the field of real-time communications. Therefore we 384 propose to leave standardization of signaling out of scope for this 385 activity, and let the web service providers define signaling as they 386 see fit. 388 5.2. The Importance of Flexibility 390 There are obviously tradeoffs between built-in functionality and 391 programmability. It is often tempting to provide the web page author 392 with a simple and relatively inflexible way of expressing their 393 intent so as to minimize the page author's effort and accelerate 394 adoption. As an example, the "" tag was adopted much more 395 rapidly than it would have been if blinking text could have only been 396 implemented by writing a JavaScript timer task to manipulate the DOM 397 style objects. 399 On the other hand, such built-in functionality comes at two important 400 costs. First, each browser implementation must implement the 401 functionality, and the more which is moved from JavaScript to 402 built-in functionality the more code must be present for that 403 implemention. Second, and more important, the page author is now 404 restricted to the subset of functionality which is provided by these 405 browser implementations. 407 The "