idnits 2.17.1 draft-omara-sframe-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- No issues found here. Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** There are 41 instances of too long lines in the document, the longest one being 21 characters in excess of 72. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the IETF Trust and authors Copyright Line does not match the current year == Line 796 has weird spacing: '...verhead bps@4...' == Line 804 has weird spacing: '...verhead bps@3...' -- The document date (May 19, 2020) is 1438 days in the past. Is this intentional? Checking references for intended status: Informational ---------------------------------------------------------------------------- == Missing Reference: 'RFC5116' is mentioned on line 591, but not defined == Missing Reference: 'Optional' is mentioned on line 601, but not defined Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 1 comment (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Network Working Group E. Omara 3 Internet-Draft J. Uberti 4 Intended status: Informational Google 5 Expires: November 20, 2020 A. GOUAILLARD 6 S. Murillo 7 CoSMo Software 8 May 19, 2020 10 Secure Frame (SFrame) 11 draft-omara-sframe-00 13 Abstract 15 This document describes the Secure Frame (SFrame) end-to-end 16 encryption and authentication mechanism for media frames in a 17 multiparty conference call, in which central media servers (SFUs) can 18 access the media metadata needed to make forwarding decisions without 19 having access to the actual media. The proposed mechanism differs 20 from other approaches through its use of media frames as the 21 encryptable unit, instead of individual RTP packets, which makes it 22 more bandwidth efficient and also allows use with non-RTP transports. 24 Status of This Memo 26 This Internet-Draft is submitted in full conformance with the 27 provisions of BCP 78 and BCP 79. 29 Internet-Drafts are working documents of the Internet Engineering 30 Task Force (IETF). Note that other groups may also distribute 31 working documents as Internet-Drafts. The list of current Internet- 32 Drafts is at https://datatracker.ietf.org/drafts/current/. 34 Internet-Drafts are draft documents valid for a maximum of six months 35 and may be updated, replaced, or obsoleted by other documents at any 36 time. It is inappropriate to use Internet-Drafts as reference 37 material or to cite them other than as "work in progress." 39 This Internet-Draft will expire on November 20, 2020. 41 Copyright Notice 43 Copyright (c) 2020 IETF Trust and the persons identified as the 44 document authors. All rights reserved. 46 This document is subject to BCP 78 and the IETF Trust's Legal 47 Provisions Relating to IETF Documents 48 (https://trustee.ietf.org/license-info) in effect on the date of 49 publication of this document. Please review these documents 50 carefully, as they describe your rights and restrictions with respect 51 to this document. Code Components extracted from this document must 52 include Simplified BSD License text as described in Section 4.e of 53 the Trust Legal Provisions and are provided without warranty as 54 described in the Simplified BSD License. 56 Table of Contents 58 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 59 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 4 60 3. Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 61 4. SFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 62 4.1. SFrame Format . . . . . . . . . . . . . . . . . . . . . . 7 63 4.2. SFrame Header . . . . . . . . . . . . . . . . . . . . . . 7 64 4.3. Encryption Schema . . . . . . . . . . . . . . . . . . . . 8 65 4.3.1. Key Derivation . . . . . . . . . . . . . . . . . . . 8 66 4.3.2. Encryption . . . . . . . . . . . . . . . . . . . . . 9 67 4.3.3. Decryption . . . . . . . . . . . . . . . . . . . . . 10 68 4.3.4. Duplicate Frames . . . . . . . . . . . . . . . . . . 11 69 4.3.5. Key Rotation . . . . . . . . . . . . . . . . . . . . 11 70 4.4. Authentication . . . . . . . . . . . . . . . . . . . . . 12 71 4.5. Ciphersuites . . . . . . . . . . . . . . . . . . . . . . 14 72 4.5.1. SFrame . . . . . . . . . . . . . . . . . . . . . . . 14 73 4.5.2. DTLS-SRTP . . . . . . . . . . . . . . . . . . . . . . 15 74 5. Key Management . . . . . . . . . . . . . . . . . . . . . . . 15 75 5.1. MLS-SFrame . . . . . . . . . . . . . . . . . . . . . . . 15 76 6. Media Considerations . . . . . . . . . . . . . . . . . . . . 16 77 6.1. SFU . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 78 6.1.1. LastN and RTP stream reuse . . . . . . . . . . . . . 16 79 6.1.2. Simulcast . . . . . . . . . . . . . . . . . . . . . . 16 80 6.1.3. SVC . . . . . . . . . . . . . . . . . . . . . . . . . 16 81 6.2. Video Key Frames . . . . . . . . . . . . . . . . . . . . 17 82 6.3. Partial Decoding . . . . . . . . . . . . . . . . . . . . 17 83 7. Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 17 84 7.1. Audio . . . . . . . . . . . . . . . . . . . . . . . . . . 17 85 7.2. Video . . . . . . . . . . . . . . . . . . . . . . . . . . 18 86 7.3. SFrame vs PERC-lite . . . . . . . . . . . . . . . . . . . 18 87 7.3.1. Audio . . . . . . . . . . . . . . . . . . . . . . . . 19 88 7.3.2. Video . . . . . . . . . . . . . . . . . . . . . . . . 19 89 8. Security Considerations . . . . . . . . . . . . . . . . . . . 19 90 8.1. Key Management . . . . . . . . . . . . . . . . . . . . . 19 91 8.2. Authentication tag length . . . . . . . . . . . . . . . . 19 92 9. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 19 93 10. References . . . . . . . . . . . . . . . . . . . . . . . . . 19 94 10.1. Normative References . . . . . . . . . . . . . . . . . . 19 95 10.2. Informative References . . . . . . . . . . . . . . . . . 20 96 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 20 98 1. Introduction 100 Modern multi-party video call systems use Selective Forwarding Unit 101 (SFU) servers to efficiently route RTP streams to call endpoints 102 based on factors such as available bandwidth, desired video size, 103 codec support, and other factors. In order for the SFU to work 104 properly though, it needs to be able to access RTP metadata and RTCP 105 feedback messages, which is not possible if all RTP/RTCP traffic is 106 end-to-end encrypted. 108 As such, two layers of encryptions and authentication are required: 109 1- Hop-by-hop (HBH) encryption of media, metadata, and feedback 110 messages between the the endpoints and SFU 2- End-to-end (E2E) 111 encryption of media between the endpoints 113 While DTLS-SRTP can be used as an efficient HBH mechanism, it is 114 inherently point-to-point and therefore not suitable for a SFU 115 context. In addition, given the various scenarios in which video 116 calling occurs, minimizing the bandwidth overhead of end-to-end 117 encryption is also an important goal. 119 This document proposes a new end-to-end encryption mechanism known as 120 SFrame, specifically designed to work in group conference calls with 121 SFUs. 123 +-------------------------------+-------------------------------+^+ 124 |V=2|P|X| CC |M| PT | sequence number | | 125 +-------------------------------+-------------------------------+ | 126 | timestamp | | 127 +---------------------------------------------------------------+ | 128 | synchronization source (SSRC) identifier | | 129 |=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=| | 130 | contributing source (CSRC) identifiers | | 131 | .... | | 132 +---------------------------------------------------------------+ | 133 | RTP extension(s) (OPTIONAL) | | 134 +^---------------------+------------------------------------------+ | 135 | | payload header | | | 136 | +--------------------+ payload ... | | 137 | | | | 138 +^+---------------------------------------------------------------+^+ 139 | : authentication tag : | 140 | +---------------------------------------------------------------+ | 141 | | 142 ++ Encrypted Portion* Authenticated Portion +--+ 144 SRTP packet format 146 2. Terminology 148 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 149 "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and 150 "OPTIONAL" in this document are to be interpreted as described in BCP 151 14 [RFC2119] [RFC8174] when, and only when, they appear in all 152 capitals, as shown here. 154 SFU: Selective Forwarding Unit (AKA RTP Switch) 156 IV: Initialization Vector 158 MAC: Message Authentication Code 160 E2EE: End to End Encryption 162 HBH: Hop By Hop 164 KMS: Key Management System 166 3. Goals 168 SFrame is designed to be a suitable E2EE protection scheme for 169 conference call media in a broad range of scenarios, as outlined by 170 the following goals: 172 1. Provide an secure E2EE mechanism for audio and video in 173 conference calls that can be used with arbitrary SFU servers. 175 2. Decouple media encryption from key management to allow SFrame to 176 be used with an arbitrary KMS. 178 3. Minimize packet expansion to allow successful conferencing in as 179 many network conditions as possible. 181 4. Independence from the underlying transport, including use in non- 182 RTP transports, e.g., WebTransport. 184 5. When used with RTP and its associated error resilience 185 mechanisms, i.e., RTX and FEC, require no special handling for 186 RTX and FEC packets. 188 6. Minimize the changes needed in SFU servers. 190 7. Minimize the changes needed in endpoints. 192 8. Work with the most popular audio and video codecs used in 193 conferencing scenarios. 195 4. SFrame 197 We propose a frame level encryption mechanism that provides effective 198 end-to-end encryption, is simple to implement, has no dependencies on 199 RTP, and minimizes encryption bandwidth overhead. Because SFrame 200 encrypts the full frame, rather than individual packets, bandwidth 201 overhead is reduced by having a single IV and authentication tag for 202 each media frame. 204 Also, because media is encrypted prior to packetization, the 205 encrypted frame is packetized using a generic RTP packetizer instead 206 of codec-dependent packetization mechanisms. With this move to a 207 generic packetizer, media metadata is moved from codec-specific 208 mechanisms to a generic frame RTP header extension which, while 209 visible to the SFU, is authenticated end-to-end. This extension 210 includes metadata needed for SFU routing such as resolution, frame 211 beginning and end markers, etc. 213 The generic packetizer splits the E2E encrypted media frame into one 214 or more RTP packets and adds the SFrame header to the beginning of 215 the first packet and an auth tag to the end of the last packet. 217 +-------------------------------------------------------+ 218 | | 219 | +----------+ +------------+ +-----------+ | 220 | | | | SFrame | |Packetizer | | DTLS+SRTP 221 | | Encoder +----->+ Enc +----->+ +-------------------------+ 222 ,+. | | | | | | | | +--+ +--+ +--+ | 223 `|' | +----------+ +-----+------+ +-----------+ | | | | | | | | 224 /|\ | ^ | | | | | | | | 225 + | | | | | | | | | | 226 / \ | | | +--+ +--+ +--+ | 227 Alice | +-----+------+ | Encrypted Packets | 228 | |Key Manager | | | 229 | +------------+ | | 230 | || | | 231 | || | | 232 | || | | 233 +-------------------------------------------------------+ | 234 || | 235 || v 236 +------------+ +-----+------+ 237 E2EE channel | Messaging | | Media | 238 via the | Server | | Server | 239 Messaging Server | | | | 240 +------------+ +-----+------+ 241 || | 242 || | 243 +-------------------------------------------------------+ | 244 | || | | 245 | || | | 246 | || | | 247 | +------------+ | | 248 | |Key Manager | | | 249 ,+. | +-----+------+ | Encrypted Packets | 250 `|' | | | +--+ +--+ +--+ | 251 /|\ | | | | | | | | | | 252 + | v | | | | | | | | 253 / \ | +----------+ +-----+------+ +-----------+ | | | | | | | | 254 Bob | | | | SFrame | | De+ | | +--+ +--+ +--+ | 255 | | Decoder +<-----+ Dec +<-----+Packetizer +<------------------------+ 256 | | | | | | | | DTLS+SRTP 257 | +----------+ +------------+ +-----------+ | 258 | | 259 +-------------------------------------------------------+ 261 The E2EE keys used to encrypt the frame are exchanged out of band 262 using a secure E2EE channel. 264 4.1. SFrame Format 266 +------------+------------------------------------------+^+ 267 |S|LEN|X|KID | Frame Counter | | 268 +^+------------+------------------------------------------+ | 269 | | | | 270 | | | | 271 | | | | 272 | | | | 273 | | Encrypted Frame | | 274 | | | | 275 | | | | 276 | | | | 277 | | | | 278 +^+-------------------------------------------------------+^+ 279 | | Authentication Tag | | 280 | +-------------------------------------------------------+ | 281 | | 282 | | 283 +----+Encrypted Portion Authenticated Portion+---+ 285 4.2. SFrame Header 287 Since each endpoint can send multiple media layers, each frame will 288 have a unique frame counter that will be used to derive the 289 encryption IV. The frame counter must be unique and monotonically 290 increasing to avoid IV reuse. 292 As each sender will use their own key for encryption, so the SFrame 293 header will include the key id to allow the receiver to identify the 294 key that needs to be used for decrypting. 296 Both the frame counter and the key id are encoded in a variable 297 length format to decrease the overhead, so the first byte in the 298 Sframe header is fixed and contains the header metadata with the 299 following format: 301 0 1 2 3 4 5 6 7 302 +-+-+-+-+-+-+-+-+ 303 |S|LEN |X| K | 304 +-+-+-+-+-+-+-+-+ 305 SFrame header metadata 307 Signature flag (S): 1 bit This field indicates the payload contains a 308 signature if set. Counter Length (LEN): 3 bits This field indicates 309 the length of the CTR fields in bytes. Extended Key Id Flag (X): 1 310 bit Indicates if the key field contains the key id or the key length. 311 Key or Key Length: 3 bits This field contains the key id (KID) if the 312 X flag is set to 0, or the key length (KLEN) if set to 1. 314 If X flag is 0 then the KID is in the range of 0-7 and the frame 315 counter (CTR) is found in the next LEN bytes: 317 0 1 2 3 4 5 6 7 318 +-+-+-+-+-+-+-+-+---------------------------------+ 319 |S|LEN |0| KID | CTR... (length=LEN) | 320 +-+-+-+-+-+-+-+-+---------------------------------+ 322 Key id (KID): 3 bits The key id (0-7). Frame counter (CTR): 323 (Variable length) Frame counter value up to 8 bytes long. 325 if X flag is 1 then KLEN is the length of the key (KID), that is 326 found after the SFrame header metadata byte. After the key id (KID), 327 the frame counter (CTR) will be found in the next LEN bytes: 329 0 1 2 3 4 5 6 7 330 +-+-+-+-+-+-+-+-+---------------------------+---------------------------+ 331 |S|LEN |1|KLEN | KID... (length=KLEN) | CTR... (length=LEN) | 332 +-+-+-+-+-+-+-+-+---------------------------+---------------------------+ 334 Key length (KLEN): 3 bits The key length in bytes. Key id (KID): 335 (Variable length) The key id value up to 8 bytes long. Frame counter 336 (CTR): (Variable length) Frame counter value up to 8 bytes long. 338 4.3. Encryption Schema 340 4.3.1. Key Derivation 342 Each client creates a 32 bytes secret key K and share it with with 343 other participants via an E2EE channel. From K, we derive 3 secrets: 345 1- Salt key used to calculate the IV 347 Key = HKDF(K, 'SFrameSaltKey', 16) 349 2- Encryption key to encrypt the media frame 351 Key = HKDF(K, 'SFrameEncryptionKey', 16) 353 3- Authentication key to authenticate the encrypted frame and the 354 media metadata 356 Key = HKDF(K, 'SFrameAuthenticationKey', 32) 357 The IV is 128 bits long and calculated from the CTR field of the 358 Frame header: 360 IV = CTR XOR Salt key 362 4.3.2. Encryption 364 After encoding the frame and before packetizing it, the necessary 365 media metadata will be moved out of the encoded frame buffer, to be 366 used later in the RTP generic frame header extension. The encoded 367 frame, the metadata buffer and the frame counter are passed to SFrame 368 encryptor. The encryptor constructs SFrame header using frame 369 counter and key id and derive the encryption IV. The frame is 370 encrypted using the encryption key and the header, encrypted frame, 371 the media metadata and the header are authenticated using the 372 authentication key. The authentication tag is then truncated (If 373 supported by the cipher suite) and prepended at the end of the 374 ciphertext. 376 The encrypted payload is then passed to a generic RTP packetized to 377 construct the RTP packets and encrypts it using SRTP keys for the HBH 378 encryption to the media server. 380 +---------------+ +---------------+ 381 | | | frame metadata+----+ 382 | | +---------------+ | 383 | frame | | 384 | | | 385 | | | 386 +-------+-------+ | 387 | | 388 CTR +---------------> IV |Enc Key <----Master Key | 389 derive IV | | | 390 + | | | 391 | + v | 392 | encrypt Auth Key | 393 | | + | 394 | | | | 395 | v | | 396 | +-------+-------+ | | 397 | | | | | 398 | | encrypted | v | 399 | | frame +---->Authenticate<-----+ 400 + | | + 401 encode CTR | | | 402 + +-------+-------+ | 403 | | | 404 | | | 405 | | | 406 | generic RTP packetize | 407 | + | 408 | | | 409 | | +--------------+ 410 +----------+ v | 411 | | 412 | +---------------+ +---------------+ +---------------+ | 413 +-> | SFrame header | | | | | | 414 +---------------+ | | | payload N/N | | 415 | | | payload 2/N | | | | 416 | payload 1/N | | | +---------------+ | 417 | | | | | auth tag | <-+ 418 +---------------+ +---------------+ +---------------+ 419 Encryption flow 421 4.3.3. Decryption 423 The receiving clients buffer all packets that belongs to the same 424 frame using the frame beginning and ending marks in the generic RTP 425 frame header extension, and once all packets are available, it passes 426 it to Frame for decryption. SFrame maintains multiple decryptor 427 objects, one for each client in the call. Initially the client might 428 not have the mapping between the incoming streams the user's keys, in 429 this case SFrame tries all unmapped keys until it finds one that 430 passes the authentication verification and use it to decrypt the 431 frame. If the client has the mapping ready, it can push it down to 432 SFrame later. 434 The KeyId field in the SFrame header is used to find the right key 435 for that user, which is incremented by the sender when they switch to 436 a new key. 438 For frames that are failed to decrypt because there is not key 439 available yet, SFrame will buffer them and retries to decrypt them 440 once a key is received. 442 4.3.4. Duplicate Frames 444 Unlike messaging application, in video calls, receiving a duplicate 445 frame doesn't necessary mean the client is under a replay attack, 446 there are other reasons that might cause this, for example the sender 447 might just be sending them in case of packet loss. SFrame decryptors 448 use the highest received frame counter to protect against this. It 449 allows only older frame pithing a short interval to support out of 450 order delivery. 452 4.3.5. Key Rotation 454 Because the E2EE keys could be rotated during the call when people 455 join and leave, these new keys are exchanged using the same E2EE 456 secure channel used in the initial key negotiation. Sending new 457 fresh keys is an expensive operation, so the key management component 458 might chose to send new keys only when other clients leave the call 459 and use hash ratcheting for the join case, so no need to send a new 460 key to the clients who are already on the call. SFrame supports both 461 modes 463 4.3.5.1. Key Ratcheting 465 When SFrame decryptor fails to decrypt one of the frames, it 466 automatically ratchets the key forward and retries again until one 467 ratchet succeed or it reaches the maximum allowed ratcheting window. 468 If a new ratchet passed the decryption, all previous ratchets are 469 deleted. 471 K(i) = HKDF(K(i-1), 'SFrameRatchetKey', 32) 473 4.3.5.2. New Key 475 SFrame will set the key immediately on the decrypts when it is 476 received and destroys the old key material, so if the key manager 477 sends a new key during the call, it is recommended not to start using 478 it immediately and wait for a short time to make sure it is delivered 479 to all other clients before using it to decrease the number of 480 decryption failure. It is up to the application and the key manager 481 to define how long this period is. 483 4.4. Authentication 485 Every client in the call knows the secret key for all other clients 486 so it can decrypt their traffic, it also means a malicious client can 487 impersonate any other client in the call by using the victim key to 488 encrypt their traffic. This might not be a problem for consumer 489 application where the number of clients in the call is small and 490 users know each others, however for enterprise use case where large 491 conference calls are common, an authentication mechanism is needed to 492 protect against malicious users. This authentication will come with 493 extra cost. 495 Adding a digital signature to each encrypted frame will be an 496 overkill, instead we propose adding signature over multiple frames. 498 The signature is calculated by concatenating the authentication tags 499 of the frames that the sender wants to authenticate (in reverse sent 500 order) and signing it with the signature key. Signature keys are 501 exchanged out of band along the encryption keys. 503 Signature = Sign(Key, AuthTag(Frame N) || AuthTag(Frame N-1) || ...|| AuthTag(Frame N-M)) 505 The authentication tags for the previous frames covered by the 506 signature and the signature itself will be appended at end of the 507 frame, after the current frame authentication tag, in the same order 508 that the signature was calculated, and the SFrame header metadata 509 signature bit (S) will be set to 1. 511 +^ +------------------+ 512 | | SFrame header S=1| 513 | +------------------+ 514 | | Encrypted | 515 | | payload | 516 | | | 517 |^ +------------------+ ^+ 518 | | Auth Tag N | | 519 | +------------------+ | 520 | | Auth Tag N-1 | | 521 | +------------------+ | 522 | | ........ | | 523 | +------------------+ | 524 | | Auth Tag N-M | | 525 | +------------------+ ^| 526 | | NUM | Signature : | 527 | +-----+ + | 528 | : | | 529 | +------------------+ | 530 | | 531 +-> Authenticated with +-> Signed with 532 Auth Tag N Signature 534 Encrypted Frame with Signature 536 Note that the authentication tag for the current frame will only 537 authenticate the SFrame header and the encrypted payload, ant not the 538 signature nor the previous frames's authentication tags (N-1 to N-M) 539 used to calculate the signature. 541 The last byte (NUM) after the authentication tag list and before the 542 signature indicates the number of the authentication tags from 543 previous frames present in the current frame. All the 544 authentications tags MUST have the same size, which MUST be equal to 545 the authentication tag size of the current frame. The signature is 546 fixed size depending on the signature algorithm used (for example, 64 547 bytes for Ed25519). 549 The receiver has to keep track of all the frames received but yet not 550 verified, by storing the authentication tags of each received frame. 551 When a signature is received, the receiver will verify it with the 552 signature key associated to the key id of the frame the signature was 553 sent in. If the verification is successful, the received will mark 554 the frames as authenticated and remove them from the list of the not 555 verified frames. It is up to the application to decide what to do 556 when signature verification fails. 558 When using SVC, the hash will be calculated over all the frames of 559 the different spatial layers within the same superframe/picture. 560 However the SFU will be able to drop frames within the same stream 561 (either spatial or temporal) to match target bitrate. 563 If the signature is sent on a frame which layer that is dropped by 564 the SFU, the receiver will not receive it and will not be able to 565 perform the signature of the other received layers. 567 An easy way of solving the issue would be to perform signature only 568 on the base layer or take into consideration the frame dependency 569 graph and send multiple signatures in parallel (each for a branch of 570 the dependency graph). 572 In case of simulcast or K-SVC, each spatial layer should be 573 authenticated with different signatures to prevent the SFU to discard 574 frames with the signature info. 576 In any case, it is possible that the frame with the signature is lost 577 or the SFU drops it, so the receiver MUST be prepared to not receive 578 a signature for a frame and remove it from the pending to be verified 579 list after a timeout. 581 4.5. Ciphersuites 583 4.5.1. SFrame 585 Each SFrame session uses a single ciphersuite that specifies the 586 following primitives: 588 o A hash function This is used for the Key derivation and frame 589 hashes for signature. We recommend using SHA256 hash function. 591 o An AEAD encryption algorithm [RFC5116] While any AEAD algorithm can 592 be used to encrypt the frame, we recommend using algorithms with safe 593 MAC truncation like AES-CTR and HMAC to reduce the per-frame 594 overhead. In this case we can use 80 bits MAC for video frames and 595 32 bits for audio frames similar to DTLS-SRTP cipher suites: 597 1- AES_CM_128_HMAC_SHA256_80 599 2- AES_CM_128_HMAC_SHA256_32 601 o [Optional] A signature algorithm If signature is supported, we 602 recommend using ed25519 604 4.5.2. DTLS-SRTP 606 SRTP is used as an HBH encryption, since the media payload is already 607 encrypted, and SRTP only protects the RTP headers, one implementation 608 could use 4 bytes outer auth tag to decrease the overhead, however it 609 is up to the application to use other ciphers like AES-128-GCM with 610 full authentication tag. 612 5. Key Management 614 SFrame must be integrated with an E2EE key management framework to 615 exchange and rotate the encryption keys. This framework will 616 maintain a group of participant endpoints who are in the call. At 617 call setup time, each endpoint will create a fresh key material and 618 optionally signing key pair for that call and encrypt the key 619 material and the public signing key to every other endpoints. They 620 encrypted keys are delivered by the messaging delivery server using a 621 reliable channel. 623 The KMS will monitor the group changes, and exchange new keys when 624 necessary. It is up to the application to define this group, for 625 example one application could have ephemeral group for every call and 626 keep rotating key when end points joins or leave the call, while 627 another application could have a persisted group that can be used for 628 multiple calls and exchange keys with all group endpoints for every 629 call. 631 When a new key material is created during the call, we recommend not 632 to start using it immediately in SFrame to give time for the new keys 633 to be delivered. If the application supports delivery receipts, it 634 can be used to track if the key is delivered to all other endpoints 635 on the call before using it. 637 Keys must have a sequential id starting from 0 and incremented eery 638 time a new key is generated for this endpoint. The key id will be 639 added in the SFrame header during encryption, so the recipient know 640 which key to use for the decryption. 642 5.1. MLS-SFrame 644 While any other E2EE KMS can be used with SFrame, there is a big 645 advantage if it is used with [MLSARCH] which natively supports very 646 large groups efficiently. When [MLSPROTO] is used, the endpoints 647 keys (AKA Application secret) can be used directly for SFrame without 648 the need to exchange separate key material. The application secret 649 is rotated automatically by [MLSPROTO] when group membership changes. 651 6. Media Considerations 653 6.1. SFU 655 Selective Forwarding Units (SFUs) as described in 656 https://tools.ietf.org/html/rfc7667#section-3.7 receives the RTP 657 streams from each participant and selects which ones should be 658 forwarded to each of the other participants. There are several 659 approaches about how to do this stream selection but in general, in 660 order to do so, the SFU needs to access metadata associated to each 661 frame and modify the RTP information of the incoming packets when 662 they are transmitted to the received participants. 664 This section describes how this normal SFU modes of operation 665 interacts with the E2EE provided by SFrame 667 6.1.1. LastN and RTP stream reuse 669 The SFU may choose to send only a certain number of streams based on 670 the voice activity of the participants. To reduce the number of SDP 671 O/A required to establish a new RTP stream, the SFU may decide to 672 reuse previously existing RTP sessions or even pre-allocate a 673 predefined number of RTP streams and choose in each moment in time 674 which participant media will be sending through it. This means that 675 in the same RTP stream (defined by either SSRC or MID) may carry 676 media from different streams of different participants. As different 677 keys are used by each participant for encoding their media, the 678 receiver will be able to verify which is the sender of the media 679 coming within the RTP stream at any given point if time, preventing 680 the SFU trying to impersonate any of the participants with another 681 participant's media. Note that in order to prevent impersonation by 682 a malicious participant (not the SFU) usage of the signature is 683 required. In case of video, the a new signature should be started 684 each time a key frame is sent to allow the receiver to identify the 685 source faster after a switch. 687 6.1.2. Simulcast 689 When using simulcast, the same input image will produce N different 690 encoded frames (one per simulcast layer) which would be processed 691 independently by the frame encryptor and assigned an unique counter 692 for each. 694 6.1.3. SVC 696 In both temporal and spatial scalability, the SFU may choose to drop 697 layers in order to match a certain bitrate or forward specific media 698 sizes or frames per second. In order to support it, the sender MUST 699 encode each spatial layer of a given picture in a different frame. 700 That is, an RTP frame may contain more than one SFrame encrypted 701 frame with an incrementing frame counter. 703 6.2. Video Key Frames 705 Forward and Post-Compromise Security requires that the e2ee keys are 706 updated anytime a participant joins/leave the call. 708 The key exchange happens async and on a different path than the SFU 709 signaling and media. So it may happen that when a new participant 710 joins the call and the SFU side requests a key frame, the sender 711 generates the e2ee encrypted frame with a key not known by the 712 receiver, so it will be discarded. When the sender updates his 713 sending key with the new key, it will send it in a non-key frame, so 714 the receiver will be able to decrypt it, but not decode it. 716 Receiver will re-request an key frame then, but due to sender and sfu 717 policies, that new key frame could take some time to be generated. 719 If the sender sends a key frame when the new e2ee key is in use, the 720 time required for the new participant to display the video is 721 minimized. 723 6.3. Partial Decoding 725 Some codes support partial decoding, where it can decrypt individual 726 packets without waiting for the full frame to arrive, with SFrame 727 this won't be possible because the decoder will not access the 728 packets until the entire frame is arrived and decrypted. 730 7. Overhead 732 The encryption overhead will vary between audio and video streams, 733 because in audio each packet is considered a separate frame, so it 734 will always have extra MAC and IV, however a video frame usually 735 consists of multiple RTP packets. The number of bytes overhead per 736 frame is calculated as the following 1 + FrameCounter length + 4 The 737 constant 1 is the SFrame header byte and 4 bytes for the HBH 738 authentication tag for both audio and video packets. 740 7.1. Audio 742 Using three different audio frame durations 20ms (50 packets/s) 40ms 743 (25 packets/s) 100ms (10 packets/s) Up to 3 bytes frame counter (3.8 744 days of data for 20ms frame duration) and 4 bytes fixed MAC length. 746 +------------+-----------+-----------+----------+-----------+ 747 | Counter len| Packets | Overhead | Overhead | Overhead | 748 | | | bps@20ms | bps@40ms | bps@100ms | 749 +------------+-----------+-----------+----------+-----------+ 750 | 1 | 0-255 | 2400 | 1200 | 480 | 751 | 2 | 255 - 65K | 2800 | 1400 | 560 | 752 | 3 | 65K - 16M | 3200 | 1600 | 640 | 753 +------------+--------- -+-----------+----------+-----------+ 755 7.2. Video 757 The per-stream overhead bits per second as calculated for the 758 following video encodings: 30fps@1000Kbps (4 packets per frame) 759 30fps@512Kbps (2 packets per frame) 15fps@200Kbps (2 packets per 760 frame) 7.5fps@30Kbps (1 packet per frame) Overhead bps = (Counter 761 length + 1 + 4 ) * 8 * fps 763 +------------+-----------+------------+------------+------------+ 764 | Counter len| Frames | Overhead | Overhead | Overhead | 765 | | | bps@30fps | bps@15fps | bps@7.5fps | 766 +------------+-----------+------------+------------+------------+ 767 | 1 | 0-255 | 1440 | 1440 | 720 | 768 | 2 | 256 - 65K | 1680 | 1680 | 840 | 769 | 3 | 56K - 16M | 1920 | 1920 | 960 | 770 | 4 | 16M - 4B | 2160 | 2160 | 1080 | 771 +------------+-----------+------------+------------+------------+ 773 7.3. SFrame vs PERC-lite 775 [PERC] has significant overhead over SFrame because the overhead is 776 per packet, not per frame, and OHB (Original Header Block) which 777 duplicates any RTP header/extension field modified by the SFU. 778 [PERCLITE] is slightly better because it doesn't 780 use the OHB anymore, however it still does per packet encryption 781 using SRTP. Below the the overheard in [PERCLITE] implemented by 782 Cosmos Software which uses extra 11 bytes per packet to preserve the 783 PT, SEQ_NUM, TIME_STAMP and SSRC fields in addition to the extra MAC 784 tag per packet. 786 OverheadPerPacket = 11 + MAC length Overhead bps = PacketPerSecond * 787 OverHeadPerPacket * 8 789 Similar to SFrame, we will assume the HBH authentication tag length 790 will always be 4 bytes for audio and video even though it is not the 791 case in this [PERCLITE] implementation 793 7.3.1. Audio 795 +-------------------+--------------------+--------------------+ 796 | Overhead bps@20ms | Overhead bps@40ms | Overhead bps@100ms | 797 +-------------------+--------------------+--------------------+ 798 | 6000 | 3000 | 1200 | 799 +-------------------+--------------------+--------------------+ 801 7.3.2. Video 803 +---------------------+----------------------+-----------------------+ 804 | Overhead bps@30fps | Overhead bps@15fps | Overhead bps@7.5fps | 805 |(4 packets per frame)| (2 packets per frame)| (1 packet per frame) | 806 +---------------------+----------------------+-----------------------+ 807 | 14400 | 7200 | 3600 | 808 +---------------------+----------------------+-----------------------+ 810 For a conference with a single incoming audio stream (@ 50 pps) and 4 811 incoming video streams (@200 Kbps), the savings in overhead is 34800 812 - 9600 = ~25 Kbps, or ~3%. 814 8. Security Considerations 816 8.1. Key Management 818 Key exchange mechanism is out of scope of this document, however 819 every client MUST change their keys when new clients joins or leaves 820 the call for "Forward Secrecy" and "Post Compromise Security". 822 8.2. Authentication tag length 824 The cipher suites defined in this draft use short authentication tags 825 for encryption, however it can easily support other ciphers with full 826 authentication tag if the short ones are proved insecure. 828 9. IANA Considerations 830 This document makes no requests of IANA. 832 10. References 834 10.1. Normative References 836 [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate 837 Requirement Levels", BCP 14, RFC 2119, 838 DOI 10.17487/RFC2119, March 1997, 839 . 841 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 842 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, 843 May 2017, . 845 10.2. Informative References 847 [MLSARCH] Omara, E., Barnes, R., Rescorla, E., Inguva, S., Kwon, A., 848 and A. Duric, "Messaging Layer Security Architecture", 849 2020. 851 [MLSPROTO] 852 Barnes, R., Millican, J., Omara, E., Cohn-Gordon, K., and 853 R. Robert, "Messaging Layer Security Protocol", 2020. 855 [PERC] Jennings, C., Jones, P., Barnes, R., and A. Roach, "PERC", 856 2020, . 858 [PERCLITE] 859 GOUAILLARD, A. and S. Murillo, "PERC-Lite", 2020, 860 . 862 Authors' Addresses 864 Emad Omara 865 Google 867 Email: emadomara@google.com 869 Justin Uberti 870 Google 872 Email: juberti@google.com 874 Alexandre GOUAILLARD 875 CoSMo Software 877 Email: Alex.GOUAILLARD@cosmosoftware.io 879 Sergio Garcia Murillo 880 CoSMo Software 882 Email: sergio.garcia.murillo@cosmosoftware.io