idnits 2.17.1 

draft-omara-sframe-00.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** There are 41 instances of too long lines in the document, the longest
     one being 21 characters in excess of 72.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  == Line 796 has weird spacing: '...verhead  bps@4...'

  == Line 804 has weird spacing: '...verhead  bps@3...'

  -- The document date (May 19, 2020) is 1438 days in the past.  Is this
     intentional?


  Checking references for intended status: Informational
  ----------------------------------------------------------------------------

  == Missing Reference: 'RFC5116' is mentioned on line 591, but not defined

  == Missing Reference: 'Optional' is mentioned on line 601, but not defined


     Summary: 1 error (**), 0 flaws (~~), 5 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	Network Working Group                                           E. Omara
3	Internet-Draft                                                 J. Uberti
4	Intended status: Informational                                    Google
5	Expires: November 20, 2020                                 A. GOUAILLARD
6	                                                              S. Murillo
7	                                                          CoSMo Software
8	                                                            May 19, 2020

10	                         Secure Frame (SFrame)
11	                         draft-omara-sframe-00

13	Abstract

15	   This document describes the Secure Frame (SFrame) end-to-end
16	   encryption and authentication mechanism for media frames in a
17	   multiparty conference call, in which central media servers (SFUs) can
18	   access the media metadata needed to make forwarding decisions without
19	   having access to the actual media.  The proposed mechanism differs
20	   from other approaches through its use of media frames as the
21	   encryptable unit, instead of individual RTP packets, which makes it
22	   more bandwidth efficient and also allows use with non-RTP transports.

24	Status of This Memo

26	   This Internet-Draft is submitted in full conformance with the
27	   provisions of BCP 78 and BCP 79.

29	   Internet-Drafts are working documents of the Internet Engineering
30	   Task Force (IETF).  Note that other groups may also distribute
31	   working documents as Internet-Drafts.  The list of current Internet-
32	   Drafts is at https://datatracker.ietf.org/drafts/current/.

34	   Internet-Drafts are draft documents valid for a maximum of six months
35	   and may be updated, replaced, or obsoleted by other documents at any
36	   time.  It is inappropriate to use Internet-Drafts as reference
37	   material or to cite them other than as "work in progress."

39	   This Internet-Draft will expire on November 20, 2020.

41	Copyright Notice

43	   Copyright (c) 2020 IETF Trust and the persons identified as the
44	   document authors.  All rights reserved.

46	   This document is subject to BCP 78 and the IETF Trust's Legal
47	   Provisions Relating to IETF Documents
48	   (https://trustee.ietf.org/license-info) in effect on the date of
49	   publication of this document.  Please review these documents
50	   carefully, as they describe your rights and restrictions with respect
51	   to this document.  Code Components extracted from this document must
52	   include Simplified BSD License text as described in Section 4.e of
53	   the Trust Legal Provisions and are provided without warranty as
54	   described in the Simplified BSD License.

56	Table of Contents

58	   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
59	   2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . .   4
60	   3.  Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . .   4
61	   4.  SFrame  . . . . . . . . . . . . . . . . . . . . . . . . . . .   5
62	     4.1.  SFrame Format . . . . . . . . . . . . . . . . . . . . . .   7
63	     4.2.  SFrame Header . . . . . . . . . . . . . . . . . . . . . .   7
64	     4.3.  Encryption Schema . . . . . . . . . . . . . . . . . . . .   8
65	       4.3.1.  Key Derivation  . . . . . . . . . . . . . . . . . . .   8
66	       4.3.2.  Encryption  . . . . . . . . . . . . . . . . . . . . .   9
67	       4.3.3.  Decryption  . . . . . . . . . . . . . . . . . . . . .  10
68	       4.3.4.  Duplicate Frames  . . . . . . . . . . . . . . . . . .  11
69	       4.3.5.  Key Rotation  . . . . . . . . . . . . . . . . . . . .  11
70	     4.4.  Authentication  . . . . . . . . . . . . . . . . . . . . .  12
71	     4.5.  Ciphersuites  . . . . . . . . . . . . . . . . . . . . . .  14
72	       4.5.1.  SFrame  . . . . . . . . . . . . . . . . . . . . . . .  14
73	       4.5.2.  DTLS-SRTP . . . . . . . . . . . . . . . . . . . . . .  15
74	   5.  Key Management  . . . . . . . . . . . . . . . . . . . . . . .  15
75	     5.1.  MLS-SFrame  . . . . . . . . . . . . . . . . . . . . . . .  15
76	   6.  Media Considerations  . . . . . . . . . . . . . . . . . . . .  16
77	     6.1.  SFU . . . . . . . . . . . . . . . . . . . . . . . . . . .  16
78	       6.1.1.  LastN and RTP stream reuse  . . . . . . . . . . . . .  16
79	       6.1.2.  Simulcast . . . . . . . . . . . . . . . . . . . . . .  16
80	       6.1.3.  SVC . . . . . . . . . . . . . . . . . . . . . . . . .  16
81	     6.2.  Video Key Frames  . . . . . . . . . . . . . . . . . . . .  17
82	     6.3.  Partial Decoding  . . . . . . . . . . . . . . . . . . . .  17
83	   7.  Overhead  . . . . . . . . . . . . . . . . . . . . . . . . . .  17
84	     7.1.  Audio . . . . . . . . . . . . . . . . . . . . . . . . . .  17
85	     7.2.  Video . . . . . . . . . . . . . . . . . . . . . . . . . .  18
86	     7.3.  SFrame vs PERC-lite . . . . . . . . . . . . . . . . . . .  18
87	       7.3.1.  Audio . . . . . . . . . . . . . . . . . . . . . . . .  19
88	       7.3.2.  Video . . . . . . . . . . . . . . . . . . . . . . . .  19
89	   8.  Security Considerations . . . . . . . . . . . . . . . . . . .  19
90	     8.1.  Key Management  . . . . . . . . . . . . . . . . . . . . .  19
91	     8.2.  Authentication tag length . . . . . . . . . . . . . . . .  19
92	   9.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  19
93	   10. References  . . . . . . . . . . . . . . . . . . . . . . . . .  19
94	     10.1.  Normative References . . . . . . . . . . . . . . . . . .  19
95	     10.2.  Informative References . . . . . . . . . . . . . . . . .  20
96	   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  20

98	1.  Introduction

100	   Modern multi-party video call systems use Selective Forwarding Unit
101	   (SFU) servers to efficiently route RTP streams to call endpoints
102	   based on factors such as available bandwidth, desired video size,
103	   codec support, and other factors.  In order for the SFU to work
104	   properly though, it needs to be able to access RTP metadata and RTCP
105	   feedback messages, which is not possible if all RTP/RTCP traffic is
106	   end-to-end encrypted.

108	   As such, two layers of encryptions and authentication are required:
109	   1- Hop-by-hop (HBH) encryption of media, metadata, and feedback
110	   messages between the the endpoints and SFU 2- End-to-end (E2E)
111	   encryption of media between the endpoints

113	   While DTLS-SRTP can be used as an efficient HBH mechanism, it is
114	   inherently point-to-point and therefore not suitable for a SFU
115	   context.  In addition, given the various scenarios in which video
116	   calling occurs, minimizing the bandwidth overhead of end-to-end
117	   encryption is also an important goal.

119	   This document proposes a new end-to-end encryption mechanism known as
120	   SFrame, specifically designed to work in group conference calls with
121	   SFUs.

123	     +-------------------------------+-------------------------------+^+
124	     |V=2|P|X|  CC   |M|     PT      |       sequence number         | |
125	     +-------------------------------+-------------------------------+ |
126	     |                           timestamp                           | |
127	     +---------------------------------------------------------------+ |
128	     |           synchronization source (SSRC) identifier            | |
129	     |=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=| |
130	     |            contributing source (CSRC) identifiers             | |
131	     |                               ....                            | |
132	     +---------------------------------------------------------------+ |
133	     |                   RTP extension(s) (OPTIONAL)                 | |
134	   +^---------------------+------------------------------------------+ |
135	   | |   payload header   |                                          | |
136	   | +--------------------+     payload  ...                         | |
137	   | |                                                               | |
138	   +^+---------------------------------------------------------------+^+
139	   | :                       authentication tag                      : |
140	   | +---------------------------------------------------------------+ |
141	   |                                                                   |
142	   ++ Encrypted Portion*                      Authenticated Portion +--+

144	                           SRTP packet format

146	2.  Terminology

148	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
149	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and
150	   "OPTIONAL" in this document are to be interpreted as described in BCP
151	   14 [RFC2119] [RFC8174] when, and only when, they appear in all
152	   capitals, as shown here.

154	   SFU:  Selective Forwarding Unit (AKA RTP Switch)

156	   IV:  Initialization Vector

158	   MAC:  Message Authentication Code

160	   E2EE:  End to End Encryption

162	   HBH:  Hop By Hop

164	   KMS:  Key Management System

166	3.  Goals

168	   SFrame is designed to be a suitable E2EE protection scheme for
169	   conference call media in a broad range of scenarios, as outlined by
170	   the following goals:

172	   1.  Provide an secure E2EE mechanism for audio and video in
173	       conference calls that can be used with arbitrary SFU servers.

175	   2.  Decouple media encryption from key management to allow SFrame to
176	       be used with an arbitrary KMS.

178	   3.  Minimize packet expansion to allow successful conferencing in as
179	       many network conditions as possible.

181	   4.  Independence from the underlying transport, including use in non-
182	       RTP transports, e.g., WebTransport.

184	   5.  When used with RTP and its associated error resilience
185	       mechanisms, i.e., RTX and FEC, require no special handling for
186	       RTX and FEC packets.

188	   6.  Minimize the changes needed in SFU servers.

190	   7.  Minimize the changes needed in endpoints.

192	   8.  Work with the most popular audio and video codecs used in
193	       conferencing scenarios.

195	4.  SFrame

197	   We propose a frame level encryption mechanism that provides effective
198	   end-to-end encryption, is simple to implement, has no dependencies on
199	   RTP, and minimizes encryption bandwidth overhead.  Because SFrame
200	   encrypts the full frame, rather than individual packets, bandwidth
201	   overhead is reduced by having a single IV and authentication tag for
202	   each media frame.

204	   Also, because media is encrypted prior to packetization, the
205	   encrypted frame is packetized using a generic RTP packetizer instead
206	   of codec-dependent packetization mechanisms.  With this move to a
207	   generic packetizer, media metadata is moved from codec-specific
208	   mechanisms to a generic frame RTP header extension which, while
209	   visible to the SFU, is authenticated end-to-end.  This extension
210	   includes metadata needed for SFU routing such as resolution, frame
211	   beginning and end markers, etc.

213	   The generic packetizer splits the E2E encrypted media frame into one
214	   or more RTP packets and adds the SFrame header to the beginning of
215	   the first packet and an auth tag to the end of the last packet.

217	      +-------------------------------------------------------+
218	      |                                                       |
219	      |  +----------+      +------------+      +-----------+  |
220	      |  |          |      |   SFrame   |      |Packetizer |  |       DTLS+SRTP
221	      |  | Encoder  +----->+    Enc     +----->+           +-------------------------+
222	 ,+.  |  |          |      |            |      |           |  |   +--+  +--+  +--+   |
223	 `|'  |  +----------+      +-----+------+      +-----------+  |   |  |  |  |  |  |   |
224	 /|\  |                          ^                            |   |  |  |  |  |  |   |
225	  +   |                          |                            |   |  |  |  |  |  |   |
226	 / \  |                          |                            |   +--+  +--+  +--+   |
227	Alice |                    +-----+------+                     |   Encrypted Packets  |
228	      |                    |Key Manager |                     |                      |
229	      |                    +------------+                     |                      |
230	      |                         ||                            |                      |
231	      |                         ||                            |                      |
232	      |                         ||                            |                      |
233	      +-------------------------------------------------------+                      |
234	                                ||                                                   |
235	                                ||                                                   v
236	                           +------------+                                      +-----+------+
237	            E2EE channel   |  Messaging |                                      |   Media    |
238	              via the      |  Server    |                                      |   Server   |
239	          Messaging Server |            |                                      |            |
240	                           +------------+                                      +-----+------+
241	                                ||                                                   |
242	                                ||                                                   |
243	      +-------------------------------------------------------+                      |
244	      |                         ||                            |                      |
245	      |                         ||                            |                      |
246	      |                         ||                            |                      |
247	      |                    +------------+                     |                      |
248	      |                    |Key Manager |                     |                      |
249	 ,+.  |                    +-----+------+                     |   Encrypted Packets  |
250	 `|'  |                          |                            |   +--+  +--+  +--+   |
251	 /|\  |                          |                            |   |  |  |  |  |  |   |
252	  +   |                          v                            |   |  |  |  |  |  |   |
253	 / \  |  +----------+      +-----+------+      +-----------+  |   |  |  |  |  |  |   |
254	 Bob  |  |          |      |   SFrame   |      |   De+     |  |   +--+  +--+  +--+   |
255	      |  | Decoder  +<-----+    Dec     +<-----+Packetizer +<------------------------+
256	      |  |          |      |            |      |           |  |        DTLS+SRTP
257	      |  +----------+      +------------+      +-----------+  |
258	      |                                                       |
259	      +-------------------------------------------------------+

261	   The E2EE keys used to encrypt the frame are exchanged out of band
262	   using a secure E2EE channel.

264	4.1.  SFrame Format

266	     +------------+------------------------------------------+^+
267	     |S|LEN|X|KID |         Frame Counter                    | |
268	   +^+------------+------------------------------------------+ |
269	   | |                                                       | |
270	   | |                                                       | |
271	   | |                                                       | |
272	   | |                                                       | |
273	   | |                  Encrypted Frame                      | |
274	   | |                                                       | |
275	   | |                                                       | |
276	   | |                                                       | |
277	   | |                                                       | |
278	   +^+-------------------------------------------------------+^+
279	   | |                 Authentication Tag                    | |
280	   | +-------------------------------------------------------+ |
281	   |                                                           |
282	   |                                                           |
283	   +----+Encrypted Portion            Authenticated Portion+---+

285	4.2.  SFrame Header

287	   Since each endpoint can send multiple media layers, each frame will
288	   have a unique frame counter that will be used to derive the
289	   encryption IV.  The frame counter must be unique and monotonically
290	   increasing to avoid IV reuse.

292	   As each sender will use their own key for encryption, so the SFrame
293	   header will include the key id to allow the receiver to identify the
294	   key that needs to be used for decrypting.

296	   Both the frame counter and the key id are encoded in a variable
297	   length format to decrease the overhead, so the first byte in the
298	   Sframe header is fixed and contains the header metadata with the
299	   following format:

301	    0 1 2 3 4 5 6 7
302	   +-+-+-+-+-+-+-+-+
303	   |S|LEN  |X|  K  |
304	   +-+-+-+-+-+-+-+-+
305	   SFrame header metadata

307	   Signature flag (S): 1 bit This field indicates the payload contains a
308	   signature if set.  Counter Length (LEN): 3 bits This field indicates
309	   the length of the CTR fields in bytes.  Extended Key Id Flag (X): 1
310	   bit Indicates if the key field contains the key id or the key length.
311	   Key or Key Length: 3 bits This field contains the key id (KID) if the
312	   X flag is set to 0, or the key length (KLEN) if set to 1.

314	   If X flag is 0 then the KID is in the range of 0-7 and the frame
315	   counter (CTR) is found in the next LEN bytes:

317	    0 1 2 3 4 5 6 7
318	   +-+-+-+-+-+-+-+-+---------------------------------+
319	   |S|LEN  |0| KID |    CTR... (length=LEN)          |
320	   +-+-+-+-+-+-+-+-+---------------------------------+

322	   Key id (KID): 3 bits The key id (0-7).  Frame counter (CTR):
323	   (Variable length) Frame counter value up to 8 bytes long.

325	   if X flag is 1 then KLEN is the length of the key (KID), that is
326	   found after the SFrame header metadata byte.  After the key id (KID),
327	   the frame counter (CTR) will be found in the next LEN bytes:

329	 0 1 2 3 4 5 6 7
330	+-+-+-+-+-+-+-+-+---------------------------+---------------------------+
331	|S|LEN  |1|KLEN |   KID... (length=KLEN)    |    CTR... (length=LEN)    |
332	+-+-+-+-+-+-+-+-+---------------------------+---------------------------+

334	   Key length (KLEN): 3 bits The key length in bytes.  Key id (KID):
335	   (Variable length) The key id value up to 8 bytes long.  Frame counter
336	   (CTR): (Variable length) Frame counter value up to 8 bytes long.

338	4.3.  Encryption Schema

340	4.3.1.  Key Derivation

342	   Each client creates a 32 bytes secret key K and share it with with
343	   other participants via an E2EE channel.  From K, we derive 3 secrets:

345	   1- Salt key used to calculate the IV

347	   Key = HKDF(K, 'SFrameSaltKey', 16)

349	   2- Encryption key to encrypt the media frame

351	   Key = HKDF(K, 'SFrameEncryptionKey', 16)

353	   3- Authentication key to authenticate the encrypted frame and the
354	   media metadata

356	   Key = HKDF(K, 'SFrameAuthenticationKey', 32)
357	   The IV is 128 bits long and calculated from the CTR field of the
358	   Frame header:

360	   IV = CTR XOR Salt key

362	4.3.2.  Encryption

364	   After encoding the frame and before packetizing it, the necessary
365	   media metadata will be moved out of the encoded frame buffer, to be
366	   used later in the RTP generic frame header extension.  The encoded
367	   frame, the metadata buffer and the frame counter are passed to SFrame
368	   encryptor.  The encryptor constructs SFrame header using frame
369	   counter and key id and derive the encryption IV.  The frame is
370	   encrypted using the encryption key and the header, encrypted frame,
371	   the media metadata and the header are authenticated using the
372	   authentication key.  The authentication tag is then truncated (If
373	   supported by the cipher suite) and prepended at the end of the
374	   ciphertext.

376	   The encrypted payload is then passed to a generic RTP packetized to
377	   construct the RTP packets and encrypts it using SRTP keys for the HBH
378	   encryption to the media server.

380	                             +---------------+  +---------------+
381	                             |               |  | frame metadata+----+
382	                             |               |  +---------------+    |
383	                             |     frame     |                       |
384	                             |               |                       |
385	                             |               |                       |
386	                             +-------+-------+                       |
387	                                     |                               |
388	            CTR +---------------> IV |Enc Key <----Master Key        |
389	                   derive IV         |                  |            |
390	             +                       |                  |            |
391	             |                       +                  v            |
392	             |                    encrypt           Auth Key         |
393	             |                       |                  +            |
394	             |                       |                  |            |
395	             |                       v                  |            |
396	             |               +-------+-------+          |            |
397	             |               |               |          |            |
398	             |               |   encrypted   |          v            |
399	             |               |     frame     +---->Authenticate<-----+
400	             +               |               |          +
401	         encode CTR          |               |          |
402	             +               +-------+-------+          |
403	             |                       |                  |
404	             |                       |                  |
405	             |                       |                  |
406	             |              generic RTP packetize       |
407	             |                       +                  |
408	             |                       |                  |
409	             |                       |                  +--------------+
410	  +----------+                       v                                 |
411	  |                                                                    |
412	  |   +---------------+      +---------------+     +---------------+   |
413	  +-> | SFrame header |      |               |     |               |   |
414	      +---------------+      |               |     |  payload N/N  |   |
415	      |               |      |  payload 2/N  |     |               |   |
416	      |  payload 1/N  |      |               |     +---------------+   |
417	      |               |      |               |     |    auth tag   | <-+
418	      +---------------+      +---------------+     +---------------+
419	                           Encryption flow

421	4.3.3.  Decryption

423	   The receiving clients buffer all packets that belongs to the same
424	   frame using the frame beginning and ending marks in the generic RTP
425	   frame header extension, and once all packets are available, it passes
426	   it to Frame for decryption.  SFrame maintains multiple decryptor
427	   objects, one for each client in the call.  Initially the client might
428	   not have the mapping between the incoming streams the user's keys, in
429	   this case SFrame tries all unmapped keys until it finds one that
430	   passes the authentication verification and use it to decrypt the
431	   frame.  If the client has the mapping ready, it can push it down to
432	   SFrame later.

434	   The KeyId field in the SFrame header is used to find the right key
435	   for that user, which is incremented by the sender when they switch to
436	   a new key.

438	   For frames that are failed to decrypt because there is not key
439	   available yet, SFrame will buffer them and retries to decrypt them
440	   once a key is received.

442	4.3.4.  Duplicate Frames

444	   Unlike messaging application, in video calls, receiving a duplicate
445	   frame doesn't necessary mean the client is under a replay attack,
446	   there are other reasons that might cause this, for example the sender
447	   might just be sending them in case of packet loss.  SFrame decryptors
448	   use the highest received frame counter to protect against this.  It
449	   allows only older frame pithing a short interval to support out of
450	   order delivery.

452	4.3.5.  Key Rotation

454	   Because the E2EE keys could be rotated during the call when people
455	   join and leave, these new keys are exchanged using the same E2EE
456	   secure channel used in the initial key negotiation.  Sending new
457	   fresh keys is an expensive operation, so the key management component
458	   might chose to send new keys only when other clients leave the call
459	   and use hash ratcheting for the join case, so no need to send a new
460	   key to the clients who are already on the call.  SFrame supports both
461	   modes

463	4.3.5.1.  Key Ratcheting

465	   When SFrame decryptor fails to decrypt one of the frames, it
466	   automatically ratchets the key forward and retries again until one
467	   ratchet succeed or it reaches the maximum allowed ratcheting window.
468	   If a new ratchet passed the decryption, all previous ratchets are
469	   deleted.

471	   K(i) = HKDF(K(i-1), 'SFrameRatchetKey', 32)

473	4.3.5.2.  New Key

475	   SFrame will set the key immediately on the decrypts when it is
476	   received and destroys the old key material, so if the key manager
477	   sends a new key during the call, it is recommended not to start using
478	   it immediately and wait for a short time to make sure it is delivered
479	   to all other clients before using it to decrease the number of
480	   decryption failure.  It is up to the application and the key manager
481	   to define how long this period is.

483	4.4.  Authentication

485	   Every client in the call knows the secret key for all other clients
486	   so it can decrypt their traffic, it also means a malicious client can
487	   impersonate any other client in the call by using the victim key to
488	   encrypt their traffic.  This might not be a problem for consumer
489	   application where the number of clients in the call is small and
490	   users know each others, however for enterprise use case where large
491	   conference calls are common, an authentication mechanism is needed to
492	   protect against malicious users.  This authentication will come with
493	   extra cost.

495	   Adding a digital signature to each encrypted frame will be an
496	   overkill, instead we propose adding signature over multiple frames.

498	   The signature is calculated by concatenating the authentication tags
499	   of the frames that the sender wants to authenticate (in reverse sent
500	   order) and signing it with the signature key.  Signature keys are
501	   exchanged out of band along the encryption keys.

503	Signature = Sign(Key, AuthTag(Frame N) || AuthTag(Frame N-1) || ...|| AuthTag(Frame N-M))

505	   The authentication tags for the previous frames covered by the
506	   signature and the signature itself will be appended at end of the
507	   frame, after the current frame authentication tag, in the same order
508	   that the signature was calculated, and the SFrame header metadata
509	   signature bit (S) will be set to 1.

511	       +^ +------------------+
512	       |  | SFrame header S=1|
513	       |  +------------------+
514	       |  |  Encrypted       |
515	       |  |  payload         |
516	       |  |                  |
517	       |^ +------------------+ ^+
518	       |  |  Auth Tag N      |  |
519	       |  +------------------+  |
520	       |  |  Auth Tag N-1    |  |
521	       |  +------------------+  |
522	       |  |    ........      |  |
523	       |  +------------------+  |
524	       |  |  Auth Tag N-M    |  |
525	       |  +------------------+ ^|
526	       |  | NUM | Signature  :  |
527	       |  +-----+            +  |
528	       |  :                  |  |
529	       |  +------------------+  |
530	       |                        |
531	       +-> Authenticated with   +-> Signed with
532	           Auth Tag N               Signature

534	       Encrypted Frame with Signature

536	   Note that the authentication tag for the current frame will only
537	   authenticate the SFrame header and the encrypted payload, ant not the
538	   signature nor the previous frames's authentication tags (N-1 to N-M)
539	   used to calculate the signature.

541	   The last byte (NUM) after the authentication tag list and before the
542	   signature indicates the number of the authentication tags from
543	   previous frames present in the current frame.  All the
544	   authentications tags MUST have the same size, which MUST be equal to
545	   the authentication tag size of the current frame.  The signature is
546	   fixed size depending on the signature algorithm used (for example, 64
547	   bytes for Ed25519).

549	   The receiver has to keep track of all the frames received but yet not
550	   verified, by storing the authentication tags of each received frame.
551	   When a signature is received, the receiver will verify it with the
552	   signature key associated to the key id of the frame the signature was
553	   sent in.  If the verification is successful, the received will mark
554	   the frames as authenticated and remove them from the list of the not
555	   verified frames.  It is up to the application to decide what to do
556	   when signature verification fails.

558	   When using SVC, the hash will be calculated over all the frames of
559	   the different spatial layers within the same superframe/picture.
560	   However the SFU will be able to drop frames within the same stream
561	   (either spatial or temporal) to match target bitrate.

563	   If the signature is sent on a frame which layer that is dropped by
564	   the SFU, the receiver will not receive it and will not be able to
565	   perform the signature of the other received layers.

567	   An easy way of solving the issue would be to perform signature only
568	   on the base layer or take into consideration the frame dependency
569	   graph and send multiple signatures in parallel (each for a branch of
570	   the dependency graph).

572	   In case of simulcast or K-SVC, each spatial layer should be
573	   authenticated with different signatures to prevent the SFU to discard
574	   frames with the signature info.

576	   In any case, it is possible that the frame with the signature is lost
577	   or the SFU drops it, so the receiver MUST be prepared to not receive
578	   a signature for a frame and remove it from the pending to be verified
579	   list after a timeout.

581	4.5.  Ciphersuites

583	4.5.1.  SFrame

585	   Each SFrame session uses a single ciphersuite that specifies the
586	   following primitives:

588	   o A hash function This is used for the Key derivation and frame
589	   hashes for signature.  We recommend using SHA256 hash function.

591	   o An AEAD encryption algorithm [RFC5116] While any AEAD algorithm can
592	   be used to encrypt the frame, we recommend using algorithms with safe
593	   MAC truncation like AES-CTR and HMAC to reduce the per-frame
594	   overhead.  In this case we can use 80 bits MAC for video frames and
595	   32 bits for audio frames similar to DTLS-SRTP cipher suites:

597	   1- AES_CM_128_HMAC_SHA256_80

599	   2- AES_CM_128_HMAC_SHA256_32

601	   o [Optional] A signature algorithm If signature is supported, we
602	   recommend using ed25519

604	4.5.2.  DTLS-SRTP

606	   SRTP is used as an HBH encryption, since the media payload is already
607	   encrypted, and SRTP only protects the RTP headers, one implementation
608	   could use 4 bytes outer auth tag to decrease the overhead, however it
609	   is up to the application to use other ciphers like AES-128-GCM with
610	   full authentication tag.

612	5.  Key Management

614	   SFrame must be integrated with an E2EE key management framework to
615	   exchange and rotate the encryption keys.  This framework will
616	   maintain a group of participant endpoints who are in the call.  At
617	   call setup time, each endpoint will create a fresh key material and
618	   optionally signing key pair for that call and encrypt the key
619	   material and the public signing key to every other endpoints.  They
620	   encrypted keys are delivered by the messaging delivery server using a
621	   reliable channel.

623	   The KMS will monitor the group changes, and exchange new keys when
624	   necessary.  It is up to the application to define this group, for
625	   example one application could have ephemeral group for every call and
626	   keep rotating key when end points joins or leave the call, while
627	   another application could have a persisted group that can be used for
628	   multiple calls and exchange keys with all group endpoints for every
629	   call.

631	   When a new key material is created during the call, we recommend not
632	   to start using it immediately in SFrame to give time for the new keys
633	   to be delivered.  If the application supports delivery receipts, it
634	   can be used to track if the key is delivered to all other endpoints
635	   on the call before using it.

637	   Keys must have a sequential id starting from 0 and incremented eery
638	   time a new key is generated for this endpoint.  The key id will be
639	   added in the SFrame header during encryption, so the recipient know
640	   which key to use for the decryption.

642	5.1.  MLS-SFrame

644	   While any other E2EE KMS can be used with SFrame, there is a big
645	   advantage if it is used with [MLSARCH] which natively supports very
646	   large groups efficiently.  When [MLSPROTO] is used, the endpoints
647	   keys (AKA Application secret) can be used directly for SFrame without
648	   the need to exchange separate key material.  The application secret
649	   is rotated automatically by [MLSPROTO] when group membership changes.

651	6.  Media Considerations

653	6.1.  SFU

655	   Selective Forwarding Units (SFUs) as described in
656	   https://tools.ietf.org/html/rfc7667#section-3.7 receives the RTP
657	   streams from each participant and selects which ones should be
658	   forwarded to each of the other participants.  There are several
659	   approaches about how to do this stream selection but in general, in
660	   order to do so, the SFU needs to access metadata associated to each
661	   frame and modify the RTP information of the incoming packets when
662	   they are transmitted to the received participants.

664	   This section describes how this normal SFU modes of operation
665	   interacts with the E2EE provided by SFrame

667	6.1.1.  LastN and RTP stream reuse

669	   The SFU may choose to send only a certain number of streams based on
670	   the voice activity of the participants.  To reduce the number of SDP
671	   O/A required to establish a new RTP stream, the SFU may decide to
672	   reuse previously existing RTP sessions or even pre-allocate a
673	   predefined number of RTP streams and choose in each moment in time
674	   which participant media will be sending through it.  This means that
675	   in the same RTP stream (defined by either SSRC or MID) may carry
676	   media from different streams of different participants.  As different
677	   keys are used by each participant for encoding their media, the
678	   receiver will be able to verify which is the sender of the media
679	   coming within the RTP stream at any given point if time, preventing
680	   the SFU trying to impersonate any of the participants with another
681	   participant's media.  Note that in order to prevent impersonation by
682	   a malicious participant (not the SFU) usage of the signature is
683	   required.  In case of video, the a new signature should be started
684	   each time a key frame is sent to allow the receiver to identify the
685	   source faster after a switch.

687	6.1.2.  Simulcast

689	   When using simulcast, the same input image will produce N different
690	   encoded frames (one per simulcast layer) which would be processed
691	   independently by the frame encryptor and assigned an unique counter
692	   for each.

694	6.1.3.  SVC

696	   In both temporal and spatial scalability, the SFU may choose to drop
697	   layers in order to match a certain bitrate or forward specific media
698	   sizes or frames per second.  In order to support it, the sender MUST
699	   encode each spatial layer of a given picture in a different frame.
700	   That is, an RTP frame may contain more than one SFrame encrypted
701	   frame with an incrementing frame counter.

703	6.2.  Video Key Frames

705	   Forward and Post-Compromise Security requires that the e2ee keys are
706	   updated anytime a participant joins/leave the call.

708	   The key exchange happens async and on a different path than the SFU
709	   signaling and media.  So it may happen that when a new participant
710	   joins the call and the SFU side requests a key frame, the sender
711	   generates the e2ee encrypted frame with a key not known by the
712	   receiver, so it will be discarded.  When the sender updates his
713	   sending key with the new key, it will send it in a non-key frame, so
714	   the receiver will be able to decrypt it, but not decode it.

716	   Receiver will re-request an key frame then, but due to sender and sfu
717	   policies, that new key frame could take some time to be generated.

719	   If the sender sends a key frame when the new e2ee key is in use, the
720	   time required for the new participant to display the video is
721	   minimized.

723	6.3.  Partial Decoding

725	   Some codes support partial decoding, where it can decrypt individual
726	   packets without waiting for the full frame to arrive, with SFrame
727	   this won't be possible because the decoder will not access the
728	   packets until the entire frame is arrived and decrypted.

730	7.  Overhead

732	   The encryption overhead will vary between audio and video streams,
733	   because in audio each packet is considered a separate frame, so it
734	   will always have extra MAC and IV, however a video frame usually
735	   consists of multiple RTP packets.  The number of bytes overhead per
736	   frame is calculated as the following 1 + FrameCounter length + 4 The
737	   constant 1 is the SFrame header byte and 4 bytes for the HBH
738	   authentication tag for both audio and video packets.

740	7.1.  Audio

742	   Using three different audio frame durations 20ms (50 packets/s) 40ms
743	   (25 packets/s) 100ms (10 packets/s) Up to 3 bytes frame counter (3.8
744	   days of data for 20ms frame duration) and 4 bytes fixed MAC length.

746	   +------------+-----------+-----------+----------+-----------+
747	   | Counter len| Packets   | Overhead  | Overhead | Overhead  |
748	   |            |           | bps@20ms  | bps@40ms | bps@100ms |
749	   +------------+-----------+-----------+----------+-----------+
750	   |          1 | 0-255     |      2400 |     1200 |       480 |
751	   |          2 | 255 - 65K |      2800 |     1400 |       560 |
752	   |          3 | 65K - 16M |      3200 |     1600 |       640 |
753	   +------------+--------- -+-----------+----------+-----------+

755	7.2.  Video

757	   The per-stream overhead bits per second as calculated for the
758	   following video encodings: 30fps@1000Kbps (4 packets per frame)
759	   30fps@512Kbps (2 packets per frame) 15fps@200Kbps (2 packets per
760	   frame) 7.5fps@30Kbps (1 packet per frame) Overhead bps = (Counter
761	   length + 1 + 4 ) * 8 * fps

763	   +------------+-----------+------------+------------+------------+
764	   | Counter len| Frames    | Overhead   | Overhead   | Overhead   |
765	   |            |           | bps@30fps  | bps@15fps  | bps@7.5fps |
766	   +------------+-----------+------------+------------+------------+
767	   |          1 | 0-255     |       1440 |       1440 |        720 |
768	   |          2 | 256 - 65K |       1680 |       1680 |        840 |
769	   |          3 | 56K - 16M |       1920 |       1920 |        960 |
770	   |          4 | 16M - 4B  |       2160 |       2160 |       1080 |
771	   +------------+-----------+------------+------------+------------+

773	7.3.  SFrame vs PERC-lite

775	   [PERC] has significant overhead over SFrame because the overhead is
776	   per packet, not per frame, and OHB (Original Header Block) which
777	   duplicates any RTP header/extension field modified by the SFU.
778	   [PERCLITE] <https://mailarchive.ietf.org/arch/msg/perc/
779	   SB0qMHWz6EsDtz3yIEX0HWp5IEY/> is slightly better because it doesn't
780	   use the OHB anymore, however it still does per packet encryption
781	   using SRTP.  Below the the overheard in [PERCLITE] implemented by
782	   Cosmos Software which uses extra 11 bytes per packet to preserve the
783	   PT, SEQ_NUM, TIME_STAMP and SSRC fields in addition to the extra MAC
784	   tag per packet.

786	   OverheadPerPacket = 11 + MAC length Overhead bps = PacketPerSecond *
787	   OverHeadPerPacket * 8

789	   Similar to SFrame, we will assume the HBH authentication tag length
790	   will always be 4 bytes for audio and video even though it is not the
791	   case in this [PERCLITE] implementation

793	7.3.1.  Audio

795	   +-------------------+--------------------+--------------------+
796	   | Overhead bps@20ms | Overhead  bps@40ms | Overhead bps@100ms |
797	   +-------------------+--------------------+--------------------+
798	   |              6000 |               3000 |               1200 |
799	   +-------------------+--------------------+--------------------+

801	7.3.2.  Video

803	  +---------------------+----------------------+-----------------------+
804	  | Overhead  bps@30fps |  Overhead  bps@15fps |  Overhead  bps@7.5fps |
805	  |(4 packets per frame)| (2 packets per frame)| (1 packet per frame)  |
806	  +---------------------+----------------------+-----------------------+
807	  |               14400 |                 7200 |                  3600 |
808	  +---------------------+----------------------+-----------------------+

810	   For a conference with a single incoming audio stream (@ 50 pps) and 4
811	   incoming video streams (@200 Kbps), the savings in overhead is 34800
812	   - 9600 = ~25 Kbps, or ~3%.

814	8.  Security Considerations

816	8.1.  Key Management

818	   Key exchange mechanism is out of scope of this document, however
819	   every client MUST change their keys when new clients joins or leaves
820	   the call for "Forward Secrecy" and "Post Compromise Security".

822	8.2.  Authentication tag length

824	   The cipher suites defined in this draft use short authentication tags
825	   for encryption, however it can easily support other ciphers with full
826	   authentication tag if the short ones are proved insecure.

828	9.  IANA Considerations

830	   This document makes no requests of IANA.

832	10.  References

834	10.1.  Normative References

836	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
837	              Requirement Levels", BCP 14, RFC 2119,
838	              DOI 10.17487/RFC2119, March 1997,
839	              <https://www.rfc-editor.org/info/rfc2119>.

841	   [RFC8174]  Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC
842	              2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174,
843	              May 2017, <https://www.rfc-editor.org/info/rfc8174>.

845	10.2.  Informative References

847	   [MLSARCH]  Omara, E., Barnes, R., Rescorla, E., Inguva, S., Kwon, A.,
848	              and A. Duric, "Messaging Layer Security Architecture",
849	              2020.

851	   [MLSPROTO]
852	              Barnes, R., Millican, J., Omara, E., Cohn-Gordon, K., and
853	              R. Robert, "Messaging Layer Security Protocol", 2020.

855	   [PERC]     Jennings, C., Jones, P., Barnes, R., and A. Roach, "PERC",
856	              2020, <https://datatracker.ietf.org/doc/rfc8723/>.

858	   [PERCLITE]
859	              GOUAILLARD, A. and S. Murillo, "PERC-Lite", 2020,
860	              <https://tools.ietf.org/html/draft-murillo-perc-lite-01>.

862	Authors' Addresses

864	   Emad Omara
865	   Google

867	   Email: emadomara@google.com

869	   Justin Uberti
870	   Google

872	   Email: juberti@google.com

874	   Alexandre GOUAILLARD
875	   CoSMo Software

877	   Email: Alex.GOUAILLARD@cosmosoftware.io

879	   Sergio Garcia Murillo
880	   CoSMo Software

882	   Email: sergio.garcia.murillo@cosmosoftware.io