idnits 2.17.1 

draft-romano-dcon-recording-07.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

     No issues found here.

  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

  ** The document seems to lack an IANA Considerations section.  (See Section
     2.2 of https://www.ietf.org/id-info/checklist for how to handle the case
     when there are no actions for IANA.)

  ** The document seems to lack separate sections for Informative/Normative
     References.  All references will be assumed normative when checking for
     downward references.


  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the IETF Trust and authors Copyright Line does not
     match the current year

  -- The document date (December 14, 2012) is 4150 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Unused Reference: 'RFC2234' is defined on line 948, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC2434' is defined on line 954, but no explicit
     reference was found in the text

  == Unused Reference: 'RFC3261' is defined on line 958, but no explicit
     reference was found in the text

  ** Obsolete normative reference: RFC 2234 (Obsoleted by RFC 4234)

  ** Obsolete normative reference: RFC 2434 (Obsoleted by RFC 5226)

  ** Downref: Normative reference to an Informational RFC: RFC 5567

  == Outdated reference: A later version (-13) exists of
     draft-ietf-mediactrl-call-flows-10

  ** Downref: Normative reference to an Informational draft:
     draft-ietf-mediactrl-call-flows (ref. 'I-D.ietf-mediactrl-call-flows')

  ** Obsolete normative reference: RFC 4582 (Obsoleted by RFC 8855)

  ** Obsolete normative reference: RFC 3920 (Obsoleted by RFC 6120)


     Summary: 8 errors (**), 0 flaws (~~), 6 warnings (==), 1 comment (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	DISPATCH                                                     A. Amirante
3	Internet-Draft                                      University of Napoli
4	Expires: June 17, 2013                                       T. Castaldi
5	                                                              L. Miniero
6	                                                                Meetecho
7	                                                             S P. Romano
8	                                                    University of Napoli
9	                                                       December 14, 2012

11	              Session Recording for Conferences using SMIL
12	                     draft-romano-dcon-recording-07

14	Abstract

16	   This document deals with session recording, specifically for what
17	   concerns recording of multimedia conferences, both centralized and
18	   distributed.  Each involved media is recorded separately, and is then
19	   properly tagged.  A SMIL [W3C.CR-SMIL3-20080115] metadata is used to
20	   put all the separate recordings together and handle their
21	   synchronization, as well as the possibly asynchronous opening and
22	   closure of media within the context of a conference.  This SMIL
23	   metadata can subsequently be used by an interested user by means of a
24	   compliant player in order to passively receive a playout of the whole
25	   multimedia conference session.  The motivation for this document
26	   comes from our experience with our conferencing framework, Meetecho,
27	   for which we implemented a recording functionality.

29	Status of this Memo

31	   This Internet-Draft is submitted to IETF in full conformance with the
32	   provisions of BCP 78 and BCP 79.

34	   Internet-Drafts are working documents of the Internet Engineering
35	   Task Force (IETF).  Note that other groups may also distribute
36	   working documents as Internet-Drafts.  The list of current Internet-
37	   Drafts is at http://datatracker.ietf.org/drafts/current/.

39	   Internet-Drafts are draft documents valid for a maximum of six months
40	   and may be updated, replaced, or obsoleted by other documents at any
41	   time.  It is inappropriate to use Internet-Drafts as reference
42	   material or to cite them other than as "work in progress."

44	   This Internet-Draft will expire on June 17, 2013.

46	Copyright Notice

48	   Copyright (c) 2012 IETF Trust and the persons identified as the
49	   document authors.  All rights reserved.

51	   This document is subject to BCP 78 and the IETF Trust's Legal
52	   Provisions Relating to IETF Documents
53	   (http://trustee.ietf.org/license-info) in effect on the date of
54	   publication of this document.  Please review these documents
55	   carefully, as they describe your rights and restrictions with respect
56	   to this document.  Code Components extracted from this document must
57	   include Simplified BSD License text as described in Section 4.e of
58	   the Trust Legal Provisions and are provided without warranty as
59	   described in the Simplified BSD License.

61	Table of Contents

63	   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
64	   2.  Conventions  . . . . . . . . . . . . . . . . . . . . . . . . .  3
65	   3.  Terminology  . . . . . . . . . . . . . . . . . . . . . . . . .  3
66	   4.  Recording  . . . . . . . . . . . . . . . . . . . . . . . . . .  4
67	     4.1.  Audio/Video  . . . . . . . . . . . . . . . . . . . . . . .  4
68	     4.2.  Chat . . . . . . . . . . . . . . . . . . . . . . . . . . .  8
69	     4.3.  Slides . . . . . . . . . . . . . . . . . . . . . . . . . . 10
70	     4.4.  Whiteboard . . . . . . . . . . . . . . . . . . . . . . . . 11
71	   5.  Tagging  . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
72	     5.1.  SMIL Head  . . . . . . . . . . . . . . . . . . . . . . . . 13
73	     5.2.  SMIL Body  . . . . . . . . . . . . . . . . . . . . . . . . 14
74	       5.2.1.  Audio/Video  . . . . . . . . . . . . . . . . . . . . . 16
75	       5.2.2.  Chat . . . . . . . . . . . . . . . . . . . . . . . . . 17
76	       5.2.3.  Slides . . . . . . . . . . . . . . . . . . . . . . . . 18
77	       5.2.4.  Whiteboard . . . . . . . . . . . . . . . . . . . . . . 19
78	   6.  Playout  . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
79	   7.  Security Considerations  . . . . . . . . . . . . . . . . . . . 22
80	   8.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 22
81	   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 22
82	   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 23

84	1.  Introduction

86	   This document deals with session recording, specifically for what
87	   concerns recording of multimedia conferences, both centralized and
88	   distributed.  Each involved media is recorded separately, and is then
89	   properly tagged.  Such a functionality is often required in many
90	   conferencing systems, and is of great interest to the XCON [RFC5239]
91	   Working Group.  The motivation for this document comes from our
92	   experience with our conferencing framework, Meetecho, for which we
93	   implemented a recording functionality.  Meetecho is a standards-based
94	   conferencing framework, and so we tried our best to implement
95	   recording in a standard fashion as well.

97	   In the approach presented in this document, a SMIL
98	   [W3C.CR-SMIL3-20080115] metadata is used to put all the separate
99	   recordings together and handle their synchronization, as well as the
100	   possibly asynchronous opening and closure of media within the context
101	   of a conference.  This SMIL metadata can subsequently be used by an
102	   interested user by means of a compliant player in order to passively
103	   receive a playout of the whole multimedia conference session.

105	   The document presents the approach by sequentially describing the
106	   several required steps.  So, in Section 4 the recording step is
107	   presented, with an overview of how each involved media might be
108	   recorded and stored for future use.  As it will be explained in the
109	   following sections, existing approaches might be exploited to achieve
110	   this steps (e.g.  MEDIACTRL [RFC5567].  Then, in Section 5 the
111	   tagging process is described, by showing how each media can be
112	   addressed in a SMIL metadata file, with specific focus upon the
113	   timing and inter-media synchronization aspects.  Finally, Section 6
114	   is devoted to describing how a potential player for the recorded
115	   session can be implemented and what it is supposed to achieve.

117	2.  Conventions

119	   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
120	   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
121	   RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as
122	   described in BCP 14, RFC 2119 [RFC2119] and indicate requirement
123	   levels for compliant implementations.

125	3.  Terminology

127	   TBD.

129	4.  Recording

131	   When a multimedia conference is realized over the Internet, several
132	   media might be involved at the same time.  Besides, these media might
133	   come and go asynchronously during the lifetime of the same
134	   conference.  This makes it quite clear that, in case such a
135	   conference needs to be recorded in order to allow a subsequent,
136	   possibly offline, playout, these media need to be recorded in a
137	   format that is aware of all the timing-related aspects.  A typical
138	   example is a videoconference with slide sharing.  While audio and
139	   video have a life of their own, slides changes might be triggered at
140	   a completely different pace.  Besides, the start of a slideshow might
141	   occur much later than the start of the audio/video session.  All
142	   these requirements must be taken into account when dealing with
143	   session recording in a conference.  Besides, it's important that all
144	   the individual recordings be taken in a standard fashion, in order to
145	   achieve the maximum compatibility among different solutions and avoid
146	   any proprietary mechanism or approach that could prevent a successful
147	   playout later on.

149	   In this document, we present our approach towards media recording in
150	   a conference.  Specifically, we will deal with the recording of the
151	   following media:

153	   o  audio and video streams (in Section 4.1);
154	   o  text chats (in Section 4.2);
155	   o  slide presentations (in Section 4.3);
156	   o  whiteboards (in Section 4.4).

158	   Additional media that might be involved in a conference (e.g. desktop
159	   or application sharing) are not presented in this document, and their
160	   description is left to future extensions.

162	4.1.  Audio/Video

164	   In a conferencing system compliant with [RFC5239], audio and video
165	   streams contributed by participants are carried in RTP channels
166	   [RFC3550].  These RTP channels may or may not be secured (e.g by
167	   means of SRTP/ZRTP).  Whether or not these channels are secured,
168	   anyway, is not an issue in this case.  In fact, as it is usually the
169	   case, all the participants terminate their media streams at a central
170	   point (a mixer entity), with whom they would have a secured
171	   connection.  This means that the mixer would get access to the
172	   unencrypted payloads, and would be able to mix and/or store them
173	   accordingly.

175	   From an high level topology point of view, this is how a recorder for
176	   audio and video streams could be envisaged:

178	              SIP   +------------+ SIP
179	         /----------|   XCON AS  |--------
180	        /           +------------+        \
181	       /                   |MEDIACTRL      \
182	      /                    |                \
183	   +-----+              +-----+              +-----+
184	   |     |     RTP      |     |   RTP        |     |
185	   |UA-A +<------------>+Mixer+<------------>+UA-B |
186	   |     |              |     |              |     |
187	   +-----+              +-++--+              +-----+
188	                         |   |
189	              RTP UA-A   |   | RTP UA-B (Rx+Tx)
190	              (Rx+Tx)    V   V
191	                      +----------+
192	                      |          |
193	                      | Recorder |
194	                      |          |
195	                      +----------+

197	                      Figure 1: Audio/Video Recorder

199	      [Editors' Note: this is a slightly modified version of the
200	      topology proposed on the DISPATCH mailing list,
201	      http://www.ietf.org/mail-archive/web/dispatch/current/
202	      msg00256.html
203	      where the Application Server has been specialized in an XCON-aware
204	      AS, and the AS<->Mixer protocol is the Media Control Channel
205	      Framework protocol (CFW) specified in [RFC6230].]

207	   That said, actually recording audio and video streams in a conference
208	   may be accomplished in several ways.  Two different approaches might
209	   be highlighted:

211	   1.  recording each contribution from/to each participant in a
212	       separate file (Figure 2);
213	   2.  recording the overall mix (one for audio and one from video, or
214	       more if several mixes for the same media type are available) in a
215	       dedicated file (Figure 3).

217	                                +-------+
218	                                | UAC-C |
219	                                +-------+
220	                                    "
221	                            C (RTP) "
222	                                    "
223	                                    "
224	                                    v
225	  +-------+  A (RTP)           +----------+           B (RTP)  +-------+
226	  | UAC-A |===================>| Recorder |<===================| UAC-B |
227	  +-------+                    +----------+                    +-------+
228	                                    *
229	                                    *
230	                                    *
231	                                    ****> A.gsm, A.h263
232	                                    ****> B.g711, B.h264
233	                                    ****> C.amr

235	                  Figure 2: Recording individual streams

237	                                +-------+
238	                                | UAC-C |
239	                                +-------+
240	                                    "
241	                            C (RTP) "
242	                                    "
243	                                    "
244	                                    v
245	  +-------+  A (RTP)           +----------+           B (RTP)  +-------+
246	  | UAC-A |===================>| Recorder |<===================| UAC-B |
247	  +-------+                    +----------+                    +-------+
248	                                    *
249	                                    *
250	                                    *
251	                                    ****> (A+B+C).wav, (A+B+C).h263

253	                     Figure 3: Recording mixed streams

255	   Of the two, the second is probably more feasable.  In fact, the first
256	   would require a potentially vast amount of separate recordings which
257	   would need to be subsequently muxed and correlated to each other.
258	   Besides, within the context of a multimedia conference, most of the
259	   times the streams are already mixed for all the participants, and so
260	   recording the mix directly would be a clear advantage.  Such an
261	   approach, of course, assumes that all the streams pass through a
262	   central point where the mixing occurs: it is the case depicted in
263	   Figure 1.  The recording would take place in that point.  Such
264	   central point, the mixer (which in this case would also act as the
265	   recorder, or a frontend to it), might be a MEDIACTRL-based [RFC5567]
266	   Media Server.  Considering the similar nature of audio and video
267	   (both being RTP based and mixed by probably the same entity) they are
268	   analysed in the same section of this document.  The same applies to
269	   tagging and playout as well.  It is important to note that in case
270	   any policy is involved (e.g. moderation by means of the BFCP
271	   [RFC4582]) the mixer would take it into account when recording.  In
272	   fact, the same policies applied to the actual conference with respect
273	   to the delivery of audio and video to the participants needs to be
274	   enforced for the recording as well.

276	   In a more general way, if the mixer does not support a direct
277	   recording of the mixes it prepares, recording a mix can be achieved
278	   by attaching the recorder entity (whatever it is) as a passive
279	   participant to the conference.  This would allow the recorder to
280	   receive all the involved audio and video streams already properly
281	   mixed, with policies already taken into consideration.  This approach
282	   is depicted in Figure 4.

284	                                +-------+
285	                                |  UAC  |
286	                                |   C   |
287	                                +-------+
288	                                   " ^
289	                           C (RTP) " "
290	                                   " "
291	                                   " " A+B (RTP)
292	                                   v "
293	   +-------+  A (RTP)           +--------+  A+C (RTP)         +-------+
294	   |  UAC  |===================>| Media  |===================>|  UAC  |
295	   |   A   |<===================| Server |<===================|   B   |
296	   +-------+         B+C (RTP)  +--------+           B (RTP)  +-------+
297	                                    "
298	                                    "
299	                                    " A+B+C (RTP)
300	                                    "
301	                                    v
302	                              +----------+
303	                              | Recorder |
304	                              +----------+
305	                                    *
306	                                    ****> (A+B+C).wav, (A+B+C).h263

308	                Figure 4: Recorder as a passive participant

310	   Whether or not the mixer is MEDIACTRL-based, it's quite likely that
311	   the AS handling the multimedia conference business logic has some
312	   control on the mixing involved.  This means it can request the
313	   recording of each available audio and/or video mix in a conference,
314	   if only by adding the passive participant as mentioned above.
315	   Besides, events occurring at the media level or business logic in the
316	   AS itself allow the AS to take note of timing information for each of
317	   the recorded media.  For instance, the AS may take note of when the
318	   video mixing started, in order to properly tag the video recording in
319	   the tagging phase.  Both the recordings and the timing events list
320	   would subsequently be used in order to prepare the metadata
321	   information of the audio and video in the overall session recording
322	   description.  Such a phase is described in Section 5.2.1.

324	   In a MEDIACTRL Media Server, such a functionality might be
325	   accomplished by means of the Mixer Control Package
326	   [I-D.ietf-mediactrl-mixer-control-package].  At the end of the
327	   conference, URLs to the actual recordings would be made available for
328	   the AS to use.  The AS might then subsequently access those
329	   recordings according to its business logic, e.g. to store them
330	   somewhere else (the MS storage might be temporary) or to implement an
331	   offline transcoding and/or mixing of all the recordings in order to
332	   obtain a single file representative of the whole audio/video
333	   participation in the conference.  Practical examples of these
334	   scenarios are presented in [I-D.ietf-mediactrl-call-flows].

336	   Of course, if the recording of a mix is not possible or desired, one
337	   could still fallback to the first approach, that is individually
338	   recording all the incoming contributions.  It is the case, for
339	   instance, of conferencing systems which don't implement video mixing,
340	   but just rely instead on a switching/forwarding of the potentially
341	   several video streams to each participant.  This functionality can
342	   also be achieved by means of the same control package previously
343	   introduced, since it allows for the recording of both mixes and
344	   individual connections.  Once the conference ends, the AS can then
345	   decide what to do with the recordings, e.g. mixing them all together
346	   offline (thus obtaining an overall mix) or leave them as they are.
347	   The tagging process would the subsequently take the decision into
348	   account, and address the resulting media accordingly.

350	4.2.  Chat

352	   What has been said about audio and video partially applies to text
353	   chats as well.  In fact, just as for audio and video a central mixer
354	   is usually involved, for instant messaging most of the times the
355	   contributions by all participants pass through a central node from
356	   where they are forwarded to the other participants.  It is the case,
357	   for instance, of XMPP [RFC3920] and MSRP [RFC4975] based text
358	   conferences.  If so, recording of the text part of a conference is
359	   not hard to achieve either.  The AS just needs to implement some form
360	   of logging, in order to store all the messages flowing through the
361	   text conference central node, together with information on the
362	   senders of these messages and timing-related information.  Of course,
363	   the AS may not directly be the text conference mixer: the same
364	   considerations apply, however, in the sense that the remote mixer
365	   must be able to implement the aforementioned logging, and must be
366	   able to receive related instructions from the controlling AS.
367	   Besides, considering the possible protocol-agnostic nature of the
368	   conferencing system (as envisaged in [RFC5239]), several different
369	   instant messaging protocols may be involved in the same conference.
370	   Just as the conferencing system would act as a protocol gateway
371	   during the lifetime of the conference (i.e. provide MSRP users with
372	   the text coming from XMPP participants and viceversa), all the
373	   contributions coming from the different instant messaging protocols
374	   would need to be recorded in the same log, and in the same format, to
375	   avoid ambiguity later on.

377	   An example of a recorder for instant messaging is presented in
378	   Figure 5.

380	                                +-------+
381	                                | UAC-C |
382	                                +-------+
383	                                    ^
384	                           C (MSRP) " '10.11.24 Hi!'
385	                                    "
386	                                    "
387	                                    v
388	  +-------+  A (XMPP)          +----------+           B (IRC)  +-------+
389	  | UAC-A |<==================>| Recorder |<==================>| UAC-B |
390	  +-------+  '10.11.26 Hey C'  +----------+ '10.11.30 Hey man' +-------+
391	                                    *
392	                                    *
393	                                    *     [..]
394	                                    ****> 10.11.24 <User C> Hi!
395	                                    ****> 10.11.26 <User A> Hey C
396	                                    ****> 10.11.30 <User B> Hey man
397	                                          [..]

399	                   Figure 5: Recording a text conference

401	   The same considerations already mentioned about optional policies
402	   involved apply to text conferences as well: i.e., if a UAC is not
403	   allowed to contribute text to the chat, this contribution is excluded
404	   both from the mix the other participants receive and from the ongoing
405	   recording.

407	   Considerations about the format of the recording are left to
408	   Section 5.2.2.  Until then, we just assume the AS has a way to record
409	   text conferences somehow in a format it is familiar with.  This
410	   format would subsequently be converted to another, standard, format
411	   that a player would be able to access.

413	4.3.  Slides

415	   Another media typically available in a multimedia conference over the
416	   internet is the slides presentation.  In fact, slides, whatever
417	   format they're in, are still the most common way of presenting
418	   something within a collaboration framework.  The problem is that,
419	   most of the times, these slides are deployed in a proprietary way
420	   (e.g.  Microsoft Powerpoint and the like).  This means that, besides
421	   the recording aspect of the issue, the delivery itself of such a
422	   slides can be problematic when considered in a standards based
423	   conferencing framework.

425	   Considering that no standard way of implementing such a functionality
426	   is commonly available yet, we assume that a conferencing framework
427	   makes such slides available to the participants in a conference as a
428	   slideshow, that is, a series of static images whose appearance might
429	   be dictated by a dedicated protocol.  For instance, a presenter may
430	   trigger the change of a slide by means of an instant messaging
431	   protocol, providing each authorized participant with an URL from
432	   where to get the current slide with optional metadata to describe its
433	   content.

435	   An example is presented in Figure 6.  The presenter has previously
436	   uploaded its presentation converted in a proprietary format.  The
437	   presentation has been converted to images and a description of the
438	   new format has been sent back to the presenter (e.g. an XML
439	   metadata).  At this point, the presenter makes use of XMPP to inform
440	   the other participants about the current slide, by providing an HTTP
441	   URL to the related image.

443	                              +-----------+
444	                              | Presenter |
445	                              +-----------+
446	                                   "
447	                           (XMPP)  " Current presentation: f44gf
448	                                   " Current slide number: 4
449	                                   " URL: http://example.com/f44gf/4.jpg
450	                                   "
451	                                   v
452	 +-------+  (XMPP)            +----------+            (XMPP)  +-------+
453	 | UAC-A |<===================| ConfServ |===================>| UAC-B |
454	 +-------+                    +----------+                    +-------+
455	     |                                                            |
456	     | HTTP GET (http://example.com/f44gf/4.jpg)                  |
457	     v                  HTTP GET (http://example.com/f44gf/4.jpg) |
458	                                                                  v

460	                  Figure 6: Presentation sharing via XMPP

462	   From this assumption, the recording of each slide presentation would
463	   be relatively trivial to achieve.  In fact, the AS would just need to
464	   have access to the set of images (with the optional metadata
465	   involved) of each presentation, and to the additional information
466	   related to presenters and to when each slide was triggered.  For
467	   instance, the AS may take note of the fact that slide 4 from
468	   presentation "f44gf" of the example above has been presented by UAC
469	   "spromano" from the second 56 of the conference to the second 302.
470	   Properly recording all those events would allow for a subsequent
471	   tagging, thus allowing for the integration of this medium in the
472	   whole session recording description together with the other media
473	   involved.  This phase will be described in Section 5.2.3.

475	4.4.  Whiteboard

477	   To conclude the overview on the analysed media, we consider a further
478	   medium which is quite commonly deployed in multimedia conferences:
479	   the shared whiteboard.  There are several ways of implementing such a
480	   functionality.  While some standard solutions exist, they are rarely
481	   used within the context of commercial conferencing application, which
482	   usually prefer to implement it in a proprietary fashion.

484	   Without delving into a discussion on this aspect, suffices it to say
485	   that for a successful recording of a whiteboard session most of the
486	   times it is enough to just record the individual contributions of
487	   each involved participant (together with the usual timing-related
488	   information).  In fact, this would allow for a subsequent replay of
489	   the whiteboard session in an easy way.  Unlike audio and video,
490	   whiteboarding usually is a very lightweight media, and so recording
491	   the individual contributions rather than the resulting mix (as we
492	   suggested in Section 4.1) is advisable.  These contributions may
493	   subsequently be mixed together in order to obtain a standard
494	   recording (e.g. a series of images, animations, or even a low
495	   framerate video).  An example of recording for this medium is
496	   presented in Figure 7.

498	                                +-------+
499	                                | UAC-C |
500	                                +-------+
501	                                    "
502	                           C (XMPP) " 10.11.20: line
503	                                    "
504	                                    "
505	                                    v
506	 +-------+  A (XMPP)          +-----------+          B (XMPP)  +-------+
507	 | UAC-A |===================>| WB server |<===================| UAC-B |
508	 +-------+  10.10.56: circle  +-----------+    10.12.30: text  +-------+
509	                                    *
510	                                    *
511	                                    *
512	                                    ****> 10.10.56: circle (A)
513	                                    ****> 10.11.20: line (C)
514	                                    ****> 10.12.30: text (B)

516	                 Figure 7: Recording a whiteboard session

518	   The recording process may be enriched by the population of a parallel
519	   event list.  For instance, optimizations might include event as the
520	   creation of a new whiteboard, the clearing of an existing whiteboard
521	   or the adding of a background image that replaced the previously
522	   existing content.  Such event would be precious in a subsequent
523	   playout of the recorded steps, since they would allow for a more
524	   lightweight replication in case seeking is involved.  For instance,
525	   if 70 drawings have been done, but at second 560 of the conference
526	   the whiteboard has been cleared and since then only 5 drawings have
527	   been added, a viewer seeking to second 561 would just need the
528	   clear+5 drawings to be replicated.  Anyway, further discussion upon
529	   the tagging process of this media is presented in Section 5.2.4.

531	5.  Tagging

533	   Once the different media have been recorded and stored, and their
534	   timing related somehow, this information needs to be properly tagged
535	   in order to allow intra-media and inter-media synchronization in case
536	   a playout is invoked.  Besides, it would be desirable to make use of
537	   standard means for achieving such a functionality.  For these
538	   reasons, we chose to make use of the Synchronized Multimedia
539	   Integration Language [W3C.CR-SMIL3-20080115], which fulfills all the
540	   aforementioned requirements, besides being a well-established W3C
541	   standard.  In fact, timing information is very easy to address using
542	   this specification, and VCR-like controls (start, pause, stop,
543	   rewind, fast forward, seek and the like) are all easily deploayble in
544	   a player using the format.

546	   The SMIL specification provides means to address different media by
547	   using custom tags (e.g. audio, img, textstream and so on), and for
548	   each of these media the related tempification can be easily
549	   described.  The following subsections will describe how a SMIL
550	   metadata could be prepared in order to map with the media recorded as
551	   described in Section 4.

553	   Specifically, considering how a SMIL file is assumed to be
554	   constructed, the head will be described in Section 5.1, while the
555	   body (with different focus for each media) will be presented in
556	   Section 5.2.

558	5.1.  SMIL Head

560	   As specified in [W3C.CR-SMIL3-20080115], a SMIL file is composed of
561	   two separate sections: a head and a body.  The head, among all the
562	   needed information, also includes details about the allowed layouts
563	   for a multimedia presentation.  Considering the amount of media that
564	   might have been involved in a single conference, properly
565	   constructing such a section definitely makes much sense.  In fact,
566	   all the involved media need to be placed in order not to prevent
567	   access to other concurrent media within the context of the same
568	   recording.

570	   For instance, this is how a series of different media might be placed
571	   in a layout according to different screen resolutions:

573	<?xml version="1.0" encoding="UTF-8"?>
574	<smil xmlns:xml="http://www.w3.org/XML/1998/namespace">
575	  <head>
576	    <switch systemScreenSize="800X600">
577	      <layout>
578	        <root-layout width="800" height="600" background-color="black"/>
579	        <region id="image0" regionName="image" fit="fill" top="310" \
580	                left="370" width="400" height="350" />
581	        <region id="video0" regionName="video" top="0" left="370" \
582	                width="430" height="310" fit="fill" />
583	        <region id="chat0" regionName="chat" fit="fill" alt="chat" \
584	                top="410" left="370" width="400" height="-60"/>
585	        <region id="wb0" regionName="wb" top="0" left="0" width="370" \
586	                height="520"/>
587	      </layout>
588	    </switch>
589	    <switch systemScreenSize="1024X768">
590	      <layout>
591	        <root-layout width="1024" height="768" \
592	                     background-color="black"/>
593	        <region id="image1" regionName="image" fit="fill" top="310" \
594	                left="594" width="400" height="350"/>
595	        <region id="video1" regionName="video" top="0" left="594" \
596	                width="430" height="310" fit="fill"/>
597	        <region id="chat1" regionName="chat" fit="fill" alt="chat" \
598	                top="578" left="594" width="400" height="108"/>
599	        <region id="wb1" regionName="wb" top="0" left="0" width="594" \
600	                height="688"/>
601	      </layout>
602	    </switch>
603	[..]

605	   That said, it's important that this section of the SMIL file be
606	   constructed properly.  In fact, the layout description also contains
607	   explicit region identifiers, which are referred to when describing
608	   media in the body section.

610	   TBD. (?)

612	5.2.  SMIL Body

614	   The SMIL head section described previously is very important for what
615	   concerns presentation-related settings, but does not contain any
616	   timing-related information.  Such information, in fact, belongs to a
617	   separate section in the SMIL file, the so called body.  This body
618	   contains the information on all the involved media in the recorded
619	   session, and for each media timing information are provided.  This
620	   timing information includes not only when each media appears and when
621	   it goes away, but also details on the media lifetime as well.  By
622	   correlating the timing information for each media, a SMIL reader can
623	   infer inter-media synchronization and present the recorded session as
624	   it was conceived to appear.

626	   Besides, the involved media can be grouped in the body in order to
627	   implement sequential and/or parallel playback involving a subset of
628	   the available media.  This is made possible by making use of the
629	   <seq> and <par> elements.  The <par> element in particular is of
630	   great interest to this document, since in a multimedia conference
631	   many media are presented to participants at the same time.

633	   That said, it is important to be able to separately address each
634	   involved medium.  To do so, SMIL makes use of well specified
635	   elements.  For instance, a <video> element is used to state the
636	   presence of a video stream in the session.  Each of these elements
637	   can be furtherly customized and configured by means of ad-hoc
638	   attributes.  For instance, the 'src' attribute in a <video> element
639	   means that the actual video stream source can be found at the
640	   provided address.

642	   The element for each media is also the place where SMIL adds
643	   information upon when the addressed media comes into play.  This is
644	   done by means of two attributes called 'begin' and 'end'
645	   respectively.  As the names themselves suggest, the 'begin' attribute
646	   gives a temporal reference on the media start, while the 'end'
647	   attribute specifies when the media ends.  For instance, an element
648	   formatted in the following way:

650	   <video src="http://www.example.com/conference45.avi" region="box12" \
651	          begin="15s" end="400s"/>

653	   means that a video stream (whose URL is provided in 'src') must be
654	   played in the session only 15 seconds after the session beginning,
655	   and that it must end 385 seconds after.  This information is also
656	   used when seeking through a session.  For instance, if a user
657	   accessing the recording seeks to 200 seconds after the beginning, the
658	   video will appear as well at the relative time of 200-15=185 seconds.

660	   Considering the recorded media presented in Section 4, the
661	   construction of following sections of the body will be described:

663	   o  audio/video streams (in Section 5.2.1);
664	   o  text chats (in Section 5.2.2);
665	   o  slide presentations (in Section 5.2.3);
666	   o  whiteboards (in Section 5.2.4).

668	5.2.1.  Audio/Video

670	   In SMIL, the element to describe an audio stream is <audio>, while
671	   for video the element is <video>.  Considering that these two stream
672	   types are handled in a very similar way, only video will be
673	   addressed.  This is an explicit choice for two reasons: (i) video is
674	   slightly more complex to address than audio, and so treating video
675	   makes more sense; (ii) often off-line encoders/muxers will place the
676	   recorded elementary audio and video streams in a single video
677	   container, which means both streams can actually be addressed in a
678	   single media file.

680	   That said, <video> is the element used in a SMIL bod to state the
681	   presence of an audio/video stream.  It's tempification, related to
682	   other media, might be implemented by making use of a <par>/<seq>
683	   aggregator.  In such an element, some attributes are of great
684	   relevance and should be included:

686	   o  'src', to address the actual video file to use (usually a HTTP
687	      URL);
688	   o  'begin' and 'end', for timing information (when the video should
689	      appear/disappear in the session);
690	   o  'region', to specify where the stream will need to appear in the
691	      layout as configured in the head (e.g. place it in the region
692	      called box12).

694	   All these information can easily be taken according to the stream as
695	   recorded previously (optionally re-encoded and/or re-muxed), together
696	   with the timing information as part of the event log.  The 'src', in
697	   particular, can be any video file, which means that an encoding of
698	   the stream for a player is quite trivial to achieve.

700	   Besides, as mentioned in Section 4.1, recordings may be available as
701	   already mixed streams, or individual streams.  In case the recording
702	   is already mixed, then the tagging can be done as seen in the
703	   previous paragraph:

705	   <video src="http://www.example.com/conference45.avi" region="box12" \
706	          begin="15s" end="400s"/>

708	   where this element would state the presence of an audio/video stream,
709	   to appear in the specified region in the specified range of time.  In
710	   case several recordings are available, instead, the tagging would be
711	   a little more complex: in fact, the metadata would need to address
712	   the parallel playback of the different recordings, which would also
713	   need to reflect the actual lifetime of the original streams in the
714	   conference.  For instance, if UAC A joined the conference much before
715	   UAC B, its contributions would appear in the playout accordingly.  An
716	   example of how this could be achieved in a SMIL metadata is presented
717	   here:

719	   <par>
720	      [..]
721	      <video src="http://www.example.com/userA.avi" region="box12" \
722	             begin="15s" end="400s"/>
723	      <video src="http://www.example.com/userB.avi" region="box16" \
724	             begin="230s" end="521s"/>
725	      [..]
726	   </par>

728	   This lines tell an interested player that the two specified video
729	   streams (whose URLs are provided in the respective 'src' attributes)
730	   must be played in parallel, and in different regions.  Anyway, video
731	   stream 'userA.avi' starts after 15 seconds, while 'userB.avi' starts
732	   after 230 seconds since the beginning of the conference, reflecting
733	   the appearance of these media in the conference itself.

735	5.2.2.  Chat

737	   Text in SMIL can be addressed in several different ways, the most
738	   common ones being <text> and <textstream> elements. <text>, however,
739	   usually deals only with static text content, that is text without
740	   timing information (e.g.  HTML).  For this reason, <textstream>
741	   should be used instead, since it allows text to appear and disappear
742	   in real-time.

744	   The attributes to configure the element are basically the same as the
745	   one presented for <video> (src, region, begin, end).  The difference,
746	   anyway, is on the file to refer to in the 'src' attribute.  In fact,
747	   if timing information is needed, a proper format for tempified text
748	   is needed.  The <textstream> element supports RealText Markup, which
749	   is a separate markup language for dealing with real-time text.  It is
750	   the format used, for instance, for subtitle captioning.  An example
751	   of RealText is presented in the following lines:

753	   <window width="340" height="160" wordwrap="true" loop="false" \
754	           bgcolor="white">
755	      <font color="black" face="Arial" size="+0">
756	         <Time begin="0:00:02.2"/><br/><User C>Hi
757	         <Time begin="0:00:04.5"/><br/><User A>Hey C
758	         <Time begin="0:00:08.1"/><br/><User B>Hey man
759	         [..]

761	   This example recalls Figure 5, where the first message (by User C)
762	   was sent at 10.11.24.  Assuming the text conference started at
763	   10.11.22, the log is converted to RealText and tagged accordingly
764	   (e.g.  User C saying his first message two seconds after the
765	   conference started).  The RealText fine can then be addressed in SMIL
766	   using the aforementioned <textstream> element:

768	 <par>
769	    [..]
770	    <textstream src="http://example.com/chats/conf45.rt" region="chat" \
771	                begin="0s" end="500s"/>
772	    [..]
773	 </par>

775	   Once the requirement on the file format is assessed, the next step is
776	   obvious.  Whatever format the chat in the conference had been
777	   recorded into, it needs to be converted to a RealText file in order
778	   to have it addressed in the resulting SMIL file.  The conversion is
779	   usually very trivial to achieve, considering that chat logs often
780	   have the same information needed in a RealText file except for the
781	   presentation format.

783	5.2.3.  Slides

785	   The easiest way to deal with a slideshow and/or a shared slide
786	   presentation is to make use of the <img> element.  In fact, as
787	   anticipated in Section 4.3, slides in a presentation most often are
788	   composed of a static content, and can be assimilated to images.  This
789	   means that addressing a complete presentation in a SMIL file can be
790	   achieved by following these steps:

792	   1.  preparing a list of images reflecting the original presentations
793	       (e.g. 10 images for 10 slides, or more if any animation was
794	       involved);

796	   2.  prepare the timing related information (e.g. when slide 1
797	       appeared, and when it was substituted by slide 2);
798	   3.  placing a series of <img> elements in the SMIL metadata to
799	       address the first two steps.

801	   An example of this, recalling the scenario depicted in Figure 6, is
802	   presented here:

804	   <par>
805	      [..]
806	      <img src="http://www.example.com/f44gf/1.jpg" region="image" \
807	           begin="0s" end="10s"/>
808	      <img src="http://www.example.com/f44gf/2.jpg" region="image" \
809	           begin="10s" end="18s"/>
810	      <img src="http://www.example.com/f44gf/3.jpg" region="image" \
811	           begin="18s" end="30s"/>
812	      [..]
813	   </par>

815	   The slideshow would usually be a sequence, and so a <seq> would seem
816	   the more apt way to address the presentation sharing.  Nevertheless,
817	   timing information are very important, and it's quite likely that
818	   several additional media will flow in parallel with the slides (e.g.
819	   the video stream which includes the presenter talking).  That's why a
820	   <par> element is used instead, which for brevity omits the other
821	   media involved.

823	5.2.4.  Whiteboard

825	   As anticipated in Section 4.4, no standard solution is usually
826	   deployed when talking of whitebording in a conferencing system.  For
827	   this reason, the recording process suggested in Section 4.4 is just a
828	   timing-aware dump of all the interactions occurred at the whiteboard
829	   level.  These interactions might subsequently be converted in a more
830	   common format as, for instance, a video or an image slide show.  In
831	   case of a video, the same considerations of Section 5.2.1 would
832	   apply, since the whiteboard recording would actually be a video
833	   itself.  In case it is converted to a slideshow, the tagging process
834	   would occur as explained in Section 5.2.3.

836	   However, SMIL also allows for custom, non-standard media to be
837	   involved in its metadata.  This can be achieved by means of the
838	   standard element <ref>, which is a generic media reference.  This
839	   element allows for the description and addressing of non-standard
840	   media (or at least media the chosen SMIL specification is not aware
841	   of), which could be implemented in a custom player.  This means that,
842	   if a whiteboard has been recorded in a proprietary way, and this way
843	   needs for a reason or for another to be preserved, the <ref> element
844	   may be used to address it: in fact, the same attributes previously
845	   introduced (including 'src' and the others) are available to this
846	   element as well.  Of course, if this approach is used only a player
847	   able to understand the proprietary media extension would be able to
848	   replay the recorded whiteboard session.  To make the player aware of
849	   the format employed, a 'type' attribute could be added as well.

851	   An example of how the recorded whiteboard might be addressed is
852	   provided here:

854	   <par>
855	      [..]
856	         <ref src="http://example.com/wb/wb12.txt" region="wb" \
857	              type="myFormat"/>
858	      [..]
859	   </par>

861	6.  Playout

863	   Once the SMIL metadata has been properly prepared, a playout of the
864	   recorded conference is not difficult to achieve.  In fact, an
865	   interested user just needs to get a SMIL-aware player supporting the
866	   several file formats involved, that are: (i) audio/video; (ii)
867	   images; (iii) RealText; (iv) the whiteboarding session, whatever
868	   format it has been recorded into.  Considering the standard nature of
869	   SMIL and of almost all the media involved, the session is likely to
870	   be easily accessable to many players out there in the wild.  Anyway,
871	   the 'type' attribute for all the involved media can be used to check
872	   for the support of the related media or not.

874	   Additional information provided in the SMIL head (e.g. the <switch>
875	   elements and the <layout> they suggest) provide guidance for players
876	   to presenting the addressed media in the expected way.

878	   The sequence an interested user needs to realize in order to access a
879	   recorded conference session can be summarized in the following
880	   simplified steps:

882	   o  the user retrieves the SMIL file associated with the conference
883	      she/he is interested to (e.g. by means of HTTP or other out-of-
884	      band mechanisms);

886	   o  the SMIL file is passed to a compliant media player (which could
887	      have been the means to get the SMIL file in the first place);
888	   o  the player parses the SMIL file and checks if all the media are
889	      supported; apart from explicitly non-standard media (e.g.
890	      whiteboard) the player might check if the envolved media files are
891	      encoded in a format it supports (e.g. a video file encoded in
892	      H.264/MP3);
893	   o  the player prepares the presentation screen; it makes use of the
894	      information in the <head> in order to choose the right layout; the
895	      choice may be automatic (e.g. according to the screen resolution)
896	      or guided by the user;
897	   o  the player starts retrieving each involved media file; it may
898	      either retrieve each file in its completeness, or start
899	      downloading and then start the playout almost immediately (e.g.
900	      buffering); it also listens for user-generated events, like the
901	      user pausing/resuming the playout, or seeking to a specific time
902	      in the conference; if any of these events occur, it takes the
903	      related action (e.g. seeking to the right time for each medium in
904	      the conference, taking the timing information from the SMIL file
905	      as well).

907	   A general overview of the scenario can be seen in Figure 8.

909	+------+ 1. START    +----------+                          +----------+
910	| User |------------>|   User   |------------------------->| Sessions |
911	|      |<------------| (player) |  2. get conf45.smil      | database |
912	+------+  6. SHOW    +----------+                          +----------+
913	                       |  |  |
914	                       |  |  |
915	                       |  |  |   3. get audios and videos  +-----------+
916	                       |  |  +---------------------------->| WebServer |
917	                       |  |                                |  (video)  |
918	                       |  |    4. get RealText files       +-----------+
919	                       |  +------------------------------->|  (text)   |
920	                       |    5. get slide images            +-----------+
921	                       +---------------------------------->|  (images) |
922	                                                           +-----------+

924	      Figure 8: Retrieving and playing a recorded conference session

926	   In this quite oversimplified scenario, an interested viewer chooses
927	   to start viewing a previously recorded conference.  She/he knows the
928	   address to the recorded session (http://example.com/conf45.smil) and
929	   passes it to her/his player (1.).  Starting the playout triggers the
930	   retrieval of the SMIL description (2.), which may be achieved by
931	   means of HTTP or any other protocol.  Once the player has access to
932	   the description, it starts retrieving the individual media resources
933	   addressed there (video in 3., chat in 4., slides in 5.), and,
934	   according to the implementation of the player, it either waits for
935	   all the downloads to complete or just buffers a little while before
936	   starting the presentation to the user (6.).

938	7.  Security Considerations

940	   TBD.

942	8.  Acknowledgements

944	   The authors would like to thank...

946	9.  References

948	   [RFC2234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
949	              Specifications: ABNF", RFC 2234, November 1997.

951	   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
952	              Requirement Levels", BCP 14, RFC 2119, March 1997.

954	   [RFC2434]  Narten, T. and H. Alvestrand, "Guidelines for Writing an
955	              IANA Considerations Section in RFCs", BCP 26, RFC 2434,
956	              October 1998.

958	   [RFC3261]  Rosenberg, J., Schulzrinne, H., Camarillo, G., Johnston,
959	              A., Peterson, J., Sparks, R., Handley, M., and E.
960	              Schooler, "SIP: Session Initiation Protocol", RFC 3261,
961	              June 2002.

963	   [RFC3550]  Schulzrinne, H., Casner, S., Frederick, R., and V.
964	              Jacobson, "RTP: A Transport Protocol for Real-Time
965	              Applications", STD 64, RFC 3550, July 2003.

967	   [RFC5567]  Melanchuk, T., "An Architectural Framework for Media
968	              Server Control", RFC 5567, June 2009.

970	   [RFC6230]  Boulton, C., Melanchuk, T., and S. McGlashan, "Media
971	              Control Channel Framework", RFC 6230, May 2011.

973	   [I-D.ietf-mediactrl-mixer-control-package]
974	              McGlashan, S., Melanchuk, T., and C. Boulton, "A Mixer
975	              Control Package for the Media Control Channel Framework",
976	              draft-ietf-mediactrl-mixer-control-package-14 (work in
977	              progress), January 2011.

979	   [I-D.ietf-mediactrl-call-flows]
980	              Amirante, A., Castaldi, T., Miniero, L., and S. Romano,
981	              "Media Control Channel Framework (CFW) Call Flow
982	              Examples", draft-ietf-mediactrl-call-flows-10 (work in
983	              progress), November 2012.

985	   [RFC5239]  Barnes, M., Boulton, C., and O. Levin, "A Framework for
986	              Centralized Conferencing", RFC 5239, June 2008.

988	   [RFC4582]  Camarillo, G., Ott, J., and K. Drage, "The Binary Floor
989	              Control Protocol (BFCP)", RFC 4582, November 2006.

991	   [W3C.CR-SMIL3-20080115]
992	              Bulterman, D., "Synchronized Multimedia Integration
993	              Language (SMIL 3.0)", World Wide Web Consortium CR CR-
994	              SMIL3-20080115, January 2008,
995	              <http://www.w3.org/TR/2008/CR-SMIL3-20080115>.

997	   [RFC3920]  Saint-Andre, P., Ed., "Extensible Messaging and Presence
998	              Protocol (XMPP): Core", RFC 3920, October 2004.

1000	   [RFC4975]  Campbell, B., Mahy, R., and C. Jennings, "The Message
1001	              Session Relay Protocol (MSRP)", RFC 4975, September 2007.

1003	Authors' Addresses

1005	   Alessandro Amirante
1006	   University of Napoli
1007	   Via Claudio 21
1008	   Napoli  80125
1009	   Italy

1011	   Email: alessandro.amirante@unina.it

1013	   Tobia Castaldi
1014	   Meetecho
1015	   Via Carlo Poerio 89
1016	   Napoli  80100
1017	   Italy

1019	   Email: tcastaldi@meetecho.com
1020	   Lorenzo Miniero
1021	   Meetecho
1022	   Via Carlo Poerio 89
1023	   Napoli  80100
1024	   Italy

1026	   Email: lorenzo@meetecho.com

1028	   Simon Pietro Romano
1029	   University of Napoli
1030	   Via Claudio 21
1031	   Napoli  80125
1032	   Italy

1034	   Email: spromano@unina.it