idnits 2.17.1 

draft-ietf-avt-rtp-3gpp-timed-text-15.txt:

  Checking boilerplate required by RFC 5378 and the IETF Trust (see
  https://trustee.ietf.org/license-info):
  ----------------------------------------------------------------------------

  ** It looks like you're using RFC 3978 boilerplate.  You should update this
     to the boilerplate described in the IETF Trust License Policy document
     (see https://trustee.ietf.org/license-info), which is required now.

  -- Found old boilerplate from RFC 3978, Section 5.1 on line 14.

  -- Found old boilerplate from RFC 3978, Section 5.5 on line 2923.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2896.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2903.

  -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2909.

  ** This document has an original RFC 3978 Section 5.4 Copyright Line,
     instead of the newer IETF Trust Copyright according to RFC 4748.

  ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead
     of the newer disclaimer which includes the IETF Trust according to RFC
     4748.


  Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt:
  ----------------------------------------------------------------------------

  == No 'Intended status' indicated for this document; assuming Proposed
     Standard


  Checking nits according to https://www.ietf.org/id-info/checklist :
  ----------------------------------------------------------------------------

     No issues found here.

  Miscellaneous warnings:
  ----------------------------------------------------------------------------

  == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not
     match the current year

  -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not
     defined in RFC 2119.  If it is intended as a requirements expression, it
     should be rewritten using one of the combinations defined in RFC 2119;
     otherwise it should not be all-uppercase.

  -- The document seems to lack a disclaimer for pre-RFC5378 work, but may
     have content which was first submitted before 10 November 2008.  If you
     have contacted all the original authors and they are all willing to grant
     the BCP78 rights to the IETF Trust, then this is fine, and you can ignore
     this comment.  If not, you may need to add the pre-RFC5378 disclaimer. 
     (See the Legal Provisions document at
     https://trustee.ietf.org/license-info for more information.)

  -- The document date (June 13, 2005) is 6885 days in the past.  Is this
     intentional?


  Checking references for intended status: Proposed Standard
  ----------------------------------------------------------------------------

     (See RFCs 3967 and 4897 for information about using normative references
     to lower-maturity documents in RFCs)

  == Missing Reference: '129' is mentioned on line 772, but not defined

  == Missing Reference: '254' is mentioned on line 772, but not defined

  == Missing Reference: '0' is mentioned on line 1236, but not defined

  == Missing Reference: '127' is mentioned on line 1236, but not defined

  == Missing Reference: '68' is mentioned on line 1233, but not defined

  == Missing Reference: '69' is mentioned on line 1234, but not defined

  == Missing Reference: '71' is mentioned on line 1236, but not defined

  -- Looks like a reference, but probably isn't: 'SampleContents' on line 1929

  == Unused Reference: '22' is defined on line 2765, but no explicit
     reference was found in the text

  == Unused Reference: '23' is defined on line 2770, but no explicit
     reference was found in the text

  -- Possible downref: Non-RFC (?) normative reference: ref. '1'

  -- Possible downref: Non-RFC (?) normative reference: ref. '2'

  ** Obsolete normative reference: RFC 2327 (ref. '4') (Obsoleted by RFC 4566)

  ** Obsolete normative reference: RFC 3548 (ref. '6') (Obsoleted by RFC 4648)

  -- Obsolete informational reference (is this intentional?): RFC 2733 (ref.
     '7') (Obsoleted by RFC 5109)

  == Outdated reference: A later version (-12) exists of
     draft-ietf-avt-rtp-retransmission-11

  -- Obsolete informational reference (is this intentional?): RFC 2326 (ref.
     '15') (Obsoleted by RFC 7826)

  -- Obsolete informational reference (is this intentional?): RFC 2044 (ref.
     '18') (Obsoleted by RFC 2279)

  -- Obsolete informational reference (is this intentional?): RFC 3267 (ref.
     '22') (Obsoleted by RFC 4867)

  -- Obsolete informational reference (is this intentional?): RFC 3016 (ref.
     '23') (Obsoleted by RFC 6416)

  -- Obsolete informational reference (is this intentional?): RFC 2793 (ref.
     '24') (Obsoleted by RFC 4103)

  -- Obsolete informational reference (is this intentional?): RFC 3555 (ref.
     '30') (Obsoleted by RFC 4855, RFC 4856)

  == Outdated reference: A later version (-05) exists of
     draft-freed-media-type-reg-04


     Summary: 5 errors (**), 0 flaws (~~), 13 warnings (==), 18 comments (--).

     Run idnits with the --verbose option for more detailed information about
     the items above.

--------------------------------------------------------------------------------


2	   Internet Draft                                                 J. Rey
3	   draft-ietf-avt-rtp-3gpp-timed-text-15.txt                   Y. Matsui
4	                                                               Panasonic
5	   Expires: December 13, 2005                              June 13, 2005

7	                  RTP Payload Format for 3GPP Timed Text

9	   Status of this Memo

11	   By submitting this Internet-Draft, each author represents that any
12	   applicable patent or other IPR claims of which he or she is aware
13	   have been or will be disclosed, and any of which he or she becomes
14	   aware will be disclosed, in accordance with Section 6 of BCP 79.

16	   Internet-Drafts are working documents of the Internet Engineering
17	   Task Force (IETF), its areas, and its working groups. Note that other
18	   groups may also distribute working documents as Internet-Drafts.

20	   Internet-Drafts are draft documents valid for a maximum of six months
21	   and may be updated, replaced, or obsoleted by other documents at any
22	   time. It is inappropriate to use Internet-Drafts as reference
23	   material or to cite them other than as "work in progress."

25	   The list of current Internet-Drafts can be accessed at
26	   http://www.ietf.org/1id-abstracts.txt

28	   The list of Internet-Draft Shadow Directories can be accessed at
29	   http://www.ietf.org/shadow.html

31	   Abstract

33	   This document specifies an RTP payload format for the transmission of
34	   3GPP (3rd Generation Partnership Project) timed text.  3GPP timed
35	   text is a time-lined decorated text media format with defined storage
36	   in a 3GP file.  Timed Text can be synchronized with audio/video
37	   contents and used in application such as captioning, titling and
38	   multimedia presentations.  In the following sections the problems of
39	   streaming timed text are addressed and a payload format for streaming
40	   3GPP timed text over RTP is specified.

42	   Table of Contents

44	   1. Introduction....................................................4
45	   2. Motivation, Requirements and Design Rationale...................4
46	    2.1. Motivation...................................................4
47	    2.2. Basic Components of the 3GPP Timed Text Media Format.........4
48	    2.3. Requirements.................................................5
49	    2.4. Limitations..................................................7
50	    2.5. Design Rationale.............................................8
51	   3. Terminology....................................................10
52	   4. RTP Payload Format for 3GPP Timed Text.........................12
53	    4.1. Payload Header Definitions..................................13
54	     4.1.1. Common Payload Header Fields.............................14
55	     4.1.2. TYPE 1 Header............................................16
56	     4.1.3. TYPE 2 Header............................................19
57	     4.1.4. TYPE 3 Header............................................22
58	     4.1.5. TYPE 4 Header............................................23
59	     4.1.6. TYPE 5 Header............................................23
60	    4.2. Buffering of Sample Descriptions............................24
61	     4.2.1. Dynamic SIDX wrap-around mechanism.......................24
62	    4.3. Finding payload header values in 3GP files..................26
63	    4.4. Fragmentation of Timed Text Samples.........................29
64	    4.5. Reassembling Text Samples at the Receiver...................30
65	    4.6. On Aggregate Payloads.......................................32
66	    4.7. Payload Examples............................................36
67	    4.8. Relation to RFC 3640........................................40
68	    4.9. Relation to RFC 2793........................................41
69	   5. Resilient Transport............................................41
70	   6. Congestion control.............................................42
71	   7. Scene Description..............................................43
72	    7.1. Text Rendering Position and Composition.....................43
73	    7.2. SMIL usage..................................................44
74	    7.3. Finding layout values in a 3GP file.........................44
75	   8. 3GPP Timed Text Media Type.....................................44
76	   9. SDP usage......................................................48
77	    9.1. Mapping to SDP..............................................48
78	    9.2. Parameter Usage in the SDP Offer/Answer Model...............48
79	     9.2.1. Unicast Usage............................................49
80	     9.2.2. Multicast Usage..........................................51
81	    9.3. Offer/Answer Examples.......................................52
82	    9.4. Parameter Usage outside of Offer/Answer.....................54
83	   10. IANA Considerations...........................................54
84	   11. Security considerations.......................................54
85	   12. References....................................................55
86	    12.1. Normative References.......................................55
87	    12.2. Informative References.....................................55
88	   13. Annexes.......................................................57
89	   13.1. Basics of the 3GP File Structure...........................57
90	   14. Acknowledgements..............................................58
91	   15. Authors' Addresses............................................58
92	   16. IPR Notices...................................................59
93	   17. Full Copyright Statement......................................59

95	   [Note to the RFC Editor:
96	    - Please replace "RFCXXXX" with the RFC designation of this document
97	      when published,
98	    - Please substitute "draft-ietf-..." references with the
99	      corresponding RFC number if available at the time of publication]

101	1. Introduction

103	   3GPP timed text is a media format for time-lined decorated text
104	   specified in the 3GPP Technical Specification TS 26.245 "Transparent
105	   end-to-end packet switched streaming service (PSS); Timed Text Format
106	   (Release 6)" [1].  Besides plain text, the 3GPP timed text format
107	   allows the creation of decorated text like for karaoke applications,
108	   scrolling text for newscasts or hyperlinked text.  These contents may
109	   or may not be synchronized with other media, like audio or video.

111	   The purpose of this draft is to provide a means to stream 3GPP timed
112	   text contents using RTP [3].  This includes the streaming of timed
113	   text being read out of a (3GP) file as well as the streaming of timed
114	   text generated in real-time, a.k.a. live streaming.

116	   Section 2 contains the motivation of this document, an overview of
117	   the media format, the requirements and the design rationale.  Section
118	   3 defines the terminology used.  Section 4 specifies the payload
119	   headers, the fragmentation and re-assembly rules for text samples,
120	   the rules for payload aggregation and the relations of this document
121	   to RFC 3640 [12] and RFC 2793 [24].  Section 5 specifies some simple
122	   schemes for resilient transport and gives pointers to other possible
123	   mechanisms.  Section 6 addresses congestion control.  Section 7
124	   specifies scene description.  Section 8 defines the media type.
125	   Section 9 specifies SDP for unicast and multicast sessions, including
126	   usage in the Offer / Answer model [13].  Sections 10 and 11 address
127	   IANA and security considerations.  Section 12 lists references.
128	   Annexes are included as Section 13.

130	2. Motivation, Requirements and Design Rationale

132	2.1. Motivation

134	   The 3GPP timed text format was developed for use in the services
135	   specified in the 3GPP Transparent End-to-end Packet-switched
136	   Streaming Services (3GPP PSS) specification [16].

138	   As of today, PSS allows to download 3GPP timed text contents stored
139	   in 3GP files.  However, due to the lack of a RTP payload format, it
140	   is not possible to stream 3GPP timed text contents over RTP.

142	   This document specifies such payload format.

144	2.2. Basic Components of the 3GPP Timed Text Media Format

146	   Before going into the details of the design, it is necessary to have
147	   knowledge about how the media format is constructed.  We can identify
148	   four differentiated functional components: layout information,
149	   default formatting, text strings and decoration.  In the following we
150	   shortly explain these and match them to their designations in a 3GP
151	   file:

153	        o Initial spatial layout information related to the text
154	          strings: these are the height and width of the text region
155	          where text is displayed, the position of the text region in
156	          the display and the layer or proximity of the text to the
157	          user.  In 3GP files, this information is contained in the
158	          Track Header Box (3GP file designations are capitalized for
159	          clarity).

161	        o Default settings for formatting and positioning of text:
162	          style (font, size, colour,...), background colour, horizontal
163	          and vertical justification, line width, scrolling, etcetera.
164	          For 3GP files, this corresponds to the Sample Descriptions.

166	        o The actual text strings: encoded characters using either UTF-
167	          8 [18] or UTF-16 [19] encoding and,

169	        o The decoration: if some characters have different style,
170	          delay, blink, etcetera... this needs to be indicated.  The
171	          decoration is only present in the text samples if it is
172	          actually needed.  Otherwise, the default settings as above
173	          apply.  In 3GP files text strings and decoration inside the
174	          Text Samples, i.e. Modifier Boxes are appended to the text
175	          strings, if needed.  At the time of writing this payload
176	          format the following modifiers are specified in the 3GPP
177	          timed text media format specification [1]:

179	           - text highlight,
180	           - highlight color,
181	           - blinking text,
182	           - karaoke feature,
183	           - hyperlink,
184	           - text delay,
185	           - text style and,
186	           - positioning of the text box and,
187	           - text wrap indication.

189	2.3. Requirements

191	   Once the basic components are known, it is necessary to define which
192	   requirements shall the payload format fulfill:

194	     1. It shall enable both live streaming and streaming from a 3GP
195	        file.

197	                Informative note: for the purpose of this document, the
198	                term live streaming refers to those scenarios where the
199	                timed text stream is sent from a live encoder.  Upon
200	                reception the content may or may not be stored in a 3GP
201	                file.  Typically, in live streaming applications, the
202	                sender encapsulates the timed text content in RTP
203	                packets following the guidelines given in this document.
204	                At the receiving side, a buffer is used to cancel the
205	                network delay and delay jitter.  If receiver and sender
206	                support packet loss resilience mechanisms (see Section
207	                5) it may also be possible to recover from packet
208	                losses.  Note that how sender and receiver actually
209	                manage and dimension the buffers are implementation
210	                design choices.

212	     2. Furthermore, it shall be possible for an RTP receiver using this
213	        payload format, and capable of storing in 3GP format, to obtain
214	        all necessary information from the RTP packets for storing the
215	        received text contents according to the 3GP file format.  This
216	        file may or may not be the same as the original file.

218	                Informative note: the 3GP file format itself is based on
219	                the ISO Base Media File Format recommendation [2].
220	                Section 13.1 gives some insight into the 3GP file
221	                structure.  Further, Sections 4.3 and 7.3 specify where
222	                the information needed for filling in payload headers is
223	                found in a 3GP file.  For live streaming, appropriate
224	                values complying with the format and units described in
225	                [1] shall be used.  Where needed, clarifications on
226	                appropriate values are given in this document.

228	     3. It shall enable efficient and resilient transport of timed text
229	        contents over RTP.  In particular:

231	          a. Enable the transmission of the sample descriptions both by
232	             out-of-band and in-band means.  Sample descriptions are
233	             important information, which potentially apply to several
234	             text samples.  These default formatting settings are
235	             typically transmitted out-of-band (reliably) once at the
236	             initialization phase.  If additional sample descriptions
237	             are needed in the course of a session, these may be sent
238	             also out-of-band or in-band.  In-band transmission,
239	             although unreliable, may be more appropriate for sending
240	             sample descriptions if these should be sent frequently, as
241	             opposed to establishing an additional communication channel
242	             for SDP, for example.  It is also useful in cases where an
243	             out-of-band channel may not be available and for live
244	             streaming, where contents are not known a priori.  Thus,
245	             the payload format shall enable out-of-band and in-band
246	             transmission of sample descriptions.  Section 4.1.6
247	             specifies a payload header for transmitting sample
248	             descriptions in-band.  Section 9 specifies how sample
249	             descriptions are mapped to SDP.

251	          b. Enable the fragmentation of a text sample into several RTP
252	             packets in order to cover a wide range of applications and
253	             network environments.  In general, fragmentation should be
254	             a rare event given the low bit rates and relatively small
255	             text sample sizes.  However, the 3GPP Timed Text media
256	             format does allow for larger text samples.  Therefore, the
257	             payload format shall take this into account and provide a
258	             means for coping with fragmentation and reassembly.
259	             Section 4.3 deals with fragmentation.

261	          c. Enable the aggregation of units into an RTP packet for
262	             making the transport more efficient.  In a mobile
263	             communication environment a typical text sample size is
264	             around 100-200 bytes.  If the available bit rate and the
265	             packet size allow it, units should be aggregated into one
266	             RTP packet.  Section 4.6 deals with aggregation.

268	          d. Enable the use of resilient transport mechanisms, such as
269	             repetition, retransmission [11] and FEC [7] (see Section
270	             5.)  For a more general discussion, refer to RFC 2354 [8],
271	             which discusses available mechanisms for stream repair.

273	2.4. Limitations

275	     The payload headers have been optimized in size for RTP.  Instead
276	     of using 32-bit (S)LEN, SDUR, SIDX header fields which would carry
277	     many unused bits much of the time, it has been a design choice to
278	     reduce the size of these fields.  As a consequence, this payload
279	     format has reduced maximum values with respect to sizes and
280	     durations of (text) samples and sample descriptions.  These maximum
281	     values differ from those allowed in 3GP files, where they are
282	     expressed using 32-bit (unsigned) integers.  In some cases
283	     extension mechanisms are provided to deal with larger values.
284	     However, it is noted that the values used here should be enough for
285	     the streaming applications targeted.

287	     Following limitations apply:

289	     1. The maximum size of text samples carried in RTP packets is
290	        restricted to be a 16-bit (unsigned) integer (this includes the
291	        text strings and modifiers).  This means a maximum size for the
292	        unit would be about 64 Kbytes.  No extension mechanism is
293	        provided.

295	     2. The sample description index values are restricted to be an
296	        (unsigned) 8-bit integer.  An extension mechanism is given in
297	        Section 4.3.

299	     3. The text sample duration is restricted to be a 24-bit (unsigned)
300	        integer.  This yields a maximum duration at a timestamp
301	        clockrate of 1000 Hz of about 4.6 hours.  Nevertheless, an
302	        extension mechanism is provided in Section 4.3.

304	     4. Sample descriptions are also restricted in size: if the size
305	        cannot be expressed as a (unsigned) 16-bit integer, the sample
306	        description shall not be conveyed.  As in the case of the sample
307	        size, no extension mechanism is provided.

309	     5. A further limitation concerns the UTF-16 encodings supported:
310	        only transport of text strings following big endian byte order
311	        is supported.  See Section 4.1.1 for details.

313	2.5. Design Rationale

315	   The following design choices were made:

317	     1. 'Unit' approach: the payload formats specified in this draft
318	        follow a simple scheme: a 3-byte common header (Common Payload
319	        Header) followed by a specific header for each text sample
320	        (fragment) type.  Following these headers, the text sample
321	        contents are placed (Section 4.1.1 and following).  This
322	        structure is called a 'unit'.

324	        The following units have been devised to comply with the
325	        requirements mentioned in Section 2.3:

327	          a. A TYPE 1 unit that contains one complete text sample,

329	          b. A TYPE 2 unit that contains a complete text string or a
330	             fragment thereof,

332	          c. A TYPE 3 unit that contains the complete modifiers or only
333	             the first fragment thereof,

335	          d. A TYPE 4 unit that contains one modifier fragment other
336	             than the first and,

338	          e. A TYPE 5 unit that contains one sample description.

340	        This 'unit' approach was motivated by the following reasons:

342	              1. Allows a simple classification of the text samples and
343	                text sample fragments that can be conveyed by the
344	                payload format.

346	              2. Enables easy interoperability with RFC 3640 [12].
347	                During the development of this payload format, interest
348	                was shown from MPEG-4 standardization participants in
349	                developing a common payload structure for the transport
350	                of 3GPP Timed Text.  While interoperability is not
351	                strictly necessary for this payload format to work, it
352	                has been pursued in this payload format.  Section 4.8
353	                explains how this is done.

355	     2. Character count is not implemented.  This payload format does
356	        detect lost text samples fragments but it does not enable an RTP
357	        receiver to find out the exact number of text characters lost.
358	        In fact, the fragment size included in the payload headers does
359	        not help in finding the number of lost characters, because the
360	        UTF-8/UTF-16 [18][19] encodings used yield a variable number of
361	        bytes per character.

363	        For finding out the exact number of lost characters, an
364	        additional field reflecting the character count (and possibly
365	        the character offset) upon fragmentation would be required.
366	        This would additionally require the entity performing
367	        fragmentation to count the characters included in each text
368	        fragment.

370	        One benefit of having a character count would be that the
371	        display application would be able to replace missing characters
372	        through some other character representing character loss, e.g.:

374	             If we take the "Some text is lost now" and assume the loss
375	             of a packet containing the text in the middle, this could
376	             be displayed (with a character count):

378	             "Some ############now"

380	             As opposed to:

382	             "Some #now"

384	             Which is what this payload format enables ("#" indicates a
385	             missing character or packet, respectively).

387	        However, it is the opinion of the authors that for applications
388	        such as subtitling applications and multimedia presentations
389	        that use this payload format, such partial error correction is
390	        not worth the cost of including two additional fields, namely
391	        character count and character offset.  Instead, it is
392	        recommended that some more overhead be invested to provide full
393	        error correction by protecting the less text sample fragments
394	        using the measures outlined in Section 5.

396	     3. Fragment re-assembly: in order to re-assemble the text samples,
397	        offset information is needed.  Instead of a character or byte
398	        offset, a single byte, TOTAL/THIS, is used.  These two values
399	        indicate the total number and current index of fragments of a
400	        text sample.  This is simpler than having a character offset
401	        field in each fragment.  Details in Section 4.1.3.

403	     4. A length field, LEN, is present in the common header fields.
404	        While the length in the RTP payload format is not needed by most
405	        RTP applications (typically lower layers, like UDP, provide this
406	        information) it does ease interoperability with RFC 3640.  This
407	        is because the Access Units (AUs) used for carriage of data in
408	        RFC 3640 must include a length indication.  Details in Section
409	        4.8.

411	     5. The header fields in the specific payload headers (TYPE headers
412	        in Sections 4.1.2 to 4.1.6) have been arranged for easy
413	        processing on 32-bit machines.  For this reason the fields SIDX
414	        and SDUR are swapped in TYPE 1 unit, compared to the other
415	        units.

417	3. Terminology

419	   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
420	   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
421	   document are to be interpreted as described in RFC 2119 [5].

423	   Furthermore, the following terms are used and have specific meaning
424	   within the context of this document:

426	   text sample or whole text sample

428	        In the 3GPP Timed Text media format [1] this term refers to a
429	        unit of timed text data as contained in the source (3GP) file.
430	        This includes the text string byte count, possibly a Byte Order
431	        Mark, the text string and any modifiers that may follow.  Its
432	        equivalent in audio/video would be a frame.

434	        In this document, however, a text sample comprises only text
435	        strings followed by zero or more modifiers.  This definition of
436	        text sample excludes the 16-bit text string byte count and the
437	        16-bit Byte Order Mark (BOM) present in 3GP file text samples
438	        (see Section 4.3 and Figure 9).  The 16-bit BOM is not
439	        transported in RTP as explained in Section 4.1.1.

441	   text strings:

443	        text strings is the term used to denote the actual text
444	        characters encoded either as UTF-8 or UTF-16.  When using this
445	        payload format, the text string does not contain any byte order
446	        mark (BOM).  See Figure 9 for details.

448	   fragment or text sample fragment:

450	        a fraction of a text sample.  A fragment may contain either text
451	        strings or modifier (decoration) contents, but not both at the
452	        same time.

454	   sample contents:

456	        general term to identify timed text data transported when using
457	        this payload format.  Sample contents may be one or several text
458	        samples, sample descriptions and sample fragments (note that, as
459	        per Section 4.6, there is only one case in which more than one
460	        fragment may be included in a payload).

462	   decoration/modifiers:

464	        the terms "decoration" and "modifiers" are used interchangeably
465	        throughout the document to denote the contents of the text
466	        sample that modify the default text formatting.  Modifiers may,
467	        for example, specify different font size for a particular
468	        sequence of characters or define karaoke timing for the sample.

470	   sample description:

472	        this term is used to denote information which is potentially
473	        shared by more than one text sample.  In a 3GP file a sample
474	        description is stored in a place where it can be shared.  It
475	        contains setup and default information such as scrolling
476	        direction, text box position, delay value, default font,
477	        background color, etc.

479	   units or transport units:

481	        the payload headers specified in this document encapsulate text
482	        samples, fragments thereof and sample descriptions by placing a
483	        common header and specific payload header (Sections 4.1.1 to
484	        4.1.6) before them and so building what is here called a
485	        (transport) unit.

487	   aggregation / aggregate packet

489	        The payload of an aggregate (RTP) packet consists of several
490	        (transport) units.

492	   track / stream

494	        3GP files contain audio/video and text tracks.  This document
495	        enables to stream text tracks using RTP.  Therefore both terms
496	        are exchanged in this document in the context of 3GP files.

498	   Media Header Box / Track Header Box / ...

500	        the 3GP file format makes use of these structures defined in the
501	        ISO Base File Format [2].  When referring to these in this
502	        document, initials are capitalized for clarity.

504	4. RTP Payload Format for 3GPP Timed Text

506	   The format of an RTP packet containing 3GPP timed text is shown
507	   below:

509	       0                   1                   2                   3
510	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
511	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
512	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
513	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
514	      |                           timestamp                           |
515	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
516	      |           synchronization source (SSRC) identifier            |
517	     /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
518	    | |U|   R   | TYPE|             LEN               |               :
519	    | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               :
520	   U| :           (variable header fields depending on TYPE           :
521	   N| :                                                               :
522	   I< +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
523	   T| |                                                               |
524	    | :                    SAMPLE CONTENTS                            :
525	    | |                                               +-+-+-+-+-+-+-+-+
526	    | |                                               |
527	     \+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
528	               Figure 1. 3GPP Timed Text RTP Packet Format.

530	   Marker bit (M): the marker bit SHALL be set to 1 if the RTP packet
531	   includes one or more whole text samples or the last fragment of a
532	   text sample; otherwise set to zero (0).

534	   Timestamp: the timestamp MUST indicate the sampling instant of the
535	   earliest (or only) unit contained in the RTP packet.  The initial
536	   value SHOULD be randomly determined, as specified in RTP [3].

538	        The timestamp value should provide enough timing resolution for
539	        expressing the duration of text samples, for synchronizing text
540	        with other media and for performing RTCP measurements such as
541	        the interarrival delay jitter or the RTCP Packet Receipt Times
542	        Report Block (Section 4.3 of RFC 3611 [20]).  This is compliant
543	        to RTP, section 5.1:

545	             "The resolution of the clock MUST be sufficient for the
546	             desired synchronization accuracy and for measuring packet
547	             arrival jitter (one tick per video frame is typically not
548	             sufficient)"

550	        The above observation applies to both timed text tracks included
551	        in a 3GP file as well as live streaming sessions.  In the case
552	        of a 3GP timed text track, the timestamp clockrate is the value
553	        of the "timescale" parameter in the Media Header Box for that
554	        text track.  Each track in a 3GP file MAY have its own clockrate
555	        as specified in the Media Header Box.  Likewise, live streaming
556	        applications SHALL use an appropriate timestamp clockrate.  A
557	        default value of 1000 Hz is RECOMMENDED.  Other timestamp
558	        clockrates MAY be used.  In this case, the typical behavior here
559	        is to match the 3GPP timed text clockrate to that used by an
560	        associated audio or video stream.

562	        In an aggregate payload, units MUST be placed in play-out order,
563	        i.e. earliest first in the payload.  If TYPE 1 units are
564	        aggregated, the timestamp of the subsequent units MUST be
565	        obtained by adding the timed text sample duration of previous
566	        samples to the RTP timestamp value.  There are two exceptions to
567	        this rule: TYPE 5 units and an aggregate payload containing two
568	        fragments of the same text sample.  The details of the timestamp
569	        calculation are given in Section 4.6.

571	        Finally, timestamp clockrates MUST be signaled by out-of-band
572	        means at session setup, e.g., using the media type "rate"
573	        parameter in SDP.  See Section 9 for details.

575	   Payload Type (PT): the payload type is set dynamically and sent by
576	   out-of-band means.

578	   The usage of the remaining RTP header fields, namely V, P, X, CC, SN
579	   and SSRC, follows the rules of RTP and the profile in use.

581	4.1. Payload Header Definitions

583	   The (transport) units specified in this document consist of a set of
584	   common fields (U, R, TYPE, LEN), followed by specific header fields
585	   (TYPES 1-5) and text sample contents.  See Figure 1 and Figure 2.

587	   In Figure 2 two example RTP packets are depicted.  Thereby, the first
588	   one contains an aggregate RTP payload with two complete text samples
589	   and the second one contains one text sample fragment.  After each
590	   unit header is explained, detailed payload examples follow in Section
591	   4.7.

593	                                        +----------------------+
594	                                        |                      |
595	                                        |   RTP Header         |
596	                                        |                      |
597	                               ---------+----------------------+
598	                               |        |                      |
599	                               |        |COMMON + TYPE 1 Header|
600	                               |        ........................
601	                        UNIT 1 -        |                      |
602	                               |        |    Text Sample       |
603	                               |        |                      |
604	                               |-------\........................
605	                                -------/|                      |
606	                               |        |COMMON + TYPE 1 Header|
607	                               |        ........................
608	                        UNIT 2 -        |                      |
609	                               |        |    Text Sample       |
610	                               |        |                      |
611	                               |        |                      |
612	                               ---------+----------------------+

614	                                        +----------------------+
615	                                        |                      |
616	                                        |   RTP Header         |
617	                                        |                      |
618	                               ---------+----------------------+
619	                               |        |  COMMON + TYPE 2     |
620	                               |        |    (or 3 or 4) Hdr   |
621	                               |        ........................
622	                        UNIT 3 -        |                      |
623	                               |        | Text Sample Fragment |
624	                               |        |                      |
625	                               |        |                      |
626	                               ---------+----------------------+
627	                     Figure 2. Example RTP packets.

629	4.1.1. Common Payload Header Fields

631	   The fields common to all payload headers have the following format:

633	            0                   1                   2
634	            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
635	           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
636	           |U|   R   |TYPE |             LEN               |
637	           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
638	                     Figure 3. Common payload header fields.

640	   Where:

642	   o U (1 bit) "UTF Transformation flag": this is used to inform RTP
643	     receivers whether UTF-8 (U=0) or UTF-16 (U=1) was used to encode
644	     the text string.  UTF-16 text strings transported by this payload
645	     format MUST be serialized in big endian order, a.k.a. network byte
646	     order.

648	        Informative note:  timed text clients complying with the 3GPP
649	        Timed Text format [1] are only required to understand the big
650	        endian serialization.  Thus, in order to ease interoperability,
651	        the reverse serialization (little endian) is not supported by
652	        this payload format.

654	     For the payload formats defined in this document, the U bit is
655	     only used in TYPE 1 and TYPE 2 headers.  Senders MUST set the U
656	     bit to zero in TYPE 3, TYPE 4 and TYPE 5 headers.  Consequently,
657	     receivers MUST ignore the U bit in TYPE 3, TYPE 4 and TYPE 5
658	     headers.

660	   o R (4 bits) "Reserved bits": for future extensions.  This field
661	     MUST be set to zero (0x0) and MUST be ignored by receivers.

663	   o TYPE (3 bits) "Type Field": this field specifies which specific
664	     header fields follow.  The following TYPE values are defined:

666	        - TYPE 1, for a whole text sample
667	        - TYPE 2, for a text string fragment (without modifiers)
668	        - TYPE 3, for a whole modifier box or the first fragment of a
669	          modifier box
670	        - TYPE 4, for a modifier fragment other than first.
671	        - TYPE 5, for a sample description.  Exactly one header per
672	          sample description.
673	        - TYPE 0, 6 and 7 are reserved for future extensions.  Note that
674	          future extensions are possible, e.g., a unit that explicitly
675	          signals the number of characters present in a fragment (see
676	          Section 2.5).  In order to guarantee backwards-compatibility,
677	          it SHALL be possible that older clients ignore (newer) units
678	          they do not understand, without invalidating the timestamp
679	          calculation mechanisms or otherwise preventing from decoding
680	          the other units.

682	   o Finally, the LEN (16 bits) "Length Field": indicates the size (in
683	     bytes) of this header field and all the fields following, i.e. the
684	     LEN field followed by the unit payload: text strings and modifiers
685	     (if any).  This definition only excludes the initial U/R/TYPE byte
686	     of the common header.  The LEN field follows network byte order.

688	     The way in which LEN is obtained when streaming out of a 3GP file
689	     depends on the particular unit type.  This is explained for each
690	     unit in the sections below.

692	     For live streaming, both sample length and the LEN value for the
693	     current fragment MUST be calculated during the sampling process or
694	     during fragmentation.

696	     In general, LEN may take the following values:

698	      - TYPE = 1, LEN >= 8,
699	      - TYPE = 2, LEN > 9,
700	      - TYPE = 3, LEN > 6,
701	      - TYPE = 4, LEN > 6 and,
702	      - TYPE = 5, LEN > 3.

704	     Receivers MUST discard units that do not comply with these values.
705	     However, the RTP header fields and the rest of the units in the
706	     payload (if any) are still useful, as guaranteed by the
707	     requirement for future extensions above.

709	     In the following subsections the different payload headers for the
710	     values of TYPE are specified.

712	4.1.2. TYPE 1 Header

714	       0                   1                   2                   3
715	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
716	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
717	      |U|   R   |TYPE |       LEN  (always >=8)       |    SIDX       |
718	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
719	      |                      SDUR                     |     TLEN      |
720	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
721	      |      TLEN     |
722	      +-+-+-+-+-+-+-+-+
723	                    Figure 4. TYPE 1 Header Format.

725	   This header type is used to transport whole text samples.  This unit
726	   should be the most common case, i.e. the text sample should be
727	   usually small enough to be transported in one unit without having to
728	   separate text strings from modifiers.  In an aggregate (RTP packet)
729	   payload containing several text samples, every sample is preceded by
730	   its own TYPE 1 header (see Figure 12).

732	        Informative note: as indicated in the Terminology Section, a
733	        text sample is composed by the text strings followed by the
734	        modifiers (if any).  This is also how text samples are stored in
735	        3GP files.  The separation of a text sample into text strings
736	        and modifiers is only needed for large samples (or small
737	        available IP MTU sizes, see Section 4.4) and it is accomplished
738	        with TYPE 2 and TYPE 3 headers, as explained in the Sections
739	        below.

741	   Note that also empty text samples are considered whole text samples,
742	   although they do not contain sample contents.  Empty text samples may
743	   be used to clear the display or to put an end to samples of unknown
744	   duration, for example.  Units without sample contents SHALL have a
745	   LEN field value of 8 (0x0008).

747	   The fields above have the following meaning:

749	   o U, R and TYPE as defined in Section 4.1.1.

751	   o LEN, in this case, represents the length of the (complete) text
752	     sample plus eight (8) bytes of headers.  For finding the length if
753	     the text sample in the Sample Size Box of 3GP files, see Section
754	     4.3.

756	   o SIDX (8 bits) "Text Sample Entry Index": this is an index used to
757	     identify the sample descriptions.

759	     The SIDX field is used to find the sample description
760	     corresponding to the unit's payload.  There are two types of SIDX
761	     values: static and dynamic.

763	     Static SIDX values are used to identify sample descriptions that
764	     MUST be sent out-of-band and MUST remain active during the whole
765	     session.  A static SIDX value is unequivocally linked to one
766	     particular sample description during the whole session.  It SHOULD
767	     be avoided that many sample descriptions are carried
768	     out-of-band, since these may become large and, ultimately,
769	     transport is not the goal of the out-of-band channel.  Thus, this
770	     feature is RECOMMENDED for transporting those sample descriptions
771	     that provide a set of minimum default format settings.  Static
772	     SIDX values MUST fall in the (closed) interval [129,254].

774	     Dynamic SIDX values are used for sample descriptions sent in-band.
775	     Sample descriptions MAY be sent in-band for several reasons:
776	     because they are generated in real time, for transport resiliency
777	     or both.  A dynamic SIDX value is unequivocally linked to one
778	     particular sample description during the period in which this is
779	     active in the session and it SHALL NOT be modified during that
780	     period.  This period MAY be smaller than or equal to the session
781	     duration.  This period is not known a priori.  A maximum of 64
782	     dynamic simultaneously active SIDX values is allowed at any
783	     moment.  Dynamic SIDX values MUST fall in the closed interval
784	     [0,127].  This should be enough for both, recorded content and
785	     live streaming applications.  Nevertheless, a wrap-around
786	     mechanism is provided in Section 4.2.1 to handle streaming
787	     sessions where more than 64 SIDX values might be needed.  Servers
788	     MAY make use of dynamic sample descriptions.  Clients MUST be able
789	     to receive and interpret dynamic sample descriptions.

791	     Finally, SIDX values 128 and 255 are reserved for future use.

793	   o SDUR (24 bits) "Text Sample Duration": indicates the sample
794	     duration in RTP timestamp units of the text sample.  For this
795	     field, a length of 3 bytes is preferred to 2 bytes.  This is
796	     because, for a typical clockrate of 1000 Hz, 16 bits would allow
797	     for a maximum duration of just 65 seconds, which might be too
798	     short for some streams.  On the other hand, 24 bits at 1000 Hz
799	     allow for a maximum duration of about 4.6 hours, while for 90 KHz,
800	     this value is about 3 minutes.  These values should be enough for
801	     streaming applications.  However, if a larger duration is needed,
802	     the extension mechanism specified in Section 4.3 SHALL be used.

804	     Apart from defining the time period during which the text is
805	     displayed, the duration field is also used to find the timestamp
806	     of subsequent units within the aggregate RTP packet payload (if
807	     any).  This is explained in Section 4.6.

809	     Text samples have generally a known duration at the time of
810	     transmission.  However, in some cases like live streaming, the
811	     time for which a text piece shall be presented might not be known
812	     a priori.  Thus, the value zero SDUR=0 (0x000000) is reserved to
813	     signal unknown duration.  The amount of time that a sample of
814	     unknown duration is presented is determined by the timestamp of
815	     the next sample that shall be displayed at the receiver: text
816	     samples of unknown duration SHALL be displayed until the next text
817	     sample becomes active, as indicated by its timestamp.

819	     The next example illustrates how units of unknown duration MUST be
820	     presented.  If no text sample following is available, it is an
821	     implementation issue what should be displayed.  E.g. a server
822	     could send an empty sample to clear the text box.

824	        Example: imagine you are in an airport watching the latest news
825	        report while you wait for your plane.  Airports are loud, so the
826	        news report is transcribed in the lower area of the screen.
827	        This area displays two lines of text: the headlines and the
828	        words spoken by the news speaker.  As usual, the headlines are
829	        shown for a longer time than the rest.  This time is, in
830	        principle, unknown to the stream server, which is streaming
831	        live.  A headline is just replaced when the next headline is
832	        received.

834	     However, upon storing a text sample with SDUR=0 in a 3GP file, the
835	     SDUR value MUST be changed to the effective duration of the text
836	     sample, which MUST be always greater than zero (note that the ISO
837	     file format [2] explicitly forbids a sample duration of zero).
838	     The effective duration MUST be calculated as the timestamp
839	     difference between the current sample (with unknown duration) and
840	     the next text sample that is displayed.

842	     Note that samples of unknown duration SHALL NOT use features,
843	     which require knowledge of the duration of the sample up front.
844	     Such features are scrolling and karaoke in [1].  This also applies
845	     for future extensions of the Timed Text format.  Furthermore, only
846	     sample descriptions (TYPE 5 units) MAY follow units of unknown
847	     duration in the same aggregate payload.  Otherwise, it would not
848	     be possible to calculate the timestamp of these other units.

850	     For text contents stored in 3GP files, see Section 4.3 for details
851	     on how to extract the duration value.  For live streaming, live
852	     encoders SHALL assign appropriate values and units according to
853	     [1] and later releases.

855	   o TLEN (16 bits), "Text String Length", is a byte-count of the text
856	     string.  The text string length is needed by the decoder to know
857	     where the modifiers in the payload start.  TLEN is not present in
858	     text string fragments (TYPE 2) since it can be deductively
859	     calculated from the LEN values of each fragment.

861	     The TLEN value is obtained from the text samples as contained in
862	     3GP files.  Refer to Section 4.3.  For live content, the TLEN MUST
863	     be obtained during the sampling process.

865	   o Finally, the actual text sample is placed after the TLEN field.
866	     As defined in Section 3, a text sample consists of a string of
867	     characters encoded using either UTF-8 or UTF-16, followed by zero
868	     or more modifiers.  Note also, that no BOM and no byte count are
869	     included in the strings carried in the payload (as opposed to text
870	     samples stored in 3GP files [1]).

872	4.1.3. TYPE 2 Header

874	       0                   1                   2                   3
875	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
876	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
877	      |U|   R   |TYPE |          LEN( always >9)      | TOTAL | THIS  |
878	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
879	      |                    SDUR                       |    SIDX       |
880	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
881	      |               SLEN            |
882	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
883	                      Figure 5. TYPE 2 Header Format.

885	   This header type is used to transport either a whole text string or a
886	   fragment of it.  TYPE 2 units SHALL NOT contain modifiers.  In
887	   detail:

889	   o U, R and TYPE as defined in Section 4.1.1.

891	   o SIDX and SDUR as defined in Section 4.1.2.

893	        Note that the U, SIDX and SDUR fields are meaningful since
894	        partial text strings can also be displayed.

896	   o The LEN field (16 bits) indicates the length of the text string
897	     fragment plus nine (9) bytes of headers.  Its value is calculated
898	     upon fragmentation.  LEN MUST always be greater than nine (0x0009).
899	     Otherwise, the unit MUST be discarded.

901	     According to the guidelines in Section 4.3, text strings MUST be
902	     split at character boundaries for allowing the display of text
903	     fragments.  Hence, a text fragment MUST contain at least one
904	     character in either UTF-8 or UTF-16.  Actually, this is just a
905	     formalism since by observing the guidelines, much larger fragments
906	     should be created.

908	     Note also, that TYPE 2 units do not contain an explicit text
909	     string length, TLEN (see TYPE 1).  This is because TYPE 2 units do
910	     not contain any modifiers after the text string.  If needed, the
911	     length of the received string can be obtained using the LEN values
912	     of the TYPE 2 units.

914	   o The SLEN field (16 bits) indicates the size (in bytes) of the
915	     original (whole) text sample to which this fragment belongs.  This
916	     length comprises the text string plus any modifier boxes present
917	     (and includes neither the byte order mark nor the text string
918	     length as mentioned in the Terminology Section).

920	     Regarding the text sample length: timed text samples are neither
921	     generated at regular intervals nor there is a default sample size.
922	     If 3GP files are streamed, the length of the text samples is
923	     calculated beforehand and included in the track itself, while for
924	     live encoding it is the real time encoder that SHALL choose an
925	     appropriate size for each text sample.  In this case, the amount
926	     of text 'captured' in a sample depends on the text source and the
927	     particular application (see examples below).  Samples may, e.g.,
928	     be tailored to match the packet MTU as close as possible or to
929	     provide a given redundancy for the available bit rate.  The
930	     encoding application MUST also take into account the delay
931	     constraints of the real-time session and assess whether FEC,
932	     retransmission or other similar techniques are reasonable options
933	     for stream repair.

935	     The following examples shall illustrate how a real-time encoder
936	     may choose its settings to adapt to the scenario constraints.

938	          Example: imagine a newscast scenario, where the spoken news
939	          is transcribed and synchronized with the image and voice of
940	          the reporter.  We assume that the news speaker talks at an
941	          average speed of 5 words per second with an average word
942	          length of 5 characters plus one space per word, i.e. 30
943	          characters per second.  We assume an available IP MTU of 576
944	          bytes and an available bitrate of 576*8bits per
945	          second=4.6Kbps.  We assume each character can be encoded
946	          using 2-bytes in UTF-16.  In this scenario, several
947	          constraints may apply, for example: available IP MTU,
948	          available bandwidth, allowable delay and required redundancy.
949	          If the target were to minimize the packet overhead, a text
950	          sample covering 8 seconds of text would be closest to the IP
951	          MTU: IP/UDP/RTP/TYPE1 Header + (8s text sample)=20+8+12+8+(~6
952	          chars/word * 5 word/s * 8s *2 chars/word)= 528 bytes < 576
953	          bytes.  For other scenarios, like lossy networks, it may
954	          happen that just one packet per sample is too low of a
955	          redundancy.  In this case, a choice could be that the encoder
956	          'collects' text every second, thus yielding text samples
957	          (TYPE 1 units) of 68 bytes, TYPE 1 header included.  We can,
958	          e.g., include three contiguous text samples in one RTP
959	          payload: the current and last two text samples (see below).
960	          This accounts to a total IP packet size of 20+8+12+3*(8+60)=
961	          244 bytes.  Now, with the same available bitrate of 4.6Kbps,
962	          these 244-byte packets can be sent redundantly up two times
963	          per second:

965	          RTP payload (1,2,3)(1,2,3) (2,3,4)(2,3,4) (3,4,5)(3,4,5) ...
966	          Time:       <----1s------> <----1s------> <-----1s-----> ...

968	          This means that each text sample is sent at least six times,
969	          which should provide enough redundancy.  Although not as
970	          bandwidth efficient (488*8 < 528*8 < 576*8 bps) as the
971	          previous packetization, this option increases the stream
972	          redundancy while still meeting the delay and bandwidth
973	          constraints.

975	          Another example would be a user sending timed text from a
976	          type-in area in the display.  In this case, the text sample
977	          is created as soon as the user clicks the 'send' button.
978	          Depending on the packet length, fragmentation may be needed.

980	          In a video conferencing application, text is synchronized
981	          with audio and video.  Thus, the text samples shall be
982	          displayed long enough to be read by a human, shall fit in the
983	          video screen and shall 'capture' the audio contents rendered
984	          during the time the corresponding video and audio is
985	          rendered.

987	     For stored content, see Section 4.3 for details on how to find the
988	     SLEN value in a 3GP file.  For live content, the SLEN MUST be
989	     obtained during the sampling process.

991	     Finally, note that clients MAY use SLEN to buffer space for the
992	     remaining fragments of a text sample.

994	   o The fields TOTAL (4 bits) and THIS (4 bits) indicate the total
995	     number of fragments in which the original text sample (i.e. text
996	     string and its modifiers) has been fragmented and which order
997	     occupies the current fragment in that sequence, respectively.
998	     Note that the sequence number alone cannot replace the
999	     functionality of the THIS field, since packets (and fragments) may
1000	     be repeated, e.g., as in repeated transmission (see Section 5).
1001	     Thus, an indication for "fragment offset" is needed.

1003	     The usual "byte offset" field is not used here for two reasons: a)
1004	     it would take one more byte and b) it does not provide any
1005	     information on the character offset.  UTF-8/UTF-16 text strings
1006	     have, in general, a variable character length ranging from 1 to 6
1007	     bytes.  Therefore, the TOTAL/THIS solution is preferred.  It could
1008	     also be argued that the LEN and SLEN fields be used for this
1009	     purpose, but while they would provide information about the
1010	     completeness of the text sample, they do not specify the order of
1011	     the fragments.

1013	     In all cases (TYPEs 2, 3 and 4), if the value of THIS is greater
1014	     than TOTAL or if TOTAL equals zero (0x0), the fragment SHALL be
1015	     discarded.

1017	   o Finally, the sample contents following the SLEN field consist of a
1018	     fragment of the UTF-8/UTF-16 character string; no modifiers
1019	     follow.

1021	4.1.4. TYPE 3 Header

1023	       0                   1                   2                   3
1024	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1025	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1026	      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
1027	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1028	      |                      SDUR                     |
1029	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1030	                      Figure 6. TYPE 3 Header Format.

1032	   This header type is used to transport either the entire modifier
1033	   contents present in a text sample or just the first fragment of them.
1034	   This depends on whether the modifier boxes fit in the current RTP
1035	   payload.

1037	   If a text sample containing modifiers is fragmented this header MUST
1038	   be used to transport the first fragment or, if possible, the complete
1039	   modifiers.

1041	   In detail:

1043	   o The U, R and TYPE fields are defined as in Section 4.1.1.

1045	   o LEN indicates the length of the modifier contents.  Its value is
1046	     obtained upon fragmentation.  Additionally, the LEN field MUST be
1047	     greater than six (0x0006).  Otherwise, the unit MUST be discarded.

1049	   o The TOTAL/THIS field has the same meaning as for TYPE 2.

1051	     For TYPE 3 unit containing the last (trailing) modifier fragment,
1052	     the value of TOTAL MUST be equal to that of THIS (TOTAL=THIS).  In
1053	     addition, TOTAL=THIS MUST be greater than one, because the total
1054	     number of fragments of a text sample is logically always larger
1055	     than one.

1057	     Otherwise, if TOTAL is different from THIS in a TYPE 3 unit, this
1058	     means that the unit contains the first fragment of the modifiers.

1060	   o The SDUR has the same definition for TYPE 1.  Since the fragments
1061	     are always transported in own RTP packets, this field is only
1062	     needed to know how long this fragment is valid.  This may, e.g.,
1063	     be used to determine how long it should be kept in the display
1064	     buffer.

1066	   Note that the SLEN and SIDX fields are not present in TYPE 3 unit
1067	   headers.  This is because: a) these fragments do not contain text
1068	   strings and b) these types of fragments are applied over text string
1069	   fragments, which already contain this information.

1071	4.1.5. TYPE 4 Header

1073	       0                   1                   2                   3
1074	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1075	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1076	      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
1077	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1078	      |                      SDUR                     |
1079	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1080	                      Figure 7. TYPE 4 Header Format.

1082	   This header type is placed before modifier fragments, other than the
1083	   first one.

1085	   The U, R and TYPE fields are used as per Section 4.1.1.

1087	   LEN indicates as for TYPE 3 the length of the modifier contents and
1088	   SHALL also be obtained upon fragmentation.  The LEN field MUST be
1089	   greater than six (0x0006).  Otherwise, the unit MUST be discarded.

1091	   TOTAL/THIS is used as in TYPE 2.

1093	   The SDUR field is defined as in TYPE 1. The reasoning behind the
1094	   absence of SLEN and SIDX is the same as in TYPE 3 units.

1096	4.1.6. TYPE 5 Header

1098	       0                   1                   2                   3
1099	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1100	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1101	      |U|   R   |TYPE |      LEN( always >3)          |   SIDX        |
1102	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1103	                      Figure 8. TYPE 5 Header Format.

1105	   This header type is used to transport (dynamic) sample descriptions.
1106	   Every sample description MUST have its own TYPE 5 header.

1108	   The U, R and TYPE fields are used as per Section 4.1.1.

1110	   The LEN field indicates the length of the sample description, plus
1111	   three units accounting for the SIDX and LEN field itself.  Thus, this
1112	   field MUST be greater than three (0x0003).  Otherwise, the unit MUST
1113	   be discarded.

1115	   If the sample is streamed from a 3GP file, the length of the sample
1116	   description contents (i.e. what comes after SIDX in the unit itself)
1117	   is obtained from the file (see Section 4.3).

1119	   The SIDX field contains a dynamic SIDX value assigned to the sample
1120	   description carried as sample content of this unit.  As only dynamic
1121	   sample descriptions are carried using TYPE 5, the possible SIDX
1122	   values are in the (closed) interval [0,127].

1124	   Senders MAY make use of TYPE 5 units.  All receivers MUST implement
1125	   support for TYPE 5 units, since it adds minimum complexity and it may
1126	   increase the robustness of the streaming session.

1128	   The next section specifies how SIDX values are calculated.

1130	4.2. Buffering of Sample Descriptions

1132	   The buffering of sample descriptions is a matter of the client's
1133	   timed text codec implementation.  In order to work properly, this
1134	   payload format requires that:

1136	     o Static sample descriptions MUST be buffered at the client, at
1137	        least, for the duration of the session.

1139	     o If dynamic sample descriptions are used, their buffering and
1140	        update of the SIDX values MUST follow the mechanism described in
1141	        the next section.

1143	4.2.1. Dynamic SIDX wrap-around mechanism

1145	   The use of dynamic sample descriptions by senders is OPTIONAL.
1146	   However, if used, senders MUST implement this mechanism.  Receivers
1147	   MUST always implement it.

1149	   Dynamic SIDX values remain active either during the entire duration
1150	   of the session (if used just once) or in different intervals of it
1151	   (if used once or more).

1153	        Note: in the following SIDX means dynamic SIDX.

1155	   For choosing the wrap-around mechanism, the following rationale was
1156	   used: there are 128 dynamic SIDX values possible, [0..127].  If one
1157	   chooses to allow a maximum of 127 to be used as dynamic SIDXs, then
1158	   any reordered packet with a new sample description would make the
1159	   mechanism fail.  E.g., if the last packet received is SIDX=5, then
1160	   all 127 values except SIDX=6 would be "active".  Now, if a reordered
1161	   packet arrives with a new description, SIDX=9, it will be mistakenly
1162	   discarded, because the SIDX=9 is, at that moment, marked as "active"
1163	   and active sample descriptions shall not be re-written.  Therefore,
1164	   a "guard interval" is introduced.  This guard interval reduces the
1165	   number of active SIDXs at any point in time to 64.  Although most
1166	   timed text applications will probably need less than 64 sample
1167	   descriptions during a session (in total), a wrap-around mechanism to
1168	   handle the need for more is described here.

1170	   Thereby, a sliding window of 64 active SIDX values is used.  Values
1171	   within the window are "active"; all others are marked "inactive".  An
1172	   SIDX value becomes active if at least one sample description
1173	   identified by that SIDX has been received.  Since sample descriptions
1174	   MAY be sent redundantly, it is possible that a client receives a
1175	   given SIDX several times.  However, active sample descriptions SHALL
1176	   NOT be overwritten: the receiver SHALL ignore redundant sample
1177	   descriptions and it MUST use the already cached copy.  The "guard
1178	   interval" of (64) inactive values ensures that always the correct
1179	   association SIDX <-> sample description is used.

1181	        Informative note: as for the "guard interval" value itself, 64
1182	        as 128/2 was considered simple enough while still meeting the
1183	        expected maximum number of sample descriptions.  Besides that,
1184	        there's no other motivation for choosing 64 or a different
1185	        value.

1187	   The following algorithm is used to buffer dynamic sample descriptions
1188	   maintain the dynamic SIDX values:

1190	   Let X be the last SIDX received that updated the range of active
1191	   sample descriptions.  Let Y be a value within the allowed range for
1192	   dynamic SIDX: [0,127], and different from X. Let Z be the SIDX of the
1193	   last received sample description.  Then:

1195	     1. Initialize all dynamic SIDX values as inactive.  For stored
1196	        contents, read the sample description index in the Sample to
1197	        Chunk box ("stsc") for that sample.  For live streaming, the
1198	        first value MAY be zero or any other value in the interval
1199	        above.  Go to step 2.

1201	     2. First in-band sample description with SIDX=Z is received and
1202	        stored, Set X=Z. Go to step 3.

1204	     3. Any SIDX within the interval [X+1 modulo(128), X+64 modulo(128)]
1205	        is marked as inactive and any corresponding sample description
1206	        is deleted.  Any SIDX within the interval [X+65 modulo(128), X]
1207	        is set active.  Go to step 4 (wait state).

1209	     4. Wait for next sample description.  Once the client is
1210	        initialized, the interval of active SIDX values MUST change
1211	        whenever a sample description with an SIDX value in the inactive
1212	        set is received.  I.e., upon reception of a sample description
1213	        with SIDX=Z do:

1215	        a. If Z is in the (closed) interval [X+1 modulo(128), X+64
1216	          modulo(128)] then set X=Z, store the sample description and
1217	          go to step 3.

1219	        b. Else Z must be in the interval [X+65 modulo(128), X], thus:
1220	            i. If SIDX=Z is not stored, then store the sample
1221	               description. Go to beginning of step 4 (wait state).
1222	           ii. Else go to the beginning of step 4 (wait state).

1224	        Informative note: it is allowed to send any value of SIDX=X in
1225	        the interval [0,127].  E.g., if [64..127] is the current active
1226	        set and SIDX=0 is sent a new sample description is defined (0)
1227	        and an old one deleted (64), thus [65..127] and [0] are active.
1228	        Similarly, one could now send SIDX=64, thus inverting the active
1229	        and inactive sets.

1231	   Example,

1233	        if X=4, any SIDX in the interval [5,68] is inactive.  Active
1234	        SIDX values are in the complementary interval [69,127] plus
1235	        [0,4].  E.g., if the client receives a SIDX=6, then the active
1236	        interval is now different: [0,6] plus [71,127].  If the received
1237	        SIDX is in the current active interval no change SHALL be
1238	        applied.

1240	4.3. Finding payload header values in 3GP files

1242	   For the purpose of streaming timed text contents, some values in the
1243	   boxes contained in a 3GP file are mapped to fields of this payload
1244	   header.  This section explains where to find those values.

1246	   Additionally, for the duration and sample description indexes,
1247	   extension mechanisms are provided.  All senders MUST implement the
1248	   extension mechanisms described herein.

1250	   If the file is streamed out of a 3GP file, thee following guidelines
1251	   SHALL be followed.
1252	        Note: all fields in the objects (boxes) of a 3GP file are found
1253	        in network byte order.

1255	   Information obtained from the Sample Table Box (stbl):

1257	        o Sample Descriptions and Sample Description length:  the
1258	          Sample Description box (stsd, inside the stbl) contains the
1259	          sample descriptions.  For timed text media, each element of
1260	          stsd is a timed text sample entry (type "tx3g").

1262	          The (unsigned) 32 bits of the "size" field in the stsd box
1263	          represent the length (in bytes) of the sample description, as
1264	          carried in TYPE 5 units.  On the other hand, the LEN field of
1265	          TYPE 5 units is restricted to 16 bits.  Therefore if the
1266	          value of "size" is greater than (2^16-1-3)[bytes], then the
1267	          sample description SHALL NOT be streamed with this payload
1268	          format.  There is no extension mechanism defined in this
1269	          case, since fragmentation of sample descriptions is not
1270	          defined (sample descriptions are typically up to some 200
1271	          bytes in size).  Note: the three (3) accounts for the TYPE 5
1272	          header fields included in the LEN value.

1274	        o SDUR from the Decoding Time to Sample Box (stts). The
1275	          (unsigned) 32 bits of the "sample delta" field are used for
1276	          calculating SDUR.  However, since SDUR field is only 3 bytes
1277	          long, then text samples with duration values larger than
1278	          (2^24-1)/(timestamp clockrate)[seconds] cannot be streamed
1279	          directly.  The solution is simple: copies of the
1280	          corresponding text sample SHALL be sent.  Thereby, the
1281	          timestamp and duration values SHALL be adjusted so that a
1282	          continuous display is guaranteed as it just one sample would
1283	          have been sent.  I.e., a sample with timestamp TS and
1284	          duration SDUR can be sent as two samples having timestamps
1285	          TS1 and TS2 and durations SDUR1 and SDUR2, such that TS1=TS,
1286	          TS2=TS1+SDUR1 and SDUR=SDUR1+SDUR2.

1288	        o Text sample length from the Sample Size Box (stsz).  The
1289	          (unsigned) 32 bits of the "sample size" or "entry size" (one
1290	          of them, depending on whether the sample size is fixed or
1291	          variable) indicate the length (in bytes) of the 3GP text
1292	          sample.  For obtaining the length of the (actual) streamed
1293	          text sample, the lengths of the text string byte count (2
1294	          bytes) and, in case of UTF-16 strings, the length the BOM
1295	          (also 2 bytes) SHALL be deducted.  This is illustrated in
1296	          Figure 9.

1298	          Text Sample according to 3GPP TS 26.245

1300	                               TEXT SAMPLE (length=stsz)
1301	                 .--------------------------------------------------.
1302	                /                                                    \
1303	                               TEXT STRING  (length=TBC)
1304	                    .------------------------------------.
1305	                   /                                      \
1306	                TBC BOM                                     MODIFIERS
1307	               +---+---+----------------------------------+-----------+
1308	                                     ||
1309	                                     ||    TBC BOM  -> TLEN  field
1310	                                     ||   +---+---+    U bit
1311	                                     ||
1312	                                     \/

1314	          Text Sample according to this Payload Format

1316	                                 TEXT SAMPLE (length=SLEN w/o TBC,BOM)
1317	                        .--------------------------------------------.
1318	                       /                                              \
1319	                                     TEXT STRING (length=TLEN)
1320	                        .--------------------------------.
1321	                       /                                  \
1322	                                    TEXT STRING             MODIFIERS
1323	                       +----------------------------------+-----------+

1325	              KEY:
1326	              TBC= Text string Byte Count
1327	              BOM= Byte Order Mark
1328	                    Figure 9. Text sample composition.

1330	          Moreover, since the LEN field in TYPE 1 unit header is 16-bit
1331	          long, then larger text sample sizes than (2^16-1-8) [bytes]
1332	          SHALL NOT be streamed.  Also in this case, there is no
1333	          extension mechanism defined.  This is because this maximum is
1334	          considered enough for the targeted streaming applications.
1335	          (Note: the eight (8) accounts for the TYPE 1 header fields
1336	          included in the LEN value).

1338	        o SIDX from the Sample to Chunk Box (stsc): the stsc Box is
1339	          used to find samples and their corresponding sample
1340	          descriptions.  These are referenced by the "sample
1341	          description index", a (unsigned) 32-bit integer.  If possible,
1342	          these indices may be directly mapped to the SIDX field.
1343	          However, there are several cases where this may not be
1344	          possible:

1346	                a) The total number of indices used is greater than the
1347	                number of indices available, i. e., if the static sample
1348	                descriptions are more than 127 or the dynamic ones are
1349	                more than 64 or,

1351	                b) The original SIDX value ranges do not fit in the
1352	                allowed ranges for static (129-254) or dynamic (0-127)
1353	                values.

1355	          Therefore, when assigning SIDX values to the sample
1356	          descriptions, the following guidelines are provided:

1358	          o    Static sample descriptions can simply be assigned
1359	                consecutive values within the range 129-254 (closed
1360	                interval).  This range should be well enough for static
1361	                sample descriptions.

1363	          o    As for dynamic sample descriptions:

1365	                a) Streams that use less than 64 dynamic sample
1366	                descriptions SHOULD use consecutive values for SIDX
1367	                anywhere in the range 0-127 (closed interval).

1369	                b) For streams with more than 64 sample descriptions,
1370	                the SIDX values MUST be assigned in usage order, and if
1371	                any sample description shall be used after it has been
1372	                set inactive, it will need to be re-sent and assigned a
1373	                new SIDX value (according to the algorithm in
1374	                Section4.2.1).

1376	   Information obtained from the Media Data Box:

1378	        o Text strings, TLEN, U bit and modifiers from the Media Data
1379	          Box (mdat).  Text strings, 16-bit text string byte count,
1380	          Byte Order Mark (BOM, indicating UTF encoding) and modifier
1381	          boxes can be found here.

1383	          For TYPE 1 units, the value of TLEN is extracted from the
1384	          text string byte count that precedes the text string in the
1385	          text sample, as stored in the 3GP file.  If UTF-16 encoding
1386	          is used, two (2) more bytes have to be deducted from this
1387	          byte count beforehand, in order to exclude the BOM.  See
1388	          Figure 9.

1390	4.4. Fragmentation of Timed Text Samples

1392	   This section explains why text samples may have to be fragmented and
1393	   discusses some of the possible approaches to do it.  A solution is
1394	   proposed together with rules and recommendations for fragmenting and
1395	   transporting text samples.

1397	   3GPP Timed Text applications are expected to operate at low bitrates.
1398	   This fact, added to the small size of timed text samples (typically
1399	   one or two hundred bytes) makes fragmentation of text samples a rare
1400	   event.  Samples should usually fit into the MTU size of the used
1401	   network path.

1403	   Nevertheless, some text strings (e.g. ending roll in a movie) and
1404	   some modifier boxes (i.e. for hyperlinks, for karaoke or for styles)
1405	   may become large.  This may also apply for future modifier boxes.  In
1406	   such cases, the first option to consider is whether it is possible to
1407	   adjust the encoding (e.g. the size of sample) in such a way that
1408	   fragmentation is avoided.  If so, this is preferred to fragmentation
1409	   and SHOULD be done.

1411	   Otherwise, if this is not possible or other constraints avoid it,
1412	   fragmentation MAY be used and the basic guidelines given in this
1413	   document MUST be followed:

1415	   o It is RECOMMENDED that text samples are fragmented as seldom as
1416	     possible, i.e. the least possible number of fragments is created
1417	     out of a text sample.

1419	   o If there is some bitrate and free space in the payload available,
1420	     sample descriptions (if at hand) SHOULD be aggregated.

1422	   o Text strings MUST split at character boundaries, see TYPE 2
1423	     header.  Otherwise, it is not possible to display the text
1424	     contents of a fragment if a previous fragment was lost.  As a
1425	     consequence, text string fragmentation requires knowledge of the
1426	     UTF-8/UTF-16 encoding formats to determine character boundaries.

1428	   o Unlike text strings, the modifier boxes are NOT REQUIRED to split
1429	     at meaningful boundaries.  However, it is RECOMMENDED to do so
1430	     whenever possible.  This decreases the effects of packet loss.
1431	     This payload format does not ensure that partially received
1432	     modifiers be applied to text strings.  If only part of the
1433	     modifiers is received, it is an application issue how to deal with
1434	     these, i.e. whether to use them or not.

1436	        Informative note: ensuring that partially received modifiers can
1437	        be applied to text strings in all cases (for all modifier types
1438	        and for all fragment loss constellations) would place additional
1439	        requirements on the payload format.  In particular this would
1440	        require that: a) senders understand the semantics of the
1441	        modifier boxes and b) specific fragment headers for each of the
1442	        modifier boxes are defined, in addition to the payload formats
1443	        defined below.  Understanding the modifiers semantics means
1444	        knowing, e.g., where does each modifier start and end, which
1445	        text fragments are affected, which modifiers may or may not be
1446	        split or what the fields indicate.  This is necessary for being
1447	        able to split the modifiers in such a way that each fragment can
1448	        be applied independent of previous packet losses.  This would
1449	        require a more intelligent fragmentation entity and more complex
1450	        headers.  Given the low probability of fragmentation and the
1451	        desire to keep the requirements low, it does not seem reasonable
1452	        to specify such modifier box specific headers.

1454	   o Modifier and text string fragments SHOULD be protected against
1455	     packet losses, i.e. using FEC [7], retransmission [11], repetition
1456	     (Section 5) or an equivalent technique.  This minimizes the
1457	     effects of packet loss.

1459	   o An additional requirement when fragmenting text samples is that
1460	     the start of the modifiers MUST be indicated using the payload
1461	     header defined for that purpose, i.e. a TYPE 3 unit MUST be used
1462	     (see Section 4.1.4).  This enables a receiver to detect the start
1463	     of the modifiers as long as there are not two or more consecutive
1464	     packet losses.

1466	   o Finally, sample descriptions SHALL NOT be fragmented, because they
1467	     contain important information that may affect several text
1468	     samples.

1470	4.5. Reassembling Text Samples at the Receiver

1472	   The payload headers defined in this document allow reassembling
1473	   fragmented text samples.  For this purpose, the standard RTP
1474	   timestamp, the duration field (SDUR) and the fields TOTAL/THIS in the
1475	   payload headers are used.

1477	   Units that belong to the same text sample MUST have the same
1478	   timestamp.  TYPE 5 units do not comply with this rule since they are
1479	   not part of any particular text sample.

1481	   The process for collecting the different fragments (units) of a text
1482	   sample is as follows:

1484	     1. Search for units having the same timestamp value, i.e., units
1485	        that belong to the same text sample or sample descriptions that
1486	        shall become available at that time instant.  If several units
1487	        of the same sample are repeated, only one of them SHALL be used.
1488	        Repeated units are those that have the same timestamp and the
1489	        same values for TOTAL/THIS.

1491	                Note that, as mentioned in Section 4.1.1, the receiver
1492	                SHALL ignore units with unrecognized TYPE value.
1493	                However, the RTP header fields and the rest of the units
1494	                (if any) in the payload are still useful.

1496	     2. Check within this set whether any of the units from the text
1497	        sample is missing.  This is done using the TOTAL and THIS
1498	        fields; the TOTAL field indicates how many fragments were
1499	        created out of the text sample and the THIS field indicates the
1500	        position of this fragment in the text sample.  As result of this
1501	        operation two outcomes are possible:

1503	          a. No fragment is missing.  Then the THIS field SHALL be used
1504	             to order the fragments and reassemble the text sample
1505	             before forwarding it to the decoding application.  Special
1506	             care SHALL be taken when reassembling the text string as
1507	             indicated in bullet 4 below.

1509	          b. One or more fragments are missing: check whether this
1510	             fragment belongs to the text string or to the modifiers:
1511	             TYPE 2 units identify text string fragments, TYPE 3 and 4
1512	             modifier fragments:

1514	              i. If the fragment or fragments missing belong to the
1515	                  text string and the modifiers were received complete,
1516	                  then the received text characters may, at least, be
1517	                  displayed as plain text.  Some modifiers may only be
1518	                  applied as long as it is possible to identify the
1519	                  character numbers, e.g. if only last text string
1520	                  fragment is lost.  This is the case for modifiers
1521	                  defining specific font styles ('styl'), highlighted
1522	                  characters ('hlit'), karaoke feature ('krok)' and
1523	                  blinking characters ('blnk').  Other modifiers such as
1524	                  'dlay' or 'tbox' can be applied without the knowledge
1525	                  of the character number.  It is an application issue
1526	                  to decide whether to use apply the modifiers or not.

1528	             ii. If the fragment missing belongs to the modifiers and
1529	                  the text strings were received complete, then the
1530	                  incomplete modifiers may be used.  The text string
1531	                  SHOULD at least be displayed as plain text.  As
1532	                  mentioned in Section 4.3 modifiers may split without
1533	                  observing meaningful boundaries.  Hence, it may not
1534	                  always be possible to make use of partially received
1535	                  modifiers.  However, to avoid this, it is RECOMMENDED
1536	                  that the modifiers do split at meaningful boundaries.

1538	            iii. A third possibility is that it is not possible to
1539	                  discern whether modifiers or text strings were
1540	                  received complete.  E.g. if the TYPE 3 unit of a
1541	                  sample plus the following or preceding packet is lost,
1542	                  there is no way for the RTP receiver to know if one if
1543	                  both packets lost belong to the modifiers or there is
1544	                  also some text strings.  Repetition, FEC,
1545	                  retransmission or other protection mechanisms as per
1546	                  section 4.6 are RECOMMENDED to avoid this situation.

1548	             iv. Finally, if it is sure that neither text strings nor
1549	                  modifiers were received complete, then the text
1550	                  strings and the modifiers may be rendered partially or
1551	                  may be discarded.  This is an application choice.

1553	     3. Sample descriptions can be directly associated with the
1554	        reassembled text samples, via the sample description index
1555	        (SIDX).

1557	     4. Reassembling of text strings: since the text strings transported
1558	        in RTP packets MUST NOT include any byte order mark (BOM), the
1559	        receiver MUST prepend it to the reassembled UTF-16 string before
1560	        handling it to the timed text decoder (see Figure 9).  The value
1561	        of the BOM is 0xFEFF because only big endian serialization of
1562	        UTF-16 strings is supported by this payload format.

1564	4.6. On Aggregate Payloads

1566	   Units SHOULD be aggregated to avoid overhead, whenever possible.  The
1567	   aggregate payloads MUST comply with one of the following ordered
1568	   configurations:

1570	   1. Zero or more sample descriptions (TYPE 5) followed by zero or more
1571	     whole text samples (TYPE 1 units).  At least one unit of either
1572	     type MUST be present.

1574	   2. Zero or more sample descriptions followed by zero or one modifier
1575	     fragment, either TYPE 3 or TYPE 4.  At least one unit MUST be
1576	     present.

1578	   3. Zero or more sample descriptions followed by zero or one text
1579	     string fragment (TYPE 2) followed by zero or one TYPE 3 unit.  If
1580	     a TYPE 2 unit and a TYPE 3 unit are present, then they MUST belong
1581	     to the same text sample.  At least one unit MUST be present.

1583	   Some observations:

1585	   o Different aggregates than the ones listed above SHALL NOT be used.

1587	   o Sample descriptions MUST be placed in the aggregate payload before
1588	     the occurrence of any non-TYPE 5 units.

1590	   o Correct reception of TYPE 5 units is important since their
1591	     contents may be referenced by several other units in the stream.

1593	     Receivers are unable to use text samples until their corresponding
1594	     sample description is received.  Accordingly, a sender SHOULD send
1595	     multiple copies of a sample description to ensure reliability (see
1596	     section 5).  Receivers MAY use payload specific feedback messages
1597	     [21] to tell a sender that they have received a particular sample
1598	     description.

1600	   o Regarding timestamp calculation: in general, the rules for
1601	     calculating the timestamp of units in an aggregate payload depend
1602	     on the type of unit.  Based on the possible constellations for
1603	     aggregate payloads as above we have:

1605	           o Sample descriptions MUST receive the RTP timestamp of the
1606	             packet in which they are included.

1608	             Note that for TYPE 5 units, the timestamp actually does not
1609	             represent the instant when they are played out, but instead
1610	             the instant at which they become available for use.

1612	          o For the first configuration: the first TYPE 1 unit receives
1613	             the RTP timestamp.  The timestamp of any subsequent TYPE 1
1614	             unit MUST be obtained by adding sample duration and
1615	             timestamp, both of the preceding TYPE 1 unit.

1617	          o For the second and third configuration, all units, TYPE 2,
1618	             3 and 4, MUST receive the RTP timestamp.

1620	           Refer to detailed examples on the timestamp calculation
1621	           below.

1623	   o As per configuration 3 above, a payload MAY contain several
1624	     fragments of one (and only one) text sample.  If so, then exactly
1625	     one TYPE 2 unit followed by exactly one TYPE 3 unit are allowed in
1626	     the same payload.  This is in line with RFC 3640 [12], Section
1627	     2.4, which explicitly disallows combining fragments of different
1628	     samples in the same RTP payload.  Note that, in this special case,
1629	     no timestamp calculation is needed.  I. e., the RTP timestamp of
1630	     both units is equal to the timestamp in the packet's RTP header.

1632	   o Finally, note that the use of empty text samples allows for
1633	     aggregating non-consecutive TYPE 1 units in the same payload.  Two
1634	     text samples, with timestamps TS1 and TS3 and durations SDUR1 and
1635	     SDUR3, are not consecutive if it holds TS1+SDUR1 < TS3.  A
1636	     solution for this is to include an empty TYPE 1 unit with duration
1637	     SDUR2 between them, such that TS2+SDUR2 = TS1+SDUR1+SDUR2 = TS3.

1639	   Some examples of aggregate payloads are illustrated in Figure 10
1640	   (Note: the figure is not scaled.)
1641	      N/A    TS1   TS2     TS3
1642	    +------+-----+------+-----+
1643	    |TYPE5 |TYPE1|TYPE1 |TYPE1|
1644	    +------+-----+------+-----+
1645	      N/A   sdur1  sdur2  sdur3

1647	                                   N/A    TS4
1648	                                 +-----+-------+
1649	                                 |TYPE5| TYPE 1|                   a)
1650	                                 +-----+-------+
1651	                                   N/A   sdur4

1653	                                        TS4         TS4    TS4
1654	                                 +--------------+ +--------------+
1655	                                 |    TYPE2     | |TYPE2 |TYPE 3 | b)
1656	                                 +--------------+ +--------------+
1657	                                       sdur4       sdur4   sdur4

1659	                                        TS4             TS4
1660	                                 +--------------+ +--------------+
1661	                                 | TYPE2| TYPE 3| |     TYPE4    | c)
1662	                                 +--------------+ +--------------+
1663	                                   sdur4  sdur4        sdur4

1665	    |----------PAYLOAD 1------|  |--PAYLOAD 2---| |--PAYLOAD 3---|
1666	             rtpts1                  rtpts2          rtpts3

1668	     KEY:
1669	        TSx means Text Sample x,
1670	        rtptsy represents the standard RTP timestamp for PAYLOAD y
1671	        sdurz the duration of unit z
1672	        N/A means not applicable

1674	                  Figure 10. Example aggregate payloads.

1676	   In Figure 10 four text samples (TS1 through TS4) are sent using three
1677	   RTP packets.  These configurations have been chosen to show how the 5
1678	   TYPE headers are used.  Additionally, three different possibilities
1679	   for the last text sample, TS4, are depicted: a), b) and c).

1681	   In Figure 11, option b) from Figure 10 is chosen to illustrate how
1682	   the timestamp for each unit is found
1683	      N/A    TS1   TS2    TS3        TS4            TS4    TS4
1684	    +------+-----+------+-----+  +--------------+ +--------------+
1685	    |TYPE5 |TYPE1|TYPE1 |TYPE1|  |    TYPE2     | |TYPE2 |TYPE 3 |
1686	    +------+-----+------+-----+  +--------------+ +--------------+
1687	      N/A   sdur1 sdur2  sdur3         sdur4       sdur4   sdur4

1689	     (#1)    (#2) (#3)   (#4)           (#5)        (#6)    (#7)

1691	    |----------PAYLOAD 1------|  |--PAYLOAD 2---| |--PAYLOAD 3---|
1692	             rtpts1                  rtpts2          rtpts3

1694	               Figure 11. Selected payloads from Figure 10.

1696	   Assuming TSx means Text Sample x, rtptsy represents the standard RTP
1697	   timestamp for PAYLOAD y and sdurz the duration of unit z, the
1698	   timestamp for unit #z, ts(#z), can be found as the sum of rtptsy and
1699	   the cumulative sum of the durations of preceding units in that
1700	   payload (except in the case of PAYLOAD 3 as per rule 3 above).  Thus,
1701	   we have:

1703	          1. for the units in the first aggregate payload, PAYLOAD 1:

1705	                        ts(#1)= rtpts1,
1706	                        ts(#2)= rtpts1,
1707	                        ts(#3)= rtpts1 + sdur1,
1708	                        ts(#4)= rtpts1 + sdur1 + sdur2,

1710	           Note that the TYPE 5 and the first TYPE 1 unit have both the
1711	           RTP timestamp.

1713	          2. for PAYLOAD 2:

1715	                        ts(#5)= rtpts2,

1717	          3. for PAYLOAD 3:

1719	                        ts(#6)= ts(#7)= rtpsts2= rtpts3

1721	           According to configuration 3 above, the TYPE2 and the TYPE 3
1722	           units shall belong to the same sample.  Hence rtpts3 must be
1723	           equal to rtpts2.  For the same reason, the value of SDUR is
1724	           not be used to calculate the timestamp of the next unit.

1726	4.7. Payload Examples

1728	   Some example of payloads using the defined headers are shown below:

1730	       0                   1                   2                   3
1731	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1732	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1733	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
1734	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1735	      |                           timestamp                           |
1736	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1737	      |           synchronization source (SSRC) identifier            |
1738	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1739	      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
1740	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1741	      |                     SDUR                      |     TLEN      |
1742	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1743	      |    TLEN       |                                               |
1744	      +---------------+                                               |
1745	      |                  text string (no.bytes=TLEN)                  |
1746	      |                                                               |
1747	      |                                                               |
1748	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1749	      |                   modifiers   (no.bytes=LEN - 8 - TLEN)       |
1750	      |                                                               |
1751	      |                                                               |
1752	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1753	      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
1754	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1755	      |                     SDUR                      |     TLEN      |
1756	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1757	      |    TLEN       |                                               |
1758	      +---------------+                                               |
1759	      |                  text string (no.bytes=TLEN)                  |
1760	      |                                                               |
1761	      |                                                               |
1762	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1763	      |                   modifiers   (no.bytes=LEN - 8 - TLEN)       |
1764	      |                                               +-+-+-+-+-+-+-+-+
1765	      |                                               |
1766	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1767	            Figure 12. A payload carrying two TYPE 1 units.

1769	   In Figure 12 an RTP packet carrying two TYPE 1 units is depicted.  It
1770	   can be seen how the length fields LEN and TLEN can be used to find
1771	   the start of the next unit (LEN), find the start of the modifiers
1772	   (TLEN) and find the length of the modifiers (LEN-TLEN).

1774	       0                   1                   2                   3
1775	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1776	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1777	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
1778	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1779	      |                           timestamp                           |
1780	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1781	      |           synchronization source (SSRC) identifier            |
1782	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1783	      |U|   R   |TYPE5|      LEN( always >3)          |   SIDX        |
1784	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1785	      |                                                               |
1786	      |                   sample description (no.bytes=LEN - 3)       |
1787	      |                                                               |
1788	      |                                                               |
1789	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1790	      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
1791	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1792	      |                      SDUR                     |     TLEN      |
1793	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1794	      |      TLEN     |                                               |
1795	      +-+-+-+-+-+-+-+-+                                               |
1796	      |                  text string fragment (no.bytes=TLEN)         |
1797	      |                                                               |
1798	      |                                                               |
1799	      |                                               +-+-+-+-+-+-+-+-+
1800	      |                                               |
1801	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1802	     Figure 13. An RTP packet carrying a TYPE 5 and a TYPE 1 unit.

1804	   In Figure 13, a sample description and a TYPE 1 unit are aggregated.
1805	   The TYPE 1 unit happens to contain only text strings and is small so
1806	   that an additional the TYPE 5 unit is included for taking advantage
1807	   of the available bits in the packet.

1809	       0                   1                   2                   3
1810	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1811	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1812	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
1813	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1814	      |                           timestamp                           |
1815	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1816	      |           synchronization source (SSRC) identifier            |
1817	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1818	      |U|   R   |TYPE2|          LEN( always >9)      |TOTAL=4|THIS=1 |
1819	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1820	      |                    SDUR                       |    SIDX       |
1821	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1822	      |               SLEN            |                               |
1823	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
1824	      |                  text string fragment (no.bytes=LEN - 9)      |
1825	      |                                                               |
1826	      :                                                               :
1827	      :                                                               :
1828	      |                                               +-+-+-+-+-+-+-+-+
1829	      |                                               |
1830	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1831	    Figure 14. Payload with first text string fragment of a sample.

1833	   In Figure 14, Figure 15 and Figure 16 a text sample is split into
1834	   three RTP packets.  In the first one, the text string is big and
1835	   takes the whole packet length.  In the second packet in Figure 15,
1836	   the only possibility for carrying two fragments of the same text
1837	   sample is represented (see configuration 3 in Section 4.6).  The last
1838	   packet showed carries the last modifier fragment, a TYPE 4.

1840	       0                   1                   2                   3
1841	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1842	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1843	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
1844	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1845	      |                           timestamp                           |
1846	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1847	      |           synchronization source (SSRC) identifier            |
1848	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1849	      |U|   R   |TYPE2|          LEN( always >9)      |TOTAL=4|THIS=2 |
1850	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1851	      |                    SDUR                       |    SIDX       |
1852	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1853	      |               SLEN            |                               |
1854	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
1855	      |                  text string fragment (no.bytes=LEN - 9)      |
1856	      |                                                               |
1857	      |                                                               |
1858	      |                                                               |
1859	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1860	      |U|   R   |TYPE3|        LEN( always >6)        |TOTAL=4|THIS=3 |
1861	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1862	      |                      SDUR                     |               |
1863	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1864	      |                                                               |
1865	      |                    modifiers (no.bytes=LEN - 6)               |
1866	      |                                               +-+-+-+-+-+-+-+-+
1867	      |                                               |
1868	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1869	       Figure 15. An RTP packet carrying a TYPE2 unit and a TYPE 3 unit.

1871	       0                   1                   2                   3
1872	       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
1873	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1874	      |V=2|P|X| CC    |M|    PT       |        sequence number        |
1875	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1876	      |                           timestamp                           |
1877	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1878	      |           synchronization source (SSRC) identifier            |
1879	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1880	      |U|   R   |TYPE4|        LEN( always >6)        |TOTAL=4|THIS=4 |
1881	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1882	      |                      SDUR                     |               |
1883	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
1884	      |                                                               |
1885	      |                    modifiers (no.bytes=LEN - 6)               |
1886	      |                                               +-+-+-+-+-+-+-+-+
1887	      |                                               |
1888	      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
1889	     Figure 16. An RTP packet carrying last modifiers fragment (TYPE 4).

1891	4.8. Relation to RFC 3640

1893	   RFC 3640 defines a payload format for the transport of any
1894	   non-multiplexed MPEG-4 elementary stream.  One of the various MPEG-4
1895	   elementary streams types are MPEG-4 timed text streams, specified in
1896	   MPEG-4 part 17 [28], also known as ISO/IEC 14496-17.  MPEG-4 timed
1897	   text streams are capable of carrying 3GPP timed text data, as
1898	   specified in 3GPP TS 26.245 [1].

1900	   MPEG-4 timed text streams are intentionally constructed so as to
1901	   guarantee interoperability between RFC 3640 and this payload format.
1902	   This means that the construction of the RTP packets carrying timed
1903	   text is the same.  I.e., the MPEG-4 timed text elementary stream as
1904	   per ISO/IEC 14496-17 is identical to the (aggregate) payloads
1905	   constructed using this payload format.

1907	   Figure 11 illustrates the process of constructing an RTP packet
1908	   containing timed text.  As it can be seen in the partition block, the
1909	   (transport) units used in this payload format are identical to the
1910	   Timed Text Units (TTUs) defined in ISO/IEC 14496-17.  Likewise, the
1911	   rules for payload aggregation as per Section 4.6 are identical to the
1912	   ones defined in ISO/IEC 14496-17 and compliant with RFC 3640.  As a
1913	   result, an RTP packet that uses this payload format is identical to
1914	   and RTP packet using RFC 3640 conveying TTUs according to ISO/IEC
1915	   14496-17.  In particular, MPEG-4 Part 17 specifies that when using
1916	   RFC 3640 for transporting timed text streams, the "streamType"
1917	   parameter value is set to 0x0D and the value of the
1918	   "objectTypeIndication" in "config" takes the value 0x08.

1920	                +--------------------------------------+
1921	   Text samples | +--------------+   +--------------+  |
1922	   as per 3GPP  | |Text Sample 1 |   |Text Sample N |  |
1923	   TS 26245     | +--------------+   +--------------+  |
1924	                +--------------------------------------+
1925	                                  \/
1926	   +-------------------------------------------------------------------+
1927	   | Partition Text Samples into units. TTU[i]= TYPE i units.          |
1928	   |                                                                   |
1929	   |[U R TYPE LEN][{TOTAL,THIS}SIDX{SDUR}{TLEN}{SLEN}][SampleContents] |
1930	   |{..} means present if applicable, [..] means always present        |
1931	   +-------------------------------------------------------------------+
1932	                   \/                                \/
1933	   +-------------------------------------------------------------------+
1934	   |                      Aggregation (if possible)                    |
1935	   +-------------------------------------------------------------------+
1936	                   \/                                \/
1937	   +-------------------------------------------------------------------+
1938	   | RTP Entity adds and fills RTP header and Sends RTP packet, where  |
1939	   |  RTP packets according to this Payload Format =                   |
1940	   |= RTP packets carrying MPEG-4 Timed Text ES over RFC3640           |
1941	   +-------------------------------------------------------------------+
1942	                     Figure 11. Relation to RFC 3640.

1944	   Note: the use of RFC 3640 for transport of ISO/IEC 14496-17 data does
1945	   not require any new SDP parameters or any new mode definition.

1947	4.9. Relation to RFC 2793

1949	   The RFC 2793 [24] and its revision [25] specify a protocol for
1950	   enabling text conversation.  Typical applications of this payload
1951	   format are text communication terminals and text conferencing tools.
1952	   Text session contents are specified in ITU-T Recommendation T.140
1953	   [26].  T.140 text is UTF-8 coded as specified in T.140 [26] with no
1954	   extra framing.  The T140block contains one or more T.140 code
1955	   elements as specified in T.140.  Code elements are control sequences
1956	   such as "New Line", "Interrupt", "String Terminator" or "Start of
1957	   String".  Most T.140 code elements are single ISO 10646 [27]
1958	   characters, but some are multiple character sequences. Each character
1959	   is UTF-8 encoded [18] into one or more octets.

1961	   This payload format may also be used for conversational applications
1962	   (even for instant messaging).  However, this is not the main target
1963	   of it.  The differentiating feature of 3GPP Timed Text media format
1964	   is that it allows text decoration.  This is especially useful in
1965	   multimedia presentations, karaoke, commercial banners, news tickers,
1966	   karaoke, clickable text strings and captions.  T.140 text contents
1967	   used in RFC 2793 do not allow the use of text decoration.

1969	   Furthermore, the conversational text RTP payload format recommends a
1970	   method to include redundant text from already transmitted packets in
1971	   order to reduce the risk of text loss caused by packet loss.  Thereby
1972	   payloads would include a redundant copy of the last payload sent.
1973	   This payload format does not describe such method, but this is also
1974	   applicable here.  As explained in Section 5 packet redundancy SHOULD
1975	   be use, whenever possible.  The aggregation guidelines in Section 4.6
1976	   allow redundant payloads.

1978	5. Resilient Transport

1980	   Apart from the basic fragmentation guidelines described in the
1981	   section above, the simplest option for packet loss resilient
1982	   transport is packet repetition.  Such mechanism may consist of a
1983	   strict window-based repetition mechanism or, simply, a repetition
1984	   mechanism in a wider sense, where new and old packets are mixed, for
1985	   example.

1987	   A server MAY decide to use repetition as a measure for packet loss
1988	   resilience.  Thereby, a server MAY send the same RTP payloads or just
1989	   some of the units from the payloads.

1991	   As for the case of complete payloads, single repeated units MUST
1992	   match exactly the same units sent in the first transmission, i.e. if
1993	   fragmentation is needed, it SHALL be performed only once for each
1994	   text sample   Only then, a receiver can use the already received and
1995	   the repeated units to reconstruct the original text samples.  Since
1996	   the RTP timestamp is used to group together the fragments of a
1997	   sample, care must taken to preserve the timing of units when
1998	   constructing new RTP packets.

2000	        E.g. if a text sample was originally sent as a single
2001	        non-fragmented text sample (one TYPE 1 unit), a repetition of
2002	        that sample MUST be sent also as a single non-fragmented text
2003	        sample in one unit.  Likewise, if the original text sample was
2004	        fragmented and spread over several RTP packets, say a total of 3
2005	        units, then the repeated fragments SHALL also have the same byte
2006	        boundaries and use the same unit headers and bytes per fragment.

2008	   With repetition, repeated units resolve to the same timestamp as
2009	   their originals.  Where redundant units are available, only one of
2010	   them SHALL be used.

2012	   Regarding the RTP header fields:

2014	   o if the whole RTP payload is repeated, all payload-specific fields
2015	     in the RTP header (the M, TS and PT fields) MUST keep their
2016	     original values except the sequence number that MUST be
2017	     incremented to comply with RTP (the fields TOTAL/THIS enable to
2018	     re-assemble fragments with different sequence numbers).

2020	   o in packets containing single repeated units, the general rules in
2021	     Section 3 for assigning values to the RTP header fields apply.
2022	     Particularly relevant here is to keep the value of the RTP
2023	     timestamp to preserve the timing of the units.

2025	   Apart from repetition other mechanisms such as FEC [7],
2026	   retransmission [11] or similar techniques could be used to cope with
2027	   packet losses.

2029	6. Congestion control

2031	   Congestion control for RTP SHALL be implemented in accordance with
2032	   RTP [3], and the applicable RTP profile, e.g. RTP/AVP [17].

2034	   When using this payload format, mainly two factors may affect the
2035	   congestion control:

2037	   o    The use of (unit) aggregation may make the payload format more
2038	   bandwidth efficient, by avoiding header overhead and thus reducing
2039	   the used bitrate.

2041	   o    The use of resilient transport mechanisms: although timed text
2042	   applications typically operate at low bitrates, the increase due to
2043	   resilient transport shall be considered for congestion control
2044	   mechanisms.  This applies to all mechanisms but especially to less
2045	   efficient ones like repetition.

2047	7. Scene Description

2049	7.1. Text Rendering Position and Composition

2051	   In order to set up a timed text session, regardless of the stream
2052	   being stored in a 3GP file or streamed live, some initial layout
2053	   information is needed by the communicating peers.

2055	      +-------------------------------------------+
2056	      |      <-> tx                               |    +-------------+
2057	      |     +-------------------------------+     |<---|Display Area |
2058	      |  ^  |                               |     |    +-------------+
2059	      |  :  |                               |     |
2060	      |  :ty|                               |     |    +-------------+
2061	      |  :  |                               |<---------|Video track  |
2062	      |  :  |                               |     |    +-------------+
2063	      |  :  |                               |     |
2064	      |  :  |                               |     |
2065	      |  :  |                               |     |
2066	      |  v  |                               |     |
2067	      |  -  |   x-------------------------+ |     |    +-------------+
2068	      |h ^  |   |                         |<-----------|Text Track   |
2069	      |e :  +---|-------------------------|-+     |    +-------------+
2070	      |i :      | +---------------------+ |       |
2071	      |g :      | |                     | |       |    +-------------+
2072	      |h :      | |                     |<------------ |Text Box     |
2073	      |t v      | +---------------------+ |       |    +-------------+
2074	      |  -      +-------------------------+       |
2075	      +-------------------------------------------+
2076	                <........................>
2077	                        w i d t h
2078	   Figure 17. Illustration of text rendering position and composition

2080	   The parameters used for negotiating the position and size of the text
2081	   track in the display area are shown in Figure 17.  These are the
2082	   "width" and "height" of the text track, its translation values, "tx"
2083	   and "ty", and its "layer" or proximity to the user.

2085	   At the same time, the sender of the stream needs to know the
2086	   receiver's capabilities.  In this case, the maximum allowable values
2087	   for the text track height and width: "max-h" and "max-w", for the
2088	   stream the receiver shall display.

2090	   This layout information MUST be conveyed in a reliable form previous
2091	   to the start of the session, e.g. during session announcement or in
2092	   an Offer/Answer (O/A) exchange.  An example of a reliable transport
2093	   may be the out-of-band channel used for SDP.  Sections 8 and 9
2094	   provide details on the mapping of these parameters to SDP
2095	   descriptions and their usage in O/A.

2097	   For stored content, the layout values expressing stream properties
2098	   MUST be obtained from the Track Header Box.  See Section 7.3.

2100	   For live streaming appropriate values as negotiated during session
2101	   set-up shall be used.

2103	7.2. SMIL usage

2105	   The attributes contained in the Track Header Boxes of a 3GP file only
2106	   specify the spatial relationship of the tracks within the given 3GP
2107	   file.

2109	   If multiple 3GP files are sent, they require spatial synchronization.
2110	   For example, for a text and video stream, the positions of the text
2111	   and video tracks in Figure 17 shall be determined.  For such purpose,
2112	   SMIL [9] MAY be used.

2114	   SMIL assigns regions in the display to each of those files and places
2115	   the tracks within those regions.  Generally, in SMIL, the position of
2116	   one track (or stream) is expressed relative to another track.  This
2117	   is different to the 3GP file, where the upper left corner is the
2118	   reference for all translation offsets.  Hence, only if the position
2119	   in SMIL is relative to the video track origin, then this translation
2120	   offset has the same value as (tx, ty) in the 3GP file.

2122	   Note also that the original track header information is used for each
2123	   track only within its region, as assigned by SMIL.  Therefore, even
2124	   if SMIL scene description is used, the track header information
2125	   pieces SHOULD be sent anyway as they represent the intrinsic media
2126	   properties.  See 3GPP SMIL Language Profile in [29] for details.

2128	7.3. Finding layout values in a 3GP file

2130	   In a 3GP file, within the Track Header Box (tkhd):

2132	        o tx, ty: these values specify the translation offset of the
2133	          (text) track relative to the upper left corner of the video
2134	          track, if present.  They are the second but last and third
2135	          but last values in the unity matrix; values are fixed-point
2136	          16.16 values, restricted to be (signed) integers (i.e., the
2137	          lower 16 bits of each value shall be all zeros).  Therefore,
2138	          only the first 16 bits are used for obtaining the value of
2139	          the media type parameters.

2141	        o width, height: they have the same name in the tkhd box.  All
2142	          (unsigned) 32 bits are meaningful.

2144	        o layer: all (signed) 16 bits are used.

2146	8. 3GPP Timed Text Media Type

2148	   The media subtype for the 3GPP Timed Text codec is allocated from the
2149	   standards tree.  The top-level media type under which this payload
2150	   format is registered is 'video'.  This registration is done using the
2151	   template defined in [31] and following RFC 3555 [30].

2153	   The receiver MUST ignore any unrecognized parameter.

2155	   Media type: video

2157	   Media subtype: 3gpp-tt

2159	   Required parameters

2161	        rate:
2162	                Refer to Section 3 in RFCXXXX.

2164	        sver:
2165	                The parameter "sver" contains a list of supported
2166	                backwards-compatible versions of the timed text format
2167	                specification (3GPP TS 26.245) that the sender accepts
2168	                to receive (and which are the same that it would be
2169	                willing to send).  The first value is the value
2170	                preferred to receive (or preferred to send).  The first
2171	                value MAY be followed by a comma-separated list of
2172	                versions that SHOULD be used as alternatives.  The order
2173	                is meaningful, being first the most preferred and last
2174	                the least preferred.  Each entry has the format
2175	                Zi(xi*256+yi), where "Zi" is the number of the Release,
2176	                "xi" and "yi" are taken from the 3GPP specification
2177	                version, i.e. vZi.xi.yi.  For example, for 3GPP TS
2178	                26.245 v6.0.0, Zi(xi*256+yi)=6(0), the version value is
2179	                "60".  (Note that "60" is the concatenation of the
2180	                values Zi=6 and (xi*256+yi)=0 and not its product.)

2182	                If no "sver" value is available, for example, when
2183	                streaming out of a 3GP file, the default value "60",
2184	                corresponding to the 3GPP Release 6 version of 3GPP TS
2185	                26.245, SHALL be used.

2187	   Optional parameters:

2189	        tx:
2190	                This parameter indicates the horizontal translation
2191	                offset in pixels of the text track with respect to the
2192	                origin of the video track.  This value is the decimal
2193	                representation of a 16-bit signed integer.  Refer to TS
2194	                3GPP 26.245 for an illustration of this parameter.

2196	        ty:
2197	                This parameter indicates the vertical translation offset
2198	                in pixels of the text track with respect to the origin
2199	                of the video track.  This value is the decimal
2200	                representation of a 16-bit signed integer.  Refer to TS
2201	                3GPP 26.245 for an illustration of this parameter.

2203	        layer:
2204	                This parameter indicates the proximity of the text track
2205	                to the viewer.  More negative values mean closer to the
2206	                viewer.  This parameter has no units.  This value is the
2207	                decimal representation of a 16-bit signed integer.

2209	        tx3g:
2210	                This parameter MUST be used for conveying sample
2211	                descriptions out-of-band.  It contains a comma-separated
2212	                list of base64-encoded entries.  The entries of this
2213	                list that MAY follow any particular order and the list
2214	                SHALL NOT be empty.  Each entry is the result of running
2215	                base64 encoding over the concatenation of the (static)
2216	                SIDX value as 8-bit unsigned integer and the (static)
2217	                sample description for that SIDX, in this order.  The
2218	                format of a sample description entry can be found in
2219	                3GPP TS 26.245 Release 6 and later releases.  All
2220	                servers and clients MUST understand this parameter and
2221	                MUST be capable of using the sample description(s)
2222	                contained in it.  Please refer to RFC 3548 for details
2223	                on the base64 encoding.

2225	        width:
2226	                This parameter indicates the width in pixels of the text
2227	                track or area of the text being sent.  This value is the
2228	                decimal representation of a 32-bit unsigned integer.
2229	                Refer to TS 3GPP 26.245 for an illustration of this
2230	                parameter.

2232	        height:
2233	                This parameter indicates the height in pixels of the
2234	                text track being sent.  This value is the decimal
2235	                representation of a 32-bit unsigned integer.  Refer to
2236	                TS 3GPP 26.245 for an illustration of this parameter.

2238	        max-w:
2239	                This parameter indicates display capabilities.  This is
2240	                the maximum "width" value that the sender of this
2241	                parameter supports.  This value is the decimal
2242	                representation of a 32-bit unsigned integer.
2243	        max-h:
2244	                This parameter indicates display capabilities.  This is
2245	                the maximum "height" value that the sender of this
2246	                parameter supports.  This value is the decimal
2247	                representation of a 32-bit unsigned integer.

2249	   Encoding considerations:

2251	        This media type is framed (see section 4.8 in [31]) and
2252	        partially contains binary data.

2254	   Restrictions on usage:

2256	        This media type depends on RTP framing, and hence is only
2257	        defined for transfer via RTP [3]. Transport within other framing
2258	        protocols is not defined at this time.

2260	   Security considerations:

2262	        Please refer to Section 11 of RFCXXXX.

2264	   Interoperability considerations:

2266	        The 3GPP Timed Text media format and its file storage is
2267	        specified in Release 6 of 3GPP TS 26.245 "Transparent end-to-end
2268	        packet switched streaming service (PSS); Timed Text Format
2269	        (Release 6)".  Note also that 3GPP may in future Releases
2270	        specify extensions or updates to the timed text media format in
2271	        a backwards-compatible way, e. g. new modifier boxes or
2272	        extensions to the sample descriptions.  The payload format
2273	        defined in RFCXXXX allows for such extensions.  For future 3GPP
2274	        Releases of the Timed Text Format, the parameter "sver" is used
2275	        to identify the exact specification used.

2277	        The defined storage format for 3GPP Timed Text format is the
2278	        3GPP File Format (3GP) [32]. 3GP files may be transferred using
2279	        the media type video/3gpp as registered by RFC 3839 [33].  The
2280	        3GPP File Format is a container file that may contain, e.g.,
2281	        audio and video which may be synchronized with the
2282	        3GPP Timed Text.

2284	   Published specification: RFC XXXX

2286	   Applications which use this media type:

2288	        Multimedia streaming applications.

2290	   Additional information:

2292	        the 3GPP Timed Text media format is specified in 3GPP TS 26.245
2293	        "Transparent end-to-end packet switched streaming service (PSS);
2294	        Timed Text Format (Release 6)".  This document and future
2295	        extensions to the 3GPP Timed Text format are publicly available
2296	        at http://www.3gpp.org.

2298	        Magic number(s): None.

2300	        File extension(s): None.

2302	        Macintosh File Type Code(s): None.

2304	   Person & email address to contact for further information:

2306	        Jose Rey, jose.rey@eu.panasonic.com
2307	        Yoshinori Matsui, matsui.yoshinori@jp.panasonic.com
2308	        Audio/Video Transport Working Group.

2310	   Intended usage: COMMON

2312	   Authors:
2313	        Jose Rey
2314	        Yoshinori Matsui

2316	   Change controller:
2317	        IETF Audio/Video Transport Working Group delegated from the
2318	        IESG.

2320	9. SDP usage

2322	9.1. Mapping to SDP

2324	   The information carried in the media type specification has a
2325	   specific mapping to fields in SDP [4].  If SDP is used to specify
2326	   sessions using this payload format, the mapping is done as follows:

2328	   o The media type ("video") goes in the SDP "m=" as the media name.

2330	       m=video <port number> RTP/<RTP profile> <dynamic payload type>

2332	   o The media subtype ("3gpp-tt") and the timestamp clockrate "rate"
2333	     (the RECOMMENDED 1000 Hz or other value) go in SDP "a=rtpmap" line
2334	     as the encoding name and rate, respectively:

2336	       a=rtpmap:<payload type> 3gpp-tt/1000

2338	   o The REQUIRED parameter "sver" goes in the SDP "a=fmtp" attribute
2339	     by copying it directly from the media type string as a semicolon
2340	     separated parameter=value pair.

2342	   o The OPTIONAL parameters "tx", "ty", "layer", "tx3g", "width",
2343	     "height", "max-w" and "max-h" go in the SDP "a=fmtp" attribute by
2344	     copying them directly from the media type string as a semicolon
2345	     separated list of parameter=value(s) pairs:

2347	       a=fmtp:<dynamic payload type> <parameter
2348	       name>=<value>[,<value>][; <parameter name>=<value>]

2350	   o   Any unknown parameter to the device that uses the SDP SHALL be
2351	       ignored.  E.g. parameters added in media format later
2352	       specifications MAY be copied into the SDP and SHALL be ignored
2353	       by receivers that do not understand them.

2355	9.2. Parameter Usage in the SDP Offer/Answer Model

2357	   In this section the meaning of the SDP parameters defined in this
2358	   document within the Offer/Answer [13] context is explained.

2360	   In unicast, sender and receiver typically negotiate the streams, i.e.
2361	   which codecs and parameter values are used in the session.  This is
2362	   also possible in multicast to a lesser extend.

2364	   Additionally, the meaning of the parameters MAY vary depending on
2365	   which direction it used.  In the following sections, a
2366	   "<directionality> offer" means an offer that contains a stream set to
2367	   <directionality>.  <directionality> may take the values sendrecv,
2368	   sendonly and recvonly.  Similar considerations apply for answers.
2369	   E.g. an answer to sendonly offer is a recvonly answer.

2371	9.2.1. Unicast Usage

2373	   The following types of parameters are used in this payload format:

2375	     1. Declarative parameters: offerer and answerer declare the values
2376	        they will use for the incoming (sendrecv/recvonly) or outgoing
2377	        (sendonly) stream.  Offerer and answerer MAY use different
2378	        values.

2380	          a. "tx", "ty" and "layer": these are parameters describing
2381	             where the received text track is placed.  Depending on the
2382	             directionality:

2384	              i. MUST appear in all sendrecv offers and answers and in
2385	                  all recvonly offers and answers (thus applying to the
2386	                  incoming stream).  In the case of sendrecv offers and
2387	                  answers and in recvonly offers, these values SHOULD be
2388	                  used by the sender of the stream unless it has a
2389	                  particular preference, in which case, it MUST make
2390	                  sure that these different values do not corrupt the
2391	                  presentation.  For recvonly answers, the answerer MAY
2392	                  accept the proposed values for the incoming stream (in
2393	                  a sendonly offer, see bullet below) or respond with
2394	                  different ones.  The offerer MUST use the returned
2395	                  values.

2397	             ii. MAY appear in sendonly offers and MUST appear in
2398	                  sendonly answers.  In sendonly offers they specify the
2399	                  values that the offerer proposes for sending (see
2400	                  example in Section 9.3).  In sendonly answers these
2401	                  values SHOULD be copied from the corresponding
2402	                  recvonly offer upon accepting the stream, unless a
2403	                  particular preference by the receiver if the stream
2404	                  exists, as explained in the previous bullet.

2406	     2. Parameters describing the display capabilities, "max-h" and
2407	        "max-w", which indicate the maximum dimensions of the text track
2408	        (text display area) for the incoming stream "tx" and "ty" values
2409	        (see Figure 17).  "max-h" and "max-w" MUST be included in all
2410	        offers and answers where "tx" and "ty" refer to the incoming
2411	        stream, thus excluding sendonly offers and answers (see example
2412	        in Section 9.3), where they SHALL NOT be present.

2414	     3. Parameters describing the sent stream properties, i.e. the
2415	        sender of the stream decides upon the values of these:

2417	          a. "width" and "height", specify the text track dimensions.
2418	             They SHALL ALWAYS be present in sendrecv and sendonly
2419	             offers and answers.  For recvonly answers, the answerer
2420	             MUST include the offered parameter values (if any) verbatim
2421	             in the answer upon accepting the stream.

2423	          b. "tx3g" contains static sample descriptions.  It MAY only be
2424	             present in sendrecv and sendonly offers and answers.  This
2425	             parameter applies to the stream that offerers or answerers
2426	             send.

2428	     4. Negotiable parameters, which MUST be agreed on.  This is the
2429	        case of "sver".  This parameter MUST be present in every offer
2430	        and answer.  The answerer SHALL choose one supported value from
2431	        the offerer's list or else it MUST remove the stream or reject
2432	        the session.

2434	     5. Symmetric parameters: "rate", timestamp clockrate, belongs to
2435	        this class.  Symmetric parameters MUST be echoed verbatim in the
2436	        answer.  Otherwise the stream MUST be removed or the session
2437	        rejected.

2439	   The following Table 1 summarises all options:

2441	     +..---------------------------+----------+----------+----------+
2442	     |   ``--..__  Directionality/ | sendrecv | recvonly | sendonly |
2443	     + Type of   ``--..__   O or A +----------+----------+----------+
2444	     |    Parameter      ``--..__  |   O/A    |   O/A    |   O/A    |
2445	     +--------------+------------``+----------+----------+----------+
2446	     | Declarative  |tx, ty, layer |   M/M    |   M/M    |   m/M    |
2447	     |              |              |          |          |          |
2448	     +--------------+--------------+----------+----------+----------+
2449	     | Display      |max-h, max-w  |   M/M    |   M/M    |   -/-    |
2450	     | Capabilities |              |          |          |          |
2451	     +--------------+--------------+----------+----------+----------+
2452	     | Stream       |height, width |   M/M    |   -/(M)  |   M/M    |
2453	     | properties   |tx3g          |   m/m    |   -/-    |   m/m    |
2454	     |              |              |          |          |          |
2455	     +--------------+--------------+----------+----------+----------+
2456	     |  Negotiable  |sver          |   M/M    |   M/M    |   M/M    |
2457	     |              |              |          |          |          |
2458	     +--------------+--------------+----------+----------+----------+
2459	     |  Symmetric   |rate          |   M/M    |   M/M    |   M/M    |
2460	     +--------------+--------------+----------+----------+----------+
2461	          Table 1. Parameter usage in Unicast Offer / Answer.

2463	   Key:
2464	        o M means MUST be present
2465	        o m means MAY be present (such as proposed values)
2466	        o (M) or (m) means MUST or MAY, if applicable
2467	        o a hyphen ("-") means the parameter MUST NOT be present.

2469	   Other observations regarding parameter usage:

2471	     o Translation and transparency values: in sendonly offers "tx",
2472	        "ty" and "layer" indicate proposed values.  This is useful for
2473	        visually composed sessions where the different streams occupy
2474	        different parts of the display, e.g., a video stream and the
2475	        captions.  These are just suggested values because it is the
2476	        peer rendering the text that ultimately decides where to place
2477	        the text track.

2479	     o Text track (area) dimensions, "height" and "width": in the case
2480	        of sendonly offers, an answerer accepting the offer MUST be
2481	        prepared to render the stream using these values.  If any of
2482	        these conditions are not met, the stream MUST be removed or the
2483	        session rejected.

2485	     o Display capabilities, "max-h" and "max-w": an answerer sending a
2486	        stream SHALL ensure that the "height" and "width" values in the
2487	        answer are compatible with the offerer's signalled capabilities.

2489	     o Version handling via "sver": the idea is that offerer and
2490	        answerer communicate using the same version.  This is achieved
2491	        by letting the answerer choose from a list of supported
2492	        versions, "sver".  For recvonly streams, the first value in the
2493	        list is the preferred version to receive.  Consequently, for
2494	        sendonly (and sendrecv) streams the first value is the one
2495	        preferred for sending (and receiving).  The answerer MUST choose
2496	        one value and return it in the answer.  Upon receiving the
2497	        answer, the offerer SHALL be prepared to send (sendonly and
2498	        sendrecv) and receive (recvonly and sendrecv) a stream using
2499	        that version.  If none of the versions in the list is supported
2500	        the stream MUST be removed or the session rejected.  Note that,
2501	        if alternative non-compatible versions are offered, then this
2502	        SHALL be done using different payload types.

2504	9.2.2. Multicast Usage

2506	   In multicast the parameter usage is similar to the unicast case,
2507	   except in the following cases:

2509	   o the parameters "tx", "ty" and "layer" in multicast offers only
2510	     have meaning for sendrecv and recvonly streams.  In order for all
2511	     clients to have the same vision of the session, they MUST be used
2512	     symmetrically.

2514	   o for "height", "width" and the "tx3g" (for sendrecv and sendonly),
2515	     multicast offers specify which values of these parameters the
2516	     participants MUST use for sending.  Thus, if the stream is
2517	     accepted, the answerer MUST also here include them verbatim in the
2518	     answer (also "tx3g", if present).

2520	   o The capability parameters, "max-h" and "max-w", SHALL NOT be used
2521	     in multicast.  If the offered text track should change in size, a
2522	     new offer SHALL be used instead.

2524	   o Regarding version handling:

2526	     In the case of multicast offers, an answerer MAY accept a
2527	     multicast offer as long as one of the versions listed in the
2528	     "sver" is supported.  Therefore, if the stream is accepted, the
2529	     answerer MUST choose its preferred version but, unlike in unicast,
2530	     the offerer SHALL NOT change the offered stream to this chosen
2531	     version because there may be other session participants that do
2532	     support the newer extensions.  Consequently, different session
2533	     participants may end up using different backwards-compatible media
2534	     format versions.  It is RECOMMENDED that the multicast offer
2535	     contains a limited number of versions, in order for all
2536	     participants to have the same view of the session.  This is a
2537	     responsibility of the session creator.  If none of the offered
2538	     versions is supported, the stream SHALL be removed or the session
2539	     rejected.  Also in this case, if alternative non-compatible
2540	     versions are offered, then this SHALL be done using different
2541	     payload types.

2543	9.3. Offer/Answer Examples

2545	   In these unicast O/A examples the long lines are wrapped around.
2546	   Static sample descriptions are shortened for clarity.

2548	   For sendrecv :

2550	   O -> A

2552	   m=video <port> RTP/AVP 98
2553	   a=rtpmap:98 3gpp-tt/1000
2554	   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=120;
2555	   max-w=160; sver=6256,60; tx3g=81...
2556	   a=sendrecv

2558	   A -> O

2560	   m=video <port> RTP/AVP 98..
2561	   a=rtpmap:98 3gpp-tt/1000
2562	   a=fmtp:98 tx=100; ty=95; layer=0; height=90; width=100; max-h=100;
2563	   max-w=160; sver=60; tx3g=82...
2564	   a=sendrecv

2566	   In this example the offerer is telling the answerer where it will
2567	   place the received stream and what is the maximum height and width
2568	   allowable for the stream that it will receive.  Also, it tells the
2569	   answerer the dimensions of the text track for the stream sent and
2570	   which sample description it shall use.  It offers two versions, 6256
2571	   and 60.  The answerer responds with an equivalent set of parameters
2572	   for the stream it receives.  In this case the answerer's "max-h" and
2573	   "max-w" are compatible with the offerer's "height" and "width".
2574	   Otherwise, the answerer would have to remove this stream and the
2575	   offerer would have to issue a new offer taking the answerer's
2576	   capabilities into account.  This is possible only if multiple payload
2577	   types are present in the initial offer so that at least one of them
2578	   matches the answerer's capabilities as expressed by "max-h" and
2579	   "max-w" in the negative answer.  Note also that the answerer's text
2580	   box dimensions fit within the maximum values signalled in the offer.
2581	   Finally, the answerer chooses to use version 60 of the timed text
2582	   format.

2584	   For recvonly:

2586	   Offerer -> Answerer

2588	   m=video <port> RTP/AVP 98
2589	   a=rtpmap:98 3gpp-tt/1000
2590	   a=fmtp:98 tx=100; ty=100; layer=0; max-h=120; max-w=160; sver=6256,60
2591	   a=recvonly

2593	   A -> O

2595	   m=video <port> RTP/AVP 98..
2596	   a=rtpmap:98 3gpp-tt/1000
2597	   a=fmtp:98 tx=100; ty=100; layer=0; height=90; width=100; sver=60;
2598	   tx3g=82...
2599	   a=sendonly

2601	   In this case, the offer is different from the previous case: it does
2602	   not include the stream properties: "height", "width" and "tx3g".  The
2603	   answerer copies the "tx", "ty" and "layer" values, thus acknowledging
2604	   these.  "max-h" and "max-w" are not present in the answer because the
2605	   "tx" and "ty" (and "layer") in this special case do not apply to the
2606	   received, but to the sent stream.  Also, if offerer and answerer had
2607	   very different displays sizes, it would not be possible to express
2608	   the answerer's capabilities.  In the example above and for an
2609	   answerer with a 50x50 display, the translation values are already out
2610	   of range.

2612	   For sendonly:

2614	   O -> A

2616	   m=video <port> RTP/AVP 98
2617	   a=rtpmap:98 3gpp-tt/1000
2618	   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100;
2619	   sver=6256,60; tx3g=81...
2620	   a=sendonly
2621	   A -> O

2623	   m=video <port> RTP/AVP 98..
2624	   a=rtpmap:98 3gpp-tt/1000
2625	   a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=100;
2626	   max-w=160; sver=60
2627	   a=recvonly

2629	   Note that "max-h" and "max-w" are not present in the offer.  Also,
2630	   with this answer, the answerer would accept the offer as is (thus
2631	   echoing "tx", "ty", "height", "width" and "layer") and additionally
2632	   inform the offerer about its capabilities: "max-h" and "max-w".

2634	   Another possible answer for this case would be:

2636	   A -> O

2638	   m=video <port> RTP/AVP 98..
2639	   a=rtpmap:98 3gpp-tt/1000
2640	   a=fmtp:98 tx=120; ty=105; layer=0; max-h=95; max-w=150; sver=60
2641	   a=recvonly

2643	   In this case the answerer does not accept the values offered.  The
2644	   offerer MUST use these values or else remove the stream.

2646	9.4. Parameter Usage outside of Offer/Answer

2648	   SDP may also be employed outside of the Offer/Answer context, for
2649	   instance for multimedia sessions that are announced through the
2650	   Session Announcement Protocol (SAP) [14], or streamed through the
2651	   Real Time Streaming Protocol (RTSP) [15].

2653	   In this case, the receiver of a session description is required to
2654	   support the parameters and given values for the streams or else it
2655	   MUST reject the session.  It is the responsibility of the sender (or
2656	   creator) of the session descriptions to define the session parameters
2657	   so that the probability of unsuccessful session setup is minimized.
2658	   This is out of the scope of this document.

2660	10. IANA Considerations

2662	   IANA is requested to register the media subtype name "3gpp-tt" for
2663	   the media type "video" as specified in Section 8 of this document.

2665	11. Security considerations

2667	   RTP packets using the payload format defined in this specification
2668	   are subject to the security considerations discussed in the RTP
2669	   specification [3] and any applicable RTP profile, e.g. AVP [17].

2671	   In particular, an attacker may invalidate the current set of active
2672	   sample descriptions at the client by means of repeating a packet with
2673	   an old sample description, i.e. replay attack.  This would mean that
2674	   the display of the text would be corrupted, if displayed at all.
2675	   Another form of attack may consist in sending redundant fragments,
2676	   whose boundaries do not match the exact boundaries of the originals
2677	   (as indicated by LEN) or fragments that carry different sample
2678	   lengths (SLEN).  This may cause a decoder to crash.

2680	   These types of attack may easily be avoided by using source
2681	   authentication and integrity protection.

2683	   Additionally, peers in a timed text session may desire to retain
2684	   privacy in their communication, i.e. confidentiality.

2686	   This payload format does not provide any mechanisms for achieving
2687	   these.  Confidentiality, integrity protection and authentication have
2688	   to be solved by a mechanism external to this payload format, e.g.,
2689	   SRTP [10].

2691	12. References

2693	12.1. Normative References

2695	   [1]  Transparent end-to-end packet switched streaming service (PSS);
2696	     Timed Text Format (Release 6), TS 26.245 v 6.0.0, June 2004.

2698	   [2]  ISO/IEC 14496-12:2004 Information technology - Coding of
2699	     audio-visual objects - Part 12: ISO base media file format.

2701	   [3]  H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A
2702	     Transport Protocol for Real-Time Applications", STD 64, RFC 3550,
2703	     July 2003.

2705	   [4]  M. Handley, V. Jacobson, "SDP: Session Description Protocol",
2706	     RFC 2327, April 1998.

2708	   [5]  S. Bradner, "Key words for use in RFCs to indicate requirement
2709	     levels," BCP 14, RFC 2119, IETF, March 1997.

2711	   [6]  S. Josefsson (Ed.), "The Base16, Base32, and Base64 Data
2712	     Encodings", RFC 3548, July 2003.

2714	12.2. Informative References

2716	   [7]  J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic
2717	     Forward Error Correction", RFC 2733, December 1999.

2719	   [8]  C. Perkins, O. Hodson, "Options for Repair of Streaming Media",
2720	     RFC 2354, June 1998.

2722	   [9]  W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)",
2723	     August, 2001.

2725	   [10] M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M.
2726	     Naslund, K. Norrman, "The Secure Real-Time Transport Protocol",
2727	     RFC 3711, March 2004.

2729	   [11] J. Rey et al., "RTP Retransmission Payload Format",
2730	     draft-ietf-avt-rtp-retransmission-11.txt, work in progress, March
2731	     2005.

2733	   [12] Van der Meer et al., "RTP Payload Format for Transport of MPEG-4
2734	     Elementary Streams ", RFC 3640, November 2003.

2736	   [13] J. Rosenberg., H. Schulzrinne, " An Offer/Answer Model with the
2737	     Session Description Protocol (SDP)", RFC 3264, June 2002.

2739	   [14] M. Handley, et al. "Session Announcement Protocol", RFC 2974,
2740	     October 2000.

2742	   [15] H. Schulzrinne, et al.,"Real Time Streaming Protocol (RTSP)",
2743	     RFC 2326, April 1998.

2745	   [16] Transparent end-to-end packet switched streaming service (PSS);
2746	     Protocols and codecs (Release 6), TS 26.234 v 6.1.0, September
2747	     2004.

2749	   [17] H. Schulzrinne, S. Casner, "RTP Profile for Audio and Video
2750	     Conferences with Minimal Control", STD 65, RFC 3551, July 2003.

2752	   [18] F. Yergeau, "UTF-8, a transformation format of Unicode and ISO
2753	     10646", RFC 2044, October 1996.

2755	   [19] P. Hoffman, F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC
2756	     2781, February 2000.

2758	   [20] Friedman, et al., "RTP Control Protocol Extended Reports (RTCP
2759	     XR)", RFC 3611, November 2003.

2761	   [21] Ott, et al., "Extended RTP Profile for RTCP-based Feedback
2762	     (RTP/AVPF)", draft-ietf-avt-rtcp-feedback-11.txt, work in
2763	     progress, August 2004.

2765	   [22] IETF RFC 3267: "Real-Time Transport Protocol (RTP) Payload
2766	     Format and File Storage Format for the Adaptive Multi-Rate (AMR)
2767	     Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs", Sjoberg J. et
2768	     al., June 2002.

2770	   [23] IETF RFC 3016: "RTP Payload Format for MPEG-4 Audio/Visual
2771	     Streams", Kikuchi Y. et al., November 2000.

2773	   [24] G. Hellstrom, "RTP Payload for Text Conversation", RFC 2793, May
2774	     2000.

2776	   [25] G. Hellstrom, P. Jones, "RTP Payload for Text Conversation",
2777	     draft-ietf-avt-rfc2793bis-09.txt, Work In Progress, August 2004.

2779	   [26] ITU-T Recommendation T.140 (1998) - Text conversation protocol
2780	     for multimedia application, with amendment 1, (2000).

2782	   [27] ISO/IEC 10646-1: (1993), Universal Multiple Octet Coded
2783	     Character Set.

2785	   [28] ISO/IEC FCD 14496-17 Information technology - Coding of
2786	     audio-visual objects - Part 17: Streaming text format, Work in
2787	     progress, June 2004.

2789	   [29] Transparent end-to-end Packet-switched Streaming Service (PSS);
2790	     3GPP SMIL language profile, (Release 6), TS 26.246 v 6.0.0, June
2791	     2004.

2793	   [30] Casner, S. and P. Hoschka, "MIME Type Registration of RTP
2794	     Payload Formats", RFC 3555, July 2003.

2796	   [31] Freed, N. and J. Klensin, "Media Type Specifications and
2797	     Registration Procedures", draft-freed-media-type-reg-04, April
2798	     2005.

2800	   [32] Transparent end-to-end packet switched streaming service (PSS);
2801	     3GPP file format (3GP) (Release 6), TS 26.244 V6.3. March 2005.

2803	   [33] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd
2804	     Generation Partnership Project (3GPP) Multimedia files", RFC 3839,
2805	     July 2004.

2807	13. Annexes

2809	13.1. Basics of the 3GP File Structure

2811	   This section provides a coarse overview of the 3GP file structure,
2812	   which follows the ISO Base Media file Format [2].

2814	   Each 3GP file consists of "Boxes".  In general, a 3GP file contains
2815	   the File Type Box (ftyp), the Movie Box (moov), and the Media Data
2816	   Box (mdat).  The File Type Box identifies the type and properties of
2817	   the 3GP file itself.  The Movie Box and the Media Data Box, serving
2818	   as containers, include own boxes for each media.  Boxes start with a
2819	   header, which indicates both size and type (these fields are called
2820	   namely "size" and "type").  Additionally, each box type may include a
2821	   number of boxes.

2823	   In the following, only those boxes are mentioned, which are useful
2824	   for the purposes of this payload format.

2826	   The Movie Box (moov) contains one or more Track Boxes (trak), which
2827	   include information about each track.  A Track Box contains, among
2828	   others, the Track Header Box (tkhd), the Media Header Box (mdhd) and
2829	   the Media Information Box (minf).

2831	   The Track Header Box specifies the characteristics of a single track,
2832	   where a track is, in this case, the streamed text during a session.
2833	   Exactly one Track Header Box is present for a track.  It contains
2834	   information about the track, such as the spatial layout (width and
2835	   height), the video transformation matrix and the layer number.  Since
2836	   these pieces of information are essential and static, i.e. constant
2837	   for the duration of the session, they must be sent prior to the
2838	   transmission of any text samples.

2840	   The Media Header Box contains the "timescale" or number of time units
2841	   that pass in one second, i.e. cycles per second or Hertz.  The Media
2842	   Information Box includes the Sample Table Box (stbl) which contains
2843	   all the time and data indexing of the media samples in a track.
2844	   Using this box, it is possible to locate samples in time, determine
2845	   their type, their size, container, and offset into that container.
2846	   Inside the Sample Table Box we can find the Sample Description Box
2847	   (stsd, for finding sample descriptions), the Decoding Time to Sample
2848	   Box (stts, for finding sample duration), the Sample Size Box (stsz)
2849	   and the Sample to Chunk Box (stsc, for finding the sample description
2850	   index).

2852	   Finally, the Media Data Box contains the media data itself.  In timed
2853	   text tracks this box contains text samples.  Its equivalent to audio
2854	   and video is audio and video frames, respectively.  The text sample
2855	   consists of the text length, the text string, and one or several
2856	   Modifier Boxes.  The text length is the size of the text in bytes.
2857	   The text string is plain text to render.  The Modifier Box is
2858	   information to render in addition to the text such as colour, font,
2859	   etc.

2861	14. Acknowledgements

2863	   The authors would like to thank Dave Singer, Jan van der Meer, Magnus
2864	   Westerlund and Colin Perkins for their comments and suggestions to
2865	   this document.

2867	   The authors would also like to thank Markus Gebhard for the free and
2868	   publicly available JavE ASCII Editor (used for the ASCII drawings in
2869	   this document) and Henrik Levkowetz for the Idnits web service.

2871	15. Authors' Addresses

2873	   Jose Rey                             jose.rey@eu.panasonic.com
2874	   Panasonic R&D Center Germany GmbH
2875	   Monzastr. 4c
2876	   D-63225 Langen, Germany
2877	   Phone: +49-6103-766-134
2878	   Fax:   +49-6103-766-166

2880	   Yoshinori Matsui             matsui.yoshinori@jp.panasonic.com
2881	   Matsushita Electric Industrial Co., LTD.
2882	   1006 Kadoma
2883	   Kadoma-shi, Osaka, Japan
2884	   Phone: +81 6 6900 9689
2885	   Fax:   +81 6 6900 9699

2887	16. IPR Notices

2889	   The IETF takes no position regarding the validity or scope of any
2890	   Intellectual Property Rights or other rights that might be claimed to
2891	   pertain to the implementation or use of the technology described in
2892	   this document or the extent to which any license under such rights
2893	   might or might not be available; nor does it represent that it has
2894	   made any independent effort to identify any such rights.  Information
2895	   on the procedures with respect to rights in RFC documents can be
2896	   found in BCP 78 and BCP 79.

2898	   Copies of IPR disclosures made to the IETF Secretariat and any
2899	   assurances of licenses to be made available, or the result of an
2900	   attempt made to obtain a general license or permission for the use of
2901	   such proprietary rights by implementers or users of this
2902	   specification can be obtained from the IETF on-line IPR repository at
2903	   http://www.ietf.org/ipr.

2905	   The IETF invites any interested party to bring to its attention any
2906	   copyrights, patents or patent applications, or other proprietary
2907	   rights that may cover technology that may be required to implement
2908	   this standard.  Please address the information to the IETF at
2909	   ietf-ipr@ietf.org.

2911	17. Full Copyright Statement

2913	   Copyright (C) The Internet Society (2005).  This document is subject
2914	   to the rights, licenses and restrictions contained in BCP 78, and
2915	   except as set forth therein, the authors retain all their rights.

2917	   This document and the information contained herein are provided on an
2918	   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
2919	   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
2920	   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
2921	   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
2922	   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
2923	   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.