idnits 2.17.1 draft-ietf-avt-rtp-3gpp-timed-text-15.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** It looks like you're using RFC 3978 boilerplate. You should update this to the boilerplate described in the IETF Trust License Policy document (see https://trustee.ietf.org/license-info), which is required now. -- Found old boilerplate from RFC 3978, Section 5.1 on line 14. -- Found old boilerplate from RFC 3978, Section 5.5 on line 2923. -- Found old boilerplate from RFC 3979, Section 5, paragraph 1 on line 2896. -- Found old boilerplate from RFC 3979, Section 5, paragraph 2 on line 2903. -- Found old boilerplate from RFC 3979, Section 5, paragraph 3 on line 2909. ** This document has an original RFC 3978 Section 5.4 Copyright Line, instead of the newer IETF Trust Copyright according to RFC 4748. ** This document has an original RFC 3978 Section 5.5 Disclaimer, instead of the newer disclaimer which includes the IETF Trust according to RFC 4748. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- == No 'Intended status' indicated for this document; assuming Proposed Standard Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- No issues found here. Miscellaneous warnings: ---------------------------------------------------------------------------- == The copyright year in the RFC 3978 Section 5.4 Copyright Line does not match the current year -- The exact meaning of the all-uppercase expression 'NOT REQUIRED' is not defined in RFC 2119. If it is intended as a requirements expression, it should be rewritten using one of the combinations defined in RFC 2119; otherwise it should not be all-uppercase. -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (June 13, 2005) is 6885 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) == Missing Reference: '129' is mentioned on line 772, but not defined == Missing Reference: '254' is mentioned on line 772, but not defined == Missing Reference: '0' is mentioned on line 1236, but not defined == Missing Reference: '127' is mentioned on line 1236, but not defined == Missing Reference: '68' is mentioned on line 1233, but not defined == Missing Reference: '69' is mentioned on line 1234, but not defined == Missing Reference: '71' is mentioned on line 1236, but not defined -- Looks like a reference, but probably isn't: 'SampleContents' on line 1929 == Unused Reference: '22' is defined on line 2765, but no explicit reference was found in the text == Unused Reference: '23' is defined on line 2770, but no explicit reference was found in the text -- Possible downref: Non-RFC (?) normative reference: ref. '1' -- Possible downref: Non-RFC (?) normative reference: ref. '2' ** Obsolete normative reference: RFC 2327 (ref. '4') (Obsoleted by RFC 4566) ** Obsolete normative reference: RFC 3548 (ref. '6') (Obsoleted by RFC 4648) -- Obsolete informational reference (is this intentional?): RFC 2733 (ref. '7') (Obsoleted by RFC 5109) == Outdated reference: A later version (-12) exists of draft-ietf-avt-rtp-retransmission-11 -- Obsolete informational reference (is this intentional?): RFC 2326 (ref. '15') (Obsoleted by RFC 7826) -- Obsolete informational reference (is this intentional?): RFC 2044 (ref. '18') (Obsoleted by RFC 2279) -- Obsolete informational reference (is this intentional?): RFC 3267 (ref. '22') (Obsoleted by RFC 4867) -- Obsolete informational reference (is this intentional?): RFC 3016 (ref. '23') (Obsoleted by RFC 6416) -- Obsolete informational reference (is this intentional?): RFC 2793 (ref. '24') (Obsoleted by RFC 4103) -- Obsolete informational reference (is this intentional?): RFC 3555 (ref. '30') (Obsoleted by RFC 4855, RFC 4856) == Outdated reference: A later version (-05) exists of draft-freed-media-type-reg-04 Summary: 5 errors (**), 0 flaws (~~), 13 warnings (==), 18 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 2 Internet Draft J. Rey 3 draft-ietf-avt-rtp-3gpp-timed-text-15.txt Y. Matsui 4 Panasonic 5 Expires: December 13, 2005 June 13, 2005 7 RTP Payload Format for 3GPP Timed Text 9 Status of this Memo 11 By submitting this Internet-Draft, each author represents that any 12 applicable patent or other IPR claims of which he or she is aware 13 have been or will be disclosed, and any of which he or she becomes 14 aware will be disclosed, in accordance with Section 6 of BCP 79. 16 Internet-Drafts are working documents of the Internet Engineering 17 Task Force (IETF), its areas, and its working groups. Note that other 18 groups may also distribute working documents as Internet-Drafts. 20 Internet-Drafts are draft documents valid for a maximum of six months 21 and may be updated, replaced, or obsoleted by other documents at any 22 time. It is inappropriate to use Internet-Drafts as reference 23 material or to cite them other than as "work in progress." 25 The list of current Internet-Drafts can be accessed at 26 http://www.ietf.org/1id-abstracts.txt 28 The list of Internet-Draft Shadow Directories can be accessed at 29 http://www.ietf.org/shadow.html 31 Abstract 33 This document specifies an RTP payload format for the transmission of 34 3GPP (3rd Generation Partnership Project) timed text. 3GPP timed 35 text is a time-lined decorated text media format with defined storage 36 in a 3GP file. Timed Text can be synchronized with audio/video 37 contents and used in application such as captioning, titling and 38 multimedia presentations. In the following sections the problems of 39 streaming timed text are addressed and a payload format for streaming 40 3GPP timed text over RTP is specified. 42 Table of Contents 44 1. Introduction....................................................4 45 2. Motivation, Requirements and Design Rationale...................4 46 2.1. Motivation...................................................4 47 2.2. Basic Components of the 3GPP Timed Text Media Format.........4 48 2.3. Requirements.................................................5 49 2.4. Limitations..................................................7 50 2.5. Design Rationale.............................................8 51 3. Terminology....................................................10 52 4. RTP Payload Format for 3GPP Timed Text.........................12 53 4.1. Payload Header Definitions..................................13 54 4.1.1. Common Payload Header Fields.............................14 55 4.1.2. TYPE 1 Header............................................16 56 4.1.3. TYPE 2 Header............................................19 57 4.1.4. TYPE 3 Header............................................22 58 4.1.5. TYPE 4 Header............................................23 59 4.1.6. TYPE 5 Header............................................23 60 4.2. Buffering of Sample Descriptions............................24 61 4.2.1. Dynamic SIDX wrap-around mechanism.......................24 62 4.3. Finding payload header values in 3GP files..................26 63 4.4. Fragmentation of Timed Text Samples.........................29 64 4.5. Reassembling Text Samples at the Receiver...................30 65 4.6. On Aggregate Payloads.......................................32 66 4.7. Payload Examples............................................36 67 4.8. Relation to RFC 3640........................................40 68 4.9. Relation to RFC 2793........................................41 69 5. Resilient Transport............................................41 70 6. Congestion control.............................................42 71 7. Scene Description..............................................43 72 7.1. Text Rendering Position and Composition.....................43 73 7.2. SMIL usage..................................................44 74 7.3. Finding layout values in a 3GP file.........................44 75 8. 3GPP Timed Text Media Type.....................................44 76 9. SDP usage......................................................48 77 9.1. Mapping to SDP..............................................48 78 9.2. Parameter Usage in the SDP Offer/Answer Model...............48 79 9.2.1. Unicast Usage............................................49 80 9.2.2. Multicast Usage..........................................51 81 9.3. Offer/Answer Examples.......................................52 82 9.4. Parameter Usage outside of Offer/Answer.....................54 83 10. IANA Considerations...........................................54 84 11. Security considerations.......................................54 85 12. References....................................................55 86 12.1. Normative References.......................................55 87 12.2. Informative References.....................................55 88 13. Annexes.......................................................57 89 13.1. Basics of the 3GP File Structure...........................57 90 14. Acknowledgements..............................................58 91 15. Authors' Addresses............................................58 92 16. IPR Notices...................................................59 93 17. Full Copyright Statement......................................59 95 [Note to the RFC Editor: 96 - Please replace "RFCXXXX" with the RFC designation of this document 97 when published, 98 - Please substitute "draft-ietf-..." references with the 99 corresponding RFC number if available at the time of publication] 101 1. Introduction 103 3GPP timed text is a media format for time-lined decorated text 104 specified in the 3GPP Technical Specification TS 26.245 "Transparent 105 end-to-end packet switched streaming service (PSS); Timed Text Format 106 (Release 6)" [1]. Besides plain text, the 3GPP timed text format 107 allows the creation of decorated text like for karaoke applications, 108 scrolling text for newscasts or hyperlinked text. These contents may 109 or may not be synchronized with other media, like audio or video. 111 The purpose of this draft is to provide a means to stream 3GPP timed 112 text contents using RTP [3]. This includes the streaming of timed 113 text being read out of a (3GP) file as well as the streaming of timed 114 text generated in real-time, a.k.a. live streaming. 116 Section 2 contains the motivation of this document, an overview of 117 the media format, the requirements and the design rationale. Section 118 3 defines the terminology used. Section 4 specifies the payload 119 headers, the fragmentation and re-assembly rules for text samples, 120 the rules for payload aggregation and the relations of this document 121 to RFC 3640 [12] and RFC 2793 [24]. Section 5 specifies some simple 122 schemes for resilient transport and gives pointers to other possible 123 mechanisms. Section 6 addresses congestion control. Section 7 124 specifies scene description. Section 8 defines the media type. 125 Section 9 specifies SDP for unicast and multicast sessions, including 126 usage in the Offer / Answer model [13]. Sections 10 and 11 address 127 IANA and security considerations. Section 12 lists references. 128 Annexes are included as Section 13. 130 2. Motivation, Requirements and Design Rationale 132 2.1. Motivation 134 The 3GPP timed text format was developed for use in the services 135 specified in the 3GPP Transparent End-to-end Packet-switched 136 Streaming Services (3GPP PSS) specification [16]. 138 As of today, PSS allows to download 3GPP timed text contents stored 139 in 3GP files. However, due to the lack of a RTP payload format, it 140 is not possible to stream 3GPP timed text contents over RTP. 142 This document specifies such payload format. 144 2.2. Basic Components of the 3GPP Timed Text Media Format 146 Before going into the details of the design, it is necessary to have 147 knowledge about how the media format is constructed. We can identify 148 four differentiated functional components: layout information, 149 default formatting, text strings and decoration. In the following we 150 shortly explain these and match them to their designations in a 3GP 151 file: 153 o Initial spatial layout information related to the text 154 strings: these are the height and width of the text region 155 where text is displayed, the position of the text region in 156 the display and the layer or proximity of the text to the 157 user. In 3GP files, this information is contained in the 158 Track Header Box (3GP file designations are capitalized for 159 clarity). 161 o Default settings for formatting and positioning of text: 162 style (font, size, colour,...), background colour, horizontal 163 and vertical justification, line width, scrolling, etcetera. 164 For 3GP files, this corresponds to the Sample Descriptions. 166 o The actual text strings: encoded characters using either UTF- 167 8 [18] or UTF-16 [19] encoding and, 169 o The decoration: if some characters have different style, 170 delay, blink, etcetera... this needs to be indicated. The 171 decoration is only present in the text samples if it is 172 actually needed. Otherwise, the default settings as above 173 apply. In 3GP files text strings and decoration inside the 174 Text Samples, i.e. Modifier Boxes are appended to the text 175 strings, if needed. At the time of writing this payload 176 format the following modifiers are specified in the 3GPP 177 timed text media format specification [1]: 179 - text highlight, 180 - highlight color, 181 - blinking text, 182 - karaoke feature, 183 - hyperlink, 184 - text delay, 185 - text style and, 186 - positioning of the text box and, 187 - text wrap indication. 189 2.3. Requirements 191 Once the basic components are known, it is necessary to define which 192 requirements shall the payload format fulfill: 194 1. It shall enable both live streaming and streaming from a 3GP 195 file. 197 Informative note: for the purpose of this document, the 198 term live streaming refers to those scenarios where the 199 timed text stream is sent from a live encoder. Upon 200 reception the content may or may not be stored in a 3GP 201 file. Typically, in live streaming applications, the 202 sender encapsulates the timed text content in RTP 203 packets following the guidelines given in this document. 204 At the receiving side, a buffer is used to cancel the 205 network delay and delay jitter. If receiver and sender 206 support packet loss resilience mechanisms (see Section 207 5) it may also be possible to recover from packet 208 losses. Note that how sender and receiver actually 209 manage and dimension the buffers are implementation 210 design choices. 212 2. Furthermore, it shall be possible for an RTP receiver using this 213 payload format, and capable of storing in 3GP format, to obtain 214 all necessary information from the RTP packets for storing the 215 received text contents according to the 3GP file format. This 216 file may or may not be the same as the original file. 218 Informative note: the 3GP file format itself is based on 219 the ISO Base Media File Format recommendation [2]. 220 Section 13.1 gives some insight into the 3GP file 221 structure. Further, Sections 4.3 and 7.3 specify where 222 the information needed for filling in payload headers is 223 found in a 3GP file. For live streaming, appropriate 224 values complying with the format and units described in 225 [1] shall be used. Where needed, clarifications on 226 appropriate values are given in this document. 228 3. It shall enable efficient and resilient transport of timed text 229 contents over RTP. In particular: 231 a. Enable the transmission of the sample descriptions both by 232 out-of-band and in-band means. Sample descriptions are 233 important information, which potentially apply to several 234 text samples. These default formatting settings are 235 typically transmitted out-of-band (reliably) once at the 236 initialization phase. If additional sample descriptions 237 are needed in the course of a session, these may be sent 238 also out-of-band or in-band. In-band transmission, 239 although unreliable, may be more appropriate for sending 240 sample descriptions if these should be sent frequently, as 241 opposed to establishing an additional communication channel 242 for SDP, for example. It is also useful in cases where an 243 out-of-band channel may not be available and for live 244 streaming, where contents are not known a priori. Thus, 245 the payload format shall enable out-of-band and in-band 246 transmission of sample descriptions. Section 4.1.6 247 specifies a payload header for transmitting sample 248 descriptions in-band. Section 9 specifies how sample 249 descriptions are mapped to SDP. 251 b. Enable the fragmentation of a text sample into several RTP 252 packets in order to cover a wide range of applications and 253 network environments. In general, fragmentation should be 254 a rare event given the low bit rates and relatively small 255 text sample sizes. However, the 3GPP Timed Text media 256 format does allow for larger text samples. Therefore, the 257 payload format shall take this into account and provide a 258 means for coping with fragmentation and reassembly. 259 Section 4.3 deals with fragmentation. 261 c. Enable the aggregation of units into an RTP packet for 262 making the transport more efficient. In a mobile 263 communication environment a typical text sample size is 264 around 100-200 bytes. If the available bit rate and the 265 packet size allow it, units should be aggregated into one 266 RTP packet. Section 4.6 deals with aggregation. 268 d. Enable the use of resilient transport mechanisms, such as 269 repetition, retransmission [11] and FEC [7] (see Section 270 5.) For a more general discussion, refer to RFC 2354 [8], 271 which discusses available mechanisms for stream repair. 273 2.4. Limitations 275 The payload headers have been optimized in size for RTP. Instead 276 of using 32-bit (S)LEN, SDUR, SIDX header fields which would carry 277 many unused bits much of the time, it has been a design choice to 278 reduce the size of these fields. As a consequence, this payload 279 format has reduced maximum values with respect to sizes and 280 durations of (text) samples and sample descriptions. These maximum 281 values differ from those allowed in 3GP files, where they are 282 expressed using 32-bit (unsigned) integers. In some cases 283 extension mechanisms are provided to deal with larger values. 284 However, it is noted that the values used here should be enough for 285 the streaming applications targeted. 287 Following limitations apply: 289 1. The maximum size of text samples carried in RTP packets is 290 restricted to be a 16-bit (unsigned) integer (this includes the 291 text strings and modifiers). This means a maximum size for the 292 unit would be about 64 Kbytes. No extension mechanism is 293 provided. 295 2. The sample description index values are restricted to be an 296 (unsigned) 8-bit integer. An extension mechanism is given in 297 Section 4.3. 299 3. The text sample duration is restricted to be a 24-bit (unsigned) 300 integer. This yields a maximum duration at a timestamp 301 clockrate of 1000 Hz of about 4.6 hours. Nevertheless, an 302 extension mechanism is provided in Section 4.3. 304 4. Sample descriptions are also restricted in size: if the size 305 cannot be expressed as a (unsigned) 16-bit integer, the sample 306 description shall not be conveyed. As in the case of the sample 307 size, no extension mechanism is provided. 309 5. A further limitation concerns the UTF-16 encodings supported: 310 only transport of text strings following big endian byte order 311 is supported. See Section 4.1.1 for details. 313 2.5. Design Rationale 315 The following design choices were made: 317 1. 'Unit' approach: the payload formats specified in this draft 318 follow a simple scheme: a 3-byte common header (Common Payload 319 Header) followed by a specific header for each text sample 320 (fragment) type. Following these headers, the text sample 321 contents are placed (Section 4.1.1 and following). This 322 structure is called a 'unit'. 324 The following units have been devised to comply with the 325 requirements mentioned in Section 2.3: 327 a. A TYPE 1 unit that contains one complete text sample, 329 b. A TYPE 2 unit that contains a complete text string or a 330 fragment thereof, 332 c. A TYPE 3 unit that contains the complete modifiers or only 333 the first fragment thereof, 335 d. A TYPE 4 unit that contains one modifier fragment other 336 than the first and, 338 e. A TYPE 5 unit that contains one sample description. 340 This 'unit' approach was motivated by the following reasons: 342 1. Allows a simple classification of the text samples and 343 text sample fragments that can be conveyed by the 344 payload format. 346 2. Enables easy interoperability with RFC 3640 [12]. 347 During the development of this payload format, interest 348 was shown from MPEG-4 standardization participants in 349 developing a common payload structure for the transport 350 of 3GPP Timed Text. While interoperability is not 351 strictly necessary for this payload format to work, it 352 has been pursued in this payload format. Section 4.8 353 explains how this is done. 355 2. Character count is not implemented. This payload format does 356 detect lost text samples fragments but it does not enable an RTP 357 receiver to find out the exact number of text characters lost. 358 In fact, the fragment size included in the payload headers does 359 not help in finding the number of lost characters, because the 360 UTF-8/UTF-16 [18][19] encodings used yield a variable number of 361 bytes per character. 363 For finding out the exact number of lost characters, an 364 additional field reflecting the character count (and possibly 365 the character offset) upon fragmentation would be required. 366 This would additionally require the entity performing 367 fragmentation to count the characters included in each text 368 fragment. 370 One benefit of having a character count would be that the 371 display application would be able to replace missing characters 372 through some other character representing character loss, e.g.: 374 If we take the "Some text is lost now" and assume the loss 375 of a packet containing the text in the middle, this could 376 be displayed (with a character count): 378 "Some ############now" 380 As opposed to: 382 "Some #now" 384 Which is what this payload format enables ("#" indicates a 385 missing character or packet, respectively). 387 However, it is the opinion of the authors that for applications 388 such as subtitling applications and multimedia presentations 389 that use this payload format, such partial error correction is 390 not worth the cost of including two additional fields, namely 391 character count and character offset. Instead, it is 392 recommended that some more overhead be invested to provide full 393 error correction by protecting the less text sample fragments 394 using the measures outlined in Section 5. 396 3. Fragment re-assembly: in order to re-assemble the text samples, 397 offset information is needed. Instead of a character or byte 398 offset, a single byte, TOTAL/THIS, is used. These two values 399 indicate the total number and current index of fragments of a 400 text sample. This is simpler than having a character offset 401 field in each fragment. Details in Section 4.1.3. 403 4. A length field, LEN, is present in the common header fields. 404 While the length in the RTP payload format is not needed by most 405 RTP applications (typically lower layers, like UDP, provide this 406 information) it does ease interoperability with RFC 3640. This 407 is because the Access Units (AUs) used for carriage of data in 408 RFC 3640 must include a length indication. Details in Section 409 4.8. 411 5. The header fields in the specific payload headers (TYPE headers 412 in Sections 4.1.2 to 4.1.6) have been arranged for easy 413 processing on 32-bit machines. For this reason the fields SIDX 414 and SDUR are swapped in TYPE 1 unit, compared to the other 415 units. 417 3. Terminology 419 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 420 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 421 document are to be interpreted as described in RFC 2119 [5]. 423 Furthermore, the following terms are used and have specific meaning 424 within the context of this document: 426 text sample or whole text sample 428 In the 3GPP Timed Text media format [1] this term refers to a 429 unit of timed text data as contained in the source (3GP) file. 430 This includes the text string byte count, possibly a Byte Order 431 Mark, the text string and any modifiers that may follow. Its 432 equivalent in audio/video would be a frame. 434 In this document, however, a text sample comprises only text 435 strings followed by zero or more modifiers. This definition of 436 text sample excludes the 16-bit text string byte count and the 437 16-bit Byte Order Mark (BOM) present in 3GP file text samples 438 (see Section 4.3 and Figure 9). The 16-bit BOM is not 439 transported in RTP as explained in Section 4.1.1. 441 text strings: 443 text strings is the term used to denote the actual text 444 characters encoded either as UTF-8 or UTF-16. When using this 445 payload format, the text string does not contain any byte order 446 mark (BOM). See Figure 9 for details. 448 fragment or text sample fragment: 450 a fraction of a text sample. A fragment may contain either text 451 strings or modifier (decoration) contents, but not both at the 452 same time. 454 sample contents: 456 general term to identify timed text data transported when using 457 this payload format. Sample contents may be one or several text 458 samples, sample descriptions and sample fragments (note that, as 459 per Section 4.6, there is only one case in which more than one 460 fragment may be included in a payload). 462 decoration/modifiers: 464 the terms "decoration" and "modifiers" are used interchangeably 465 throughout the document to denote the contents of the text 466 sample that modify the default text formatting. Modifiers may, 467 for example, specify different font size for a particular 468 sequence of characters or define karaoke timing for the sample. 470 sample description: 472 this term is used to denote information which is potentially 473 shared by more than one text sample. In a 3GP file a sample 474 description is stored in a place where it can be shared. It 475 contains setup and default information such as scrolling 476 direction, text box position, delay value, default font, 477 background color, etc. 479 units or transport units: 481 the payload headers specified in this document encapsulate text 482 samples, fragments thereof and sample descriptions by placing a 483 common header and specific payload header (Sections 4.1.1 to 484 4.1.6) before them and so building what is here called a 485 (transport) unit. 487 aggregation / aggregate packet 489 The payload of an aggregate (RTP) packet consists of several 490 (transport) units. 492 track / stream 494 3GP files contain audio/video and text tracks. This document 495 enables to stream text tracks using RTP. Therefore both terms 496 are exchanged in this document in the context of 3GP files. 498 Media Header Box / Track Header Box / ... 500 the 3GP file format makes use of these structures defined in the 501 ISO Base File Format [2]. When referring to these in this 502 document, initials are capitalized for clarity. 504 4. RTP Payload Format for 3GPP Timed Text 506 The format of an RTP packet containing 3GPP timed text is shown 507 below: 509 0 1 2 3 510 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 511 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 512 |V=2|P|X| CC |M| PT | sequence number | 513 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 514 | timestamp | 515 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 516 | synchronization source (SSRC) identifier | 517 /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 518 | |U| R | TYPE| LEN | : 519 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : 520 U| : (variable header fields depending on TYPE : 521 N| : : 522 I< +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 523 T| | | 524 | : SAMPLE CONTENTS : 525 | | +-+-+-+-+-+-+-+-+ 526 | | | 527 \+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 528 Figure 1. 3GPP Timed Text RTP Packet Format. 530 Marker bit (M): the marker bit SHALL be set to 1 if the RTP packet 531 includes one or more whole text samples or the last fragment of a 532 text sample; otherwise set to zero (0). 534 Timestamp: the timestamp MUST indicate the sampling instant of the 535 earliest (or only) unit contained in the RTP packet. The initial 536 value SHOULD be randomly determined, as specified in RTP [3]. 538 The timestamp value should provide enough timing resolution for 539 expressing the duration of text samples, for synchronizing text 540 with other media and for performing RTCP measurements such as 541 the interarrival delay jitter or the RTCP Packet Receipt Times 542 Report Block (Section 4.3 of RFC 3611 [20]). This is compliant 543 to RTP, section 5.1: 545 "The resolution of the clock MUST be sufficient for the 546 desired synchronization accuracy and for measuring packet 547 arrival jitter (one tick per video frame is typically not 548 sufficient)" 550 The above observation applies to both timed text tracks included 551 in a 3GP file as well as live streaming sessions. In the case 552 of a 3GP timed text track, the timestamp clockrate is the value 553 of the "timescale" parameter in the Media Header Box for that 554 text track. Each track in a 3GP file MAY have its own clockrate 555 as specified in the Media Header Box. Likewise, live streaming 556 applications SHALL use an appropriate timestamp clockrate. A 557 default value of 1000 Hz is RECOMMENDED. Other timestamp 558 clockrates MAY be used. In this case, the typical behavior here 559 is to match the 3GPP timed text clockrate to that used by an 560 associated audio or video stream. 562 In an aggregate payload, units MUST be placed in play-out order, 563 i.e. earliest first in the payload. If TYPE 1 units are 564 aggregated, the timestamp of the subsequent units MUST be 565 obtained by adding the timed text sample duration of previous 566 samples to the RTP timestamp value. There are two exceptions to 567 this rule: TYPE 5 units and an aggregate payload containing two 568 fragments of the same text sample. The details of the timestamp 569 calculation are given in Section 4.6. 571 Finally, timestamp clockrates MUST be signaled by out-of-band 572 means at session setup, e.g., using the media type "rate" 573 parameter in SDP. See Section 9 for details. 575 Payload Type (PT): the payload type is set dynamically and sent by 576 out-of-band means. 578 The usage of the remaining RTP header fields, namely V, P, X, CC, SN 579 and SSRC, follows the rules of RTP and the profile in use. 581 4.1. Payload Header Definitions 583 The (transport) units specified in this document consist of a set of 584 common fields (U, R, TYPE, LEN), followed by specific header fields 585 (TYPES 1-5) and text sample contents. See Figure 1 and Figure 2. 587 In Figure 2 two example RTP packets are depicted. Thereby, the first 588 one contains an aggregate RTP payload with two complete text samples 589 and the second one contains one text sample fragment. After each 590 unit header is explained, detailed payload examples follow in Section 591 4.7. 593 +----------------------+ 594 | | 595 | RTP Header | 596 | | 597 ---------+----------------------+ 598 | | | 599 | |COMMON + TYPE 1 Header| 600 | ........................ 601 UNIT 1 - | | 602 | | Text Sample | 603 | | | 604 |-------\........................ 605 -------/| | 606 | |COMMON + TYPE 1 Header| 607 | ........................ 608 UNIT 2 - | | 609 | | Text Sample | 610 | | | 611 | | | 612 ---------+----------------------+ 614 +----------------------+ 615 | | 616 | RTP Header | 617 | | 618 ---------+----------------------+ 619 | | COMMON + TYPE 2 | 620 | | (or 3 or 4) Hdr | 621 | ........................ 622 UNIT 3 - | | 623 | | Text Sample Fragment | 624 | | | 625 | | | 626 ---------+----------------------+ 627 Figure 2. Example RTP packets. 629 4.1.1. Common Payload Header Fields 631 The fields common to all payload headers have the following format: 633 0 1 2 634 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 635 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 636 |U| R |TYPE | LEN | 637 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 638 Figure 3. Common payload header fields. 640 Where: 642 o U (1 bit) "UTF Transformation flag": this is used to inform RTP 643 receivers whether UTF-8 (U=0) or UTF-16 (U=1) was used to encode 644 the text string. UTF-16 text strings transported by this payload 645 format MUST be serialized in big endian order, a.k.a. network byte 646 order. 648 Informative note: timed text clients complying with the 3GPP 649 Timed Text format [1] are only required to understand the big 650 endian serialization. Thus, in order to ease interoperability, 651 the reverse serialization (little endian) is not supported by 652 this payload format. 654 For the payload formats defined in this document, the U bit is 655 only used in TYPE 1 and TYPE 2 headers. Senders MUST set the U 656 bit to zero in TYPE 3, TYPE 4 and TYPE 5 headers. Consequently, 657 receivers MUST ignore the U bit in TYPE 3, TYPE 4 and TYPE 5 658 headers. 660 o R (4 bits) "Reserved bits": for future extensions. This field 661 MUST be set to zero (0x0) and MUST be ignored by receivers. 663 o TYPE (3 bits) "Type Field": this field specifies which specific 664 header fields follow. The following TYPE values are defined: 666 - TYPE 1, for a whole text sample 667 - TYPE 2, for a text string fragment (without modifiers) 668 - TYPE 3, for a whole modifier box or the first fragment of a 669 modifier box 670 - TYPE 4, for a modifier fragment other than first. 671 - TYPE 5, for a sample description. Exactly one header per 672 sample description. 673 - TYPE 0, 6 and 7 are reserved for future extensions. Note that 674 future extensions are possible, e.g., a unit that explicitly 675 signals the number of characters present in a fragment (see 676 Section 2.5). In order to guarantee backwards-compatibility, 677 it SHALL be possible that older clients ignore (newer) units 678 they do not understand, without invalidating the timestamp 679 calculation mechanisms or otherwise preventing from decoding 680 the other units. 682 o Finally, the LEN (16 bits) "Length Field": indicates the size (in 683 bytes) of this header field and all the fields following, i.e. the 684 LEN field followed by the unit payload: text strings and modifiers 685 (if any). This definition only excludes the initial U/R/TYPE byte 686 of the common header. The LEN field follows network byte order. 688 The way in which LEN is obtained when streaming out of a 3GP file 689 depends on the particular unit type. This is explained for each 690 unit in the sections below. 692 For live streaming, both sample length and the LEN value for the 693 current fragment MUST be calculated during the sampling process or 694 during fragmentation. 696 In general, LEN may take the following values: 698 - TYPE = 1, LEN >= 8, 699 - TYPE = 2, LEN > 9, 700 - TYPE = 3, LEN > 6, 701 - TYPE = 4, LEN > 6 and, 702 - TYPE = 5, LEN > 3. 704 Receivers MUST discard units that do not comply with these values. 705 However, the RTP header fields and the rest of the units in the 706 payload (if any) are still useful, as guaranteed by the 707 requirement for future extensions above. 709 In the following subsections the different payload headers for the 710 values of TYPE are specified. 712 4.1.2. TYPE 1 Header 714 0 1 2 3 715 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 716 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 717 |U| R |TYPE | LEN (always >=8) | SIDX | 718 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 719 | SDUR | TLEN | 720 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 721 | TLEN | 722 +-+-+-+-+-+-+-+-+ 723 Figure 4. TYPE 1 Header Format. 725 This header type is used to transport whole text samples. This unit 726 should be the most common case, i.e. the text sample should be 727 usually small enough to be transported in one unit without having to 728 separate text strings from modifiers. In an aggregate (RTP packet) 729 payload containing several text samples, every sample is preceded by 730 its own TYPE 1 header (see Figure 12). 732 Informative note: as indicated in the Terminology Section, a 733 text sample is composed by the text strings followed by the 734 modifiers (if any). This is also how text samples are stored in 735 3GP files. The separation of a text sample into text strings 736 and modifiers is only needed for large samples (or small 737 available IP MTU sizes, see Section 4.4) and it is accomplished 738 with TYPE 2 and TYPE 3 headers, as explained in the Sections 739 below. 741 Note that also empty text samples are considered whole text samples, 742 although they do not contain sample contents. Empty text samples may 743 be used to clear the display or to put an end to samples of unknown 744 duration, for example. Units without sample contents SHALL have a 745 LEN field value of 8 (0x0008). 747 The fields above have the following meaning: 749 o U, R and TYPE as defined in Section 4.1.1. 751 o LEN, in this case, represents the length of the (complete) text 752 sample plus eight (8) bytes of headers. For finding the length if 753 the text sample in the Sample Size Box of 3GP files, see Section 754 4.3. 756 o SIDX (8 bits) "Text Sample Entry Index": this is an index used to 757 identify the sample descriptions. 759 The SIDX field is used to find the sample description 760 corresponding to the unit's payload. There are two types of SIDX 761 values: static and dynamic. 763 Static SIDX values are used to identify sample descriptions that 764 MUST be sent out-of-band and MUST remain active during the whole 765 session. A static SIDX value is unequivocally linked to one 766 particular sample description during the whole session. It SHOULD 767 be avoided that many sample descriptions are carried 768 out-of-band, since these may become large and, ultimately, 769 transport is not the goal of the out-of-band channel. Thus, this 770 feature is RECOMMENDED for transporting those sample descriptions 771 that provide a set of minimum default format settings. Static 772 SIDX values MUST fall in the (closed) interval [129,254]. 774 Dynamic SIDX values are used for sample descriptions sent in-band. 775 Sample descriptions MAY be sent in-band for several reasons: 776 because they are generated in real time, for transport resiliency 777 or both. A dynamic SIDX value is unequivocally linked to one 778 particular sample description during the period in which this is 779 active in the session and it SHALL NOT be modified during that 780 period. This period MAY be smaller than or equal to the session 781 duration. This period is not known a priori. A maximum of 64 782 dynamic simultaneously active SIDX values is allowed at any 783 moment. Dynamic SIDX values MUST fall in the closed interval 784 [0,127]. This should be enough for both, recorded content and 785 live streaming applications. Nevertheless, a wrap-around 786 mechanism is provided in Section 4.2.1 to handle streaming 787 sessions where more than 64 SIDX values might be needed. Servers 788 MAY make use of dynamic sample descriptions. Clients MUST be able 789 to receive and interpret dynamic sample descriptions. 791 Finally, SIDX values 128 and 255 are reserved for future use. 793 o SDUR (24 bits) "Text Sample Duration": indicates the sample 794 duration in RTP timestamp units of the text sample. For this 795 field, a length of 3 bytes is preferred to 2 bytes. This is 796 because, for a typical clockrate of 1000 Hz, 16 bits would allow 797 for a maximum duration of just 65 seconds, which might be too 798 short for some streams. On the other hand, 24 bits at 1000 Hz 799 allow for a maximum duration of about 4.6 hours, while for 90 KHz, 800 this value is about 3 minutes. These values should be enough for 801 streaming applications. However, if a larger duration is needed, 802 the extension mechanism specified in Section 4.3 SHALL be used. 804 Apart from defining the time period during which the text is 805 displayed, the duration field is also used to find the timestamp 806 of subsequent units within the aggregate RTP packet payload (if 807 any). This is explained in Section 4.6. 809 Text samples have generally a known duration at the time of 810 transmission. However, in some cases like live streaming, the 811 time for which a text piece shall be presented might not be known 812 a priori. Thus, the value zero SDUR=0 (0x000000) is reserved to 813 signal unknown duration. The amount of time that a sample of 814 unknown duration is presented is determined by the timestamp of 815 the next sample that shall be displayed at the receiver: text 816 samples of unknown duration SHALL be displayed until the next text 817 sample becomes active, as indicated by its timestamp. 819 The next example illustrates how units of unknown duration MUST be 820 presented. If no text sample following is available, it is an 821 implementation issue what should be displayed. E.g. a server 822 could send an empty sample to clear the text box. 824 Example: imagine you are in an airport watching the latest news 825 report while you wait for your plane. Airports are loud, so the 826 news report is transcribed in the lower area of the screen. 827 This area displays two lines of text: the headlines and the 828 words spoken by the news speaker. As usual, the headlines are 829 shown for a longer time than the rest. This time is, in 830 principle, unknown to the stream server, which is streaming 831 live. A headline is just replaced when the next headline is 832 received. 834 However, upon storing a text sample with SDUR=0 in a 3GP file, the 835 SDUR value MUST be changed to the effective duration of the text 836 sample, which MUST be always greater than zero (note that the ISO 837 file format [2] explicitly forbids a sample duration of zero). 838 The effective duration MUST be calculated as the timestamp 839 difference between the current sample (with unknown duration) and 840 the next text sample that is displayed. 842 Note that samples of unknown duration SHALL NOT use features, 843 which require knowledge of the duration of the sample up front. 844 Such features are scrolling and karaoke in [1]. This also applies 845 for future extensions of the Timed Text format. Furthermore, only 846 sample descriptions (TYPE 5 units) MAY follow units of unknown 847 duration in the same aggregate payload. Otherwise, it would not 848 be possible to calculate the timestamp of these other units. 850 For text contents stored in 3GP files, see Section 4.3 for details 851 on how to extract the duration value. For live streaming, live 852 encoders SHALL assign appropriate values and units according to 853 [1] and later releases. 855 o TLEN (16 bits), "Text String Length", is a byte-count of the text 856 string. The text string length is needed by the decoder to know 857 where the modifiers in the payload start. TLEN is not present in 858 text string fragments (TYPE 2) since it can be deductively 859 calculated from the LEN values of each fragment. 861 The TLEN value is obtained from the text samples as contained in 862 3GP files. Refer to Section 4.3. For live content, the TLEN MUST 863 be obtained during the sampling process. 865 o Finally, the actual text sample is placed after the TLEN field. 866 As defined in Section 3, a text sample consists of a string of 867 characters encoded using either UTF-8 or UTF-16, followed by zero 868 or more modifiers. Note also, that no BOM and no byte count are 869 included in the strings carried in the payload (as opposed to text 870 samples stored in 3GP files [1]). 872 4.1.3. TYPE 2 Header 874 0 1 2 3 875 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 876 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 877 |U| R |TYPE | LEN( always >9) | TOTAL | THIS | 878 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 879 | SDUR | SIDX | 880 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 881 | SLEN | 882 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 883 Figure 5. TYPE 2 Header Format. 885 This header type is used to transport either a whole text string or a 886 fragment of it. TYPE 2 units SHALL NOT contain modifiers. In 887 detail: 889 o U, R and TYPE as defined in Section 4.1.1. 891 o SIDX and SDUR as defined in Section 4.1.2. 893 Note that the U, SIDX and SDUR fields are meaningful since 894 partial text strings can also be displayed. 896 o The LEN field (16 bits) indicates the length of the text string 897 fragment plus nine (9) bytes of headers. Its value is calculated 898 upon fragmentation. LEN MUST always be greater than nine (0x0009). 899 Otherwise, the unit MUST be discarded. 901 According to the guidelines in Section 4.3, text strings MUST be 902 split at character boundaries for allowing the display of text 903 fragments. Hence, a text fragment MUST contain at least one 904 character in either UTF-8 or UTF-16. Actually, this is just a 905 formalism since by observing the guidelines, much larger fragments 906 should be created. 908 Note also, that TYPE 2 units do not contain an explicit text 909 string length, TLEN (see TYPE 1). This is because TYPE 2 units do 910 not contain any modifiers after the text string. If needed, the 911 length of the received string can be obtained using the LEN values 912 of the TYPE 2 units. 914 o The SLEN field (16 bits) indicates the size (in bytes) of the 915 original (whole) text sample to which this fragment belongs. This 916 length comprises the text string plus any modifier boxes present 917 (and includes neither the byte order mark nor the text string 918 length as mentioned in the Terminology Section). 920 Regarding the text sample length: timed text samples are neither 921 generated at regular intervals nor there is a default sample size. 922 If 3GP files are streamed, the length of the text samples is 923 calculated beforehand and included in the track itself, while for 924 live encoding it is the real time encoder that SHALL choose an 925 appropriate size for each text sample. In this case, the amount 926 of text 'captured' in a sample depends on the text source and the 927 particular application (see examples below). Samples may, e.g., 928 be tailored to match the packet MTU as close as possible or to 929 provide a given redundancy for the available bit rate. The 930 encoding application MUST also take into account the delay 931 constraints of the real-time session and assess whether FEC, 932 retransmission or other similar techniques are reasonable options 933 for stream repair. 935 The following examples shall illustrate how a real-time encoder 936 may choose its settings to adapt to the scenario constraints. 938 Example: imagine a newscast scenario, where the spoken news 939 is transcribed and synchronized with the image and voice of 940 the reporter. We assume that the news speaker talks at an 941 average speed of 5 words per second with an average word 942 length of 5 characters plus one space per word, i.e. 30 943 characters per second. We assume an available IP MTU of 576 944 bytes and an available bitrate of 576*8bits per 945 second=4.6Kbps. We assume each character can be encoded 946 using 2-bytes in UTF-16. In this scenario, several 947 constraints may apply, for example: available IP MTU, 948 available bandwidth, allowable delay and required redundancy. 949 If the target were to minimize the packet overhead, a text 950 sample covering 8 seconds of text would be closest to the IP 951 MTU: IP/UDP/RTP/TYPE1 Header + (8s text sample)=20+8+12+8+(~6 952 chars/word * 5 word/s * 8s *2 chars/word)= 528 bytes < 576 953 bytes. For other scenarios, like lossy networks, it may 954 happen that just one packet per sample is too low of a 955 redundancy. In this case, a choice could be that the encoder 956 'collects' text every second, thus yielding text samples 957 (TYPE 1 units) of 68 bytes, TYPE 1 header included. We can, 958 e.g., include three contiguous text samples in one RTP 959 payload: the current and last two text samples (see below). 960 This accounts to a total IP packet size of 20+8+12+3*(8+60)= 961 244 bytes. Now, with the same available bitrate of 4.6Kbps, 962 these 244-byte packets can be sent redundantly up two times 963 per second: 965 RTP payload (1,2,3)(1,2,3) (2,3,4)(2,3,4) (3,4,5)(3,4,5) ... 966 Time: <----1s------> <----1s------> <-----1s-----> ... 968 This means that each text sample is sent at least six times, 969 which should provide enough redundancy. Although not as 970 bandwidth efficient (488*8 < 528*8 < 576*8 bps) as the 971 previous packetization, this option increases the stream 972 redundancy while still meeting the delay and bandwidth 973 constraints. 975 Another example would be a user sending timed text from a 976 type-in area in the display. In this case, the text sample 977 is created as soon as the user clicks the 'send' button. 978 Depending on the packet length, fragmentation may be needed. 980 In a video conferencing application, text is synchronized 981 with audio and video. Thus, the text samples shall be 982 displayed long enough to be read by a human, shall fit in the 983 video screen and shall 'capture' the audio contents rendered 984 during the time the corresponding video and audio is 985 rendered. 987 For stored content, see Section 4.3 for details on how to find the 988 SLEN value in a 3GP file. For live content, the SLEN MUST be 989 obtained during the sampling process. 991 Finally, note that clients MAY use SLEN to buffer space for the 992 remaining fragments of a text sample. 994 o The fields TOTAL (4 bits) and THIS (4 bits) indicate the total 995 number of fragments in which the original text sample (i.e. text 996 string and its modifiers) has been fragmented and which order 997 occupies the current fragment in that sequence, respectively. 998 Note that the sequence number alone cannot replace the 999 functionality of the THIS field, since packets (and fragments) may 1000 be repeated, e.g., as in repeated transmission (see Section 5). 1001 Thus, an indication for "fragment offset" is needed. 1003 The usual "byte offset" field is not used here for two reasons: a) 1004 it would take one more byte and b) it does not provide any 1005 information on the character offset. UTF-8/UTF-16 text strings 1006 have, in general, a variable character length ranging from 1 to 6 1007 bytes. Therefore, the TOTAL/THIS solution is preferred. It could 1008 also be argued that the LEN and SLEN fields be used for this 1009 purpose, but while they would provide information about the 1010 completeness of the text sample, they do not specify the order of 1011 the fragments. 1013 In all cases (TYPEs 2, 3 and 4), if the value of THIS is greater 1014 than TOTAL or if TOTAL equals zero (0x0), the fragment SHALL be 1015 discarded. 1017 o Finally, the sample contents following the SLEN field consist of a 1018 fragment of the UTF-8/UTF-16 character string; no modifiers 1019 follow. 1021 4.1.4. TYPE 3 Header 1023 0 1 2 3 1024 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1025 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1026 |U| R |TYPE | LEN( always >6) |TOTAL | THIS | 1027 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1028 | SDUR | 1029 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1030 Figure 6. TYPE 3 Header Format. 1032 This header type is used to transport either the entire modifier 1033 contents present in a text sample or just the first fragment of them. 1034 This depends on whether the modifier boxes fit in the current RTP 1035 payload. 1037 If a text sample containing modifiers is fragmented this header MUST 1038 be used to transport the first fragment or, if possible, the complete 1039 modifiers. 1041 In detail: 1043 o The U, R and TYPE fields are defined as in Section 4.1.1. 1045 o LEN indicates the length of the modifier contents. Its value is 1046 obtained upon fragmentation. Additionally, the LEN field MUST be 1047 greater than six (0x0006). Otherwise, the unit MUST be discarded. 1049 o The TOTAL/THIS field has the same meaning as for TYPE 2. 1051 For TYPE 3 unit containing the last (trailing) modifier fragment, 1052 the value of TOTAL MUST be equal to that of THIS (TOTAL=THIS). In 1053 addition, TOTAL=THIS MUST be greater than one, because the total 1054 number of fragments of a text sample is logically always larger 1055 than one. 1057 Otherwise, if TOTAL is different from THIS in a TYPE 3 unit, this 1058 means that the unit contains the first fragment of the modifiers. 1060 o The SDUR has the same definition for TYPE 1. Since the fragments 1061 are always transported in own RTP packets, this field is only 1062 needed to know how long this fragment is valid. This may, e.g., 1063 be used to determine how long it should be kept in the display 1064 buffer. 1066 Note that the SLEN and SIDX fields are not present in TYPE 3 unit 1067 headers. This is because: a) these fragments do not contain text 1068 strings and b) these types of fragments are applied over text string 1069 fragments, which already contain this information. 1071 4.1.5. TYPE 4 Header 1073 0 1 2 3 1074 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1075 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1076 |U| R |TYPE | LEN( always >6) |TOTAL | THIS | 1077 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1078 | SDUR | 1079 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1080 Figure 7. TYPE 4 Header Format. 1082 This header type is placed before modifier fragments, other than the 1083 first one. 1085 The U, R and TYPE fields are used as per Section 4.1.1. 1087 LEN indicates as for TYPE 3 the length of the modifier contents and 1088 SHALL also be obtained upon fragmentation. The LEN field MUST be 1089 greater than six (0x0006). Otherwise, the unit MUST be discarded. 1091 TOTAL/THIS is used as in TYPE 2. 1093 The SDUR field is defined as in TYPE 1. The reasoning behind the 1094 absence of SLEN and SIDX is the same as in TYPE 3 units. 1096 4.1.6. TYPE 5 Header 1098 0 1 2 3 1099 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1100 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1101 |U| R |TYPE | LEN( always >3) | SIDX | 1102 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1103 Figure 8. TYPE 5 Header Format. 1105 This header type is used to transport (dynamic) sample descriptions. 1106 Every sample description MUST have its own TYPE 5 header. 1108 The U, R and TYPE fields are used as per Section 4.1.1. 1110 The LEN field indicates the length of the sample description, plus 1111 three units accounting for the SIDX and LEN field itself. Thus, this 1112 field MUST be greater than three (0x0003). Otherwise, the unit MUST 1113 be discarded. 1115 If the sample is streamed from a 3GP file, the length of the sample 1116 description contents (i.e. what comes after SIDX in the unit itself) 1117 is obtained from the file (see Section 4.3). 1119 The SIDX field contains a dynamic SIDX value assigned to the sample 1120 description carried as sample content of this unit. As only dynamic 1121 sample descriptions are carried using TYPE 5, the possible SIDX 1122 values are in the (closed) interval [0,127]. 1124 Senders MAY make use of TYPE 5 units. All receivers MUST implement 1125 support for TYPE 5 units, since it adds minimum complexity and it may 1126 increase the robustness of the streaming session. 1128 The next section specifies how SIDX values are calculated. 1130 4.2. Buffering of Sample Descriptions 1132 The buffering of sample descriptions is a matter of the client's 1133 timed text codec implementation. In order to work properly, this 1134 payload format requires that: 1136 o Static sample descriptions MUST be buffered at the client, at 1137 least, for the duration of the session. 1139 o If dynamic sample descriptions are used, their buffering and 1140 update of the SIDX values MUST follow the mechanism described in 1141 the next section. 1143 4.2.1. Dynamic SIDX wrap-around mechanism 1145 The use of dynamic sample descriptions by senders is OPTIONAL. 1146 However, if used, senders MUST implement this mechanism. Receivers 1147 MUST always implement it. 1149 Dynamic SIDX values remain active either during the entire duration 1150 of the session (if used just once) or in different intervals of it 1151 (if used once or more). 1153 Note: in the following SIDX means dynamic SIDX. 1155 For choosing the wrap-around mechanism, the following rationale was 1156 used: there are 128 dynamic SIDX values possible, [0..127]. If one 1157 chooses to allow a maximum of 127 to be used as dynamic SIDXs, then 1158 any reordered packet with a new sample description would make the 1159 mechanism fail. E.g., if the last packet received is SIDX=5, then 1160 all 127 values except SIDX=6 would be "active". Now, if a reordered 1161 packet arrives with a new description, SIDX=9, it will be mistakenly 1162 discarded, because the SIDX=9 is, at that moment, marked as "active" 1163 and active sample descriptions shall not be re-written. Therefore, 1164 a "guard interval" is introduced. This guard interval reduces the 1165 number of active SIDXs at any point in time to 64. Although most 1166 timed text applications will probably need less than 64 sample 1167 descriptions during a session (in total), a wrap-around mechanism to 1168 handle the need for more is described here. 1170 Thereby, a sliding window of 64 active SIDX values is used. Values 1171 within the window are "active"; all others are marked "inactive". An 1172 SIDX value becomes active if at least one sample description 1173 identified by that SIDX has been received. Since sample descriptions 1174 MAY be sent redundantly, it is possible that a client receives a 1175 given SIDX several times. However, active sample descriptions SHALL 1176 NOT be overwritten: the receiver SHALL ignore redundant sample 1177 descriptions and it MUST use the already cached copy. The "guard 1178 interval" of (64) inactive values ensures that always the correct 1179 association SIDX <-> sample description is used. 1181 Informative note: as for the "guard interval" value itself, 64 1182 as 128/2 was considered simple enough while still meeting the 1183 expected maximum number of sample descriptions. Besides that, 1184 there's no other motivation for choosing 64 or a different 1185 value. 1187 The following algorithm is used to buffer dynamic sample descriptions 1188 maintain the dynamic SIDX values: 1190 Let X be the last SIDX received that updated the range of active 1191 sample descriptions. Let Y be a value within the allowed range for 1192 dynamic SIDX: [0,127], and different from X. Let Z be the SIDX of the 1193 last received sample description. Then: 1195 1. Initialize all dynamic SIDX values as inactive. For stored 1196 contents, read the sample description index in the Sample to 1197 Chunk box ("stsc") for that sample. For live streaming, the 1198 first value MAY be zero or any other value in the interval 1199 above. Go to step 2. 1201 2. First in-band sample description with SIDX=Z is received and 1202 stored, Set X=Z. Go to step 3. 1204 3. Any SIDX within the interval [X+1 modulo(128), X+64 modulo(128)] 1205 is marked as inactive and any corresponding sample description 1206 is deleted. Any SIDX within the interval [X+65 modulo(128), X] 1207 is set active. Go to step 4 (wait state). 1209 4. Wait for next sample description. Once the client is 1210 initialized, the interval of active SIDX values MUST change 1211 whenever a sample description with an SIDX value in the inactive 1212 set is received. I.e., upon reception of a sample description 1213 with SIDX=Z do: 1215 a. If Z is in the (closed) interval [X+1 modulo(128), X+64 1216 modulo(128)] then set X=Z, store the sample description and 1217 go to step 3. 1219 b. Else Z must be in the interval [X+65 modulo(128), X], thus: 1220 i. If SIDX=Z is not stored, then store the sample 1221 description. Go to beginning of step 4 (wait state). 1222 ii. Else go to the beginning of step 4 (wait state). 1224 Informative note: it is allowed to send any value of SIDX=X in 1225 the interval [0,127]. E.g., if [64..127] is the current active 1226 set and SIDX=0 is sent a new sample description is defined (0) 1227 and an old one deleted (64), thus [65..127] and [0] are active. 1228 Similarly, one could now send SIDX=64, thus inverting the active 1229 and inactive sets. 1231 Example, 1233 if X=4, any SIDX in the interval [5,68] is inactive. Active 1234 SIDX values are in the complementary interval [69,127] plus 1235 [0,4]. E.g., if the client receives a SIDX=6, then the active 1236 interval is now different: [0,6] plus [71,127]. If the received 1237 SIDX is in the current active interval no change SHALL be 1238 applied. 1240 4.3. Finding payload header values in 3GP files 1242 For the purpose of streaming timed text contents, some values in the 1243 boxes contained in a 3GP file are mapped to fields of this payload 1244 header. This section explains where to find those values. 1246 Additionally, for the duration and sample description indexes, 1247 extension mechanisms are provided. All senders MUST implement the 1248 extension mechanisms described herein. 1250 If the file is streamed out of a 3GP file, thee following guidelines 1251 SHALL be followed. 1252 Note: all fields in the objects (boxes) of a 3GP file are found 1253 in network byte order. 1255 Information obtained from the Sample Table Box (stbl): 1257 o Sample Descriptions and Sample Description length: the 1258 Sample Description box (stsd, inside the stbl) contains the 1259 sample descriptions. For timed text media, each element of 1260 stsd is a timed text sample entry (type "tx3g"). 1262 The (unsigned) 32 bits of the "size" field in the stsd box 1263 represent the length (in bytes) of the sample description, as 1264 carried in TYPE 5 units. On the other hand, the LEN field of 1265 TYPE 5 units is restricted to 16 bits. Therefore if the 1266 value of "size" is greater than (2^16-1-3)[bytes], then the 1267 sample description SHALL NOT be streamed with this payload 1268 format. There is no extension mechanism defined in this 1269 case, since fragmentation of sample descriptions is not 1270 defined (sample descriptions are typically up to some 200 1271 bytes in size). Note: the three (3) accounts for the TYPE 5 1272 header fields included in the LEN value. 1274 o SDUR from the Decoding Time to Sample Box (stts). The 1275 (unsigned) 32 bits of the "sample delta" field are used for 1276 calculating SDUR. However, since SDUR field is only 3 bytes 1277 long, then text samples with duration values larger than 1278 (2^24-1)/(timestamp clockrate)[seconds] cannot be streamed 1279 directly. The solution is simple: copies of the 1280 corresponding text sample SHALL be sent. Thereby, the 1281 timestamp and duration values SHALL be adjusted so that a 1282 continuous display is guaranteed as it just one sample would 1283 have been sent. I.e., a sample with timestamp TS and 1284 duration SDUR can be sent as two samples having timestamps 1285 TS1 and TS2 and durations SDUR1 and SDUR2, such that TS1=TS, 1286 TS2=TS1+SDUR1 and SDUR=SDUR1+SDUR2. 1288 o Text sample length from the Sample Size Box (stsz). The 1289 (unsigned) 32 bits of the "sample size" or "entry size" (one 1290 of them, depending on whether the sample size is fixed or 1291 variable) indicate the length (in bytes) of the 3GP text 1292 sample. For obtaining the length of the (actual) streamed 1293 text sample, the lengths of the text string byte count (2 1294 bytes) and, in case of UTF-16 strings, the length the BOM 1295 (also 2 bytes) SHALL be deducted. This is illustrated in 1296 Figure 9. 1298 Text Sample according to 3GPP TS 26.245 1300 TEXT SAMPLE (length=stsz) 1301 .--------------------------------------------------. 1302 / \ 1303 TEXT STRING (length=TBC) 1304 .------------------------------------. 1305 / \ 1306 TBC BOM MODIFIERS 1307 +---+---+----------------------------------+-----------+ 1308 || 1309 || TBC BOM -> TLEN field 1310 || +---+---+ U bit 1311 || 1312 \/ 1314 Text Sample according to this Payload Format 1316 TEXT SAMPLE (length=SLEN w/o TBC,BOM) 1317 .--------------------------------------------. 1318 / \ 1319 TEXT STRING (length=TLEN) 1320 .--------------------------------. 1321 / \ 1322 TEXT STRING MODIFIERS 1323 +----------------------------------+-----------+ 1325 KEY: 1326 TBC= Text string Byte Count 1327 BOM= Byte Order Mark 1328 Figure 9. Text sample composition. 1330 Moreover, since the LEN field in TYPE 1 unit header is 16-bit 1331 long, then larger text sample sizes than (2^16-1-8) [bytes] 1332 SHALL NOT be streamed. Also in this case, there is no 1333 extension mechanism defined. This is because this maximum is 1334 considered enough for the targeted streaming applications. 1335 (Note: the eight (8) accounts for the TYPE 1 header fields 1336 included in the LEN value). 1338 o SIDX from the Sample to Chunk Box (stsc): the stsc Box is 1339 used to find samples and their corresponding sample 1340 descriptions. These are referenced by the "sample 1341 description index", a (unsigned) 32-bit integer. If possible, 1342 these indices may be directly mapped to the SIDX field. 1343 However, there are several cases where this may not be 1344 possible: 1346 a) The total number of indices used is greater than the 1347 number of indices available, i. e., if the static sample 1348 descriptions are more than 127 or the dynamic ones are 1349 more than 64 or, 1351 b) The original SIDX value ranges do not fit in the 1352 allowed ranges for static (129-254) or dynamic (0-127) 1353 values. 1355 Therefore, when assigning SIDX values to the sample 1356 descriptions, the following guidelines are provided: 1358 o Static sample descriptions can simply be assigned 1359 consecutive values within the range 129-254 (closed 1360 interval). This range should be well enough for static 1361 sample descriptions. 1363 o As for dynamic sample descriptions: 1365 a) Streams that use less than 64 dynamic sample 1366 descriptions SHOULD use consecutive values for SIDX 1367 anywhere in the range 0-127 (closed interval). 1369 b) For streams with more than 64 sample descriptions, 1370 the SIDX values MUST be assigned in usage order, and if 1371 any sample description shall be used after it has been 1372 set inactive, it will need to be re-sent and assigned a 1373 new SIDX value (according to the algorithm in 1374 Section4.2.1). 1376 Information obtained from the Media Data Box: 1378 o Text strings, TLEN, U bit and modifiers from the Media Data 1379 Box (mdat). Text strings, 16-bit text string byte count, 1380 Byte Order Mark (BOM, indicating UTF encoding) and modifier 1381 boxes can be found here. 1383 For TYPE 1 units, the value of TLEN is extracted from the 1384 text string byte count that precedes the text string in the 1385 text sample, as stored in the 3GP file. If UTF-16 encoding 1386 is used, two (2) more bytes have to be deducted from this 1387 byte count beforehand, in order to exclude the BOM. See 1388 Figure 9. 1390 4.4. Fragmentation of Timed Text Samples 1392 This section explains why text samples may have to be fragmented and 1393 discusses some of the possible approaches to do it. A solution is 1394 proposed together with rules and recommendations for fragmenting and 1395 transporting text samples. 1397 3GPP Timed Text applications are expected to operate at low bitrates. 1398 This fact, added to the small size of timed text samples (typically 1399 one or two hundred bytes) makes fragmentation of text samples a rare 1400 event. Samples should usually fit into the MTU size of the used 1401 network path. 1403 Nevertheless, some text strings (e.g. ending roll in a movie) and 1404 some modifier boxes (i.e. for hyperlinks, for karaoke or for styles) 1405 may become large. This may also apply for future modifier boxes. In 1406 such cases, the first option to consider is whether it is possible to 1407 adjust the encoding (e.g. the size of sample) in such a way that 1408 fragmentation is avoided. If so, this is preferred to fragmentation 1409 and SHOULD be done. 1411 Otherwise, if this is not possible or other constraints avoid it, 1412 fragmentation MAY be used and the basic guidelines given in this 1413 document MUST be followed: 1415 o It is RECOMMENDED that text samples are fragmented as seldom as 1416 possible, i.e. the least possible number of fragments is created 1417 out of a text sample. 1419 o If there is some bitrate and free space in the payload available, 1420 sample descriptions (if at hand) SHOULD be aggregated. 1422 o Text strings MUST split at character boundaries, see TYPE 2 1423 header. Otherwise, it is not possible to display the text 1424 contents of a fragment if a previous fragment was lost. As a 1425 consequence, text string fragmentation requires knowledge of the 1426 UTF-8/UTF-16 encoding formats to determine character boundaries. 1428 o Unlike text strings, the modifier boxes are NOT REQUIRED to split 1429 at meaningful boundaries. However, it is RECOMMENDED to do so 1430 whenever possible. This decreases the effects of packet loss. 1431 This payload format does not ensure that partially received 1432 modifiers be applied to text strings. If only part of the 1433 modifiers is received, it is an application issue how to deal with 1434 these, i.e. whether to use them or not. 1436 Informative note: ensuring that partially received modifiers can 1437 be applied to text strings in all cases (for all modifier types 1438 and for all fragment loss constellations) would place additional 1439 requirements on the payload format. In particular this would 1440 require that: a) senders understand the semantics of the 1441 modifier boxes and b) specific fragment headers for each of the 1442 modifier boxes are defined, in addition to the payload formats 1443 defined below. Understanding the modifiers semantics means 1444 knowing, e.g., where does each modifier start and end, which 1445 text fragments are affected, which modifiers may or may not be 1446 split or what the fields indicate. This is necessary for being 1447 able to split the modifiers in such a way that each fragment can 1448 be applied independent of previous packet losses. This would 1449 require a more intelligent fragmentation entity and more complex 1450 headers. Given the low probability of fragmentation and the 1451 desire to keep the requirements low, it does not seem reasonable 1452 to specify such modifier box specific headers. 1454 o Modifier and text string fragments SHOULD be protected against 1455 packet losses, i.e. using FEC [7], retransmission [11], repetition 1456 (Section 5) or an equivalent technique. This minimizes the 1457 effects of packet loss. 1459 o An additional requirement when fragmenting text samples is that 1460 the start of the modifiers MUST be indicated using the payload 1461 header defined for that purpose, i.e. a TYPE 3 unit MUST be used 1462 (see Section 4.1.4). This enables a receiver to detect the start 1463 of the modifiers as long as there are not two or more consecutive 1464 packet losses. 1466 o Finally, sample descriptions SHALL NOT be fragmented, because they 1467 contain important information that may affect several text 1468 samples. 1470 4.5. Reassembling Text Samples at the Receiver 1472 The payload headers defined in this document allow reassembling 1473 fragmented text samples. For this purpose, the standard RTP 1474 timestamp, the duration field (SDUR) and the fields TOTAL/THIS in the 1475 payload headers are used. 1477 Units that belong to the same text sample MUST have the same 1478 timestamp. TYPE 5 units do not comply with this rule since they are 1479 not part of any particular text sample. 1481 The process for collecting the different fragments (units) of a text 1482 sample is as follows: 1484 1. Search for units having the same timestamp value, i.e., units 1485 that belong to the same text sample or sample descriptions that 1486 shall become available at that time instant. If several units 1487 of the same sample are repeated, only one of them SHALL be used. 1488 Repeated units are those that have the same timestamp and the 1489 same values for TOTAL/THIS. 1491 Note that, as mentioned in Section 4.1.1, the receiver 1492 SHALL ignore units with unrecognized TYPE value. 1493 However, the RTP header fields and the rest of the units 1494 (if any) in the payload are still useful. 1496 2. Check within this set whether any of the units from the text 1497 sample is missing. This is done using the TOTAL and THIS 1498 fields; the TOTAL field indicates how many fragments were 1499 created out of the text sample and the THIS field indicates the 1500 position of this fragment in the text sample. As result of this 1501 operation two outcomes are possible: 1503 a. No fragment is missing. Then the THIS field SHALL be used 1504 to order the fragments and reassemble the text sample 1505 before forwarding it to the decoding application. Special 1506 care SHALL be taken when reassembling the text string as 1507 indicated in bullet 4 below. 1509 b. One or more fragments are missing: check whether this 1510 fragment belongs to the text string or to the modifiers: 1511 TYPE 2 units identify text string fragments, TYPE 3 and 4 1512 modifier fragments: 1514 i. If the fragment or fragments missing belong to the 1515 text string and the modifiers were received complete, 1516 then the received text characters may, at least, be 1517 displayed as plain text. Some modifiers may only be 1518 applied as long as it is possible to identify the 1519 character numbers, e.g. if only last text string 1520 fragment is lost. This is the case for modifiers 1521 defining specific font styles ('styl'), highlighted 1522 characters ('hlit'), karaoke feature ('krok)' and 1523 blinking characters ('blnk'). Other modifiers such as 1524 'dlay' or 'tbox' can be applied without the knowledge 1525 of the character number. It is an application issue 1526 to decide whether to use apply the modifiers or not. 1528 ii. If the fragment missing belongs to the modifiers and 1529 the text strings were received complete, then the 1530 incomplete modifiers may be used. The text string 1531 SHOULD at least be displayed as plain text. As 1532 mentioned in Section 4.3 modifiers may split without 1533 observing meaningful boundaries. Hence, it may not 1534 always be possible to make use of partially received 1535 modifiers. However, to avoid this, it is RECOMMENDED 1536 that the modifiers do split at meaningful boundaries. 1538 iii. A third possibility is that it is not possible to 1539 discern whether modifiers or text strings were 1540 received complete. E.g. if the TYPE 3 unit of a 1541 sample plus the following or preceding packet is lost, 1542 there is no way for the RTP receiver to know if one if 1543 both packets lost belong to the modifiers or there is 1544 also some text strings. Repetition, FEC, 1545 retransmission or other protection mechanisms as per 1546 section 4.6 are RECOMMENDED to avoid this situation. 1548 iv. Finally, if it is sure that neither text strings nor 1549 modifiers were received complete, then the text 1550 strings and the modifiers may be rendered partially or 1551 may be discarded. This is an application choice. 1553 3. Sample descriptions can be directly associated with the 1554 reassembled text samples, via the sample description index 1555 (SIDX). 1557 4. Reassembling of text strings: since the text strings transported 1558 in RTP packets MUST NOT include any byte order mark (BOM), the 1559 receiver MUST prepend it to the reassembled UTF-16 string before 1560 handling it to the timed text decoder (see Figure 9). The value 1561 of the BOM is 0xFEFF because only big endian serialization of 1562 UTF-16 strings is supported by this payload format. 1564 4.6. On Aggregate Payloads 1566 Units SHOULD be aggregated to avoid overhead, whenever possible. The 1567 aggregate payloads MUST comply with one of the following ordered 1568 configurations: 1570 1. Zero or more sample descriptions (TYPE 5) followed by zero or more 1571 whole text samples (TYPE 1 units). At least one unit of either 1572 type MUST be present. 1574 2. Zero or more sample descriptions followed by zero or one modifier 1575 fragment, either TYPE 3 or TYPE 4. At least one unit MUST be 1576 present. 1578 3. Zero or more sample descriptions followed by zero or one text 1579 string fragment (TYPE 2) followed by zero or one TYPE 3 unit. If 1580 a TYPE 2 unit and a TYPE 3 unit are present, then they MUST belong 1581 to the same text sample. At least one unit MUST be present. 1583 Some observations: 1585 o Different aggregates than the ones listed above SHALL NOT be used. 1587 o Sample descriptions MUST be placed in the aggregate payload before 1588 the occurrence of any non-TYPE 5 units. 1590 o Correct reception of TYPE 5 units is important since their 1591 contents may be referenced by several other units in the stream. 1593 Receivers are unable to use text samples until their corresponding 1594 sample description is received. Accordingly, a sender SHOULD send 1595 multiple copies of a sample description to ensure reliability (see 1596 section 5). Receivers MAY use payload specific feedback messages 1597 [21] to tell a sender that they have received a particular sample 1598 description. 1600 o Regarding timestamp calculation: in general, the rules for 1601 calculating the timestamp of units in an aggregate payload depend 1602 on the type of unit. Based on the possible constellations for 1603 aggregate payloads as above we have: 1605 o Sample descriptions MUST receive the RTP timestamp of the 1606 packet in which they are included. 1608 Note that for TYPE 5 units, the timestamp actually does not 1609 represent the instant when they are played out, but instead 1610 the instant at which they become available for use. 1612 o For the first configuration: the first TYPE 1 unit receives 1613 the RTP timestamp. The timestamp of any subsequent TYPE 1 1614 unit MUST be obtained by adding sample duration and 1615 timestamp, both of the preceding TYPE 1 unit. 1617 o For the second and third configuration, all units, TYPE 2, 1618 3 and 4, MUST receive the RTP timestamp. 1620 Refer to detailed examples on the timestamp calculation 1621 below. 1623 o As per configuration 3 above, a payload MAY contain several 1624 fragments of one (and only one) text sample. If so, then exactly 1625 one TYPE 2 unit followed by exactly one TYPE 3 unit are allowed in 1626 the same payload. This is in line with RFC 3640 [12], Section 1627 2.4, which explicitly disallows combining fragments of different 1628 samples in the same RTP payload. Note that, in this special case, 1629 no timestamp calculation is needed. I. e., the RTP timestamp of 1630 both units is equal to the timestamp in the packet's RTP header. 1632 o Finally, note that the use of empty text samples allows for 1633 aggregating non-consecutive TYPE 1 units in the same payload. Two 1634 text samples, with timestamps TS1 and TS3 and durations SDUR1 and 1635 SDUR3, are not consecutive if it holds TS1+SDUR1 < TS3. A 1636 solution for this is to include an empty TYPE 1 unit with duration 1637 SDUR2 between them, such that TS2+SDUR2 = TS1+SDUR1+SDUR2 = TS3. 1639 Some examples of aggregate payloads are illustrated in Figure 10 1640 (Note: the figure is not scaled.) 1641 N/A TS1 TS2 TS3 1642 +------+-----+------+-----+ 1643 |TYPE5 |TYPE1|TYPE1 |TYPE1| 1644 +------+-----+------+-----+ 1645 N/A sdur1 sdur2 sdur3 1647 N/A TS4 1648 +-----+-------+ 1649 |TYPE5| TYPE 1| a) 1650 +-----+-------+ 1651 N/A sdur4 1653 TS4 TS4 TS4 1654 +--------------+ +--------------+ 1655 | TYPE2 | |TYPE2 |TYPE 3 | b) 1656 +--------------+ +--------------+ 1657 sdur4 sdur4 sdur4 1659 TS4 TS4 1660 +--------------+ +--------------+ 1661 | TYPE2| TYPE 3| | TYPE4 | c) 1662 +--------------+ +--------------+ 1663 sdur4 sdur4 sdur4 1665 |----------PAYLOAD 1------| |--PAYLOAD 2---| |--PAYLOAD 3---| 1666 rtpts1 rtpts2 rtpts3 1668 KEY: 1669 TSx means Text Sample x, 1670 rtptsy represents the standard RTP timestamp for PAYLOAD y 1671 sdurz the duration of unit z 1672 N/A means not applicable 1674 Figure 10. Example aggregate payloads. 1676 In Figure 10 four text samples (TS1 through TS4) are sent using three 1677 RTP packets. These configurations have been chosen to show how the 5 1678 TYPE headers are used. Additionally, three different possibilities 1679 for the last text sample, TS4, are depicted: a), b) and c). 1681 In Figure 11, option b) from Figure 10 is chosen to illustrate how 1682 the timestamp for each unit is found 1683 N/A TS1 TS2 TS3 TS4 TS4 TS4 1684 +------+-----+------+-----+ +--------------+ +--------------+ 1685 |TYPE5 |TYPE1|TYPE1 |TYPE1| | TYPE2 | |TYPE2 |TYPE 3 | 1686 +------+-----+------+-----+ +--------------+ +--------------+ 1687 N/A sdur1 sdur2 sdur3 sdur4 sdur4 sdur4 1689 (#1) (#2) (#3) (#4) (#5) (#6) (#7) 1691 |----------PAYLOAD 1------| |--PAYLOAD 2---| |--PAYLOAD 3---| 1692 rtpts1 rtpts2 rtpts3 1694 Figure 11. Selected payloads from Figure 10. 1696 Assuming TSx means Text Sample x, rtptsy represents the standard RTP 1697 timestamp for PAYLOAD y and sdurz the duration of unit z, the 1698 timestamp for unit #z, ts(#z), can be found as the sum of rtptsy and 1699 the cumulative sum of the durations of preceding units in that 1700 payload (except in the case of PAYLOAD 3 as per rule 3 above). Thus, 1701 we have: 1703 1. for the units in the first aggregate payload, PAYLOAD 1: 1705 ts(#1)= rtpts1, 1706 ts(#2)= rtpts1, 1707 ts(#3)= rtpts1 + sdur1, 1708 ts(#4)= rtpts1 + sdur1 + sdur2, 1710 Note that the TYPE 5 and the first TYPE 1 unit have both the 1711 RTP timestamp. 1713 2. for PAYLOAD 2: 1715 ts(#5)= rtpts2, 1717 3. for PAYLOAD 3: 1719 ts(#6)= ts(#7)= rtpsts2= rtpts3 1721 According to configuration 3 above, the TYPE2 and the TYPE 3 1722 units shall belong to the same sample. Hence rtpts3 must be 1723 equal to rtpts2. For the same reason, the value of SDUR is 1724 not be used to calculate the timestamp of the next unit. 1726 4.7. Payload Examples 1728 Some example of payloads using the defined headers are shown below: 1730 0 1 2 3 1731 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1732 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1733 |V=2|P|X| CC |M| PT | sequence number | 1734 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1735 | timestamp | 1736 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1737 | synchronization source (SSRC) identifier | 1738 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1739 |U| R |TYPE1| LEN (always >=8) | SIDX | 1740 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1741 | SDUR | TLEN | 1742 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1743 | TLEN | | 1744 +---------------+ | 1745 | text string (no.bytes=TLEN) | 1746 | | 1747 | | 1748 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1749 | modifiers (no.bytes=LEN - 8 - TLEN) | 1750 | | 1751 | | 1752 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1753 |U| R |TYPE1| LEN (always >=8) | SIDX | 1754 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1755 | SDUR | TLEN | 1756 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1757 | TLEN | | 1758 +---------------+ | 1759 | text string (no.bytes=TLEN) | 1760 | | 1761 | | 1762 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1763 | modifiers (no.bytes=LEN - 8 - TLEN) | 1764 | +-+-+-+-+-+-+-+-+ 1765 | | 1766 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1767 Figure 12. A payload carrying two TYPE 1 units. 1769 In Figure 12 an RTP packet carrying two TYPE 1 units is depicted. It 1770 can be seen how the length fields LEN and TLEN can be used to find 1771 the start of the next unit (LEN), find the start of the modifiers 1772 (TLEN) and find the length of the modifiers (LEN-TLEN). 1774 0 1 2 3 1775 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1776 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1777 |V=2|P|X| CC |M| PT | sequence number | 1778 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1779 | timestamp | 1780 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1781 | synchronization source (SSRC) identifier | 1782 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1783 |U| R |TYPE5| LEN( always >3) | SIDX | 1784 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1785 | | 1786 | sample description (no.bytes=LEN - 3) | 1787 | | 1788 | | 1789 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1790 |U| R |TYPE1| LEN (always >=8) | SIDX | 1791 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1792 | SDUR | TLEN | 1793 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1794 | TLEN | | 1795 +-+-+-+-+-+-+-+-+ | 1796 | text string fragment (no.bytes=TLEN) | 1797 | | 1798 | | 1799 | +-+-+-+-+-+-+-+-+ 1800 | | 1801 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1802 Figure 13. An RTP packet carrying a TYPE 5 and a TYPE 1 unit. 1804 In Figure 13, a sample description and a TYPE 1 unit are aggregated. 1805 The TYPE 1 unit happens to contain only text strings and is small so 1806 that an additional the TYPE 5 unit is included for taking advantage 1807 of the available bits in the packet. 1809 0 1 2 3 1810 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1811 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1812 |V=2|P|X| CC |M| PT | sequence number | 1813 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1814 | timestamp | 1815 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1816 | synchronization source (SSRC) identifier | 1817 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1818 |U| R |TYPE2| LEN( always >9) |TOTAL=4|THIS=1 | 1819 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1820 | SDUR | SIDX | 1821 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1822 | SLEN | | 1823 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1824 | text string fragment (no.bytes=LEN - 9) | 1825 | | 1826 : : 1827 : : 1828 | +-+-+-+-+-+-+-+-+ 1829 | | 1830 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1831 Figure 14. Payload with first text string fragment of a sample. 1833 In Figure 14, Figure 15 and Figure 16 a text sample is split into 1834 three RTP packets. In the first one, the text string is big and 1835 takes the whole packet length. In the second packet in Figure 15, 1836 the only possibility for carrying two fragments of the same text 1837 sample is represented (see configuration 3 in Section 4.6). The last 1838 packet showed carries the last modifier fragment, a TYPE 4. 1840 0 1 2 3 1841 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1842 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1843 |V=2|P|X| CC |M| PT | sequence number | 1844 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1845 | timestamp | 1846 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1847 | synchronization source (SSRC) identifier | 1848 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1849 |U| R |TYPE2| LEN( always >9) |TOTAL=4|THIS=2 | 1850 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1851 | SDUR | SIDX | 1852 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1853 | SLEN | | 1854 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1855 | text string fragment (no.bytes=LEN - 9) | 1856 | | 1857 | | 1858 | | 1859 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1860 |U| R |TYPE3| LEN( always >6) |TOTAL=4|THIS=3 | 1861 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1862 | SDUR | | 1863 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1864 | | 1865 | modifiers (no.bytes=LEN - 6) | 1866 | +-+-+-+-+-+-+-+-+ 1867 | | 1868 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1869 Figure 15. An RTP packet carrying a TYPE2 unit and a TYPE 3 unit. 1871 0 1 2 3 1872 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1873 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1874 |V=2|P|X| CC |M| PT | sequence number | 1875 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1876 | timestamp | 1877 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1878 | synchronization source (SSRC) identifier | 1879 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1880 |U| R |TYPE4| LEN( always >6) |TOTAL=4|THIS=4 | 1881 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1882 | SDUR | | 1883 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1884 | | 1885 | modifiers (no.bytes=LEN - 6) | 1886 | +-+-+-+-+-+-+-+-+ 1887 | | 1888 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1889 Figure 16. An RTP packet carrying last modifiers fragment (TYPE 4). 1891 4.8. Relation to RFC 3640 1893 RFC 3640 defines a payload format for the transport of any 1894 non-multiplexed MPEG-4 elementary stream. One of the various MPEG-4 1895 elementary streams types are MPEG-4 timed text streams, specified in 1896 MPEG-4 part 17 [28], also known as ISO/IEC 14496-17. MPEG-4 timed 1897 text streams are capable of carrying 3GPP timed text data, as 1898 specified in 3GPP TS 26.245 [1]. 1900 MPEG-4 timed text streams are intentionally constructed so as to 1901 guarantee interoperability between RFC 3640 and this payload format. 1902 This means that the construction of the RTP packets carrying timed 1903 text is the same. I.e., the MPEG-4 timed text elementary stream as 1904 per ISO/IEC 14496-17 is identical to the (aggregate) payloads 1905 constructed using this payload format. 1907 Figure 11 illustrates the process of constructing an RTP packet 1908 containing timed text. As it can be seen in the partition block, the 1909 (transport) units used in this payload format are identical to the 1910 Timed Text Units (TTUs) defined in ISO/IEC 14496-17. Likewise, the 1911 rules for payload aggregation as per Section 4.6 are identical to the 1912 ones defined in ISO/IEC 14496-17 and compliant with RFC 3640. As a 1913 result, an RTP packet that uses this payload format is identical to 1914 and RTP packet using RFC 3640 conveying TTUs according to ISO/IEC 1915 14496-17. In particular, MPEG-4 Part 17 specifies that when using 1916 RFC 3640 for transporting timed text streams, the "streamType" 1917 parameter value is set to 0x0D and the value of the 1918 "objectTypeIndication" in "config" takes the value 0x08. 1920 +--------------------------------------+ 1921 Text samples | +--------------+ +--------------+ | 1922 as per 3GPP | |Text Sample 1 | |Text Sample N | | 1923 TS 26245 | +--------------+ +--------------+ | 1924 +--------------------------------------+ 1925 \/ 1926 +-------------------------------------------------------------------+ 1927 | Partition Text Samples into units. TTU[i]= TYPE i units. | 1928 | | 1929 |[U R TYPE LEN][{TOTAL,THIS}SIDX{SDUR}{TLEN}{SLEN}][SampleContents] | 1930 |{..} means present if applicable, [..] means always present | 1931 +-------------------------------------------------------------------+ 1932 \/ \/ 1933 +-------------------------------------------------------------------+ 1934 | Aggregation (if possible) | 1935 +-------------------------------------------------------------------+ 1936 \/ \/ 1937 +-------------------------------------------------------------------+ 1938 | RTP Entity adds and fills RTP header and Sends RTP packet, where | 1939 | RTP packets according to this Payload Format = | 1940 |= RTP packets carrying MPEG-4 Timed Text ES over RFC3640 | 1941 +-------------------------------------------------------------------+ 1942 Figure 11. Relation to RFC 3640. 1944 Note: the use of RFC 3640 for transport of ISO/IEC 14496-17 data does 1945 not require any new SDP parameters or any new mode definition. 1947 4.9. Relation to RFC 2793 1949 The RFC 2793 [24] and its revision [25] specify a protocol for 1950 enabling text conversation. Typical applications of this payload 1951 format are text communication terminals and text conferencing tools. 1952 Text session contents are specified in ITU-T Recommendation T.140 1953 [26]. T.140 text is UTF-8 coded as specified in T.140 [26] with no 1954 extra framing. The T140block contains one or more T.140 code 1955 elements as specified in T.140. Code elements are control sequences 1956 such as "New Line", "Interrupt", "String Terminator" or "Start of 1957 String". Most T.140 code elements are single ISO 10646 [27] 1958 characters, but some are multiple character sequences. Each character 1959 is UTF-8 encoded [18] into one or more octets. 1961 This payload format may also be used for conversational applications 1962 (even for instant messaging). However, this is not the main target 1963 of it. The differentiating feature of 3GPP Timed Text media format 1964 is that it allows text decoration. This is especially useful in 1965 multimedia presentations, karaoke, commercial banners, news tickers, 1966 karaoke, clickable text strings and captions. T.140 text contents 1967 used in RFC 2793 do not allow the use of text decoration. 1969 Furthermore, the conversational text RTP payload format recommends a 1970 method to include redundant text from already transmitted packets in 1971 order to reduce the risk of text loss caused by packet loss. Thereby 1972 payloads would include a redundant copy of the last payload sent. 1973 This payload format does not describe such method, but this is also 1974 applicable here. As explained in Section 5 packet redundancy SHOULD 1975 be use, whenever possible. The aggregation guidelines in Section 4.6 1976 allow redundant payloads. 1978 5. Resilient Transport 1980 Apart from the basic fragmentation guidelines described in the 1981 section above, the simplest option for packet loss resilient 1982 transport is packet repetition. Such mechanism may consist of a 1983 strict window-based repetition mechanism or, simply, a repetition 1984 mechanism in a wider sense, where new and old packets are mixed, for 1985 example. 1987 A server MAY decide to use repetition as a measure for packet loss 1988 resilience. Thereby, a server MAY send the same RTP payloads or just 1989 some of the units from the payloads. 1991 As for the case of complete payloads, single repeated units MUST 1992 match exactly the same units sent in the first transmission, i.e. if 1993 fragmentation is needed, it SHALL be performed only once for each 1994 text sample Only then, a receiver can use the already received and 1995 the repeated units to reconstruct the original text samples. Since 1996 the RTP timestamp is used to group together the fragments of a 1997 sample, care must taken to preserve the timing of units when 1998 constructing new RTP packets. 2000 E.g. if a text sample was originally sent as a single 2001 non-fragmented text sample (one TYPE 1 unit), a repetition of 2002 that sample MUST be sent also as a single non-fragmented text 2003 sample in one unit. Likewise, if the original text sample was 2004 fragmented and spread over several RTP packets, say a total of 3 2005 units, then the repeated fragments SHALL also have the same byte 2006 boundaries and use the same unit headers and bytes per fragment. 2008 With repetition, repeated units resolve to the same timestamp as 2009 their originals. Where redundant units are available, only one of 2010 them SHALL be used. 2012 Regarding the RTP header fields: 2014 o if the whole RTP payload is repeated, all payload-specific fields 2015 in the RTP header (the M, TS and PT fields) MUST keep their 2016 original values except the sequence number that MUST be 2017 incremented to comply with RTP (the fields TOTAL/THIS enable to 2018 re-assemble fragments with different sequence numbers). 2020 o in packets containing single repeated units, the general rules in 2021 Section 3 for assigning values to the RTP header fields apply. 2022 Particularly relevant here is to keep the value of the RTP 2023 timestamp to preserve the timing of the units. 2025 Apart from repetition other mechanisms such as FEC [7], 2026 retransmission [11] or similar techniques could be used to cope with 2027 packet losses. 2029 6. Congestion control 2031 Congestion control for RTP SHALL be implemented in accordance with 2032 RTP [3], and the applicable RTP profile, e.g. RTP/AVP [17]. 2034 When using this payload format, mainly two factors may affect the 2035 congestion control: 2037 o The use of (unit) aggregation may make the payload format more 2038 bandwidth efficient, by avoiding header overhead and thus reducing 2039 the used bitrate. 2041 o The use of resilient transport mechanisms: although timed text 2042 applications typically operate at low bitrates, the increase due to 2043 resilient transport shall be considered for congestion control 2044 mechanisms. This applies to all mechanisms but especially to less 2045 efficient ones like repetition. 2047 7. Scene Description 2049 7.1. Text Rendering Position and Composition 2051 In order to set up a timed text session, regardless of the stream 2052 being stored in a 3GP file or streamed live, some initial layout 2053 information is needed by the communicating peers. 2055 +-------------------------------------------+ 2056 | <-> tx | +-------------+ 2057 | +-------------------------------+ |<---|Display Area | 2058 | ^ | | | +-------------+ 2059 | : | | | 2060 | :ty| | | +-------------+ 2061 | : | |<---------|Video track | 2062 | : | | | +-------------+ 2063 | : | | | 2064 | : | | | 2065 | : | | | 2066 | v | | | 2067 | - | x-------------------------+ | | +-------------+ 2068 |h ^ | | |<-----------|Text Track | 2069 |e : +---|-------------------------|-+ | +-------------+ 2070 |i : | +---------------------+ | | 2071 |g : | | | | | +-------------+ 2072 |h : | | |<------------ |Text Box | 2073 |t v | +---------------------+ | | +-------------+ 2074 | - +-------------------------+ | 2075 +-------------------------------------------+ 2076 <........................> 2077 w i d t h 2078 Figure 17. Illustration of text rendering position and composition 2080 The parameters used for negotiating the position and size of the text 2081 track in the display area are shown in Figure 17. These are the 2082 "width" and "height" of the text track, its translation values, "tx" 2083 and "ty", and its "layer" or proximity to the user. 2085 At the same time, the sender of the stream needs to know the 2086 receiver's capabilities. In this case, the maximum allowable values 2087 for the text track height and width: "max-h" and "max-w", for the 2088 stream the receiver shall display. 2090 This layout information MUST be conveyed in a reliable form previous 2091 to the start of the session, e.g. during session announcement or in 2092 an Offer/Answer (O/A) exchange. An example of a reliable transport 2093 may be the out-of-band channel used for SDP. Sections 8 and 9 2094 provide details on the mapping of these parameters to SDP 2095 descriptions and their usage in O/A. 2097 For stored content, the layout values expressing stream properties 2098 MUST be obtained from the Track Header Box. See Section 7.3. 2100 For live streaming appropriate values as negotiated during session 2101 set-up shall be used. 2103 7.2. SMIL usage 2105 The attributes contained in the Track Header Boxes of a 3GP file only 2106 specify the spatial relationship of the tracks within the given 3GP 2107 file. 2109 If multiple 3GP files are sent, they require spatial synchronization. 2110 For example, for a text and video stream, the positions of the text 2111 and video tracks in Figure 17 shall be determined. For such purpose, 2112 SMIL [9] MAY be used. 2114 SMIL assigns regions in the display to each of those files and places 2115 the tracks within those regions. Generally, in SMIL, the position of 2116 one track (or stream) is expressed relative to another track. This 2117 is different to the 3GP file, where the upper left corner is the 2118 reference for all translation offsets. Hence, only if the position 2119 in SMIL is relative to the video track origin, then this translation 2120 offset has the same value as (tx, ty) in the 3GP file. 2122 Note also that the original track header information is used for each 2123 track only within its region, as assigned by SMIL. Therefore, even 2124 if SMIL scene description is used, the track header information 2125 pieces SHOULD be sent anyway as they represent the intrinsic media 2126 properties. See 3GPP SMIL Language Profile in [29] for details. 2128 7.3. Finding layout values in a 3GP file 2130 In a 3GP file, within the Track Header Box (tkhd): 2132 o tx, ty: these values specify the translation offset of the 2133 (text) track relative to the upper left corner of the video 2134 track, if present. They are the second but last and third 2135 but last values in the unity matrix; values are fixed-point 2136 16.16 values, restricted to be (signed) integers (i.e., the 2137 lower 16 bits of each value shall be all zeros). Therefore, 2138 only the first 16 bits are used for obtaining the value of 2139 the media type parameters. 2141 o width, height: they have the same name in the tkhd box. All 2142 (unsigned) 32 bits are meaningful. 2144 o layer: all (signed) 16 bits are used. 2146 8. 3GPP Timed Text Media Type 2148 The media subtype for the 3GPP Timed Text codec is allocated from the 2149 standards tree. The top-level media type under which this payload 2150 format is registered is 'video'. This registration is done using the 2151 template defined in [31] and following RFC 3555 [30]. 2153 The receiver MUST ignore any unrecognized parameter. 2155 Media type: video 2157 Media subtype: 3gpp-tt 2159 Required parameters 2161 rate: 2162 Refer to Section 3 in RFCXXXX. 2164 sver: 2165 The parameter "sver" contains a list of supported 2166 backwards-compatible versions of the timed text format 2167 specification (3GPP TS 26.245) that the sender accepts 2168 to receive (and which are the same that it would be 2169 willing to send). The first value is the value 2170 preferred to receive (or preferred to send). The first 2171 value MAY be followed by a comma-separated list of 2172 versions that SHOULD be used as alternatives. The order 2173 is meaningful, being first the most preferred and last 2174 the least preferred. Each entry has the format 2175 Zi(xi*256+yi), where "Zi" is the number of the Release, 2176 "xi" and "yi" are taken from the 3GPP specification 2177 version, i.e. vZi.xi.yi. For example, for 3GPP TS 2178 26.245 v6.0.0, Zi(xi*256+yi)=6(0), the version value is 2179 "60". (Note that "60" is the concatenation of the 2180 values Zi=6 and (xi*256+yi)=0 and not its product.) 2182 If no "sver" value is available, for example, when 2183 streaming out of a 3GP file, the default value "60", 2184 corresponding to the 3GPP Release 6 version of 3GPP TS 2185 26.245, SHALL be used. 2187 Optional parameters: 2189 tx: 2190 This parameter indicates the horizontal translation 2191 offset in pixels of the text track with respect to the 2192 origin of the video track. This value is the decimal 2193 representation of a 16-bit signed integer. Refer to TS 2194 3GPP 26.245 for an illustration of this parameter. 2196 ty: 2197 This parameter indicates the vertical translation offset 2198 in pixels of the text track with respect to the origin 2199 of the video track. This value is the decimal 2200 representation of a 16-bit signed integer. Refer to TS 2201 3GPP 26.245 for an illustration of this parameter. 2203 layer: 2204 This parameter indicates the proximity of the text track 2205 to the viewer. More negative values mean closer to the 2206 viewer. This parameter has no units. This value is the 2207 decimal representation of a 16-bit signed integer. 2209 tx3g: 2210 This parameter MUST be used for conveying sample 2211 descriptions out-of-band. It contains a comma-separated 2212 list of base64-encoded entries. The entries of this 2213 list that MAY follow any particular order and the list 2214 SHALL NOT be empty. Each entry is the result of running 2215 base64 encoding over the concatenation of the (static) 2216 SIDX value as 8-bit unsigned integer and the (static) 2217 sample description for that SIDX, in this order. The 2218 format of a sample description entry can be found in 2219 3GPP TS 26.245 Release 6 and later releases. All 2220 servers and clients MUST understand this parameter and 2221 MUST be capable of using the sample description(s) 2222 contained in it. Please refer to RFC 3548 for details 2223 on the base64 encoding. 2225 width: 2226 This parameter indicates the width in pixels of the text 2227 track or area of the text being sent. This value is the 2228 decimal representation of a 32-bit unsigned integer. 2229 Refer to TS 3GPP 26.245 for an illustration of this 2230 parameter. 2232 height: 2233 This parameter indicates the height in pixels of the 2234 text track being sent. This value is the decimal 2235 representation of a 32-bit unsigned integer. Refer to 2236 TS 3GPP 26.245 for an illustration of this parameter. 2238 max-w: 2239 This parameter indicates display capabilities. This is 2240 the maximum "width" value that the sender of this 2241 parameter supports. This value is the decimal 2242 representation of a 32-bit unsigned integer. 2243 max-h: 2244 This parameter indicates display capabilities. This is 2245 the maximum "height" value that the sender of this 2246 parameter supports. This value is the decimal 2247 representation of a 32-bit unsigned integer. 2249 Encoding considerations: 2251 This media type is framed (see section 4.8 in [31]) and 2252 partially contains binary data. 2254 Restrictions on usage: 2256 This media type depends on RTP framing, and hence is only 2257 defined for transfer via RTP [3]. Transport within other framing 2258 protocols is not defined at this time. 2260 Security considerations: 2262 Please refer to Section 11 of RFCXXXX. 2264 Interoperability considerations: 2266 The 3GPP Timed Text media format and its file storage is 2267 specified in Release 6 of 3GPP TS 26.245 "Transparent end-to-end 2268 packet switched streaming service (PSS); Timed Text Format 2269 (Release 6)". Note also that 3GPP may in future Releases 2270 specify extensions or updates to the timed text media format in 2271 a backwards-compatible way, e. g. new modifier boxes or 2272 extensions to the sample descriptions. The payload format 2273 defined in RFCXXXX allows for such extensions. For future 3GPP 2274 Releases of the Timed Text Format, the parameter "sver" is used 2275 to identify the exact specification used. 2277 The defined storage format for 3GPP Timed Text format is the 2278 3GPP File Format (3GP) [32]. 3GP files may be transferred using 2279 the media type video/3gpp as registered by RFC 3839 [33]. The 2280 3GPP File Format is a container file that may contain, e.g., 2281 audio and video which may be synchronized with the 2282 3GPP Timed Text. 2284 Published specification: RFC XXXX 2286 Applications which use this media type: 2288 Multimedia streaming applications. 2290 Additional information: 2292 the 3GPP Timed Text media format is specified in 3GPP TS 26.245 2293 "Transparent end-to-end packet switched streaming service (PSS); 2294 Timed Text Format (Release 6)". This document and future 2295 extensions to the 3GPP Timed Text format are publicly available 2296 at http://www.3gpp.org. 2298 Magic number(s): None. 2300 File extension(s): None. 2302 Macintosh File Type Code(s): None. 2304 Person & email address to contact for further information: 2306 Jose Rey, jose.rey@eu.panasonic.com 2307 Yoshinori Matsui, matsui.yoshinori@jp.panasonic.com 2308 Audio/Video Transport Working Group. 2310 Intended usage: COMMON 2312 Authors: 2313 Jose Rey 2314 Yoshinori Matsui 2316 Change controller: 2317 IETF Audio/Video Transport Working Group delegated from the 2318 IESG. 2320 9. SDP usage 2322 9.1. Mapping to SDP 2324 The information carried in the media type specification has a 2325 specific mapping to fields in SDP [4]. If SDP is used to specify 2326 sessions using this payload format, the mapping is done as follows: 2328 o The media type ("video") goes in the SDP "m=" as the media name. 2330 m=video RTP/ 2332 o The media subtype ("3gpp-tt") and the timestamp clockrate "rate" 2333 (the RECOMMENDED 1000 Hz or other value) go in SDP "a=rtpmap" line 2334 as the encoding name and rate, respectively: 2336 a=rtpmap: 3gpp-tt/1000 2338 o The REQUIRED parameter "sver" goes in the SDP "a=fmtp" attribute 2339 by copying it directly from the media type string as a semicolon 2340 separated parameter=value pair. 2342 o The OPTIONAL parameters "tx", "ty", "layer", "tx3g", "width", 2343 "height", "max-w" and "max-h" go in the SDP "a=fmtp" attribute by 2344 copying them directly from the media type string as a semicolon 2345 separated list of parameter=value(s) pairs: 2347 a=fmtp: =[,][; =] 2350 o Any unknown parameter to the device that uses the SDP SHALL be 2351 ignored. E.g. parameters added in media format later 2352 specifications MAY be copied into the SDP and SHALL be ignored 2353 by receivers that do not understand them. 2355 9.2. Parameter Usage in the SDP Offer/Answer Model 2357 In this section the meaning of the SDP parameters defined in this 2358 document within the Offer/Answer [13] context is explained. 2360 In unicast, sender and receiver typically negotiate the streams, i.e. 2361 which codecs and parameter values are used in the session. This is 2362 also possible in multicast to a lesser extend. 2364 Additionally, the meaning of the parameters MAY vary depending on 2365 which direction it used. In the following sections, a 2366 " offer" means an offer that contains a stream set to 2367 . may take the values sendrecv, 2368 sendonly and recvonly. Similar considerations apply for answers. 2369 E.g. an answer to sendonly offer is a recvonly answer. 2371 9.2.1. Unicast Usage 2373 The following types of parameters are used in this payload format: 2375 1. Declarative parameters: offerer and answerer declare the values 2376 they will use for the incoming (sendrecv/recvonly) or outgoing 2377 (sendonly) stream. Offerer and answerer MAY use different 2378 values. 2380 a. "tx", "ty" and "layer": these are parameters describing 2381 where the received text track is placed. Depending on the 2382 directionality: 2384 i. MUST appear in all sendrecv offers and answers and in 2385 all recvonly offers and answers (thus applying to the 2386 incoming stream). In the case of sendrecv offers and 2387 answers and in recvonly offers, these values SHOULD be 2388 used by the sender of the stream unless it has a 2389 particular preference, in which case, it MUST make 2390 sure that these different values do not corrupt the 2391 presentation. For recvonly answers, the answerer MAY 2392 accept the proposed values for the incoming stream (in 2393 a sendonly offer, see bullet below) or respond with 2394 different ones. The offerer MUST use the returned 2395 values. 2397 ii. MAY appear in sendonly offers and MUST appear in 2398 sendonly answers. In sendonly offers they specify the 2399 values that the offerer proposes for sending (see 2400 example in Section 9.3). In sendonly answers these 2401 values SHOULD be copied from the corresponding 2402 recvonly offer upon accepting the stream, unless a 2403 particular preference by the receiver if the stream 2404 exists, as explained in the previous bullet. 2406 2. Parameters describing the display capabilities, "max-h" and 2407 "max-w", which indicate the maximum dimensions of the text track 2408 (text display area) for the incoming stream "tx" and "ty" values 2409 (see Figure 17). "max-h" and "max-w" MUST be included in all 2410 offers and answers where "tx" and "ty" refer to the incoming 2411 stream, thus excluding sendonly offers and answers (see example 2412 in Section 9.3), where they SHALL NOT be present. 2414 3. Parameters describing the sent stream properties, i.e. the 2415 sender of the stream decides upon the values of these: 2417 a. "width" and "height", specify the text track dimensions. 2418 They SHALL ALWAYS be present in sendrecv and sendonly 2419 offers and answers. For recvonly answers, the answerer 2420 MUST include the offered parameter values (if any) verbatim 2421 in the answer upon accepting the stream. 2423 b. "tx3g" contains static sample descriptions. It MAY only be 2424 present in sendrecv and sendonly offers and answers. This 2425 parameter applies to the stream that offerers or answerers 2426 send. 2428 4. Negotiable parameters, which MUST be agreed on. This is the 2429 case of "sver". This parameter MUST be present in every offer 2430 and answer. The answerer SHALL choose one supported value from 2431 the offerer's list or else it MUST remove the stream or reject 2432 the session. 2434 5. Symmetric parameters: "rate", timestamp clockrate, belongs to 2435 this class. Symmetric parameters MUST be echoed verbatim in the 2436 answer. Otherwise the stream MUST be removed or the session 2437 rejected. 2439 The following Table 1 summarises all options: 2441 +..---------------------------+----------+----------+----------+ 2442 | ``--..__ Directionality/ | sendrecv | recvonly | sendonly | 2443 + Type of ``--..__ O or A +----------+----------+----------+ 2444 | Parameter ``--..__ | O/A | O/A | O/A | 2445 +--------------+------------``+----------+----------+----------+ 2446 | Declarative |tx, ty, layer | M/M | M/M | m/M | 2447 | | | | | | 2448 +--------------+--------------+----------+----------+----------+ 2449 | Display |max-h, max-w | M/M | M/M | -/- | 2450 | Capabilities | | | | | 2451 +--------------+--------------+----------+----------+----------+ 2452 | Stream |height, width | M/M | -/(M) | M/M | 2453 | properties |tx3g | m/m | -/- | m/m | 2454 | | | | | | 2455 +--------------+--------------+----------+----------+----------+ 2456 | Negotiable |sver | M/M | M/M | M/M | 2457 | | | | | | 2458 +--------------+--------------+----------+----------+----------+ 2459 | Symmetric |rate | M/M | M/M | M/M | 2460 +--------------+--------------+----------+----------+----------+ 2461 Table 1. Parameter usage in Unicast Offer / Answer. 2463 Key: 2464 o M means MUST be present 2465 o m means MAY be present (such as proposed values) 2466 o (M) or (m) means MUST or MAY, if applicable 2467 o a hyphen ("-") means the parameter MUST NOT be present. 2469 Other observations regarding parameter usage: 2471 o Translation and transparency values: in sendonly offers "tx", 2472 "ty" and "layer" indicate proposed values. This is useful for 2473 visually composed sessions where the different streams occupy 2474 different parts of the display, e.g., a video stream and the 2475 captions. These are just suggested values because it is the 2476 peer rendering the text that ultimately decides where to place 2477 the text track. 2479 o Text track (area) dimensions, "height" and "width": in the case 2480 of sendonly offers, an answerer accepting the offer MUST be 2481 prepared to render the stream using these values. If any of 2482 these conditions are not met, the stream MUST be removed or the 2483 session rejected. 2485 o Display capabilities, "max-h" and "max-w": an answerer sending a 2486 stream SHALL ensure that the "height" and "width" values in the 2487 answer are compatible with the offerer's signalled capabilities. 2489 o Version handling via "sver": the idea is that offerer and 2490 answerer communicate using the same version. This is achieved 2491 by letting the answerer choose from a list of supported 2492 versions, "sver". For recvonly streams, the first value in the 2493 list is the preferred version to receive. Consequently, for 2494 sendonly (and sendrecv) streams the first value is the one 2495 preferred for sending (and receiving). The answerer MUST choose 2496 one value and return it in the answer. Upon receiving the 2497 answer, the offerer SHALL be prepared to send (sendonly and 2498 sendrecv) and receive (recvonly and sendrecv) a stream using 2499 that version. If none of the versions in the list is supported 2500 the stream MUST be removed or the session rejected. Note that, 2501 if alternative non-compatible versions are offered, then this 2502 SHALL be done using different payload types. 2504 9.2.2. Multicast Usage 2506 In multicast the parameter usage is similar to the unicast case, 2507 except in the following cases: 2509 o the parameters "tx", "ty" and "layer" in multicast offers only 2510 have meaning for sendrecv and recvonly streams. In order for all 2511 clients to have the same vision of the session, they MUST be used 2512 symmetrically. 2514 o for "height", "width" and the "tx3g" (for sendrecv and sendonly), 2515 multicast offers specify which values of these parameters the 2516 participants MUST use for sending. Thus, if the stream is 2517 accepted, the answerer MUST also here include them verbatim in the 2518 answer (also "tx3g", if present). 2520 o The capability parameters, "max-h" and "max-w", SHALL NOT be used 2521 in multicast. If the offered text track should change in size, a 2522 new offer SHALL be used instead. 2524 o Regarding version handling: 2526 In the case of multicast offers, an answerer MAY accept a 2527 multicast offer as long as one of the versions listed in the 2528 "sver" is supported. Therefore, if the stream is accepted, the 2529 answerer MUST choose its preferred version but, unlike in unicast, 2530 the offerer SHALL NOT change the offered stream to this chosen 2531 version because there may be other session participants that do 2532 support the newer extensions. Consequently, different session 2533 participants may end up using different backwards-compatible media 2534 format versions. It is RECOMMENDED that the multicast offer 2535 contains a limited number of versions, in order for all 2536 participants to have the same view of the session. This is a 2537 responsibility of the session creator. If none of the offered 2538 versions is supported, the stream SHALL be removed or the session 2539 rejected. Also in this case, if alternative non-compatible 2540 versions are offered, then this SHALL be done using different 2541 payload types. 2543 9.3. Offer/Answer Examples 2545 In these unicast O/A examples the long lines are wrapped around. 2546 Static sample descriptions are shortened for clarity. 2548 For sendrecv : 2550 O -> A 2552 m=video RTP/AVP 98 2553 a=rtpmap:98 3gpp-tt/1000 2554 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=120; 2555 max-w=160; sver=6256,60; tx3g=81... 2556 a=sendrecv 2558 A -> O 2560 m=video RTP/AVP 98.. 2561 a=rtpmap:98 3gpp-tt/1000 2562 a=fmtp:98 tx=100; ty=95; layer=0; height=90; width=100; max-h=100; 2563 max-w=160; sver=60; tx3g=82... 2564 a=sendrecv 2566 In this example the offerer is telling the answerer where it will 2567 place the received stream and what is the maximum height and width 2568 allowable for the stream that it will receive. Also, it tells the 2569 answerer the dimensions of the text track for the stream sent and 2570 which sample description it shall use. It offers two versions, 6256 2571 and 60. The answerer responds with an equivalent set of parameters 2572 for the stream it receives. In this case the answerer's "max-h" and 2573 "max-w" are compatible with the offerer's "height" and "width". 2574 Otherwise, the answerer would have to remove this stream and the 2575 offerer would have to issue a new offer taking the answerer's 2576 capabilities into account. This is possible only if multiple payload 2577 types are present in the initial offer so that at least one of them 2578 matches the answerer's capabilities as expressed by "max-h" and 2579 "max-w" in the negative answer. Note also that the answerer's text 2580 box dimensions fit within the maximum values signalled in the offer. 2581 Finally, the answerer chooses to use version 60 of the timed text 2582 format. 2584 For recvonly: 2586 Offerer -> Answerer 2588 m=video RTP/AVP 98 2589 a=rtpmap:98 3gpp-tt/1000 2590 a=fmtp:98 tx=100; ty=100; layer=0; max-h=120; max-w=160; sver=6256,60 2591 a=recvonly 2593 A -> O 2595 m=video RTP/AVP 98.. 2596 a=rtpmap:98 3gpp-tt/1000 2597 a=fmtp:98 tx=100; ty=100; layer=0; height=90; width=100; sver=60; 2598 tx3g=82... 2599 a=sendonly 2601 In this case, the offer is different from the previous case: it does 2602 not include the stream properties: "height", "width" and "tx3g". The 2603 answerer copies the "tx", "ty" and "layer" values, thus acknowledging 2604 these. "max-h" and "max-w" are not present in the answer because the 2605 "tx" and "ty" (and "layer") in this special case do not apply to the 2606 received, but to the sent stream. Also, if offerer and answerer had 2607 very different displays sizes, it would not be possible to express 2608 the answerer's capabilities. In the example above and for an 2609 answerer with a 50x50 display, the translation values are already out 2610 of range. 2612 For sendonly: 2614 O -> A 2616 m=video RTP/AVP 98 2617 a=rtpmap:98 3gpp-tt/1000 2618 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; 2619 sver=6256,60; tx3g=81... 2620 a=sendonly 2621 A -> O 2623 m=video RTP/AVP 98.. 2624 a=rtpmap:98 3gpp-tt/1000 2625 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=100; 2626 max-w=160; sver=60 2627 a=recvonly 2629 Note that "max-h" and "max-w" are not present in the offer. Also, 2630 with this answer, the answerer would accept the offer as is (thus 2631 echoing "tx", "ty", "height", "width" and "layer") and additionally 2632 inform the offerer about its capabilities: "max-h" and "max-w". 2634 Another possible answer for this case would be: 2636 A -> O 2638 m=video RTP/AVP 98.. 2639 a=rtpmap:98 3gpp-tt/1000 2640 a=fmtp:98 tx=120; ty=105; layer=0; max-h=95; max-w=150; sver=60 2641 a=recvonly 2643 In this case the answerer does not accept the values offered. The 2644 offerer MUST use these values or else remove the stream. 2646 9.4. Parameter Usage outside of Offer/Answer 2648 SDP may also be employed outside of the Offer/Answer context, for 2649 instance for multimedia sessions that are announced through the 2650 Session Announcement Protocol (SAP) [14], or streamed through the 2651 Real Time Streaming Protocol (RTSP) [15]. 2653 In this case, the receiver of a session description is required to 2654 support the parameters and given values for the streams or else it 2655 MUST reject the session. It is the responsibility of the sender (or 2656 creator) of the session descriptions to define the session parameters 2657 so that the probability of unsuccessful session setup is minimized. 2658 This is out of the scope of this document. 2660 10. IANA Considerations 2662 IANA is requested to register the media subtype name "3gpp-tt" for 2663 the media type "video" as specified in Section 8 of this document. 2665 11. Security considerations 2667 RTP packets using the payload format defined in this specification 2668 are subject to the security considerations discussed in the RTP 2669 specification [3] and any applicable RTP profile, e.g. AVP [17]. 2671 In particular, an attacker may invalidate the current set of active 2672 sample descriptions at the client by means of repeating a packet with 2673 an old sample description, i.e. replay attack. This would mean that 2674 the display of the text would be corrupted, if displayed at all. 2675 Another form of attack may consist in sending redundant fragments, 2676 whose boundaries do not match the exact boundaries of the originals 2677 (as indicated by LEN) or fragments that carry different sample 2678 lengths (SLEN). This may cause a decoder to crash. 2680 These types of attack may easily be avoided by using source 2681 authentication and integrity protection. 2683 Additionally, peers in a timed text session may desire to retain 2684 privacy in their communication, i.e. confidentiality. 2686 This payload format does not provide any mechanisms for achieving 2687 these. Confidentiality, integrity protection and authentication have 2688 to be solved by a mechanism external to this payload format, e.g., 2689 SRTP [10]. 2691 12. References 2693 12.1. Normative References 2695 [1] Transparent end-to-end packet switched streaming service (PSS); 2696 Timed Text Format (Release 6), TS 26.245 v 6.0.0, June 2004. 2698 [2] ISO/IEC 14496-12:2004 Information technology - Coding of 2699 audio-visual objects - Part 12: ISO base media file format. 2701 [3] H. Schulzrinne, S. Casner, R. Frederick and V. Jacobson, "RTP: A 2702 Transport Protocol for Real-Time Applications", STD 64, RFC 3550, 2703 July 2003. 2705 [4] M. Handley, V. Jacobson, "SDP: Session Description Protocol", 2706 RFC 2327, April 1998. 2708 [5] S. Bradner, "Key words for use in RFCs to indicate requirement 2709 levels," BCP 14, RFC 2119, IETF, March 1997. 2711 [6] S. Josefsson (Ed.), "The Base16, Base32, and Base64 Data 2712 Encodings", RFC 3548, July 2003. 2714 12.2. Informative References 2716 [7] J. Rosenberg, H. Schulzrinne, "An RTP Payload Format for Generic 2717 Forward Error Correction", RFC 2733, December 1999. 2719 [8] C. Perkins, O. Hodson, "Options for Repair of Streaming Media", 2720 RFC 2354, June 1998. 2722 [9] W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)", 2723 August, 2001. 2725 [10] M. Baugher, D. A. McGrew, D. Oran, R. Blom, E. Carrara, M. 2726 Naslund, K. Norrman, "The Secure Real-Time Transport Protocol", 2727 RFC 3711, March 2004. 2729 [11] J. Rey et al., "RTP Retransmission Payload Format", 2730 draft-ietf-avt-rtp-retransmission-11.txt, work in progress, March 2731 2005. 2733 [12] Van der Meer et al., "RTP Payload Format for Transport of MPEG-4 2734 Elementary Streams ", RFC 3640, November 2003. 2736 [13] J. Rosenberg., H. Schulzrinne, " An Offer/Answer Model with the 2737 Session Description Protocol (SDP)", RFC 3264, June 2002. 2739 [14] M. Handley, et al. "Session Announcement Protocol", RFC 2974, 2740 October 2000. 2742 [15] H. Schulzrinne, et al.,"Real Time Streaming Protocol (RTSP)", 2743 RFC 2326, April 1998. 2745 [16] Transparent end-to-end packet switched streaming service (PSS); 2746 Protocols and codecs (Release 6), TS 26.234 v 6.1.0, September 2747 2004. 2749 [17] H. Schulzrinne, S. Casner, "RTP Profile for Audio and Video 2750 Conferences with Minimal Control", STD 65, RFC 3551, July 2003. 2752 [18] F. Yergeau, "UTF-8, a transformation format of Unicode and ISO 2753 10646", RFC 2044, October 1996. 2755 [19] P. Hoffman, F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2756 2781, February 2000. 2758 [20] Friedman, et al., "RTP Control Protocol Extended Reports (RTCP 2759 XR)", RFC 3611, November 2003. 2761 [21] Ott, et al., "Extended RTP Profile for RTCP-based Feedback 2762 (RTP/AVPF)", draft-ietf-avt-rtcp-feedback-11.txt, work in 2763 progress, August 2004. 2765 [22] IETF RFC 3267: "Real-Time Transport Protocol (RTP) Payload 2766 Format and File Storage Format for the Adaptive Multi-Rate (AMR) 2767 Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs", Sjoberg J. et 2768 al., June 2002. 2770 [23] IETF RFC 3016: "RTP Payload Format for MPEG-4 Audio/Visual 2771 Streams", Kikuchi Y. et al., November 2000. 2773 [24] G. Hellstrom, "RTP Payload for Text Conversation", RFC 2793, May 2774 2000. 2776 [25] G. Hellstrom, P. Jones, "RTP Payload for Text Conversation", 2777 draft-ietf-avt-rfc2793bis-09.txt, Work In Progress, August 2004. 2779 [26] ITU-T Recommendation T.140 (1998) - Text conversation protocol 2780 for multimedia application, with amendment 1, (2000). 2782 [27] ISO/IEC 10646-1: (1993), Universal Multiple Octet Coded 2783 Character Set. 2785 [28] ISO/IEC FCD 14496-17 Information technology - Coding of 2786 audio-visual objects - Part 17: Streaming text format, Work in 2787 progress, June 2004. 2789 [29] Transparent end-to-end Packet-switched Streaming Service (PSS); 2790 3GPP SMIL language profile, (Release 6), TS 26.246 v 6.0.0, June 2791 2004. 2793 [30] Casner, S. and P. Hoschka, "MIME Type Registration of RTP 2794 Payload Formats", RFC 3555, July 2003. 2796 [31] Freed, N. and J. Klensin, "Media Type Specifications and 2797 Registration Procedures", draft-freed-media-type-reg-04, April 2798 2005. 2800 [32] Transparent end-to-end packet switched streaming service (PSS); 2801 3GPP file format (3GP) (Release 6), TS 26.244 V6.3. March 2005. 2803 [33] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd 2804 Generation Partnership Project (3GPP) Multimedia files", RFC 3839, 2805 July 2004. 2807 13. Annexes 2809 13.1. Basics of the 3GP File Structure 2811 This section provides a coarse overview of the 3GP file structure, 2812 which follows the ISO Base Media file Format [2]. 2814 Each 3GP file consists of "Boxes". In general, a 3GP file contains 2815 the File Type Box (ftyp), the Movie Box (moov), and the Media Data 2816 Box (mdat). The File Type Box identifies the type and properties of 2817 the 3GP file itself. The Movie Box and the Media Data Box, serving 2818 as containers, include own boxes for each media. Boxes start with a 2819 header, which indicates both size and type (these fields are called 2820 namely "size" and "type"). Additionally, each box type may include a 2821 number of boxes. 2823 In the following, only those boxes are mentioned, which are useful 2824 for the purposes of this payload format. 2826 The Movie Box (moov) contains one or more Track Boxes (trak), which 2827 include information about each track. A Track Box contains, among 2828 others, the Track Header Box (tkhd), the Media Header Box (mdhd) and 2829 the Media Information Box (minf). 2831 The Track Header Box specifies the characteristics of a single track, 2832 where a track is, in this case, the streamed text during a session. 2833 Exactly one Track Header Box is present for a track. It contains 2834 information about the track, such as the spatial layout (width and 2835 height), the video transformation matrix and the layer number. Since 2836 these pieces of information are essential and static, i.e. constant 2837 for the duration of the session, they must be sent prior to the 2838 transmission of any text samples. 2840 The Media Header Box contains the "timescale" or number of time units 2841 that pass in one second, i.e. cycles per second or Hertz. The Media 2842 Information Box includes the Sample Table Box (stbl) which contains 2843 all the time and data indexing of the media samples in a track. 2844 Using this box, it is possible to locate samples in time, determine 2845 their type, their size, container, and offset into that container. 2846 Inside the Sample Table Box we can find the Sample Description Box 2847 (stsd, for finding sample descriptions), the Decoding Time to Sample 2848 Box (stts, for finding sample duration), the Sample Size Box (stsz) 2849 and the Sample to Chunk Box (stsc, for finding the sample description 2850 index). 2852 Finally, the Media Data Box contains the media data itself. In timed 2853 text tracks this box contains text samples. Its equivalent to audio 2854 and video is audio and video frames, respectively. The text sample 2855 consists of the text length, the text string, and one or several 2856 Modifier Boxes. The text length is the size of the text in bytes. 2857 The text string is plain text to render. The Modifier Box is 2858 information to render in addition to the text such as colour, font, 2859 etc. 2861 14. Acknowledgements 2863 The authors would like to thank Dave Singer, Jan van der Meer, Magnus 2864 Westerlund and Colin Perkins for their comments and suggestions to 2865 this document. 2867 The authors would also like to thank Markus Gebhard for the free and 2868 publicly available JavE ASCII Editor (used for the ASCII drawings in 2869 this document) and Henrik Levkowetz for the Idnits web service. 2871 15. Authors' Addresses 2873 Jose Rey jose.rey@eu.panasonic.com 2874 Panasonic R&D Center Germany GmbH 2875 Monzastr. 4c 2876 D-63225 Langen, Germany 2877 Phone: +49-6103-766-134 2878 Fax: +49-6103-766-166 2880 Yoshinori Matsui matsui.yoshinori@jp.panasonic.com 2881 Matsushita Electric Industrial Co., LTD. 2882 1006 Kadoma 2883 Kadoma-shi, Osaka, Japan 2884 Phone: +81 6 6900 9689 2885 Fax: +81 6 6900 9699 2887 16. IPR Notices 2889 The IETF takes no position regarding the validity or scope of any 2890 Intellectual Property Rights or other rights that might be claimed to 2891 pertain to the implementation or use of the technology described in 2892 this document or the extent to which any license under such rights 2893 might or might not be available; nor does it represent that it has 2894 made any independent effort to identify any such rights. Information 2895 on the procedures with respect to rights in RFC documents can be 2896 found in BCP 78 and BCP 79. 2898 Copies of IPR disclosures made to the IETF Secretariat and any 2899 assurances of licenses to be made available, or the result of an 2900 attempt made to obtain a general license or permission for the use of 2901 such proprietary rights by implementers or users of this 2902 specification can be obtained from the IETF on-line IPR repository at 2903 http://www.ietf.org/ipr. 2905 The IETF invites any interested party to bring to its attention any 2906 copyrights, patents or patent applications, or other proprietary 2907 rights that may cover technology that may be required to implement 2908 this standard. Please address the information to the IETF at 2909 ietf-ipr@ietf.org. 2911 17. Full Copyright Statement 2913 Copyright (C) The Internet Society (2005). This document is subject 2914 to the rights, licenses and restrictions contained in BCP 78, and 2915 except as set forth therein, the authors retain all their rights. 2917 This document and the information contained herein are provided on an 2918 "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS 2919 OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET 2920 ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, 2921 INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE 2922 INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED 2923 WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.