idnits 2.17.1 draft-ietf-avt-muxissues-00.txt: Checking boilerplate required by RFC 5378 and the IETF Trust (see https://trustee.ietf.org/license-info): ---------------------------------------------------------------------------- ** Cannot find the required boilerplate sections (Copyright, IPR, etc.) in this document. Expected boilerplate is as follows today (2024-04-20) according to https://trustee.ietf.org/license-info : IETF Trust Legal Provisions of 28-dec-2009, Section 6.a: This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 2: Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. IETF Trust Legal Provisions of 28-dec-2009, Section 6.b(i), paragraph 3: This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Checking nits according to https://www.ietf.org/id-info/1id-guidelines.txt: ---------------------------------------------------------------------------- ** Missing expiration date. The document expiration date should appear on the first and last page. ** The document seems to lack a 1id_guidelines paragraph about Internet-Drafts being working documents. ** The document seems to lack a 1id_guidelines paragraph about 6 months document validity -- however, there's a paragraph with a matching beginning. Boilerplate error? ** The document seems to lack a 1id_guidelines paragraph about the list of current Internet-Drafts. ** The document seems to lack a 1id_guidelines paragraph about the list of Shadow Directories. ** The document is more than 15 pages and seems to lack a Table of Contents. == No 'Intended status' indicated for this document; assuming Proposed Standard == The page length should not exceed 58 lines per page, but there was 26 longer pages, the longest (page 12) being 63 lines == It seems as if not all pages are separated by form feeds - found 0 form feeds but 27 pages Checking nits according to https://www.ietf.org/id-info/checklist : ---------------------------------------------------------------------------- ** The document seems to lack a Security Considerations section. ** The document seems to lack an IANA Considerations section. (See Section 2.2 of https://www.ietf.org/id-info/checklist for how to handle the case when there are no actions for IANA.) ** There are 13 instances of too long lines in the document, the longest one being 11 characters in excess of 72. == There are 1 instance of lines with non-RFC6890-compliant IPv4 addresses in the document. If these are example addresses, they should be changed. Miscellaneous warnings: ---------------------------------------------------------------------------- == Line 853 has weird spacing: '...ibuting sourc...' == Line 855 has weird spacing: '...ibuting sourc...' == Line 859 has weird spacing: '...ibuting sourc...' -- The document seems to lack a disclaimer for pre-RFC5378 work, but may have content which was first submitted before 10 November 2008. If you have contacted all the original authors and they are all willing to grant the BCP78 rights to the IETF Trust, then this is fine, and you can ignore this comment. If not, you may need to add the pre-RFC5378 disclaimer. (See the Legal Provisions document at https://trustee.ietf.org/license-info for more information.) -- The document date (October 1, 1998) is 9333 days in the past. Is this intentional? Checking references for intended status: Proposed Standard ---------------------------------------------------------------------------- (See RFCs 3967 and 4897 for information about using normative references to lower-maturity documents in RFCs) -- Missing reference section? '1' on line 1181 looks like a reference -- Missing reference section? '2' on line 1185 looks like a reference -- Missing reference section? '3' on line 1189 looks like a reference -- Missing reference section? '4' on line 1193 looks like a reference -- Missing reference section? '5' on line 1197 looks like a reference -- Missing reference section? '6' on line 1201 looks like a reference Summary: 10 errors (**), 0 flaws (~~), 7 warnings (==), 8 comments (--). Run idnits with the --verbose option for more detailed information about the items above. -------------------------------------------------------------------------------- 1 Internet Engineering Task Force AVT Working Group 2 Internet Draft J.Rosenberg, H.Schulzrinne 3 draft-ietf-avt-muxissues-00.txt Bell Labs/Columbia U. 4 October 1, 1998 5 Expires: March 1999 7 Issues and Options for RTP Multiplexing 9 STATUS OF THIS MEMO 11 This document is an Internet-Draft. Internet-Drafts are working docu- 12 ments of the Internet Engineering Task Force (IETF), its areas, and 13 its working groups. Note that other groups may also distribute work- 14 ing documents as Internet-Drafts. 16 Internet-Drafts are draft documents valid for a maximum of six months 17 and may be updated, replaced, or obsoleted by other documents at any 18 time. It is inappropriate to use Internet-Drafts as reference mate- 19 rial or to cite them other than as ``work in progress''. 21 To learn the current status of any Internet-Draft, please check the 22 ``1id-abstracts.txt'' listing contained in the Internet-Drafts Shadow 23 Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 24 munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 25 ftp.isi.edu (US West Coast). 27 Distribution of this document is unlimited. 29 ABSTRACT 31 This memorandum discusses the issues and options involved 32 in the design of a new transport protocol for multiplexed 33 voice within a single packet. The intended application is 34 the interconnection of devices which provide trunking or 35 long distance telephone service over the Internet. Such 36 devices have many voice connections simultaneously between 37 them. Multiplexing them into the same connection improves 38 on the efficiency, enables the use of low bitrate voice 39 codecs, and improves scalability. Options and issues con- 40 cerning timestamping, payload type identification, length 41 indication, and channel identification are discussed. Sev- 42 eral possible header formats are identified, and their 43 efficiencies are compared. 45 1 Introduction 47 Internet telephone gateways (ITGs) allow a public switched telephony 48 user (PSTN) user to contact another PSTN user, with the long distance 49 portion of the call routed over the Internet. Such a scenario is 50 depicted in Figure 1. 52 ~~~~~~~~ ------- ~~~~~~~~~~ ------- ~~~~~~~~ 53 A --| | | | | | | | | |-- C 54 | PSTN |--| ITG |---| IP NET |---| ITG |--| PSTN | 55 B --| X | | J | | | | K | | Y |-- D 56 ~~~~~~~~ ------- ~~~~~~~~~~ ------- ~~~~~~~~ 58 Figure 1: Internet telephony gateway architecture 60 Subscribers A and B connect to ITG J via their local telephone net- 61 work, X. A wishes to speak with user C, and B wishes to speak with 62 user D, both of which are connected to local phone network Y. To 63 complete the call, ITG J packetizes and transports the voice to and 64 from A and B through the IP network, to remote gateway K. There, ITG 65 K completes the calls to C and D through PSTN Y. This type of 66 arrangement and common destination may be particularly common for 67 connecting the PBXs of corporate branch offices across the Internet. 69 In this scenario, ITGs J and K act as Internet hosts, which are 70 effectively proxies for the telephone users connected to them. Unlike 71 typical Internet telephony, however, their will often be multiple 72 active calls between a pair of gateways, each representing a differ- 73 ent pair of users. Gateways can signal calls using SIP [1], H.323 or 74 proprietary signalling protocols. Media data is transported via a 75 separate RTP [2] session for each user. 77 We observe that using a separate RTP session for each user connected 78 between a pair of gateways is wasteful. Rather, it would be more 79 efficient to multiplex users between a pair of gateways into a single 80 RTP session. A number of proposals have been made for RTP extensions 81 to accomplish this multiplexing, [3][4] [5]. 83 This memo discusses some of the issues and options for multiplexing 84 users within RTP between a pair of gateways. There are other applica- 85 tions for RTP multiplexing, such as transport of RTP in a switched 86 RTP network, depicted in Figure 2. In this scenario, an entity which 87 we call in RTP Switch, receives some number of RTP muxed connections. 88 It extracts the multiplexed payloads from each of the received multi- 89 plex streams, switches the payloads, and generates a new set of RTP 90 Multiplexed streams. These streams may be destined for other RTP 91 Switches, or for telephony gateways. 93 The switched scenario allows better network utilization. By allowing 94 RTP multiplexing only between pairs of gateways, there is an effec- 95 tive full mesh RTP network, with the number of multiplexed users 96 between a pair of gateway potentially growing small with a large num- 97 ber of gateways. An RTP Switched network would allow for greater mul- 98 tiplexing. However, it comes at the significant cost of management, 99 dynamic routing, and central point of failure requirements. 101 These scenarios have differing requirements. In this document, we 102 focus on the gateway to gateway case in Figure 1. 104 ------- -------- 105 | | RTPMux --------- RTPMux | | 106 | ITG |--------| |--------| ITG 3 | 107 | 1 | | | -------- 108 ------- | | RTPMux -------- 109 ------- | RTP |--------| | 110 | | RTPMux | Switch | | ITG 4 | 111 | ITG |--------| | -------- 112 | 2 | | | RTPMux -------- 113 ------- | |--------| | 114 | | | ITG 5 | 115 --------- -------- 117 Figure 2: Internet telephony gateway architecture 119 2 Terminology 121 oUser: One of the individuals who has data within the RTP packet. 123 oConnection: The point to point RTP session between two ITGs. 125 oChannel: A virtual connection which is established by allowing a 126 user to send data within a packet. There are many channels per 127 connection - this represents the multiplexing. 129 oChannel Identifier: A number which identifies a channel. 131 oBlock: The section of the payload of a packet which contains data 132 for a particular user. 134 3 Requirements 136 The transport protocol must provide, at a minimum, the following 137 functionality: 139 1. Delineation: Data from different users must be clearly delin- 140 eated. 142 2. Identification. The channel to which the data belongs must be 143 identified. 145 3. Variable lengths: The protocol should support variable length 146 blocks from a particular user. This allows for variable rate 147 codecs and adjustment of packetization delays. 149 4. Low overhead: Since the protocol is designed for low rate 150 voice, it should have low overhead. This issue is extremely 151 important. New coders are emerging which can support near toll 152 quality at 5.3 kbps, and acceptable quality at rates even as 153 low as 4 kbps. It is desirable to support such codecs, as they 154 can reduce the cost of providing an ITG service. Furthermore, 155 advances in coding technology indicate that it is desirable to 156 send very low bitrate information (1 kbps or less) during 157 silence periods, so that background noise can be reproduced 158 well (as opposed to sending nothing). Support of such rates 159 requires a protocol with low overhead. 161 5. Marker: A general purpose marker bit should be available for 162 all users within the connection. 164 6. Payload Identification. The codec in use for each user should 165 be indicated somehow. It is a requirement to allow for the 166 coding type to change during the lifetime of a channel. 168 4 Issues 170 The following section identifies a number of issues which have an 171 impact on the design of the protocol. It also identifies a variety of 172 options for providing the specific services of the protocol. 174 4.1 Payload type identification 176 There are a number of ways to identify the coding of the payload. 177 They include in-band static types, in-band dynamic types, or out-of- 178 band. The in-band approaches are based on some kind of payload type 179 identifier, the semantics of which are either known apriori (static), 180 or signaled ahead of time (dynamic). The out of band techniques sig- 181 nal a binding between the channel identifier and a coder at the 182 beginning (or even during) the lifetime of the connection. 184 With out-of-band signaling, synchronizing the signaling with the 185 media stream is a major issue. The synchronization can be accom- 186 plished with either timestamps of sequence numbers. 188 One approach to performing the synchronization is as follows: The 189 source sends a message reliably to the receiver, indicating that it 190 will change codings at timestamp N, where N is some future timestamp 191 (or SN). The N should be chosen far enough into the future to guaran- 192 tee that the receiver will get the TCP message before time N. The 193 farther away N is, the more robust the system becomes, but the source 194 also loses its ability to adapt quickly. There are also several 195 options for simple in- band signaling methods which can assist in 196 error recovery. This is based on the assumption that it is better for 197 the receiver to know that the encoding has changed (even though it 198 doesn't know to what), than to know nothing. This avoids playing gar- 199 bage out. A one or two bit coding sequence number can be used in the 200 header. Such a number starts at zero. At the timestamp where the 201 encoding changes, the SN increments, and stays incremented until the 202 next change. In this fashion, we are guaranteed that the source will 203 never play out data using the wrong coding type. Probably just one or 204 two bits of this SN is necessary. 206 Using in-band payload types allows the coding to be explicitly indi- 207 cated for each packet. This eliminates synchronization problems, 208 allows the sender to change encodings without out of band signaling. 209 Its flexibility is the reason in-band payload types were used for 210 generic RTP in the first place. By using dynamic types, the number of 211 bits for the encoding can be reduced by limiting the number of codecs 212 that can be used simultaneously during a session. 214 Our conclusion is that it is desirable to have the PTI field in the 215 payload (ie, in-band). This makes it possible to do more robust rate 216 control, which becomes a significant issue when multiple connections 217 are multiplexed together (and therefore the aggregate bitrate 218 increases). It also makes sense to signal a table of encodings for 219 the payload type at the beginning of the connection. Any particular 220 pair of ITG will generally only support a few codecs. Therefore, 221 dynamically setting the codings of the PTI bit makes a more compact 222 representation possible without restricting the set of codecs which 223 may be used. 225 4.2 Timestamps 227 Timing is a very complex issue for the multiplexing protocol. The 228 first question related to it is whether the protocol will support 229 mixing of media derived from separate clocks (i.e., voice and video). 230 Although doing this seems attractive, it is complex and in opposition 231 to the philosophy under which RTP was developed. RTP explicitly 232 states that separate media should be placed in separate RTP streams. 233 This allows for different QoS to be requested for each media, and for 234 clocks to be defined based on the media type. Furthermore, this pro- 235 file is geared towards the aggregation of voice traffic generated 236 from the POTS across the Internet. As a result, the only source of 237 data is from a single, 125us clock. 239 The next basic question is whether timestamps are needed globally, 240 i.e., just one per packet independent of the number of users, or 241 locally, whereby each user within a packet needs their own timestamp. 242 A separate question is the representation of these timestamps in an 243 efficient manner. When considering these questions, the criteria to 244 keep in mind are: 246 1. Can silence periods be recovered correctly 248 2. Can resynchronization occur in the face of packet loss 250 3. What is the impact on playout buffering and jitter computation 252 The answer to this question depends on the desired capabilities of the 253 protocol. In the most general case, it is possible to have different 254 frame sizes for each user (for example, 20ms, 10ms, and 15ms) within the 255 same packet. These frames can be arbitrarily aligned in time with 256 respect to each other (i.e., the 20ms frame starts 5.3 ms after the 257 beginning of another user's 10 ms frame). The user can send packets off 258 at any point, containing data from those users whose frames have been 259 generated before the packet departure time. A somewhat more restrictive 260 capability is to allow for different frame sizes and time alignments, 261 but to require that any packet contains all the same frame sizes, all 262 aligned in time. The most restrictive case is to require separate RTP 263 sessions for users with different frame sizes. This requires a channel 264 to be torn down and re-setup when it changes codec. The desire to per- 265 form flow control on a channel-by- channel basis makes this approach 266 unacceptable, and it is not considered further. 268 4.2.1 General Case 270 First consider the general case. Packets can contain frames from some 271 or all of the users, and those frames are not the same length nor 272 time aligned in any way. An example of such a scenario is depicted in 273 Figure 3. In the figure, there are three sources, and the ti corre- 274 spond to the times of packet emissions. When packets are lost, the 275 variability in the amount and time alignment of data in each packet 276 makes it impossible to reconstruct how much time had elapsed based 277 solely on sequence numbers (such reconstruction IS possible in the 278 single user case). Furthermore, the amount of time elapsed can easily 279 vary from user to user, and therefore local timestamps are needed. 281 The general case introduces further complications which have to do 282 with jitter and delay computation. Such computations are needed for 283 RTCP reporting and possibly for the estimation of network delays, 284 used in dynamic playout buffers. In the single user case, the jitter 285 is computed between each packet as: 287 D(i,j) = (Rj - Ri) - (Sj - Si) 289 Where the Ri correspond to the reception times at the receiver mea- 290 sured in RTP time, and the Si are the RTP timestamps in the data 291 packets. The delay is computed as the difference between the arrival 292 time at the receiver and generation time, as indicated by the RTP 293 timestamp. 295 In the multiple user case, these definitions no longer make sense, as 296 there is no single RTP timestamp any longer. Each arriving packet 297 will have a single arriving time (Ri), but multiple sending times 298 (Si,j) for each block j in the ith packet. There are a number of 299 alternatives for delay and jitter computation in this case: compute 300 such information for all users, compute such information for a single 301 user, or generate a single delay and jitter estimate, but have it be 302 based on information from all users. There are pros and cons to each 303 approach. 305 First of all, it is possible for different blocks to experience dif- 306 ferent delays (and jitters) even though they are within the same 307 packet. This is because the general scenario allows for significant 308 variability, whereby blocks may either vary in size from packet to 309 packet and within a packet, or not be transmitted immediately after 310 their completion (the latter happens to source B in Figure 3). Thus, 311 it is arguable they it may be desirable to perform adaptive playout 312 buffering separately for each user, which would require the storage 313 and computation of delays for each user. 315 The second alternative is to compute the delays for a single user, 316 and use that information to size all of the other playout buffers. 317 This may be sub-optimal in terms of delay and loss, depending on what 318 fraction of the total delay and jitter are introduced by the packeti- 319 zation itself. There is a second disadvantage to this approach, how- 320 ever. When that particular user enters a silence period, delay and 321 jitter information is no longer being received, and so estimates of 322 network delay stop adapting. This implies that delay estimates will 323 be old for certain periods of time. An alternative is to change the 324 user from which delay and jitter estimates are being collected. 326 The third alternative is to compute delay estimates based on some 327 measure derived from all of the users. There are several reasonable 328 approaches. For example, the delay estimate can be computed as: 330 Delay = maxj, Ri - Si,j 331 which would yield a conservative estimate of the delay for some 332 users. This approach requires storage of only a single set of delay 333 information, although computation still grows with the number of 334 users in a packet. 336 --------------------------------- 337 Source A | | | 338 --------------------------------- 339 Source B | | | | | | | 340 --------------------------------- 341 Source C | | | | | | 342 --------------------------------- 344 | | | | | || | 345 t1 t2t3 t4 t5 t6t7 tt8 347 -------------------------------------- time / 349 Figure 3: Global Timestamp Problem 351 Sending local timestamps also requires extra bits in the block head- 352 ers. It is possible, however, to use offsets for the local times- 353 tamps. A global timestamp can be used in the RTP header (the field 354 already exists), and each user has a modifier to indicate position in 355 time relative to that timestamp. 357 A related question is how big to make the offset field. This offset 358 is bounded by the difference in time between the earliest and latest 359 samples within a packet. Clearly, this itself is bounded by the pack- 360 etization delay at the source. For this application, if we assume a 361 125us sample clock, and bound packetization delays to 100ms, the off- 362 set field is bounded by 800 ticks, requiring 10 bits. 364 4.2.2 More Restrictive Case 366 As a more restrictive case, we allow blocks to be present in a packet 367 if their frame sizes are identical and aligned in time. Note that 368 this does not imply identical codecs or identical block sizes in 369 terms of bytes; many voice codecs operate with a 20ms or 50ms frame 370 size. This case would allow all frame sizes of the same size and time 371 alignment, independent of the codec, into a packet. 373 This simplifies the timing issue tremendously. Now, the scenario is 374 much more like the single user application. The sequence numbers and 375 the frame size completely determine the timing when at least one user 376 is active. But, when all users enter silence, a global timestamp is 377 needed to indicate the duration of the silence period. The global 378 timestamp is sufficient to reconstruct the timing in the face of 379 losses. Therefore, in this case, only a global timestamp is required. 381 It is desirable to support a variety of different frame sizes within 382 such an aggregated connection, however. The way to do this in this 383 case is to simply mandate that different packets can contain differ- 384 ent frame sizes; the only restriction is within a packet. This is not 385 as simple as it may seem at first. Once this is done, the relation- 386 ship between sequence numbers and timing is lost. Consider an exam- 387 ple. There are two frame sizes, 10ms and 30ms. Packet N contains 10ms 388 frames, as does packet N+1 and N+2, however, N+3 contains 30ms 389 frames. Thus, although the difference in sequence number between the 390 first and fourth is three, the relative timing is not 10ms*3 or 391 30ms*3. Due to this fact, the measurement of jitter is complicated 392 (for the same reasons described in Section 4.3.1), as it should not 393 be done between two packets with different frame sizes. It also makes 394 recovery techniques based on sequence number more complex. To resolve 395 this problem, we use a natural concept in RTP, which is the synchro- 396 nization source (SSRC). The approach is to have a separate SSRC for 397 each frame size in use. Then, sequence numbers are interpreted for 398 each SSRC separately. This resolves the problem with the relationship 399 between timing and sequence numbering. It also makes jitter and delay 400 computations simpler - they are now done for each SSRC separately. 401 Furthermore, multiple jitter (and delay, loss, etc.) values are 402 reported to the source, one for each frame size. This is also desir- 403 able, since the different frame sizes will cause different packetiza- 404 tion delays and packet sizes, which may cause those packets to see 405 different delays and losses in the network than other packets. 407 This case has both advantages and drawbacks when compared to the gen- 408 eral case. As an advantage, timing is greatly simplified, and the 409 approach falls much in line with the original intentions of RTP. How- 410 ever, it causes losses in efficiency for systems with a variety of 411 different frame sizes in operation simultaneously. Such a situation 412 arises naturally when flow control is applied to each source individ- 413 ually, as opposed to altering the rate and codec type for all of the 414 active sources. 416 4.3 Channel ID 418 The question of channel identification may seem at first trivial - 419 simply use a 32 bit number, much like the SSRC, and be done with it. 420 However, 32 bits adds significant overhead. Reduction of the number 421 of bits for the channel ID becomes a complex issue. Unlike the single 422 user case, the connection may remain active for long periods of time 423 (days or months). The result is that channel IDs will need to be 424 reused during the lifetime of the connection. It is critical to 425 ensure that data from different channels is not confused because of 426 this. Large channel ID spacing helps to resolve this issue (although 427 it can not eliminate it), so an added side effect of reducing the 428 number of channel IDs possible is an increase in the likelihood of 429 such confusion. 431 The first question to be addressed is how many simultaneous users can 432 one expect to find in a single packet. 434 4.3.1 Number of Users 436 There are several ways to come up with some minimums and maximums. 438 4.3.1.1 Delay-bound 440 Clearly, as we add more users, the store and forward delays increase 441 since the packet size gets larger. Therefore, if we bound the per-hop 442 delay, and provide a lower bound on the codec bitrate and packetiza- 443 tion delay, an upper bound on the number of users can be obtained. 444 Consider a 2.4 kbps codec, with a 20ms frame size. This is a reason- 445 able minimum combination. Next, consider 50ms store and forward 446 delays. For a T1, this limits the number of users within a packet to 447 965. For a T3, it is 30 times this, or nearly 29,000. If silence sup- 448 pression is used, the number of users within a packet is roughly half 449 the number of active users (on average), thus requiring twice as many 450 channel identifiers (1930 and 58,000). This bound doesn't seem to 451 tight. Intuitively, even 965 seems too large. 453 4.3.1.2 Efficiency bound 455 The entire purpose of multiplexing is to improve upon efficiency. 456 Therefore, we should be able to support at least as many users as is 457 necessary to get good efficiency. Consider a typical case, a 16 kbps 458 codec, with a 20ms packetization delay. This results in 320 bits of 459 data per user. If we assume IP/UDP/RTP (20+8+12=40 bytes = 320 bits), 460 plus an additional word (32 bits) of overhead per user, the effi- 461 ciency vs. N becomes: 463 E = (320N / ((320 + 32)N + 320)) 465 This reaches an asymptote of 90 percent of this, say 88 packet, so 466 that we must support at least 14 active channels (again, due to stat 467 mux). The lower bound, therefore, on the number of users is around 468 14. 470 4.3.1.3 MTU Bound 472 In many cases, there is a maximum packet size. This is usually around 473 1500 bytes. If we consider a very low bitrate codec, the minimum 474 block size from any particular user is 32 bits (otherwise, overheads 475 become very large, and we lose word alignment, so 32 bits is a good 476 minimum). Dividing 1500 bytes by 4 bytes, we obtain a maximum of 375 477 users. Multiplying by two, the number of active channels needed is 478 around 750. 480 Based on these bounds, we need to simultaneously support at least 10 481 users, and at most 750. This would imply that at least 8 to 10 bits 482 of channel ID are required. 484 4.3.2 Channel ID Reuse Problem 486 It is important to guarantee that data from a particular channel is 487 never routed to a different channel; this would mean that a user may 488 hear pieces of conversations from different users, an error we con- 489 sider catastrophic. Such misrouting becomes possible when a channel 490 is torn down, and a new channel is set up soon after using the same 491 channel ID. Such a scenario is depicted in Figure 4. Sometime after 492 channel K is torn down, a new channel is set up using the same chan- 493 nel ID, K. If the data packets (dotted lines) are being delayed sig- 494 nificantly, blocks from the old channel K may still be present in the 495 data stream after the new channel K is established. These blocks will 496 then be played out to the new user of channel K. Protocol support is 497 needed to guarantee that this can never happen. 499 The solution lies in an intelligent signaling protocol. The protocol 500 must support a two-way handshake for all control messages. In addi- 501 tion, three simple rules must be obeyed at a source when setting up 502 or tearing down connections: 504 1. When a source sends a teardown message, it stops sending data 505 in the UDP stream for that channel. Furthermore, in the sig- 506 naling message, it indicates the sequence number of the packet 507 which contained the last block for that channel, call this 508 sequence number K. 510 2. A source cannot re-use a channel identifier until it has 511 received an acknowledge from the destination that that partic- 512 ular channel was successfully torn down. 514 | | 515 t1 |------------- teardown K | 516 |. --------------X| 517 | .old K data |t2 518 | . -------| 519 | ACK TD K --------- | 520 t3 |X----------- | 521 | . | 522 | . | 523 |------------- setup K | 524 | .--------------X| 525 | ....... |t4 526 | ACK SET K --------------| 527 |X------------- .... | 528 |...... ..X| 529 | ........data new K | 530 | .............X| 532 Figure 4: Channel ID Reuse Problem 534 3. A source cannot send begin to send data from a particular 535 channel in the UDP stream until it has received an acknowledge 536 from the destination that the setup is complete. 538 A few simple rules must also be used at the receiver: 540 1. When a receiver gets a teardown message, it checks the highest 541 SN received so far (call this sequence number M). If M > K, 542 the channel is torn down, and any further blocks containing 543 that channel ID are discarded. If M < K, blocks from that 544 channel are accepted until the received SN exceeds K. Once 545 this happens, the channel is torn down and no further blocks 546 with that channel ID are accepted. 548 2. When a setup message is received, the destination will begin 549 to accept blocks with the given channel identifier, but only 550 if the sequence numbers of the packets in which they ride is 551 greater than K. 553 The use of the sequence numbers allows the receiver to separate the old 554 channel K blocks from the new ones. This guarantees that the destination 555 will not misroute packets. An additional benefit is that the end of 556 speech will not be clipped if the last data packets arrive after the 557 teardown is received. This protocol is quite simple to implement, 558 although it requires a table at the receiver of the values of K for each 559 channel ID. 561 Alternate solutions to this reuse problem exist which can operate when 562 the above restrictions are relaxed. The simplest approach is to have the 563 source keep a linked list of free channel IDs. The list is initialized 564 to contain all channel IDs, in order. When a new channel is required to 565 be established, the channel ID is taken from the top of the list. When a 566 channel is torn down, its ID is placed at the bottom of the list. This 567 makes the time between channel ID reuse as long as possible, and reduces 568 the probability of confusion. With this method, it is no longer neces- 569 sary to include sequence numbers in the tear down messages. Also, the 570 receiver does not need to maintain a table. 572 4.3.3 Channel ID Coding 574 This section discusses some of the options for coding the channel ID 575 field. 577 4.3.3.1 Fixed Length 579 The fixed length approach is the most straightforward. A fixed number 580 of bits is assigned to the channel ID. Issues surrounding the number 581 of bits required have been discussed above. 583 4.3.3.2 Implicit + Present Mask 585 In reality, the channel IDs are very redundant. Both source and des- 586 tination know the set of active connections and their channel identi- 587 fiers from the signalling messages. Therefore, if the blocks are 588 placed in the packet in order of increasing channel ID, very little 589 information actually needs to be sent. In fact, without silence sup- 590 pression, channel activity and the presence of a block in a packet 591 are likely to be equivalent, in which case NO information actually 592 needs to be sent about channel IDs. 594 Unfortunately, there are some practical problems with this. First, 595 silence suppression is used. Secondly, even if it weren't, it is pos- 596 sible for the voice codecs at the ITG not to have their framing syn- 597 chronized (as in the general case above), so that a packet may not 598 contain data from all users. Thirdly, the source and destination do 599 NOT have a consistent view of the state of the system. There is a 600 delay while signaling messages are in transit. 602 A few simple mechanisms can be used to overcome these complexities. 603 In the header of the packet, a mask is sent. Each bit in the mask 604 indicates whether data from a channel is present in the packet or 605 not. Mapping of channel ID's to bits is done by sorting the channel 606 ID's, and mapping the lowest number to the first bit, next lowest to 607 the second, etc. Therefore, if a channel has no data for that packet, 608 its bit is set to zero. Given that the source and destination agree 609 on how many connections are active at all points in time, the number 610 of bits required is known to both sides. 612 The next step is to deal with the differences in state. An additional 613 field, called the state-number, perhaps 5 bits, is sent in the header 614 of the packet. This field starts at zero. Lets say at some point in 615 time, its value is N. The source wishes to tear down a channel. It 616 sends the tear down message to the destination, but continues to send 617 data for that channel (or it may choose to send nothing, but must set 618 the appropriate bit in the mask to zero). When the destination 619 receives the message, it replies with an acknowledge. When the 620 acknowledge is received by the source, the source considers the chan- 621 nel torn down, and no longer sends data for it, nor considers it in 622 computing the mask. In the packet where this happens, the source also 623 increments the state-number field to N+1. The destination knows that 624 the source will do this, and will therefore consider the state 625 changed for all packets whose value of the field is N+1 or greater. 626 When the next signaling message takes effect, the field is further 627 increased. Even if packets are lost, the value of the state-number 628 field for any correctly received packet completely tells the destina- 629 tion the state of the system as seen in that packet. Furthermore, it 630 is not necessary to wait for a particular setup or teardown to be 631 acknowledged before requesting another setup or teardown. 633 The number of bits for the state-number field should be set large 634 enough to represent the maximum number of state changes which can 635 have taken effect during a round trip time. As an alternative, an 636 additional exchange can occur. After the destination receives a 637 packet with state number greater than N, it destroys the state 638 related to N, and sends back, reliably, a free-state N message, indi- 639 cating to the destination that state N is now de-allocated, and can 640 be used again. Until such a message is received, the source cannot 641 reuse state N. This is essentially a window based flow control, where 642 the flow is equal to changes in state. With this addition, the number 643 of bits for the state number can be safely reduced, and it is guaran- 644 teed that the destination will never confuse the state, independent 645 of the number of state- number bits used. However, the use of too few 646 state bits can cause call blocking or delay the teardown of inactive 647 channels. 649 This problem in state difference appears to be similar to the channel 650 ID reuse problem described in Section 4.4.2. However, there is an 651 important difference. In the channel ID reuse problem, if the packet 652 containing the last block of a user arrives before the signaling mes- 653 sage tearing down that connection, there is no problem. The destina- 654 tion will generally play out silence until the signaling message is 655 received. Here, however, the destination must know that blocks are no 656 longer present in the data stream independent of when the signaling 657 messages arrive. 659 There are some drawbacks to this approach. They require the source 660 and destination to maintain state. Any error in processing at either 661 end, or a hardware failure, causes a complete loss of synchroniza- 662 tion. This hard-state nature of the protocol can be relaxed by having 663 the source send the complete state of the system with each signaling 664 message, along with the state-number field for which this state takes 665 effect. This guarantees that even in the event of end- system fail- 666 ure, the system state will be refreshed whenever a new connection is 667 set up or torn down. Furthermore, the state can be sent periodically 668 to improve performance. 670 4.4 Length Indicators 672 There are many ways to actually code the length indicators. The first 673 question, however, is the range of lengths which must be coded. 675 4.4.1 Range of Length Indicators 677 Here, there is a clear tradeoff between flexibility and efficiency. A 678 larger range can accommodate a variety of different media (such as 679 video) where lengths may be large. However, this comes at the expense 680 of a long length field, which may require another word of header to 681 hold. For voice, one would expect a maximum bitrate to be 64 kbps, 682 and around 50ms packetization delay. This yields exactly 100 words of 683 data. Therefore, an eight bit field is probably sufficient for most 684 voice applications. 686 4.4.2 PTI Based Lengths 688 In many applications, the amount of data present depends on the voice 689 codec in use. Frame based coders will generally send a frame at a 690 time. Since the codec type is indicated by the PTI field, it may not 691 always be necessary to send length information at all. Even for non- 692 frame based codecs, such as PCM, default data sizes can be set in the 693 standard (as in RFC 1890 [6]). An extension bit can be used to indi- 694 cate a non-standard length, so that when set, a length field follows. 695 This allows for efficient coding of the most common cases, but allows 696 for variable lengths with little additional cost. 698 4.4.3 Variable Length w/ Indicator 700 In this approach, a variable length header is used. All of the length 701 indicators for all of the blocks are placed together in the beginning 702 of the packet. However, the first four bits of this header field 703 indicate the number of bits used for each length field. What follows 704 are the length fields themselves, each using the number of bits indi- 705 cated by the first four bits. This approach scales well, using a 706 small overhead when the block lengths are small, and a larger 707 overhead when they are larger. The drawback is a variable length 708 header field, plus additional complexity in the parsing. An example 709 of this technique is depicted in Figure 5. In the first example, the 710 four bit indicator field has a value of three, so that the length 711 fields are all three bits long. The four lengths are then 2,6,3, and 712 8. In the second example, the 4 bit indicator has a value of two, so 713 that the length fields are all two bits long. The four lengths are 714 thus 3,2,1, and 3. 716 4b 3b 3b 3b 3b 717 -------------------- 718 Example 1 |0011|010|110|011|100| 719 -------------------- 721 4b 2b 2b 2b 2b 722 ---------------- 723 Example 2 |0010|11|10|01|11| 724 ---------------- 726 Figure 5: Variable Length w/ Indicator 728 4.4.4 Remaining Packet Length Based Lengths 730 UDP always informs RTP of how many bytes are in the payload. This 731 itself restricts the possible length of the first block, since its 732 length must be less than the total packet length minus the RTP 733 header. Furthermore, as each block is placed into the packet, the 734 possible set of lengths that it can have shrinks - it must always be 735 less than the remaining length in the packet. This approach, there- 736 fore, codes each length field with log2 of the number of bits remain- 737 ing in the packet. This approach works extremely well when there is a 738 long packet followed by several shorter ones, whereas the previous 739 approach performs poorly in this case. Furthermore, it eliminates the 740 length indicator present in the previous approach. However, it is 741 even more complex than the previous technique. It can result in no 742 savings under some conditions, especially since the header fields 743 must be rounded to 32 bits. 745 Consider an example. The total size of the packet is 31 words. Inside 746 of it are three blocks, the first whose length is 17, the second 8, 747 and the third, 6. We would code the length field with 5 bits. After 748 this block is read, the remaining amount of data in the packet is 14 749 words. Therefore, the next length field is coded with 4 bits. After 750 this block, the remaining amount of data in the packet is 6 words, so 751 the final length field is coded with three bits. The total is there- 752 fore 5+4+3 = 12 bits. In the previous approach (Section 4.5.3), the 753 entire length field would have required 4 bits for the indicator 754 (whose value would be 5), followed by 3 five bit fields, for a total 755 of 19 bits. 757 One may question this example since the overhead of the length fields 758 itself is not taken into account when computing the remaining length 759 of the packet. While this can be incorporated, it makes things even 760 more complex, and it is not actually necessary. All that is required 761 is that the length fields are coded with log2(M), where M is any 762 bound on the remaining amount of data which can be deterministically 763 computed from past information. A simple bound is the packet length 764 minus the data seen thus far (one can also subtract away any fixed 765 length fields), precisely the metric used in the example above. 767 4.4.5 Table Based Approach 769 Realistically, most systems will operate with codecs that generate 770 data in a fixed set of lengths (a frame size, for example). In that 771 case, the set of lengths which can appear in the packet are usually 772 very restricted. To take advantage of this fact, a table can be 773 transmitted to the receiver reliably before transmission commences. 774 This table can indicate the actual length of a block, and its coding. 775 The symbols transmitted in the data packets are then used in this 776 table to look up the actual lengths. This can reduce the length field 777 to 2 or 3 bits. These lengths then all occur next to each other in 778 the header. The technique now relies on state at the receiver, and 779 the parsing process is further complicated by table lookups. In addi- 780 tion, the approach only works if you know the set of lengths before 781 the system begins operation. If you allow the table to be dynamically 782 modified during a session, synchronization problems occur, and the 783 system becomes quite complex. 785 Further gains can be achieved through the use of Huffman codes 786 instead of fixed length codes This only makes sense when different 787 codecs (and correspondingly different lengths) are used with differ- 788 ent frequencies. An example of such a situation is when the codec 789 changes to a higher rate because of music-on-hold; a rare event in 790 general. 792 4.5 Marker Bit 794 The marker bit has a general functionality, but is normally used to 795 indicate the beginning of a talkspurt. It seems like a good idea to 796 include this bit for each user. 798 4.6 Location of Per User Overhead 800 There will generally be overhead on a per-user basis (information 801 such as channel ID, length, etc.). This information can be located in 802 one of three places. First, it can all reside in front of the block 803 to which it is applicable. Second, it can all be pasted together and 804 reside up front in the header of the packet. The third is a hybrid 805 solution, where some of it resides up front (such as channel ID), and 806 some resides in front of the data. There are various pros and cons to 807 the different approaches. The hybrid approach can be complex, since 808 data is split into multiple places. The case where all the header is 809 up front has a few minor advantages. First, it allows for a complete 810 separation of the data from the header. The implementation is likely 811 to be a little less complex, since extracting blocks does not require 812 actually moving through the payload. 814 5 Options 816 5.1 Option I: Mixer Based 818 This option is the most straightforward to implement, but has the 819 most overhead. The basic premise is to reuse the mixer concept intro- 820 duced in RTP. Each user is considered a contributing source, and the 821 gateway is considered a mixer. However, instead of mixing the media, 822 separate data from each user appear in the payload. The 32 bit CSRC 823 identifies each user, acting as the channel ID. Data from each user 824 is organized into blocks. Each block has its own 32 bit header, which 825 includes the length (12 bits) in units of 32 bit words, Marker bit 826 (1b), TimeStamp Offset (12b), and Payload Type (7b). Furthermore, the 827 payload type and marker bit are stricken from the RTP header (since 828 they only make sense for an individual user), and the CC field 829 expanded to fill the missing bytes. This allows for a 12 bit CC 830 field, or 4096 users in a packet. Thus, the packet would look like: 832 Figure 6: Option I 834 This approach allows for the most amount of generality in terms of 835 variable length coders and coders with different frame sizes (see 836 Section 4.3.1). The channel ID is longer than necessary, but using 837 the concept of a contributing source for the channel ID necessitates 838 the use of the additional bits. There are several variations on 839 option I, many of which have been mentioned above: 841 I.A: Put the CSRC with each 32 bit length+M+PT field, instead of all 842 of them being at the beginning. This has some pros and cons. As an 843 interesting artifact of this change, it is no longer necessary to 844 0 1 2 3 845 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 846 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 847 |V=2|P|X| CC |M| PT | sequence number | 848 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 849 | timestamp | 850 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 851 | synchronization source (SSRC) identifier | 852 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 853 | contributing source (SSRC) identifier 1 | 854 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 855 | contributing source (SSRC) identifier 2 | 856 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 857 .......... 858 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 859 | contributing source (SSRC) identifier N | 860 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 861 | Length | Timestamp Offset |M| | 862 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 863 | | 864 | Payload 1 | 865 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 866 | Length | Timestamp Offset |M| | 867 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 868 | | 869 | Payload 2 | 870 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 872 have a CC field. The length passed up by UDP is sufficient to recover 873 the point at where you stop checking for additional blocks from users 874 in the payload. In fact, the length field in the last block is not 875 strictly necessary either. 877 I.B: Do the opposite of I.A. Put the length+M+PT field up front along 878 with the CSRC fields, with the pattern being CSRC 1, length 1, CSRC 879 2, length 2, etc. Here again, the CC field is not strictly necessary. 881 I.C: The CSRC field can be shrunk to 8 bits. This allows for either 4 882 or two channel IDs to be coded in the space of one word, whereas only 883 one could in the current size of the field. 885 I.D: The CSRC field can be shrunk to 16 bits. 887 5.2 Option II: One word header 888 This option eliminates the large channel ID field present in the pre- 889 vious option. In the RTP header, the CC bit is set to zero, the 890 marker bit has no meaning, and the payload type is TBD (possible uses 891 include an indication of the number of blocks in the packet). The RTP 892 timestamp corresponds to the generation of the first sample, among 893 all blocks, enclosed in this packet. A one word header precedes each 894 block of data. The number of blocks is known by parsing them until 895 the end of the RTP packet. The one word field has a channel ID (8 896 bits), length (8 bits), Marker (1 bit), timestamp offset (11 bits), 897 and payload type (4 bits). Channel ID number 255 is reserved, and 898 causes the header to be expanded to allow for greater length, payload 899 type, and possibly channel ID encodings. The specific format for this 900 expanded header is for further study. Given the compacted payload 901 type space, it may be a good idea to allow negotiation of the meaning 902 for the payload type at the beginning of the connection. It may be 903 worthwhile to expand the length field at the expense of the channel 904 ID - this issue is for further study. 906 The format of the packet is thus: 908 0 1 2 3 909 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 910 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 911 |V=2|P|X| CC |M| PT | sequence number | 912 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 913 | timestamp | 914 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 915 | synchronization source (SSRC) identifier | 916 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 917 | Length | Timestamp Offset | CID |M| PTI | 918 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 919 | | 920 | Payload 1 | 921 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 922 | Length | Timestamp Offset | CID |M| PTI | 923 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 924 | | 925 | Payload 2 | 926 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 928 Figure 7: Option II 930 5.3 Option III - Restricted Case 932 Option II has the advantage of being able to support multiple frame 933 sizes within a single packet. However, it comes at the expense of a 934 32 bit header (which can be large for low bitrate codecs), and at a 935 reduced payload type field. This option has a 16 bit header, but does 936 not support different frame sizes within a packet. It therefore falls 937 into the category described in Section 4.3.2. Of the 16 bit header, 938 the first bit is an expand bit (to be described shortly), and the 939 second bit is the marker bit. The following 6 bits indicate payload 940 type, and the remaining 8 are for channel ID. When the expand bit is 941 set, an additional 16 bits are present, which indicate the length of 942 the block. When expand is clear, the length is derived from the pay- 943 load type. Since there is no timestamp offset, all the blocks in the 944 packet must be time aligned and have the same frame lengths. Differ- 945 ent sized frames are supported by using a different SSRC for each 946 frame length (see Section 4.3.2). In the RTP header, the CC field is 947 always zero. The marker bits and payload type are undefined. The 948 timestamp indicates the time of generation of the first sample of 949 each block. SSRC is randomly chosen, but always different for each 950 frame size. 952 The block headers are all located at the beginning of the packet, and 953 follow each other. If the total length of the fields is not a multi- 954 ple of 32 bits, it is padded out to 32. The structure of the header 955 is such that fields never break across packet boundaries. An example 956 of such a packet is given in Figure 8. There are 7 blocks in this 957 example. The first two have standard lengths based on the PT field. 958 The next one uses the expansion bit to indicate the length. The 959 fourth uses the PT field, the fifth the expansion bit, and the last 960 two use the PT field. The last 16 bits of the header are padded out. 962 Figure 8: Option III 964 5.4 Option IV - Stacked RTP 966 This approach uses a duplicate of the RTP header as the per-block 967 header. It is therefore extremely inefficient (12 bytes per block), 968 but has several advantages: different media types can be mixed, since 969 the timestamps are no longer related, and little processing is 970 required if the sources being combined came from a single user RTP 971 source. It also works well when one of the users is actually a mixer 972 (for example, a conference bridge), since the CSRC can be used. Its 973 main advantage is the reduction in overhead due to the IP and UDP 974 headers. In addition to the standard RTP header, an additional header 975 is required for length indication. This header has a number of 16 bit 976 0 1 2 3 977 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 978 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 979 |V=2|P|X| CC |M| PT | sequence number | 980 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 981 | timestamp | 982 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 983 | synchronization source (SSRC) identifier | 984 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 985 |E|M| PT | ID |E|M| PT | ID | 986 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 987 |E|M| PT | ID | Length | 988 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 989 |E|M| PT | ID |E|M| PT | ID | 990 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 991 | Length |E|M| PT | ID | 992 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 993 |E|M| PT | ID | PAD | 994 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 995 | | 996 | Payload 1 | 997 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 998 | | 999 | Payload 2 | 1000 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1001 | | 1002 | Payload 3 | 1003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1004 | | 1005 | Payload 4 | 1006 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1007 | | 1008 | Payload 5 | 1009 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1010 | | 1011 | Payload 6 | 1012 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1013 | | 1014 | Payload 7 | 1015 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1017 fields, each of which indicates a length for its corresponding block 1018 (including the 12 byte RTP header). The number of such 16 bit lengths 1019 fields is known by continuing to look for additional length fields 1020 until the total length of the packet passed up from UDP has been 1021 accounted for. If an odd number of such length fields is required, 1022 then an additional 16 bits of padding is inserted to make the length 1023 header a multiple of 32 bits. 1025 The format of such a packet is given in Figure 9. 1027 0 1 2 3 1028 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1029 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1030 | Length 1 | Length 2 | 1031 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1032 | Length 3 | PAD | 1033 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1034 |V=2|P|X| CC |M| PT | sequence number | 1035 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1036 | timestamp | 1037 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1038 | synchronization source (SSRC) identifier | 1039 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1040 | | 1041 | Payload 1 | 1042 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1043 |V=2|P|X| CC |M| PT | sequence number | 1044 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1045 | timestamp | 1046 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1047 | synchronization source (SSRC) identifier | 1048 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1049 | | 1050 | Payload 2 | 1051 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1053 Figure 9: Option IV 1055 5.4 Option V: Compacted 1057 This option uses the Implicit + Mask approach outlined in Section 1058 4.4.3.2 to code the channel ID. In all other respects it is similar 1059 to Option III. Now, however, the per-block header can be reduced to 1060 one byte: 1 bit of expansion, 1 bit of marker, and 6 bits of payload 1061 type. Furthermore, the length field (present when the expansion bit 1062 is set) is reduced to 8 bits from 16 in Option III. This reduction 1063 saves on space, but it also guarantees that fields remain aligned on 1064 byte boundaries. The mask bits are present in the beginning of the 1065 packet, and they are preceded by a 8 bit state-number. If the number 1066 of active channels is not a multiple of 32, the mask field is padded 1067 out to a full word. This approach is extremely efficient, but the 1068 channel identification procedure is more complex and requires addi- 1069 tional signaling support. 1071 A diagram of a typical packet for this option is given in Figure 10. 1072 The marker bits are indicated with lowercase ms. There are four 1073 active channels, each of which is present in this packet (all four 1074 mask bits would then be 1). The first block has a standard length, 1075 but the second has its expansion bit set, so that an 8 bit length 1076 field follows. The remaining two blocks have normal 8 bit headers. 1077 The last 24 bits of the header are padded to a word boundary. 1079 Figure 10: Option V 1081 6 Comparison of Options 1083 In this section, the options are compared in terms of efficiency. 1084 Issues relating to complexity, scalability, and generality have 1085 already been discussed in previous sections. The analysis here con- 1086 sists of a series of tables, indicating the efficiency of each option 1087 for a variety of speech codecs. Several tables are included for dif- 1088 ferent numbers of users. 1090 6.1 Specific Codecs 1092 In both Table 1 and Table 2, the efficiency vs. codec for all three 1093 options is tabulated. For G.711, G.726, G.728 and G.722, the frame 1094 size listed is a multiple of the actual frame size of the codec, 1095 which is too small to be sent one at a time. The efficiency is com- 1096 puted as the number of words of payload such a codec would occupy, 1097 times the number of users, divided by the total packet size (i.e., it 1098 does not consider inefficiencies due to padding the payload portion). 1099 Note that Option V is always superior in efficiency. The efficiencies 1100 are generally 1 to 10 percent apart. Table 1 considers the case where 1101 there are 10 users, and Table 2 considers the case where there are 1102 24. 1104 Codec|rate|Frame(ms)| I |I.C |I.D | II | III | IV | V 1105 G.711| 64 | 20 |93.02 |94.56 |94.12 |95.24 |96.39 |90.50 |96.84 1106 G.726| 32 | 20 |86.96 |89.69 |88.89 |90.91 |93.02 |82.64 |93.88 1107 0 1 2 3 1108 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 1109 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1110 |V=2|P|X| CC |M| PT | sequence number | 1111 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1112 | timestamp | 1113 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1114 | synchronization source (SSRC) identifier | 1115 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1116 | State Num |m|m|m|m| Pad | 1117 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1118 |E|M| PT |E|M| PT |E|M| PT |E|M| PT | 1119 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1120 |E|M| PT | PAD | 1121 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1122 | timestamp | 1123 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1124 | | 1125 | Block 1 | 1126 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1127 | Block 2 | 1128 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1129 | Block 3 | 1130 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1131 | | 1132 | Block 4 | 1133 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 1135 G.728| 16 | 18.75 |76.92 |81.30 |80.00 |83.33 |86.96 |70.42 |88.47 1136 G.729| 8 | 10 |50.00 |56.60 |54.55 |60.00 |66.67 |41.67 |69.72 1137 G.723|5.3 | 30 |62.50 |68.49 |66.67 |71.43 |76.92 |54.35 |79.33 1138 G.723|6.3 | 30 |66.67 |72.29 |70.59 |75.00 |80.00 |58.82 |82.16 1139 ITU | 4 | 20 |50.00 |56.60 |54.55 |60.00 |66.67 |41.67 |69.72 1140 G.722| 64 | 15 |90.91 |92.88 |92.31 |93.75 |95.24 |87.72 |95.84 1141 GSM F| 13 | 20 |75.00 |79.65 |78.26 |81.82 |85.71 |68.18 |87.35 1142 IS54 |7.95| 20 |62.50 |68.49 |66.67 |71.43 |76.92 |54.35 |79.33 1143 IS96 |8.5 | 20 |66.67 |72.29 |70.59 |75.00 |80.00 |58.82 |82.16 1145 Table 1: 10 Users 1147 Codec|rate|Frame(ms)| I |I.C |I.D | II | III | IV | V 1148 G.711| 64 | 20 |94.30 |96.00 |95.43 |96.58 |97.76 |91.34 |98.26 | 1149 G.726| 32 | 20 |89.22 |92.31 |91.25 |93.39 |95.62 |84.06 |96.57 | 1150 G.728| 16 | 18.75 |80.54 |85.71 |83.92 |87.59 |91.60 |72.51 |93.37 | 1151 G.729| 8 | 10 |55.38 |64.29 |61.02 |67.92 |76.60 |44.17 |80.87 | 1152 G.723| 5.3| 30 |67.42 |75.00 |72.29 |77.92 |84.51 |56.87 |87.57 | 1153 G.723| 6.3| 30 |71.29 |78.26 |75.79 |80.90 |86.75 |61.28 |89.42 | 1154 ITU | 4 | 20 |55.38 |64.29 |61.02 |67.92 |76.60 |44.17 |80.87 | 1155 G.722| 64 | 15 |92.54 |94.74 |93.99 |95.49 |97.04 |88.78 |97.69 | 1156 GSM F| 13 | 20 |78.83 |84.38 |82.44 |86.40 |90.76 |70.36 |92.69 | 1157 IS54 |7.95| 20 |67.42 |75.00 |72.29 |77.92 |84.51 |56.87 |87.57 | 1158 IS96 |8.5 | 20 |71.29 |78.26 |75.79 |80.90 |86.75 |61.28 |89.42 | 1160 Table 2: 24 Users 1162 7 Authors' Addresses 1164 Jonathan Rosenberg 1165 Rm. 4C-526 1166 Bell Laboratories, Lucent Technologies 1167 101 Crawfords Corner Rd. 1168 Holmdel, NJ 07733 1169 electronic mail: jdrosen@bell-labs.com 1171 Henning Schulzrinne 1172 Dept. of Computer Science 1173 Columbia University 1174 1214 Amsterdam Avenue 1175 New York, NY 10027 1176 USA 1177 electronic mail: schulzrinne@cs.columbia.edu 1179 8 Bibliography 1181 [1] M. Handley and V. Jacobson, SDP: session description protocol, 1182 Request for Comments (Proposed Standard) 2327, Internet Engineering 1183 Task Force, Apr. 1998. 1185 [2] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, RTP: a 1186 transport protocol for real-time applications, Request for Comments 1187 (Proposed Standard) 1889, Internet Engineering Task Force, Jan. 1996. 1189 [3] B. Subbiah and S. Sengodan, User multiplexing in rtp payload 1190 between ip telephony gateways, (internet draft), Internet Engineering 1191 Task Force, Aug. 1998. Work in Progress. 1193 [4] J. Rosenberg and H. Schulzrinne, An RTP payload format for user 1194 multiplexing, Internet Draft, Internet Engineering Task Force, May 1195 1998. Work in progress. 1197 [5] K. Tanigawa, T. Hoshi, and K. Tsukada, An rtp simple multiplexing 1198 transfer method for internet telephony gateway, (internet draft), 1199 Internet Engineering Task Force, June 1998. Work in Progress. 1201 [6] H. Schulzrinne, RTP profile for audio and video conferences with 1202 minimal control, Request for Comments (Proposed Standard) 1890, 1203 Internet Engineering Task Force, Jan. 1996.